Skip to content

Support multimodal input data#278

Open
moskomule wants to merge 9 commits intomainfrom
feat/multimodal-input
Open

Support multimodal input data#278
moskomule wants to merge 9 commits intomainfrom
feat/multimodal-input

Conversation

@moskomule
Copy link
Member

Refer to #277.

Add parse_input_utterance and preprocessor to TemplateChatDataset to support multimodal input data.

  • parse_input_utterance parses structured contents used in multimodal LMs
  • preprocessor preprocesses each item, like image resizing

As an example of preprocessor, I created flexeval/multimodal/image_preprocessor.py.

I want to know whether such domain-specific preprocessors should be excluded from flexeval itself (and place alongside users' jsonnets).

Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds support for multimodal input data (e.g., text + images) to flexeval, enabling evaluation of Vision Language Models (VLMs) and other multimodal language models. The implementation introduces two key features to TemplateChatDataset: parse_input_utterance to parse structured content from templates into lists of dictionaries (as required by OpenAI's multimodal API format), and preprocessor to preprocess items before template rendering (e.g., image resizing or base64 encoding).

Changes:

  • Added parse_input_utterance parameter supporting literal_eval and json_loads parsing methods
  • Added preprocessor parameter accepting a list of preprocessor instances for item transformation
  • Created Preprocessor abstract base class defining the preprocessor interface
  • Implemented ConvertImageToBase64 as an example preprocessor for image handling
  • Added tests for the parse_input_utterance functionality

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 11 comments.

File Description
flexeval/core/chat_dataset/template_based.py Core implementation of multimodal support: added Preprocessor ABC, parse_input_utterance and preprocessor parameters to TemplateChatDataset and its subclasses
flexeval/multimodal/image_preprocessor.py Example implementation of image-to-base64 preprocessor with resizing support
flexeval/multimodal/init.py Module initialization exporting ConvertImageToBase64
tests/core/chat_dataset/test_template_based.py Test coverage for parse_input_utterance feature with different parsing methods

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

moskomule and others added 5 commits February 12, 2026 15:11
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants