CodeFuse-VLM

简体中文 | HuggingFace | ModelScope

CodeFuse-VLM is a Multimodal LLM(MLLM) framework that provides users with multiple vision encoders, multimodal alignment adapters, and LLMs. Through CodeFuse-VLM framework, users are able to customize their own MLLM model to adapt their own tasks. As more and more models are published on Huggingface community, there will be more open-source vision encoders and LLMs. Each of these models has their own specialties, e.g. Code-LLama is good at code-related tasks but has poor performance for Chinese tasks. Therefore, we built CodeFuse-VLM framework to support multiple vision encoders, multimodal alignment adapters, and LLMs to adapt different types of tasks.

Under CodeFuse-VLM framework, we use cross attention multimodal adapter, Qwen-14B LLM, and Qwen-VL's vision encoder to train CodeFuse-VLM-14B model. On multiple benchmarks, our CodeFuse-VLM-14B shows superior performances over Qwen-VL and LLAVA-1.5.

Here is the table for different MLLM model's performance on benchmarks

Model	MMBench	MMBench-CN	VqaV2	GQA	TextVQA	Vizwiz
LLAVA-1.5	67.7	63.6	80.0	63.3	61.3	53.6
Qwen-VL	60.6	56.7	78.2	57.5	63.8	38.9
CodeFuse-VLM-14B	75.7	69.8	79.3	59.4	63.9	45.3

Our model achieved high ranking on MMBenchmark: https://mmbench.opencompass.org.cn/leaderboard

Here's our model's demo video

CodeFuse-VLM-min.mp4

Install

Please run sh init_env.sh

Datasets

Here's the table of datasets we used to train CodeFuse-VLM-14B:

Dataset	Task Type	Number of Samples
synthdog-en	OCR	800,000
synthdog-zh	OCR	800,000
cc3m(downsampled)	Image Caption	600,000
cc3m(downsampled)	Image Caption	600,000
SBU	Image Caption	850,000
Visual Genome VQA (Downsampled)	Visual Question Answer(VQA)	500,000
Visual Genome Region descriptions (Downsampled)	Reference Grouding	500,000
Visual Genome objects (Downsampled)	Grounded Caption	500,000
OCR VQA (Downsampled)	OCR and VQA	500,000

Please download these datasets on their own official websites.

Multimodal Alignment

Please run sh scripts/pretrain.sh or sh scripts/pretrain_multinode.sh

Visual Instruction Tuning

Please run sh scripts/finetune.sh or sh scripts/finetune_multinode.sh

Evaluation

Please run python scrips in directory llava/eval/

CodeFuse-VLM Product Video

Here's the demo video of front-end code copilot backed by our VLM model

-min.mp4

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
llava		llava
playground		playground
scripts		scripts
CodeFuse-VLM-14B-performance.png		CodeFuse-VLM-14B-performance.png
CodeFuse-VLM-arch.png		CodeFuse-VLM-arch.png
LEGAL.md		LEGAL.md
LICENSE		LICENSE
MFT-VLM-arch.png		MFT-VLM-arch.png
README.md		README.md
README_cn.md		README_cn.md
accelerate_ds_config.yaml		accelerate_ds_config.yaml
demo.ipynb		demo.ipynb
init_env.sh		init_env.sh
plot.ipynb		plot.ipynb
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

CodeFuse-VLM

Contents

Install

Datasets

Multimodal Alignment

Visual Instruction Tuning

Evaluation

CodeFuse-VLM Product Video

About

Uh oh!

Releases

Packages

Languages

License

DearWeiTing/CodeFuse-MFT-VLM

Folders and files

Latest commit

History

Repository files navigation

CodeFuse-VLM

Contents

Install

Datasets

Multimodal Alignment

Visual Instruction Tuning

Evaluation

CodeFuse-VLM Product Video

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages