Update dependency transformers to v5 #92
Open
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
This PR contains the following updates:
==4.57.1→==5.0.0Release Notes
huggingface/transformers (transformers)
v5.0.0: Transformers v5Compare Source
Transformers v5 release notes
We have a migration guide that will be continuously updated available on the
mainbranch, please check it out in case you're facing issues: migration guide.Highlights
We are excited to announce the initial release of Transformers v5. This is the first major release in five years, and the release is significant: 1200 commits have been pushed to
mainsince the latest minor release. This release removes a lot of long-due deprecations, introduces several refactors that significantly simplify our APIs and internals, and comes with a large number of bug fixes.We give an overview of our focus for this release in the following blogpost. In these release notes, we'll focus directly on the refactors and new APIs coming with v5.
This release is the full V5 release. It sets in motion something bigger: going forward, starting with v5, we'll now release minor releases every week, rather than every 5 weeks. Expect v5.1 to follow next week, then v5.2 the week that follows, etc.
We're moving forward with this change to ensure you have access to models as soon as they're supported in the library, rather than a few weeks after.
In order to install this release, please do so with the following:
For us to deliver the best package possible, it is imperative that we have feedback on how the toolkit is currently working for you. Please try it out, and open an issue in case you're facing something inconsistent/a bug.
Transformers version 5 is a community endeavor, and we couldn't have shipped such a massive release without the help of the entire community.
Significant API changes
Dynamic weight loading
We introduce a new weight loading API in
transformers, which significantly improves on the previous API. Thisweight loading API is designed to apply operations to the checkpoints loaded by transformers.
Instead of loading the checkpoint exactly as it is serialized within the model, these operations can reshape, merge,
and split the layers according to how they're defined in this new API. These operations are often a necessity when
working with quantization or parallelism algorithms.
This new API is centered around the new
WeightConverterclass:The weight converter is designed to apply a list of operations on the source keys, resulting in target keys. A common
operation done on the attention layers is to fuse the query, key, values layers. Doing so with this API would amount
to defining the following conversion:
In this situation, we apply the
Concatenateoperation, which accepts a list of layers as input and returns a singlelayer.
This allows us to define a mapping from architecture to a list of weight conversions. Applying those weight conversions
can apply arbitrary transformations to the layers themselves. This significantly simplified the
from_pretrainedmethodand helped us remove a lot of technical debt that we accumulated over the past few years.
This results in several improvements:
Linked PR: #41580
Tokenization
Just as we moved towards a single backend library for model definition, we want our tokenizers, and the
Tokenizerobject to be a lot more intuitive. With v5, tokenizer definition is much simpler; one can now initialize an emptyLlamaTokenizerand train it directly on your corpus.Defining a new tokenizer object should be as simple as this:
Once the tokenizer is defined as above, you can load it with the following:
Llama5Tokenizer(). Doing this returns you an empty, trainable tokenizer that follows the definition of the authors ofLlama5(it does not exist yet 😉).The above is the main motivation towards refactoring tokenization: we want tokenizers to behave similarly to models: trained or empty, and with exactly what is defined in their class definition.
Backend Architecture Changes: moving away from the slow/fast tokenizer separation
Up to now, transformers maintained two parallel implementations for many tokenizers:
tokenization_<model>.py) - Python-based implementations, often using SentencePiece as the backend.tokenization_<model>_fast.py) - Rust-based implementations using the 🤗 tokenizers library.In v5, we consolidate to a single tokenizer file per model:
tokenization_<model>.py. This file will use the most appropriate backend available:sentencepiecelibrary. It inherits fromPythonBackend.tokenizers. Basically allows adding tokens.MistralCommon's tokenization library. (Previously known as theMistralCommonTokenizer)The
AutoTokenizerautomatically selects the appropriate backend based on available files and dependencies. This is transparent, you continue to useAutoTokenizer.from_pretrained()as before. This allows transformers to be future-proof and modular to easily support future backends.Defining a tokenizers outside of the existing backends
We enable users and tokenizer builders to define their own tokenizers from top to bottom. Tokenizers are usually defined using a backend such as
tokenizers,sentencepieceormistral-common, but we offer the possibility to design the tokenizer at a higher-level, without relying on those backends.To do so, you can import the
PythonBackend(which was previously known asPreTrainedTokenizer). This class encapsulates all the logic related to added tokens, encoding, and decoding.If you want something even higher up the stack, then
PreTrainedTokenizerBaseis whatPythonBackendinherits from. It contains the very basic tokenizer API features:encodedecodevocab_sizeget_vocabconvert_tokens_to_idsconvert_ids_to_tokensfrom_pretrainedsave_pretrainedAPI Changes
1. Direct tokenizer initialization with vocab and merges
Starting with v5, we now enable initializing blank, untrained
tokenizers-backed tokenizers:This tokenizer will therefore follow the definition of the
LlamaTokenizeras defined in its class definition. It can then be trained on a corpus as can be seen in thetokenizersdocumentation.These tokenizers can also be initialized from vocab and merges (if necessary), like the previous "slow" tokenizers:
This tokenizer will behave as a Llama-like tokenizer, with an updated vocabulary. This allows comparing different tokenizer classes with the same vocab; therefore enabling the comparison of different pre-tokenizers, normalizers, etc.
vocab_file(as in, a path towards a file containing the vocabulary) cannot be used to initialize theLlamaTokenizeras loading from files is reserved to thefrom_pretrainedmethod.2. Simplified decoding API
The
batch_decodeanddecodemethods have been unified to reflect behavior of theencodemethod. Both single and batch decoding now use the samedecodemethod. See an example of the new behavior below:Gives:
We expect
encodeanddecodeto behave, as two sides of the same coin:encode,process,decode, should work.3. Unified encoding API
The
encode_plusmethod is deprecated in favor of the single__call__method.4.
apply_chat_templatereturnsBatchEncodingPreviously,
apply_chat_templatereturnedinput_idsfor backward compatibility. Starting with v5, it now consistently returns aBatchEncodingdict like other tokenizer methods.5. Removed legacy configuration file saving:
We simplify the serialization of tokenization attributes:
special_tokens_map.json- special tokens are now stored intokenizer_config.json.added_tokens.json- added tokens are now stored intokenizer.json.added_tokens_decoderis only stored when there is notokenizer.json.When loading older tokenizers, these files are still read for backward compatibility, but new saves use the consolidated format. We're gradually moving towards consolidating attributes to fewer files so that other libraries and implementations may depend on them more reliably.
6. Model-Specific Changes
Several models that had identical tokenizers now import from their base implementation:
These modules will eventually be removed altogether.
Removed T5-specific workarounds
The internal
_eventually_correct_t5_max_lengthmethod has been removed. T5 tokenizers now handle max length consistently with other models.Testing Changes
A few testing changes specific to tokenizers have been applied:
add_tokens,encode,decode) are now centralized and automatically applied across all tokenizers. This reduces test duplication and ensures consistent behaviorFor legacy implementations, the original BERT Python tokenizer code (including
WhitespaceTokenizer,BasicTokenizer, etc.) is preserved inbert_legacy.pyfor reference purposes.7. Deprecated / Modified Features
Special Tokens Structure:
SpecialTokensMixin: Merged intoPreTrainedTokenizerBaseto simplify the tokenizer architecture.special_tokens_map: Now only stores named special token attributes (e.g.,bos_token,eos_token). Useextra_special_tokensfor additional special tokens (formerlyadditional_special_tokens).all_special_tokensincludes both named and extra tokens.special_tokens_map_extendedandall_special_tokens_extended: Removed. AccessAddedTokenobjects directly from_special_tokens_mapor_extra_special_tokensif needed.additional_special_tokens: Still accepted for backward compatibility but is automatically converted toextra_special_tokens.Deprecated Methods:
sanitize_special_tokens(): Already deprecated in v4, removed in v5.prepare_seq2seq_batch(): Deprecated; use__call__()withtext_targetparameter instead.BatchEncoding.words(): Deprecated; useword_ids()instead.Removed Methods:
create_token_type_ids_from_sequences(): Removed from base class. Subclasses that need custom token type ID creation should implement this method directly.prepare_for_model(),build_inputs_with_special_tokens(),truncate_sequences(): Moved fromtokenization_utils_base.pytotokenization_python.pyforPythonBackendtokenizers.TokenizersBackendprovides model-ready input viatokenize()andencode(), so these methods are no longer needed in the base class._switch_to_input_mode(),_switch_to_target_mode(),as_target_tokenizer(): Removed from base class. Use__call__()withtext_targetparameter instead.parse_response(): Removed from base class.Performance
MoE Performance
The v5 release significantly improves the performance of the MoE models, as can be seen in the graphs below. We improve and optimize MoE performance through batched and grouped experts implementations, and we optimize them for decoding using
batched_mm.Core performance
We focus on improving the performance of loading weights on device (which gives speedups up to 6x in tensor parallel situations); this is preliminary work that we'll continue to work on in the coming weeks. Some notable improvements:
Library-wide changes with lesser impact
Default
dtypeupdateWe have updated the default
dtypefor all models loaded withfrom_pretrainedto beauto. This will lead to model instantiations respecting thedtypein which the model was saved, rather than forcing it to load in float 32.You can, of course, still specify the
dtypein which you want to load your model by specifying it as an argument to thefrom_pretrainedmethod.Shard size
The Hugging Face Hub infrastructure has gradually moved to a XET backend. This will significantly simplify uploads and downloads, with higher download and upload speeds, partial uploads, and, most notably, a higher threshold for accepted file sizes on the Hugging Face Hub.
To reflect this, we're increasing the default shard size of models serialized on the Hub to 50GB (up from 5GB).
use_auth_tokenThe
use_auth_tokenargument/parameter is deprecated in favor oftokeneverywhere.You should be able to search and replace
use_auth_tokenwithtokenand get the same logic.Linked PR: #41666
Attention-related features
We decided to remove some features for the upcoming v5 as they are currently only supported in a few old models and no longer integrated in current model additions. It's recommended to stick to v4.x in case you need them. Following features are affected:
Updates to supported torch APIs
We dropped support for two torch APIs:
torchscriptin #41688torch.fxin #41683Those APIs were deprecated by the PyTorch team, and we're instead focusing on the supported APIs
dynamoandexport.Quantization changes
We clean up the quantization API in transformers, and significantly refactor the weight loading as highlighted
above.
We drop support for two quantization arguments that have been deprecated for some time:
load_in_4bitload_in_8bitWe remove them in favor of the
quantization_configargument which is much more complete. As an example, here is howyou would load a 4-bit bitsandbytes model using this argument:
Configuration
from_xxx_configare deleted. Configs can be init from the__init__method in the same way. See #41314.mode.rope_parameters, including therope_thetaandrope_type. Model'sconfig.rope_parametersis a simple dictionaty in most cases, and can also be a nested dict in special cases (i.e. Gemma3 and ModernBert) with different rope parameterization for each layer type. Trying to getconfig.rope_thetawill throw an attribute error from now on. See #39847 and #42255config.vocab_size). Users are expected to access keys from their respective sub-configs (config.text_config.vocab_size).model.generate()) will no longer have ageneration_configandmodel.config.generation_configwill throw an attribute error.Processing
Tokenization
tokenization_<model>.py) will be removed in favor of using fast tokenizer filestokenization_<model>_fast.py--> will be renamed totokenization_<model>.py. As fast tokenizers are 🤗tokenizers- backend, they include a wider range of features that are maintainable and reliable.encode_plus-->__call__batch_decode-->decodeapply_chat_templateby default returns nakedinput_idsrather than aBatchEncodingdict.This was inconvenient - it should return a
BatchEncodingdict liketokenizer.__call__(), but we were stuck withit for backward compatibility. The method now returns a
BatchEncoding.Linked PRs:
Processing classes
processor_config.jsonas a nested dict, instead of serializing attributes in their own config files. Loading will be supported for all old format processors (#41474)XXXFeatureExtractorsclasses are completely removed in favor ofXXXImageProcessorclass for all vision models (#41174)XXXFastImageProcessorKwargsis removed in favor ofXXXImageProcessorKwargswhich will be shared between fast and slow processors (#40931)Modeling
RotaryEmbeddingslayers will start returning a dict of tuples, in case the model uses several RoPE configurations (Gemma2, ModernBert). Each value will be a tuple of "cos, sin" per RoPE type.RotaryEmbeddingslayer will be unified and accessed viaconfig.rope_parameters. Config attr forrope_thetamight not be accessible anymore for some models, and instead will be inconfig.rope_parameters['rope_theta']. BC will be supported for a while as much as possible, and in the near future we'll gradually move to the new RoPE format (#39847)model.language_model. It is recommended to either access the module withmodel.model.language_modelormodel.get_decoder(). See #42156kwargsin their forward methodsGenerate
GreedySearchEncoderDecoderOutput). We now only have 4 output classes built from the following matrix: decoder-only vs encoder-decoder, uses beams vs doesn't use beams (#40998)generatedoesn't receive any KV Cache argument, the default cache class used is now defined by the model (as opposed to always beingDynamicCache) (#41505)config.jsonfor any old model, it will be loaded back into model's generation config. Users are expected to access or modify generation parameters only withmodel.generation_config.do_sample = True.Trainer
New Features
compute_loss_funcHandlingcompute_loss_funcnow always takes priority over the model's built-in loss computation, giving users consistent control over custom loss functions.num_items_in_batchin Prediction Stepnum_items_in_batchargument is now passed tocompute_lossduringprediction_step, enabling proper loss scaling during evaluation.Breaking Changes
report_tonow defaults to"none"Removing arguments without deprecation cycle in
TrainingArgumentsdue to low usagemp_parameters-> legacy param that was later on added to the Sagemaker trainer_n_gpu-> not intended for users to set, we will initialize it correctly instead of putting it in theTrainingArgumentsoverwrite_output_dir- > replaced byresume_from_checkpoint, and it was only used in the examples script, no impact on Trainer.logging_dir-> only used for tensorboard, setTENSORBOARD_LOGGING_DIRenv var insteadjit_mode_eval-> useuse_torch_compileinstead, as torchscript is not recommended anymoretpu_num_cores-> It is actually better to remove it, as it is not recommended to set the number of cores. By default, all TPU cores are used . SetTPU_NUM_CORESenv var insteadpast_index-> it was only used for a very small number of models that have special architecture like transformersxl + it was not documented at all how to train those modelsray_scope-> only for a minor arg for ray integration. SetRAY_SCOPEvar env insteadwarmup_ratio-> usewarmup_stepinstead. We combined both args together by allowing passing float values inwarmup_step.Removing deprecated arguments in
TrainingArgumentsfsdp_min_num_paramsandfsdp_transformer_layer_cls_to_wrap-> usefsdp_configtpu_metrics_debug->debugpush_to_hub_token->hub_tokenpush_to_hub_model_idandpush_to_hub_organization->hub_model_idinclude_inputs_for_metrics->include_for_metricsper_gpu_train_batch_size->per_device_train_batch_sizeper_gpu_eval_batch_size->per_device_eval_batch_sizeuse_mps_device-> mps will be used by default if detectedfp16_backendandhalf_precision_backend-> we will only rely ontorch.ampas everything has been upstreamed to torchno_cuda->use_cpuinclude_tokens_per_second->include_num_input_tokens_seenuse_legacy_prediction_loop-> we only useevaluation_loopfunction from now onRemoving deprecated arguments in
Trainertokenizerin initialization ->processing_classmodel_pathin train() ->resume_from_checkpointRemoved features for
TrainerNew defaults for
Traineruse_cachein the model config will be set toFalse. You can still change the cache value throughTrainingArgumentsusel_cacheargument if needed.Pipeline
PushToHubMixin
organizationandrepo_urlfromPushToHubMixin. You must pass arepo_idinstead.ignore_metadata_errorsfromPushToMixin. In practice if we ignore errors while loading the model card, we won't be able to push the card back to the Hub so it's better to fail early and not provide the option to fail later.push_to_hubdo not accept**kwargsanymore. All accepted parameters are explicitly documented.push_to_hubare now keyword-only to avoid confusion. Onlyrepo_idcan be positional since it's the main arg.use_temp_dirargument frompush_to_hub. We now use a tmp dir in all cases.Linked PR: #42391.
CLI
The deprecated
transformers-cli ...command was deprecated,transformers ...is now the only CLI entry point.transformersCLI has been migrated toTyper, making it easier to maintain + adding some nice features out ofthe box (improved
--helpsection, autocompletion).Biggest breaking change is in
transformers chat. This command starts a terminal UI to interact with a chat model.It used to also be able to start a Chat Completion server powered by
transformersand chat with it. In this revampedversion, this feature has been removed in favor of
transformers serve. The goal of splittingtransformers chatand
transformers serveis to define clear boundaries between client and server code. It helps with maintenancebut also makes the commands less bloated. The new signature of
transformers chatis:It works hand in hand with
transformers serve, which means that iftransformers serveis running on its default endpoint,transformers chatcan be launched as follows:It can however use any OpenAI API compatible HTTP endpoint:
Linked PRs:
Removal of the
runmethodThe
transformers run(previouslytransformers-cli run) is an artefact of the past, was not documented nor tested,and isn't part of any public documentation. We're removing it for now and ask you to please let us know in case
this is a method you are using; in which case we should bring it back with better support.
Linked PR: #42447
Environment variables
TRANSFORMERS_CACHE,PYTORCH_TRANSFORMERS_CACHE, andPYTORCH_PRETRAINED_BERT_CACHEhave been removed. Please useHF_HOMEinstead.HUGGINGFACE_CO_EXAMPLES_TELEMETRY,HUGGINGFACE_CO_EXAMPLES_TELEMETRY,HUGGINGFACE_CO_PREFIX, andHUGGINGFACE_CO_RESOLVE_ENDPOINThave been removed. Please usehuggingface_hub.constants.ENDPOINTinstead.Linked PR: #42391.
Requirements update
transformersv5 pins thehuggingface_hubversion to>=1.0.0. See this migration guide to learn more about this major release. Here are to main aspects to know about:requeststohttpx. This change was made to improve performance and to support both synchronous and asynchronous requests the same way. If you are currently catchingrequests.HTTPErrorerrors in your codebase, you'll need to switch tohttpx.HTTPError.HTTP_PROXY/HTTPS_PROXYenvironment variableshf_transferand thereforeHF_HUB_ENABLE_HF_TRANSFERhave been completed dropped in favor ofhf_xet. This should be transparent for most users. Please let us know if you notice any downside!typer-slimhas been added as required dependency, used to implement bothhfandtransformersCLIs.New model additions in v5
CWM
The Code World Model (CWM) model was proposed in CWM: An Open-Weights LLM for Research on Code Generation with World Models by Meta FAIR CodeGen Team. CWM is an LLM for code generation and reasoning about code that has, in particular, been trained to better represent and reason about how code and commands affect the state of a program or system. Specifically, we mid-trained CWM on a large number of observation-action trajectories from Python execution traces and agentic interactions in containerized environments. We post-trained with extensive multi-task RL in verifiable coding, math, and multi-turn software engineering environments.
SAM3
SAM3 (Segment Anything Model 3) was introduced in SAM 3: Segment Anything with Concepts.
The SAM3 addition adds four new architectures:
SAM3 performs Promptable Concept Segmentation (PCS) on images. PCS takes text and/or image exemplars as input (e.g., "yellow school bus"), and predicts instance and semantic masks for every single object matching the concept.
Sam3Tracker and Sam3TrackerVideo perform Promptable Visual Segmentation (PVS) on images. PVS takes interactive visual prompts (points, boxes, masks) or text inputs to segment a specific object instance per prompt. This is the task that SAM 1 and SAM 2 focused on, and SAM 3 improves upon it. Sam3Tracker and Sam3TrackerVideo are updated versions of SAM2 Video that maintain the same API while providing improved performance and capabilities.
SAM3 Video performs Promptable Concept Segmentation (PCS) on videos. PCS takes text as input (e.g., "yellow school bus"), and predicts instance and semantic masks for every single object matching the concept, while preserving object identities across video frames. The model combines a detection module (SAM3) with a tracking module (SAM2-style tracker) to enable robust object tracking across video frames using text prompts.
LFM2 MoE
LFM2-MoE is a Mixture-of-Experts (MoE) variant of LFM2. The LFM2 family is optimized for on-device inference by combining short‑range, input‑aware gated convolutions with grouped‑query attention (GQA) in a layout tuned to maximize quality under strict speed and memory constraints.
LFM2‑MoE keeps this fast backbone and introduces sparse MoE feed‑forward networks to add representational capacity without significantly increasing the active compute path. The first LFM2-MoE release is LFM2-8B-A1B, with 8.3B total parameters and 1.5B active parameters. The model excels in quality (comparable to 3-4B dense models) and speed (faster than other 1.5B class models).
VideoLlama 3
The VideoLLaMA3 model is a major update to VideoLLaMA2 from Alibaba DAMO Academy.
AudioFlamingo 3
Audio Flamingo 3 (AF3) is a fully open large audio–language model designed for robust understanding and reasoning over speech, environmental sounds, and music. AF3 pairs a Whisper-style audio encoder with a causal language model and performs replace-in-place audio–text fusion: the processor aligns post-pool audio frames to a dedicated placeholder token and the model replaces those token slots with projected audio embeddings during the forward pass.
The model checkpoint is available at: nvidia/audio-flamingo-3-hf
Highlights:
Nanochat
NanoChat is a compact decoder-only transformer model designed for educational purposes and efficient training. The model features several fundamental architectural innovations which are common in modern transformer models. Therefore, it is a good model to use as a starting point to understand the principles of modern transformer models. NanoChat is a variant of the Llama architecture, with simplified attention mechanism and normalization layers.
FastVLM
FastVLM is an open-source vision-language model featuring a novel hybrid vision encoder, FastViTHD. Leveraging reparameterizable convolutional layers, scaled input resolution, and a reduced number of visual tokens, FastVLM delivers high accuracy with exceptional efficiency. Its optimized architecture enables deployment even on edge devices, achieving ultra-low TTFT (time to first token) without sacrificing performance.
PaddleOCR-VL
PaddleOCR-VL is a SOTA and resource-efficient model tailored for document parsing. Its core component is PaddleOCR-VL-0.9B, a compact yet powerful vision-language model (VLM) that integrates a NaViT-style dynamic resolution visual encoder with the ERNIE-4.5-0.3B language model to enable accurate element recognition. This innovative model efficiently supports 109 languages and excels in recognizing complex elements (e.g., text, tables, formulas, and charts), while maintaining minimal resource consumption. Through comprehensive evaluations on widely used public benchmarks and in-house benchmarks, PaddleOCR-VL achieves SOTA performance in both page-level document parsing and element-level recognition. It significantly outperforms existing solutions, exhibits strong competitiveness against top-tier VLMs, and delivers fast inference speeds. These strengths make it highly suitable for practical deployment in real-world scenarios.
SAM: Perception Encoder Audiovisual
PE Audio (Perception Encoder Audio) is a state-of-the-art multimodal model that embeds audio and text into a shared (joint) embedding space.
The model enables cross-modal retrieval and understanding between audio and text.
Text input
Audio input
The resulting embeddings can be used for:
Jais2
Jais2 a next-generation Arabic open-weight LLM trained on the richest Arabic-first dataset to date. Built from the ground up with 8B and 70B parameters, Jais 2 understands Arabic the way it's truly spoken across dialects, cuulutre, and modern expression. It is developed by MBZUAI, Inception and Cerebras Systems and based on the transformer architecture with modifications including:
Pixio
Pixio is a vision foundation model that uses ViT as a feature extractor for multiple downstream tasks like depth estimation, semantic segmentation, feed-forward 3D reconstruction, robotics, and image classification. It is built on the Masked Autoencoder (MAE) pre-training framework, with four minimal yet critical updates: 1) deeper decoder, 2) larger masking granularity, 3) more class tokens, and 4) web-scale curated training data.
Ernie 4.5 VL MoE
The Ernie 4.5 VL MoE model was released in the Ernie 4.5 Model Family release by baidu. This family of models contains multiple different architectures and model sizes. The Vision-Language series in specific is composed of a novel multimodal heterogeneous structure, sharing paremeters across modalities and dedicating parameters to specific modalities. This becomes especially apparent in the Mixture of Expert (MoE) which is composed of
This architecture has the advantage to enhance multimodal understanding without compromising, and even improving, performance on text-related tasks. An more detailed breakdown is given in the Technical Report.
Ernie 4.5] Ernie VL models by @vasqu in #39585GLM-ASR
GLM-ASR-Nano-2512 is a robust, open-source speech recognition model with 1.5B parameters. Designed for
real-world complexity, it outperforms OpenAI Whisper V3 on multiple benchmarks while maintaining a compact size.
Key capabilities include:
Exceptional Dialect Support
Beyond standard Mandarin and English, the model is highly optimized for Cantonese (粤语) and other dialects,
effectively bridging the gap in dialectal speech recognition.
Low-Volume Speech Robustness
Specifically trained for "Whisper/Quiet Speech" scenarios. It captures and accurately transcribes extremely
low-volume audio that traditional models often miss.
SOTA Performance
Achieves the lowest average error rate (4.10) among comparable open-source models, showing significant advantages
in Chinese benchmarks (Wenet Meeting, Aishell-1, etc..).
This model was contributed by Eustache Le Bihan and Yuxuan Zhang.
you can check the model card for more details and our
github repo.
GLM 4.7 Flash
GLM-4.7-Flash offers a new option for lightweight deployment that balances performance and efficiency.
GLM Image
We present GLM-4.1V-Thinking, GLM-4.5V, and GLM-4.6V, a family of vision-language models (VLMs) designed to advance general-purpose multimodal understanding and reasoning. In this report, we share our key findings in the development of the reasoning-centric training framework. We first develop a capable vision foundation model with significant potential through large-scale pre-training, which arguably sets the upper bound for the final performance. We then propose Reinforcement Learning with Curriculum Sampling (RLCS) to unlock the full potential of the model, leading to comprehensive capability enhancement across a diverse range of tasks, including STEM problem solving, video understanding, content recognition, coding, grounding, GUI-based agents, and long document interpretation. In a comprehensive evaluation across 42 public benchmarks, GLM-4.5V achieves state-of-the-art performance on nearly all tasks among open-source models of similar size, and demonstrates competitive or even superior results compared to closed-source models such as Gemini-2.5-Flash on challenging tasks including Coding and GUI Agents. Meanwhile, the smaller GLM-4.1V-9B-Thinking remains highly competitive-achieving superior results to the much larger Qwen2.5-VL-72B on 29 benchmarks. We open-source both GLM-4.1V-9B-Thinking and GLM-4.5V. We further introduce the GLM-4.6V series, open-source multimodal models with native tool use and a 128K context window. A brief overview is available at this https URL. Code, models and more information are released at https://github.com/zai-org/GLM-V
LWDetr
LW-DETR proposes a light-weight Detection Transformer (DETR) architecture designed to compete with and surpass the dominant YOLO series for real-time object detection. It achieves a new state-of-the-art balance between speed (latency) and accuracy (mAP) by combining recent transformer advances with efficient design choices.
The LW-DETR architecture is characterized by its simple and efficient structure: a plain ViT Encoder, a Projector, and a shallow DETR Decoder. It enhances the DETR architecture for efficiency and speed using the following core modifications:
LightOnOCR
LightOnOcr combines a Vision Transformer encoder (Pixtral-based) with a lightweight text decoder (Qwen3-based) distilled from high-quality open VLMs. It is optimized for document parsing tasks, producing accurate, layout-aware text extraction from high-resolution pages.
Bugfixes and improvements
JetMoeFix jetmoe after #40132 by @ArthurZucker in #41324gemma3by @Sai-Suraj-27 in #41354PretrainedConfigtoPreTrainedConfigby @Cyrilvallez in #41300ModularChecker] QOL for the modular checker by @ArthurZucker in #41361v5] Remove relative position embeddings (for bert like models) by @vasqu in [#41170](https://redirect.github.com/huggingfaConfiguration
📅 Schedule: Branch creation - At any time (no schedule defined), Automerge - At any time (no schedule defined).
🚦 Automerge: Disabled by config. Please merge this manually once you are satisfied.
♻ Rebasing: Whenever PR is behind base branch, or you tick the rebase/retry checkbox.
🔕 Ignore: Close this PR and you won't be reminded about this update again.
This PR was generated by Mend Renovate. View the repository job log.