reduce maxTokens for glm-4-9b-chat to fit 50GB GPU#47
reduce maxTokens for glm-4-9b-chat to fit 50GB GPU#47nicole-lihui wants to merge 1 commit intoBaizeAI:mainfrom
Conversation
Summary of ChangesHello @nicole-lihui, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request addresses a critical memory issue encountered when deploying the Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request aims to resolve an out-of-memory issue on 50GB GPUs for the glm-4-9b-chat model by limiting the token processing capacity. It introduces customRuntimeArgs to set --max-model-len and --max-num-batched-tokens for the vLLM runtime. While this is a good step, I've pointed out a significant inconsistency where spec.config.maxTokens is not updated to match the new runtime limit, which could lead to runtime errors and a poor user experience. I've recommended aligning these values for consistency.
| - customRuntimeArgs: | ||
| - --max-num-batched-tokens=32768 # default | ||
| - --max-model-len=32768 |
There was a problem hiding this comment.
While adding --max-model-len=32768 correctly configures the vLLM runtime to prevent out-of-memory errors, there's an inconsistency with spec.config.maxTokens which remains at 128000 on line 7. This can lead to a confusing user experience or runtime errors if a user requests a number of tokens to generate that is valid according to maxTokens but exceeds the max-model-len limit when combined with the prompt length.
To ensure consistency and prevent unexpected failures, spec.config.maxTokens should be aligned with max-model-len. I recommend reducing spec.config.maxTokens to 32768. Since this line is not part of the current changes, please consider amending this pull request to include this change.
Reason: The default 128k maxTokens causes OOM on 50GB GPUs for long-context inference.
fd96c01 to
7d1edc3
Compare
| deployments: | ||
| - customRuntimeArgs: [] | ||
| - customRuntimeArgs: | ||
| - --max-num-batched-tokens=32768 # Reduce maxTokens from 128k to 32k to fit 50GB GPU and avoid OOM |
part of #44
The default 128k maxTokens causes OOM on 50GB GPUs for long-context inference.
https://huggingface.co/zai-org/glm-4-9b-chat/blob/main/generation_config.json