-
Notifications
You must be signed in to change notification settings - Fork 35
Description
Labels: bugs, help needed
Issue Description
I can't run the benchmark code in mk mode when batch_size is greater than 1. The model I use is Llama-3.2-1B-Instruct, batch size is 2. All other parameters of ScriptConfig is set to default value.
Take the following instruction as an example.
python megakernels/scripts/generate.py mode=mk prompt="tell me a funny joke about cookies" ntok=100 batch_size=2 The traceback info is as below.
Traceback (most recent call last):
File "/root/Megakernels/megakernels/scripts/generate.py", line 211, in <module>
pydra.run(main)
File "/venv/lib/python3.12/site-packages/pydra/cli.py", line 146, in run
return _apply_overrides_and_call(fn, first_arg_type, args)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/venv/lib/python3.12/site-packages/pydra/cli.py", line 118, in _apply_overrides_and_call
return fn(config)
^^^^^^^^^^
File "/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/root/Megakernels/megakernels/scripts/generate.py", line 174, in main
gen.generate(output_tokens, prompt_len, config.ntok - 1)
File "/root/Megakernels/megakernels/generators.py", line 165, in generate
output_ids = self.run(input_ids, pos_id=pos_id)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/Megakernels/megakernels/generators.py", line 132, in run
self.schedule.globs.hidden_states[:] = hiddens.squeeze(1)
~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
RuntimeError: expand(CUDABFloat16Type{[2, 2048]}, size=[2048]): the number of sizes provided (1) must be greater or equal to the number of dime
nsions in the tensor (2)
Potential cause
I think the problem lies in the shape of BaseGlobals.hidden_state. It is initialized in make_global() function of Megakernels/megakernels/demos/latency/scheduler.py.
hidden_states=make_buffer(config.hidden_size)So the hidden_states has only one dimension because config.hidden_size is a model-related constant, let it be hidden_size. But if out batch size is greater than 1, let it be n, then in run function of MK_Generator, the input_ids should have shape (n, 1). And hiddens should have size (n, 1, hidden_size), which can not be squeezed into self.schedule.globs.hidden_states (the shape is (hidden_size)).
Environment
- GPU: H800
- OS: Linux x86_64
- CUDA: 12.8
- Python: 3.12