Skip to content

Failed to run in mk mode when batch_size is greater than 1 #2

@zhendonghua

Description

@zhendonghua

Labels: bugs, help needed

Issue Description

I can't run the benchmark code in mk mode when batch_size is greater than 1. The model I use is Llama-3.2-1B-Instruct, batch size is 2. All other parameters of ScriptConfig is set to default value.
Take the following instruction as an example.

python megakernels/scripts/generate.py mode=mk prompt="tell me a funny joke about cookies" ntok=100 batch_size=2          

The traceback info is as below.

Traceback (most recent call last):
  File "/root/Megakernels/megakernels/scripts/generate.py", line 211, in <module>
    pydra.run(main)
  File "/venv/lib/python3.12/site-packages/pydra/cli.py", line 146, in run
    return _apply_overrides_and_call(fn, first_arg_type, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/pydra/cli.py", line 118, in _apply_overrides_and_call
    return fn(config)
           ^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/Megakernels/megakernels/scripts/generate.py", line 174, in main
    gen.generate(output_tokens, prompt_len, config.ntok - 1)
  File "/root/Megakernels/megakernels/generators.py", line 165, in generate
    output_ids = self.run(input_ids, pos_id=pos_id)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/Megakernels/megakernels/generators.py", line 132, in run
    self.schedule.globs.hidden_states[:] = hiddens.squeeze(1)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
RuntimeError: expand(CUDABFloat16Type{[2, 2048]}, size=[2048]): the number of sizes provided (1) must be greater or equal to the number of dime
nsions in the tensor (2)

Potential cause

I think the problem lies in the shape of BaseGlobals.hidden_state. It is initialized in make_global() function of Megakernels/megakernels/demos/latency/scheduler.py.

hidden_states=make_buffer(config.hidden_size)

So the hidden_states has only one dimension because config.hidden_size is a model-related constant, let it be hidden_size. But if out batch size is greater than 1, let it be n, then in run function of MK_Generator, the input_ids should have shape (n, 1). And hiddens should have size (n, 1, hidden_size), which can not be squeezed into self.schedule.globs.hidden_states (the shape is (hidden_size)).

Environment

  • GPU: H800
  • OS: Linux x86_64
  • CUDA: 12.8
  • Python: 3.12

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions