Failed to run in `mk` mode when `batch_size` is greater than 1

**Labels**:  bugs, help needed

## Issue Description
I can't run the benchmark code in `mk` mode when `batch_size` is greater than 1. The model I use is Llama-3.2-1B-Instruct, batch size is 2. All other parameters of `ScriptConfig` is set to default value.
Take the following instruction as an example.

```bash
python megakernels/scripts/generate.py mode=mk prompt="tell me a funny joke about cookies" ntok=100 batch_size=2          
```

The traceback info is as below.

```
Traceback (most recent call last):
  File "/root/Megakernels/megakernels/scripts/generate.py", line 211, in <module>
    pydra.run(main)
  File "/venv/lib/python3.12/site-packages/pydra/cli.py", line 146, in run
    return _apply_overrides_and_call(fn, first_arg_type, args)
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/pydra/cli.py", line 118, in _apply_overrides_and_call
    return fn(config)
           ^^^^^^^^^^
  File "/venv/lib/python3.12/site-packages/torch/utils/_contextlib.py", line 116, in decorate_context
    return func(*args, **kwargs)
           ^^^^^^^^^^^^^^^^^^^^^
  File "/root/Megakernels/megakernels/scripts/generate.py", line 174, in main
    gen.generate(output_tokens, prompt_len, config.ntok - 1)
  File "/root/Megakernels/megakernels/generators.py", line 165, in generate
    output_ids = self.run(input_ids, pos_id=pos_id)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/Megakernels/megakernels/generators.py", line 132, in run
    self.schedule.globs.hidden_states[:] = hiddens.squeeze(1)
    ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^^^
RuntimeError: expand(CUDABFloat16Type{[2, 2048]}, size=[2048]): the number of sizes provided (1) must be greater or equal to the number of dime
nsions in the tensor (2)
```

## Potential cause
I think the problem lies in the shape of `BaseGlobals.hidden_state`. It is initialized in `make_global()` function of `Megakernels/megakernels/demos/latency/scheduler.py`.
```Python
hidden_states=make_buffer(config.hidden_size)
```
So the `hidden_states` has only one dimension because `config.hidden_size` is a model-related constant, let it be `hidden_size`. But if out batch size is greater than 1, let it be `n`, then in `run` function of `MK_Generator`, the `input_ids` should have shape `(n, 1)`. And `hiddens` should have size `(n, 1, hidden_size)`, which can not be squeezed into self.schedule.globs.hidden_states (the shape is `(hidden_size)`).


## Environment
- GPU: H800
- OS: Linux x86\_64
- CUDA: 12.8
- Python: 3.12



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Failed to run in `mk` mode when `batch_size` is greater than 1 #2

Issue Description

Potential cause

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Failed to run in mk mode when batch_size is greater than 1 #2

Description

Issue Description

Potential cause

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

Failed to run in `mk` mode when `batch_size` is greater than 1 #2