Capture compiled artifacts

The typical onnxruntime flow is the following via `llm-kvc.py` (very simplified):

1. Find the onnx model in the huggingface cache as presented via the CLI flags
2. Run a compilation process, although it is more like transpilation+compilation: input is onnx model, output is a `.bundle` with an ELF in it. Uses et-glow, gcc, llvm, neuralizer, etc.
3. Run inference. Uses InferenceServer.

If you run the same one a second time, the process recognizes that the model already has been converted from onnx to `.bundle`, and tuns just the inference step. This takes about 1/10 the time.

The current `nekko` CLI does not recognize that there are intermediate artifacts to capture. We need to do this via the following:

* decide on a compiled artifacts cache directory default location
* have an option to the CLI to override that location
* mount the location into the container
* configure the `llm-kvc.py` script to support controlling that location

cc @jerenkrantz

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Capture compiled artifacts #1

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Capture compiled artifacts #1

Description

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions