-
Notifications
You must be signed in to change notification settings - Fork 0
Closed
Description
The typical onnxruntime flow is the following via llm-kvc.py (very simplified):
- Find the onnx model in the huggingface cache as presented via the CLI flags
- Run a compilation process, although it is more like transpilation+compilation: input is onnx model, output is a
.bundlewith an ELF in it. Uses et-glow, gcc, llvm, neuralizer, etc. - Run inference. Uses InferenceServer.
If you run the same one a second time, the process recognizes that the model already has been converted from onnx to .bundle, and tuns just the inference step. This takes about 1/10 the time.
The current nekko CLI does not recognize that there are intermediate artifacts to capture. We need to do this via the following:
- decide on a compiled artifacts cache directory default location
- have an option to the CLI to override that location
- mount the location into the container
- configure the
llm-kvc.pyscript to support controlling that location
cc @jerenkrantz
Reactions are currently unavailable
Metadata
Metadata
Assignees
Labels
No labels