Take the output logits and decode the model’s prediction. Apply a softmax to the logits to obtain a probability distribution (or for simplicity, you can directly pick the argmax as the predicted token for greedy decoding). Convert the selected token ID back to a text string using the tokenizer from step 4. If the goal is to generate multi-token outputs (as is typical in language model inference), implement a generation loop: append the predicted token to the input sequence, and feed the last $N$ tokens (or the entire sequence if still under 4096 tokens) back into the model to compute the next token. Repeat this until an end-of-sequence token is produced or a desired length is reached. Ensure that the context length does not exceed 4096 tokens; if it does, you may need to drop the oldest tokens (for streaming generation). This step yields the final decoded text output from the model.