-
Notifications
You must be signed in to change notification settings - Fork 0
Description
The biggest usability problem right now is the massive latency. On an M1 Pro MacBook Pro, each inference can take as long as a second. This is a very poor user experience.
It appears that the vast majority of the latency is coming from the generation stage of the transformer rather than just the forward itself as I previously thought.
This leads me to a potential upgrade to the user experience. If I know that the suggestions that the model is making are really bad, then I should be able to immediately start typing and prevent the model from wasting time rolling out bad suggestions. If I think it's doing fairly well, then I should be fine to let it run until I don't like it anymore, tab and this would stop it from generating more. By giving the user visibility into what the model is thinking and what it's doing, we hide the latency in the user's processing of the model's output, i.e reduce the time the user is sitting around getting angry at the extension.
Architecture pending