-
Notifications
You must be signed in to change notification settings - Fork 3
draft PR to discuss MAKER impl in effectful-llm #404
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: staging-llm
Are you sure you want to change the base?
Conversation
eb8680
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is neat, it seems like there are several takeaways. Alongside this iterative solver, it would also be cool to implement a recursive solver that is allowed to call itself as a tool.
tests/test_maker.py
Outdated
| return steps | ||
|
|
||
|
|
||
| class MicroAgent: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like MicroAgent should just be a single stateless Template with a structured output type annotation?
@Template.define
def predict_next_state(state: GameState) -> Step:
"""..."""Here the behavior of parse_response, has_no_red_flags and get_vote should all be implicit in the annotation and the semantics of structured output generation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense. parse_response and has_no_red_flags could be expressed as annotations on the return type, probably raising some kind of exception. and get_no_vote could be either a handler, like Dat's LLMRetryHandler
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
parse_response should just be the default behavior of structured output generation. has_no_red_flags would follow either from a more specific static Step type or making it a Pydantic model with the runtime validation logic from has_no_red_flags.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so part of the reason I didn't use the default structured output generation functionality was from the paper it seemed like the claim was that whether the model fails to produce syntactically valid code provides signal for whether it's reasoning correctly. constrained output decoding perturbs the distribution so even if it's reasoning incorrectly it will produce syntactically well formed output.
tests/test_maker.py
Outdated
| else: | ||
| self.visualise_text() | ||
|
|
||
| def apply(self, step: Step) -> Optional["GameState"]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the return type of apply an Optional? What does None mean in this context?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wrote the GameState class first and fleshed out an API that was maybe overly cautious in what it accepts, here None represents an invalid move, though elsewhere we check that results are in valid moves so an invalid move is never supplied to apply.
|
|
||
| return total_moves | ||
|
|
||
| def is_done(self) -> bool: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see this called anywhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh, good catch, I simplified the solve_hanoi function before commiting everything and in that process removed the check for completion.
805eaae to
9dbdbd0
Compare
| raise NotImplementedError | ||
|
|
||
|
|
||
| def build_validated_model(game_state: GameState) -> type[Step]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I guess this kind of dynamic class creation is allowed, but it's not very Pythonic. If this kind of dependent type checking is something we want to support/encourage, we should think about how to make it easier and more idiomatic to express.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that makes sense. I'll open an issue to discuss.
|
|
||
|
|
||
| def predict_next_step(game_state: GameState) -> Step: | ||
| ValidStep = build_validated_model(game_state) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps a simpler alternative design would be to make the validation logic accessible as a tool that predict_next_step_inner is expected to call before it returns a result? Then Step would be simpler and predict_next_step_inner could be moved back up to module scope.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That makes sense, though the semantics that I had been considering for tool calling is that the list of tools provided to a template are a list of things the llm may use in in its execution, without any guarantee that any of the tools will be used. As this invariant is always required for the rest of the code, it makes sense to do it outside the LLM call.
8a75d14 to
86f5482
Compare
|
Note that #479 should be reverted in this branch. |
Edit: updated with the latest changes to effectful, we can implement the logic in the following snippet:
I don't think this necessarily should even live in effectful, but regardless it's an interesting case study to kick the tires on potential async/concurrent interfaces for effectful so making this draft PR to collect discussion about it.
relevant snippets from the implementation (and links to the paper):
Algorithm 3
Algorithm 2
Algorithm 1