Skip to content

Conversation

@kiranandcode
Copy link
Contributor

@kiranandcode kiranandcode commented Nov 20, 2025

Edit: updated with the latest changes to effectful, we can implement the logic in the following snippet:

def predict_next_step(game_state: GameState) -> Step:
    ValidStep = build_validated_model(game_state)

    @Template.define
    def predict_next_step_inner(game_state) -> ValidStep:
        """
        Given the state of the game of towers of Hanoi as follows:
        {game_state}
        Predict the next step to complete the game (moving all disks to the rightmost tower).
        Give a reasoning for your prediction, and return the step following the format:
        <step>start,end</step>
        where start and end are zero-based indices for the towers to move. Be concise and avoid wordy answers.
        """
        raise NotHandled

    s = predict_next_step_inner(game_state)
    return (s.start, s.end)


def solve_hanoi(state: GameState):
    log = []

    for i in itertools.count():
        print(f"step {i} - {state}")
        with handler(KAheadSampler()), handler(RetryLLMHandler()):
            step = predict_next_step(state)
        # track the step at each point
        if new_state := state.apply(step):
            log.append((state, step))

        state = new_state or state
        state.visualise()
        if state.is_done():
            break

I don't think this necessarily should even live in effectful, but regardless it's an interesting case study to kick the tires on potential async/concurrent interfaces for effectful so making this draft PR to collect discussion about it.

relevant snippets from the implementation (and links to the paper):

Algorithm 3

class Agent:
    ...
    def has_no_red_flags(self, response: str) -> Step | None:
        if len(response) > 450.0:  return None
        step = self.parse_response(response)
        if not step: return None
        if not (0 <= step.start < len(self.game_state.towers) and 0 <= step.end < len(self.game_state.towers)):
            return None
        if step not in self.game_state.valid_steps():
            return None
        return step

    def get_vote(self):  # algorithm 3
        while True:
            resp = self.predict_next_step()
            if step := self.has_no_red_flags(resp):
                return step

Algorithm 2

class FirstToMoveAheadSelector:
    def do_voting(self) -> Step:  # algorithm 2
        # run n in parallel repeatedly until k come out in top
        while True:
            # submit a batch of votes
            for vote in futures.as_completed(
                Executor.submit(agent.get_vote) for agent in self.agents
            ):
                self.votes[vote] += 1
                max_other_votes = max(
                    (self.votes[o_vote] for o_vote in self.votes if o_vote != vote),
                    default=0,
                )
                if self.votes[vote] >= max_other_votes + self.k:
                    return vote

Algorithm 1

def solve_hanoi(state: GameState):
    log = []

    for i in itertools.count():
        print(f"step {i} - {state}")
        step = FirstToAheadMoveSelector(state).do_voting()
        # track the step at each point
        log.append((state, step))

        state = state.apply(step)
        state.visualise()

Copy link
Contributor

@eb8680 eb8680 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is neat, it seems like there are several takeaways. Alongside this iterative solver, it would also be cool to implement a recursive solver that is allowed to call itself as a tool.

return steps


class MicroAgent:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like MicroAgent should just be a single stateless Template with a structured output type annotation?

@Template.define
def predict_next_state(state: GameState) -> Step:
    """..."""

Here the behavior of parse_response, has_no_red_flags and get_vote should all be implicit in the annotation and the semantics of structured output generation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense. parse_response and has_no_red_flags could be expressed as annotations on the return type, probably raising some kind of exception. and get_no_vote could be either a handler, like Dat's LLMRetryHandler

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

parse_response should just be the default behavior of structured output generation. has_no_red_flags would follow either from a more specific static Step type or making it a Pydantic model with the runtime validation logic from has_no_red_flags.

Copy link
Contributor Author

@kiranandcode kiranandcode Nov 21, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so part of the reason I didn't use the default structured output generation functionality was from the paper it seemed like the claim was that whether the model fails to produce syntactically valid code provides signal for whether it's reasoning correctly. constrained output decoding perturbs the distribution so even if it's reasoning incorrectly it will produce syntactically well formed output.

else:
self.visualise_text()

def apply(self, step: Step) -> Optional["GameState"]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is the return type of apply an Optional? What does None mean in this context?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wrote the GameState class first and fleshed out an API that was maybe overly cautious in what it accepts, here None represents an invalid move, though elsewhere we check that results are in valid moves so an invalid move is never supplied to apply.


return total_moves

def is_done(self) -> bool:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see this called anywhere?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh, good catch, I simplified the solve_hanoi function before commiting everything and in that process removed the check for completion.

@kiranandcode kiranandcode changed the base branch from kg-futures-proposal to staging-llm December 10, 2025 01:28
raise NotImplementedError


def build_validated_model(game_state: GameState) -> type[Step]:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this kind of dynamic class creation is allowed, but it's not very Pythonic. If this kind of dependent type checking is something we want to support/encourage, we should think about how to make it easier and more idiomatic to express.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that makes sense. I'll open an issue to discuss.



def predict_next_step(game_state: GameState) -> Step:
ValidStep = build_validated_model(game_state)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps a simpler alternative design would be to make the validation logic accessible as a tool that predict_next_step_inner is expected to call before it returns a result? Then Step would be simpler and predict_next_step_inner could be moved back up to module scope.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense, though the semantics that I had been considering for tool calling is that the list of tools provided to a template are a list of things the llm may use in in its execution, without any guarantee that any of the tools will be used. As this invariant is always required for the rest of the code, it makes sense to do it outside the LLM call.

@jfeser
Copy link
Contributor

jfeser commented Dec 31, 2025

Note that #479 should be reverted in this branch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants