Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
37 commits
Select commit Hold shift + click to select a range
9b3202f
integrated humaneval into eval pipeline
maharajamihir Jan 9, 2026
0a1ce27
feat: integrate humaneval and ifeval
maharajamihir Jan 13, 2026
16eb18c
removed ifeval from eval suite
maharajamihir Jan 14, 2026
e1371de
final fixes
maharajamihir Jan 15, 2026
4a8af67
address comments from pr
maharajamihir Jan 15, 2026
efef7fb
resolve merge conflicts
maharajamihir Jan 15, 2026
0f7d45f
address comments from pr review and some fixes
maharajamihir Jan 19, 2026
b8aae3c
fix: avg @ n divide by number of samples
maharajamihir Jan 19, 2026
d96c6db
fix final slop
maharajamihir Jan 19, 2026
d35d688
rename files
maharajamihir Jan 19, 2026
c97a0b1
feat: add assertions to testcases
avocadoali Jan 20, 2026
a544188
added new testcases
maharajamihir Jan 20, 2026
55dc4f6
fixed evals making them in same format as serializer output
maharajamihir Jan 22, 2026
655e35e
reformat evals one last time
maharajamihir Jan 22, 2026
f5a0d93
feat: convert evals to YAML state-based format
maharajamihir Jan 22, 2026
7a2506a
resolve merge conflicts
maharajamihir Jan 22, 2026
da02c7c
cleanup yamls
maharajamihir Jan 22, 2026
c077bb5
remove redundant mds
maharajamihir Jan 22, 2026
deb4dc0
add description to each eval
maharajamihir Jan 26, 2026
390e923
feat: visualize yaml files (#39)
avocadoali Jan 27, 2026
341924a
add: syntax highlighting in visualizer
avocadoali Jan 27, 2026
62eb910
add edit mode and cursor mode
avocadoali Jan 27, 2026
e58c21b
added converted zeta evals
maharajamihir Jan 27, 2026
020f534
refactor code to cope with new eval format
maharajamihir Jan 27, 2026
969c076
finished massive refactor
maharajamihir Jan 28, 2026
e254d4e
fix zeta test cases
avocadoali Jan 29, 2026
ca0724d
added wandb support
maharajamihir Jan 29, 2026
62a0604
omega-refactor
maharajamihir Jan 29, 2026
d257a39
added pytests
maharajamihir Jan 29, 2026
35c5295
Merge branch 'cope-with-new-format' into integrate-zed-evals
avocadoali Jan 29, 2026
f2babbe
merge with main
avocadoali Feb 6, 2026
a6304e9
rename pass_at metrics for clarity
avocadoali Feb 10, 2026
062d3eb
chore: file structure
avocadoali Feb 10, 2026
6f36e0d
fix: silent failure
avocadoali Feb 11, 2026
e0ed9d8
fix: imports
avocadoali Feb 11, 2026
f85be4a
update: proofreading testcases
avocadoali Feb 11, 2026
7d250bf
update: handcrafted task visualizer
avocadoali Feb 11, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1,107 changes: 1,107 additions & 0 deletions data/eval/handcrafted/add_import_after_use.yaml

Large diffs are not rendered by default.

154 changes: 0 additions & 154 deletions data/eval/handcrafted/add_import_easy.md

This file was deleted.

Loading