0121 maxsim support by FarmersWrap · Pull Request #211 · texttron/tevatron

FarmersWrap · 2026-01-25T18:03:33Z

fully random chunking, passage level chunking for training.
encode chunks for evaluation.

For training, train

--passage_chunk_size 256 # fixed size chunk, will be disable when set 0.
--passage_chunk_size_range 32,256 # fixed size chunk within a passage, but a number selected btw 32-256
--passage_chunk_size_range 32,256
--passage_chunk_size_variable # fully random chunks selected btw 32-256
--encode_use_pre_chunked # Not fully tested. use dataset with (doc_id, chunks) to train

For evaluation encoding, encode

fixed chunk size, 0 means no chunk
--passage_chunk_size 0
pick a random size btw 32-64 tokens to chunk a passage
--passage_chunk_size_range "32,64" \
need a prechunked jsonl
--dataset_name json
--dataset_path "${prechunked_corpus_jsonl}"
--dataset_split train
--encode_use_pre_chunked

prechunked.jsonl: {"docid": "q-en-0-pos-0", "chunks": ["a", "b", "c"]}

…nd random chunking

Ryan Yu added 30 commits December 10, 2025 03:36

rewrote passage chunking

cdaf034

added logic for left padding

1536517

added search

dccfaf8

changed the tokenizer logic

1769715

added train collator debug

16d604a

traincollator is done

61dbf6e

fixed some comments

6ec22a9

modified chunkedencoder

40eddf8

Modified forward and model

0ebcf37

Modified inference on chunked passage, in progress

e7e3bc3

fixed a chunk size not passed to model

2d9939f

changed eos to sep

ed3e302

added logs

add3832

added some scripts

249dd9d

added tests

9efb43b

added tests

9c37e29

added log

a18c578

review

f0ee786

added collator helper functions

3c1752d

added padding helper

1fa3c1e

Tested the collator

224e185

Reviewed forward and maxsim

a60b20b

dataset uses random negative for cases have less negatives

32a1976

added some prints

1b4163e

removed one breakpoint

fc16311

Added random chunks

22868d2

Added full randomization

d389bc6

removed useless variables

338fe92

Refactored the randomization

190f937

added search time passage encoding with chunks

0a9626f

Added random chunk size for eval, and tests for prechunked passages a…

b88e5c1

…nd random chunking

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

0121 maxsim support#211

0121 maxsim support#211
FarmersWrap wants to merge 31 commits intotexttron:mainfrom
FarmersWrap:0121-chunk-encode

FarmersWrap commented Jan 25, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

FarmersWrap commented Jan 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

FarmersWrap commented Jan 25, 2026 •

edited

Loading