Skip to content

0121 maxsim support#211

Open
FarmersWrap wants to merge 31 commits intotexttron:mainfrom
FarmersWrap:0121-chunk-encode
Open

0121 maxsim support#211
FarmersWrap wants to merge 31 commits intotexttron:mainfrom
FarmersWrap:0121-chunk-encode

Conversation

@FarmersWrap
Copy link

@FarmersWrap FarmersWrap commented Jan 25, 2026

  1. fully random chunking, passage level chunking for training.
  2. encode chunks for evaluation.

For training, train

  1. --passage_chunk_size 256 # fixed size chunk, will be disable when set 0.
  2. --passage_chunk_size_range 32,256 # fixed size chunk within a passage, but a number selected btw 32-256
  3. --passage_chunk_size_range 32,256
    --passage_chunk_size_variable # fully random chunks selected btw 32-256
  4. --encode_use_pre_chunked # Not fully tested. use dataset with (doc_id, chunks) to train

For evaluation encoding, encode

  1. fixed chunk size, 0 means no chunk
    --passage_chunk_size 0
  2. pick a random size btw 32-64 tokens to chunk a passage
    --passage_chunk_size_range "32,64" \
  3. need a prechunked jsonl
    --dataset_name json
    --dataset_path "${prechunked_corpus_jsonl}"
    --dataset_split train
    --encode_use_pre_chunked

prechunked.jsonl: {"docid": "q-en-0-pos-0", "chunks": ["a", "b", "c"]}

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant