This repository hosts the supplementary material for the LogiNumSynth project. Our work introduces a flexible natural language problem synthesizer that synthesizes tasks requiring proficiency in joint logical-numerical reasoning. The repository encompasses data synthesis, model evaluation, and fine-tuning scripts.
This README provides instructions for:
- Data Synthesis: How to synthesize data using LogiNumSynth.
- Model Evaluation: How to evaluate models on synthesized data.
- Model Fine-tuning: How to fine-tune models on the synthesized data.
Data synthesis with LogiNumSynth involves two sequential steps:
- Template-based Synthesis: Use code in
synthesizer/to synthesize template-based descriptions along with their formal representations. Theresources/folder provides pools and templates for synthesis. - Natural Language Conversion: Use code in
nl-tuning/to convert the template-based descriptions into more natural language descriptions using a large language model.
To synthesize the same dataset configurations as described in our paper, run:
cd synthesizer && python main.py # for EL-EN, EL-HN, HL-EN and HL-HN
cd synthesizer && python main-train.py # for EL-Train and HL-Train
cd synthesizer && python main-exhl-hn.py # for exHL-HNTo customize the synthesis process, refer to the main*.py files to modify configurations (detailed in Appendix E.3 of our paper). Here's the minimal code for synthesis:
from synthesizer.pool import PoolFactory
from synthesizer.template import TemplateFactory
from synthesizer.theory import Theory
pool_factory = PoolFactory("../resources/pools.json")
template_factory = TemplateFactory("../resources/templates.json")
entities = pool_factory.get_entity_pool(10)
attributes = pool_factory.get_attribute_pool(15)
relations = pool_factory.get_relation_pool(10)
numerical_hard_expression = {
"normal": {ConstantExpression: 0, IdentityExpression: 0, LinearExpression: 1, BinaryExpression: 1},
"binary": {ConstantExpression: 1, IdentityExpression: 1, LinearExpression: 1}
}
theory = Theory(template_factory,
entities,
attributes,
relations,
fact_num=15,
rule_num=15,
depth=3,
condition_num_interval=(1, 3),
expression_weights=numerical_hard_expression,
interval=(-100, 100))
data = theory.to_json()
data["id"] = "xxx"Instantiating the Theory class synthesizes a sample with template-based descriptions. The parameters are:
- template_factory: Factory to load templates from
resources/templates.json - entities, attributes, relations: Sets of entities, attributes, and relations sampled from the pools
- fact_num, rule_num: Number of facts and rules to be synthesized
- depth: Depth of the reasoning process
- condition_num_interval: Range of the number of conditions in each rule
- expression_weights: Weights of different types of numerical expressions
- interval: Range of operand values
If you want to extend the pools and templates, please modify the resources/pools.json and resources/templates.json. If you want to extend the synthesizer (e.g. the numerical expression, logical operators, etc.), please modify the code in the synthesizer/ folder.
To convert the template-based descriptions into more natural language descriptions, please run:
cd nl-tuning && bash run_llm_tuning.shWe have already synthesized several datasets using LogiNumSynth (as described in our paper). These can be found in the data/ folder, with corresponding few-shot examples in the prompt/ folder:
Available Datasets:
- EL-EN: Easy Logical and Easy Numerical reasoning tasks named as
el-en.jsonl, - EL-HN: Easy Logical and Hard Numerical reasoning tasks named as
el-hn.jsonl, - HL-EN: Hard Logical and Easy Numerical reasoning tasks named as
hl-en.jsonl, - HL-HN: Hard Logical and Hard Numerical reasoning tasks named as
hl-hn.jsonl, - exHL-HN: extremely Hard Logical and Hard Numerical reasoning tasks composed of 4 subtasks named as
depth-7.jsonl,depth-8.jsonl,depth-9.jsonl, anddepth-10.jsonl, - EL-Train: Easy Logical but Hard Numerical reasoning tasks for training named as
train-el.jsonl, - EN-Train: Easy Numerical but Hard Logical reasoning tasks for training named as
train-en.jsonl.
Model evaluation involves two steps:
- Model Inference: Use code in
llm-evaluation/orllm-evaluation-api/to evaluate models on the synthesized data. - Output Scoring: Use code in
answer-conclude/to score the model outputs for answer accuracy and process accuracy.
If you want to evaluate open-source models deployed locally, please configure llm-evaluation/run_llm_vllm_loop.sh and run:
cd llm-evaluation && bash run_llm_vllm_loop.shIf you want to evaluate models via APIs, please configure llm-evaluation-api/do_normal_call.py and run:
cd llm-evaluation-api && python do_normal_call.pyYou can either use the api provided by llm-evaluation-api\normal_api.py or implement it by yourself. Instruction and few-shot examples are provided in prompt/.
To score the model outputs in answer accuracy and process accuracy, you first need to call an LLM to structurize the model outputs into a JSON format. To do so, please configure answer-conclude/run_conclude_batch.sh and run:
cd answer-conclude && bash run_conclude_batch.shThen, you can score the model outputs by running:
cd answer-conclude && bash run_score.shFine-tuning involves two steps:
- Supervised Fine-tuning: Use code in
sft/to run supervised fine-tuning (optionally with RecAdam) on the synthesized data. - Benchmark Evaluation: Use code in
sft-eval/to evaluate the fine-tuned models on external numerical/logical reasoning benchmarks.
If you want to fine-tune a model on the synthesized data, please configure sft/train_swanlab.sh and run:
cd sft && bash train_swanlab.shYou can enable RecAdam (to mitigate catastrophic forgetting) by setting the flag below to true; keep it false to use the standard optimizer.
USE_RECALL_ADAM=false # set to true to enable RecAdamYou can also use SwanLab for experiments tracking and model management. Please edit the following configurations in sft/train_swanlab.sh:
PROJECT_NAME=${2:-"LogiNumSynth"} # your SwanLab project name
SWANLAB_MODE=${3:-"cloud"} # cloud, offline, disabled, or local
API_KEY=${4:-""} # your SwanLab API keyFor advanced training configuration (evaluation/save/generation behavior, batch sizes, etc.), modify sft/sft.py directly where training_args is constructed. Example:
# === Configure evaluation/save strategy early (before swanlab.init) ===
training_args.evaluation_strategy = IntervalStrategy.STEPS
training_args.eval_steps = 20
training_args.save_strategy = IntervalStrategy.STEPS
training_args.save_steps = 625
training_args.load_best_model_at_end = True
training_args.metric_for_best_model = "accuracy"
training_args.greater_is_better = True
# === Use generation-based evaluation to avoid accumulating logits ===
training_args.predict_with_generate = True
training_args.generation_max_new_tokens = 4096
training_args.generation_num_beams = 1
training_args.generation_do_sample = False
# Keep eval batch small and clear intermediate tensors quickly
training_args.per_device_eval_batch_size = 8
training_args.eval_accumulation_steps = 1
training_args.dataloader_pin_memory = FalseWe provide external benchmarks under sft-eval/datasets/, grouped as Logical/Numerical. standard splits: val.jsonl, test.jsonl (some keep original .json where the source format is preserved). Datasets with only a test split (e.g., mawps, aime24, rulearena) include just test.jsonl.
Numerical / mathematical benchmarks (paths):
sft-eval/datasets/gsm8k/main/val.jsonl,test.jsonlsft-eval/datasets/math/val.jsonl,test.jsonlsft-eval/datasets/mathqa/val.json,test.jsonsft-eval/datasets/SVAMP/data/val.jsonl,test.jsonlsft-eval/datasets/mawps/test.jsonlsft-eval/datasets/aime24/test.jsonl
Formal deductive logical reasoning benchmarks (paths):
sft-eval/datasets/ruletaker/data/val.jsonl,test.jsonlsft-eval/datasets/proofwriter/data/val.jsonl,test.jsonlsft-eval/datasets/folio/val.jsonl,test.jsonlsft-eval/datasets/fld/data/val.jsonl,test.jsonl
Complex logical reasoning and joint logical-numerical reasoning benchmarks (paths):
sft-eval/datasets/logiqa/val.jsonl,test.jsonlsft-eval/datasets/reclor/val.json,test.jsonsft-eval/datasets/abductionr/data/val.jsonl,test.jsonlsft-eval/datasets/rulearena/airline.jsonl,nba.jsonl,tax.jsonl
If you want to evaluate a fine-tuned model on the external benchmarks, please configure sft-eval/run_test.sh and run:
cd sft-eval && bash run_test.sh