Skip to content

nju-websoft/LogiNumSynth

Repository files navigation

LogiNumSynth: Synthesizing Joint Logical-Numerical Reasoning Problems for Language Models

This repository hosts the supplementary material for the LogiNumSynth project. Our work introduces a flexible natural language problem synthesizer that synthesizes tasks requiring proficiency in joint logical-numerical reasoning. The repository encompasses data synthesis, model evaluation, and fine-tuning scripts.

This README provides instructions for:

  1. Data Synthesis: How to synthesize data using LogiNumSynth.
  2. Model Evaluation: How to evaluate models on synthesized data.
  3. Model Fine-tuning: How to fine-tune models on the synthesized data.

1. How to synthesize data

Data synthesis with LogiNumSynth involves two sequential steps:

  1. Template-based Synthesis: Use code in synthesizer/ to synthesize template-based descriptions along with their formal representations. The resources/ folder provides pools and templates for synthesis.
  2. Natural Language Conversion: Use code in nl-tuning/ to convert the template-based descriptions into more natural language descriptions using a large language model.

Step 1: Synthesize Template-based Descriptions

To synthesize the same dataset configurations as described in our paper, run:

cd synthesizer && python main.py # for EL-EN, EL-HN, HL-EN and HL-HN
cd synthesizer && python main-train.py # for EL-Train and HL-Train
cd synthesizer && python main-exhl-hn.py # for exHL-HN

Customizing Synthesis

To customize the synthesis process, refer to the main*.py files to modify configurations (detailed in Appendix E.3 of our paper). Here's the minimal code for synthesis:

from synthesizer.pool import PoolFactory
from synthesizer.template import TemplateFactory
from synthesizer.theory import Theory

pool_factory = PoolFactory("../resources/pools.json")
template_factory = TemplateFactory("../resources/templates.json")

entities = pool_factory.get_entity_pool(10)
attributes = pool_factory.get_attribute_pool(15)
relations = pool_factory.get_relation_pool(10)

numerical_hard_expression = {
    "normal": {ConstantExpression: 0, IdentityExpression: 0, LinearExpression: 1, BinaryExpression: 1},
    "binary": {ConstantExpression: 1, IdentityExpression: 1, LinearExpression: 1}
}

theory = Theory(template_factory, 
                entities, 
                attributes, 
                relations, 
                fact_num=15, 
                rule_num=15, 
                depth=3,
                condition_num_interval=(1, 3), 
                expression_weights=numerical_hard_expression, 
                interval=(-100, 100))
data = theory.to_json()
data["id"] = "xxx"

Instantiating the Theory class synthesizes a sample with template-based descriptions. The parameters are:

  • template_factory: Factory to load templates from resources/templates.json
  • entities, attributes, relations: Sets of entities, attributes, and relations sampled from the pools
  • fact_num, rule_num: Number of facts and rules to be synthesized
  • depth: Depth of the reasoning process
  • condition_num_interval: Range of the number of conditions in each rule
  • expression_weights: Weights of different types of numerical expressions
  • interval: Range of operand values

If you want to extend the pools and templates, please modify the resources/pools.json and resources/templates.json. If you want to extend the synthesizer (e.g. the numerical expression, logical operators, etc.), please modify the code in the synthesizer/ folder.

Step 2: Convert template-based descriptions into more natural language descriptions

To convert the template-based descriptions into more natural language descriptions, please run:

cd nl-tuning && bash run_llm_tuning.sh

Pre-synthesized Datasets

We have already synthesized several datasets using LogiNumSynth (as described in our paper). These can be found in the data/ folder, with corresponding few-shot examples in the prompt/ folder:

Available Datasets:

  • EL-EN: Easy Logical and Easy Numerical reasoning tasks named as el-en.jsonl,
  • EL-HN: Easy Logical and Hard Numerical reasoning tasks named as el-hn.jsonl,
  • HL-EN: Hard Logical and Easy Numerical reasoning tasks named as hl-en.jsonl,
  • HL-HN: Hard Logical and Hard Numerical reasoning tasks named as hl-hn.jsonl,
  • exHL-HN: extremely Hard Logical and Hard Numerical reasoning tasks composed of 4 subtasks named as depth-7.jsonl, depth-8.jsonl, depth-9.jsonl, and depth-10.jsonl,
  • EL-Train: Easy Logical but Hard Numerical reasoning tasks for training named as train-el.jsonl,
  • EN-Train: Easy Numerical but Hard Logical reasoning tasks for training named as train-en.jsonl.

2. Model Evaluation on Synthesized Data

Model evaluation involves two steps:

  1. Model Inference: Use code in llm-evaluation/ or llm-evaluation-api/ to evaluate models on the synthesized data.
  2. Output Scoring: Use code in answer-conclude/ to score the model outputs for answer accuracy and process accuracy.

Step 1: Evaluate models on the synthesized data

If you want to evaluate open-source models deployed locally, please configure llm-evaluation/run_llm_vllm_loop.sh and run:

cd llm-evaluation && bash run_llm_vllm_loop.sh

If you want to evaluate models via APIs, please configure llm-evaluation-api/do_normal_call.py and run:

cd llm-evaluation-api && python do_normal_call.py

You can either use the api provided by llm-evaluation-api\normal_api.py or implement it by yourself. Instruction and few-shot examples are provided in prompt/.

Step 2: Score the model outputs in answer accuracy and process accuracy

To score the model outputs in answer accuracy and process accuracy, you first need to call an LLM to structurize the model outputs into a JSON format. To do so, please configure answer-conclude/run_conclude_batch.sh and run:

cd answer-conclude && bash run_conclude_batch.sh

Then, you can score the model outputs by running:

cd answer-conclude && bash run_score.sh

3. Model Fine-tuning on Synthesized Data

Fine-tuning involves two steps:

  1. Supervised Fine-tuning: Use code in sft/ to run supervised fine-tuning (optionally with RecAdam) on the synthesized data.
  2. Benchmark Evaluation: Use code in sft-eval/ to evaluate the fine-tuned models on external numerical/logical reasoning benchmarks.

Step 1: Run supervised fine-tuning (optionally with RecAdam) on the synthesized data

If you want to fine-tune a model on the synthesized data, please configure sft/train_swanlab.sh and run:

cd sft && bash train_swanlab.sh

You can enable RecAdam (to mitigate catastrophic forgetting) by setting the flag below to true; keep it false to use the standard optimizer.

USE_RECALL_ADAM=false  # set to true to enable RecAdam

You can also use SwanLab for experiments tracking and model management. Please edit the following configurations in sft/train_swanlab.sh:

PROJECT_NAME=${2:-"LogiNumSynth"}  # your SwanLab project name
SWANLAB_MODE=${3:-"cloud"}  # cloud, offline, disabled, or local
API_KEY=${4:-""} # your SwanLab API key

For advanced training configuration (evaluation/save/generation behavior, batch sizes, etc.), modify sft/sft.py directly where training_args is constructed. Example:

# === Configure evaluation/save strategy early (before swanlab.init) ===
training_args.evaluation_strategy = IntervalStrategy.STEPS
training_args.eval_steps = 20
training_args.save_strategy = IntervalStrategy.STEPS
training_args.save_steps = 625
training_args.load_best_model_at_end = True
training_args.metric_for_best_model = "accuracy"
training_args.greater_is_better = True
# === Use generation-based evaluation to avoid accumulating logits ===
training_args.predict_with_generate = True
training_args.generation_max_new_tokens = 4096
training_args.generation_num_beams = 1
training_args.generation_do_sample = False
# Keep eval batch small and clear intermediate tensors quickly
training_args.per_device_eval_batch_size = 8
training_args.eval_accumulation_steps = 1
training_args.dataloader_pin_memory = False

Step 2: Evaluate the fine-tuned models on the external numerical/logical reasoning benchmarks

We provide external benchmarks under sft-eval/datasets/, grouped as Logical/Numerical. standard splits: val.jsonl, test.jsonl (some keep original .json where the source format is preserved). Datasets with only a test split (e.g., mawps, aime24, rulearena) include just test.jsonl.

Numerical / mathematical benchmarks (paths):

  • sft-eval/datasets/gsm8k/main/val.jsonl, test.jsonl
  • sft-eval/datasets/math/val.jsonl, test.jsonl
  • sft-eval/datasets/mathqa/val.json, test.json
  • sft-eval/datasets/SVAMP/data/val.jsonl, test.jsonl
  • sft-eval/datasets/mawps/test.jsonl
  • sft-eval/datasets/aime24/test.jsonl

Formal deductive logical reasoning benchmarks (paths):

  • sft-eval/datasets/ruletaker/data/val.jsonl, test.jsonl
  • sft-eval/datasets/proofwriter/data/val.jsonl, test.jsonl
  • sft-eval/datasets/folio/val.jsonl, test.jsonl
  • sft-eval/datasets/fld/data/val.jsonl, test.jsonl

Complex logical reasoning and joint logical-numerical reasoning benchmarks (paths):

  • sft-eval/datasets/logiqa/val.jsonl, test.jsonl
  • sft-eval/datasets/reclor/val.json, test.json
  • sft-eval/datasets/abductionr/data/val.jsonl, test.jsonl
  • sft-eval/datasets/rulearena/airline.jsonl, nba.jsonl, tax.jsonl

If you want to evaluate a fine-tuned model on the external benchmarks, please configure sft-eval/run_test.sh and run:

cd sft-eval && bash run_test.sh

About

No description, website, or topics provided.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •