XPath Agent is an advanced language model-based crawler designed to efficiently extract information from massive html files.
To reproduce the results presented in the associated paper, follow the steps below using the provided command-line interface script.
Ensure you have the necessary dependencies installed.
# create a new python environment via conda
conda create -n feilian python=3.10
# install dependencies via pip
python -m pip install -r requirements.txtConfigure environment variables. For example, if you are using OpenAI API, ensure following environment variables are setting correctly:
OPENAI_API_KEY=sk-xxxDownload SWDE & SWDE_Extended to data folder, and unzip.
- SWDE: https://academictorrents.com/details/411576c7e80787e4b40452360f5f24acba9b5159
- SWDE_Extended: https://github.com/cdlockard/expanded_swde
Use the following command to run the experiment:
python scripts/experiment_cli.py --dataset <DATASET> --ie_model <IE_MODEL> --program_model <PROGRAM_MODEL> --name <EXPERIMENT_NAME>--dataset: Specifies the dataset to use. Options areSWDEorSWDE_Extended.--data_dir: The directory where the dataset is stored. Default isdata/swde.--ie_model: The information extraction model to be used.--program_model: The program model to be used.--ablation: Optional. Specifies the ablation study option. Choices are--eval_sample_size: The number of samples to evaluate. Default is32.--output: The directory where the results will be saved. Default is.tmp.--env: The environment file to load variables from. Default is.env.--seed: The random seed for reproducibility. Default is0.name: A unique identifier for the experiment.
