Skip to content

anaezquerro/separ

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

20 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🌱 SePar: Sequence Labeling algorithms for Parsing

Hi 👋 This is a Python implementation for sequence-labeling algorithms for Dependency, Constituency and Graph Parsing.

It is also the official repository of the following papers:

Please, feel free to reach out if you want to collaborate or include additional parsers to SePar!

Installation

This code was tested in Python >=3.12 in a GPU system with NVIDIA drivers (>=535) and CUDA (>=12.4) installed. Use the requirements.txt to download the dependencies in an existing environment:

pip install -r requirements.txt

Evaluation

Our code allows running the official evaluation of constituency (Sekine and Collins, 1997) and graph (Oepen et al., 2015) parsers. Please, follow the instructions to download and install the EVALB executable and SDP toolkit. For the constituency and graph parsers use the --evalb argument to compute the official evaluation.

  • For constituency parsers, the --evalb argument requires three paths: (i) the EVALB executable, (ii) the labeled parameter file and (iii) the unlabeled parameter file.
  • For graph parsers, the --evalb argument is the path to the run.sh file from the SDP toolkit.

Usage

You can train, evaluate and predict different parsers from terminal with the run.py script. Each parser has a string identifier that is introduced as the first argument of run.py. The following table shows the available parsers and their configuration (modifiable through terminal arguments).

Identifier Parser Paper Arguments Default
dep-idx Absolute and relative indexing Strzyz et al. (2019) rel$\in${true, false} false
dep-pos PoS-tag relative indexing Strzyz et al. (2019)
dep-bracket Bracketing encoding ($k$-planar) Strzyz et al. (2020) k$\in\mathbb{N}$ 1
dep-bit4 $4$-bit projective encoding Gómez-Rodríguez et al. (2023) proj$\in${None, head, head+path, path} None
dep-bit7 $7$-bit $2$-planar encoding Gómez-Rodríguez et al. (2023)
dep-hexa Hexa-Tagging Amini et al. (2023) proj$\in${head, head+path, path} head
dep-hier Hierarchical Bracketing Ezquerro et al., (2025a) variant$\in$ {proj, head, head+path, path, nonp} proj
dep-eager Arc-Eager system Nivre and Fernández-González (2002) stack$\in\mathbb{N}$, buffer$\in\mathbb{N}$, proj$\in${None, head, head+path, path} 1, 1, None
dep-biaffine Biaffine dependency parser Dozat & Manning (2017)
con-idx Absolute and relative indexing Gómez-Rodríguez and Vilares (2018) rel$\in${true, false} false
con-tetra Tetra-Tagging Kitaev and Klein (2020)
grp-idx Absolute and relative indexing Ezquerro et al. (2024) rel$\in${true, false} false
grp-bracket Bracketing encoding ($k$-planar) Ezquerro et al. (2024) k$\in\mathbb{N}$ 2
grp-hier Hierarchical bracketing encoding Ezquerro et al., (2025b)
grp-bit4k $4k$-bit encoding Ezquerro et al. (2024) k$\in\mathbb{N}$ 3
grp-bit6k $6k$-bit encoding Ezquerro et al. (2024) k$\in\mathbb{N}$ 3
grp-cov Covington Covington (2001)
grp-biaffine Biaffine graph parser Dozat & Manning (2018)

Training

To train a parser from scratch, the run.py script should follow this syntax:

python3 run.py <parser-identifier> <specific-args> \
    -c <conf> -d <device> (--load <pt-path> --seed <seed> --evalb <*evalb-paths>) \
    train --train <train-path> --dev <dev-path> --test <test-paths> \
    -o <output-folder> (--run-name <run-name>)

where:

  • <parser-identifier> is the identifier specified in the table above (e.g. dep-idx),
  • <specific-args> are the specific arguments of each parser (e.g. --rel for dep-idx),
  • <conf> is the model configuration file (see some examples in configs folder),
  • <device> is the CUDA integer device,
  • <train-path>, <dev-path> and <test-paths> are the paths to the training, development and test sets (multiple test paths are possible).
  • <output-folder> is a folder to store the training results (including the parser.pt file).

And optionally:

  • <pt-path>: Whether to load the parser from an existing .pt file.
  • <seed>: Specify other seed value. By default, this code always uses the seed 123.
  • <*evalb-paths>: Paths to evaluation scripts. This command is only available for constituency and graph parsers (see Evaluation instructions).
  • <run-name>: wandb identifier.

W&B logging: SePar also allows model debugging with wandb. Please. follow these instructions to create and account and connect it with your local installation. Note that SePar still works without a wandb account.

Distributed training

SePar supports distributed training with FSDP2 by running the script run.py with torchrun. Use the CUDA_VISIBLE_DEVICES variable to hide specific GPUs.

CUDA_VISIBLE_DEVICES=<devices> torchrun --nproc_per_node <num-devices> \
    run.py <parser-identifier> <specific-args> \
    -c <conf> (--load <pt-path> --seed <seed>) \
    train --train <train-path> --dev <dev-path> --test <test-paths> \
    -o <output-folder> (--run-name <run-name>)

where <devices> is the list of GPU identifiers (separated by comma) and <num-devices> is the number of GPUs used.

Warning

As introduced in this tutorial, FSDP2 requires manually specifying which modules or layers are sharded between GPUs for a better parameter distribution. In separ/utils/shard.py we include a function recursive_shard() which only shards large Transformer layers (specifically, those corresponding to pretrained models included in the configs folder). We suggest manually adding more layers when training with other LLMs. Do not hesitate to reach us if you need any help!

Evaluation

Evaluation with a trained parser is also performed with the run.py script.

python3 run.py <parser-identifier> <specific-args> --load <pt-path> -c <conf> -d <device> 
    eval <input> (--output <output> --batch-size <batch-size>)

where:

  • <parser-identifier> is the identifier specified in the table above (e.g. dep-idx),
  • <specific-args> are the specific arguments of each parser (e.g. --rel for dep-idx),
  • <pt-path> is the path where the parser has been stored (e.g. the parser.pt file created after training).
  • <conf> is the model configuration file (see some examples in configs folder),
  • <device> is the CUDA integer device (e.g. 0),
  • <input> is the annotated file to perform the evaluation.

And optionally:

  • <output>: Folder to store the result metric.
  • <batch-size>: Inference batch size. By default is set to 100.

Prediction

Prediction with a trained parser is also conducted from the run.py script.

python3 run.py <parser-identifier> <specific-args> --load <pt-path> -c <conf> -d <device> \
    predict <input> <output> (--batch-size <batch-size>)

where:

  • <parser-identifier> is the identifier specified in the table above (e.g. dep-idx),
  • <specific-args> are the specific arguments of each parser (e.g. --rel for dep-idx),
  • <pt-path> is the path where the parser has been stored (e.g. the parser.pt file created after training).
  • <conf> is the model configuration file (see some examples in configs folder),
  • <device> is the CUDA integer device (e.g. 0),
  • <input> is the annotated file to perform the evaluation.
  • <output> is the file to store the predicted file.

And optionally:

  • <batch-size>: Inference batch size. By default is set to 100.

Reproducibility and examples

Check the docs folder for specific examples running different dependency (docs/dep.md), constituency (docs/con.md) and semantic (docs/grp.md) parsers. Each document contains specific instructions to reproduce the results of the original papers:

The docs/examples.ipynb notebook includes some examples of how to use the implemented classes and methods to parse and linearize input graphs/trees.

About

Parsing as Sequence Labeling with Neural Models

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages