Skip to content

lorrainea/rrBDA-index

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

114 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

rrBDA-index: Text Indexing for Long Patterns

Description

This repository maintains a time- and space-efficient construction algorithm of the rrBDA-index, a new variant of the BDA-index for long patterns introduced by Loukides, Pissis, and Sweering. This new construction relies on an algorithm for computing a randomized counterpart of reduced bd-anchors. It comes in two flavours: a semi-external memory implementation and an internal memory implementation.

Requirements

  • A GNU/Linux system
  • A modern C++17 ready compiler

How to use

INPUT: A file containing the text and a file containing a set of patterns seperated by a new line.

OUTPUT: A file containing the set of patterns and the starting position of their occurrences within the text.

Installation

For the construction that uses internal memory only and 32-bit integers, compile as follows:

cd rrBDA-index_int
./pre-install.sh
make -f Makefile.32-bit.gcc

For the construction that uses both external and internal memory and 32-bit integers, compile as follows:

cd rrBDA-index_ext
./pre-install.sh
make -f Makefile.32-bit.gcc

In the same directories, you can find makefiles for 64-bit integers.

Usage

./rrbda-index_int <text_file> <ell> <pattern_file> <block_size> <output_filename> <index_filename>
./rrbda-index_ext <text_file> <ell> <pattern_file> <block_size> <ram_use> <output_filename> <index_filename>

<text_file> - name of input text file.
<ell> - lower bound on the length of input patterns to consider. 
<pattern_file> - name of input file containing the patterns.
<block_size> - size of block to use for constructing the bd-anchors (bytes).
<ram_use> - RAM usage for external SA and LCP array construction (MiB).
<output_filename> - name of output file, where pattern occurrences will be output.
<index_filename> - name of the index file to be used (if it exists) otherwise to be created.

Examples

For the construction that uses internal memory only, run as follows:

 $ ./rrbda-index_int ./data/text 3 ./data/patterns 10 out index

For the construction that uses both external and internal memory, run as follows:

 $ ./rrbda-index_ext ./data/text 3 ./data/patterns 10 1024 out index

Note that the exact same index is constructed in the end.

Evaluation

We have conducted an extensive evaluation of different text indexes. We give a teaser below using the whole human genome (version GRCh38) and a set of 500K patterns (randomly selected from the text).

The plot below depicts the indexes size for growing , the lower bound on the supported pattern lengths.

size

The plot below depicts the average pattern matching time of the indexes for growing pattern length |P|.

query_time

Citation

Lorraine A. K. Ayad, Grigorios Loukides, and Solon P. Pissis. 2024. Text indexing for long patterns using locally consistent anchors. The VLDB Journal [doi].

License

GNU GPLv3 License; Copyright (C) 2024 Lorraine A. K. Ayad, Grigorios Loukides and Solon P. Pissis.

About

Text Indexing for Long Patterns

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Contributors 2

  •  
  •