RLB (Run-Length and BWT) is a high-efficiency compression and retrieval algorithm, leveraging the principles of the Burrows-Wheeler Transform (BWT) combined with Run-Length Encoding (RLE). This project focuses on providing a robust solution for compressing text files and facilitating efficient keyword-based retrieval of records without the need for full decompression.
- Efficient Compression: Utilizes BWT and RLE to compress text files into a compact
.rlbformat. - Selective Retrieval: Allows for keyword-based searches within compressed files, enabling the retrieval of specific records without complete decompression.
- Record-Based Structure: Each
[xxx]in the text file is treated as an individual record, making it well-suited for datasets with delineated entries.
The RLB project can compress text files of the following format:
[8]Computers in industry[9]Data compression[10]Integration[11]Big data indexingEach [xxx] is recognized as a separate record. The compression process outputs a .rlb file, which contains the compressed data.
RLB supports keyword-based searches within the compressed .rlb files. This feature allows users to efficiently search for and retrieve specific records based on keywords, without the need to decompress the entire file.
This project includes a script example.sh that demonstrates basic usage to compress and search in example/dummy.txt.
Compress
$ ./bin/compress <input txt file> <output rlb file>e.g.
$ ./bin/compress example/dummy.txt example/dummy.rlbSearch
$ ./bin/search <compressed rlb file> <index file> <keyword to search>e.g.
$ ./bin/search example/dummy.rlb example/dummy.idx "in"Using the .idx File for Accelerated Search
Our search program leverages an index file (with the extension .idx) to enhance and accelerate the search process. Here’s how it works:
- When the .idx File Exists: If an
.idxfile is already present at the specified path, our search program will utilize this index file to speed up the search operation. The.idxfile contains pre-processed data that allows for quicker lookup and retrieval, thus significantly improving the efficiency of the search process. - When the .idx File Does Not Exist: In cases where an
.idxfile is not found at the given path, our search program has the capability to generate and store data that facilitates faster searching in future operations. During the initial search without an.idxfile, the program may take longer to complete the search; however, it simultaneously creates and saves necessary data in a new.idxfile at the specified path. This file will then be used in subsequent searches to achieve faster performance.