This project demonstrates the process of creating and preparing a dataset for Large Language Model (LLM) training. It involves extracting text data from various sources, cleaning and preprocessing the data, and then structuring it into suitable formats like JSON and CSV. The dataset is then stored in different locations, including Hugging Face and Google Drive, and basic evaluation and versioning techniques are applied.
- Installation
- Usage
- Data Sources
- Data Preprocessing
- Dataset Structure
- Data Storage
- Dataset Evaluation
- Dataset Versioning
- Contributing
- License
- Install Required Libraries:
- Data Collection: Extract text from scanned documents (OCR), PDFs, DOCX files, and websites.
- Data Preprocessing: Clean and preprocess the extracted text data.
- Dataset Structuring: Convert the data to JSON and CSV formats.
- Data Storage: Store the dataset on Hugging Face and Google Drive.
- Dataset Evaluation: Perform basic evaluation checks on the dataset.
- Dataset Versioning: Use DVC for dataset versioning.
- Scanned documents (images)
- PDF documents
- DOCX (MS Word) documents
- Web scraping using Firecrawl
- Cleaning the text using regular expressions.
- Sentence splitting and tokenization using
tiktoken.
- JSON Format: Stores the text and source of each data point.
- CSV Format: Stores the data in a tabular format.
- Hugging Face: The dataset is uploaded to Hugging Face Hub.
- Google Drive: The dataset is saved to Google Drive.
- Basic Statistics: Word counts and frequency analysis.
- Data Bias Detection: Using word clouds for visualization.
- DVC (Data Version Control): Used to track and manage changes to the dataset.
Contributions are welcome! Please follow the standard GitHub workflow for submitting pull requests.
MIT LICENSE