LLM Dataset Creation and Preparation

Description

This project demonstrates the process of creating and preparing a dataset for Large Language Model (LLM) training. It involves extracting text data from various sources, cleaning and preprocessing the data, and then structuring it into suitable formats like JSON and CSV. The dataset is then stored in different locations, including Hugging Face and Google Drive, and basic evaluation and versioning techniques are applied.

Installation

Install Required Libraries:

Usage

Data Collection: Extract text from scanned documents (OCR), PDFs, DOCX files, and websites.
Data Preprocessing: Clean and preprocess the extracted text data.
Dataset Structuring: Convert the data to JSON and CSV formats.
Data Storage: Store the dataset on Hugging Face and Google Drive.
Dataset Evaluation: Perform basic evaluation checks on the dataset.
Dataset Versioning: Use DVC for dataset versioning.

Data Sources

Scanned documents (images)
PDF documents
DOCX (MS Word) documents
Web scraping using Firecrawl

Data Preprocessing

Cleaning the text using regular expressions.
Sentence splitting and tokenization using tiktoken.

Dataset Structure

JSON Format: Stores the text and source of each data point.
CSV Format: Stores the data in a tabular format.

Data Storage

Hugging Face: The dataset is uploaded to Hugging Face Hub.
Google Drive: The dataset is saved to Google Drive.

Dataset Evaluation

Basic Statistics: Word counts and frequency analysis.
Data Bias Detection: Using word clouds for visualization.

Dataset Versioning

DVC (Data Version Control): Used to track and manage changes to the dataset.

Contributing

Contributions are welcome! Please follow the standard GitHub workflow for submitting pull requests.

License

MIT LICENSE

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
Create_LLM_Dataset.ipynb		Create_LLM_Dataset.ipynb
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

LLM Dataset Creation and Preparation

Description

Table of Contents

Installation

Usage

Data Sources

Data Preprocessing

Dataset Structure

Data Storage

Dataset Evaluation

Dataset Versioning

Contributing

License

About

Uh oh!

Releases

Packages

Languages

License

amoakoh22/Create-LLM-Dataset

Folders and files

Latest commit

History

Repository files navigation

LLM Dataset Creation and Preparation

Description

Table of Contents

Installation

Usage

Data Sources

Data Preprocessing

Dataset Structure

Data Storage

Dataset Evaluation

Dataset Versioning

Contributing

License

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages