Skip to content

Transform YouTube videos into language learning materials by extracting transcripts and adapting them to different proficiency levels

Notifications You must be signed in to change notification settings

roy989898/YouTube-Transcript-Processor

Repository files navigation

🎥 YouTube Transcript Processor

Transform YouTube videos into language learning materials by extracting transcripts and adapting them to different proficiency levels

Python License Maintenance

✨ Features

  • 📝 Automatic Transcript Extraction - Fetch transcripts from YouTube videos
  • 🌏 Multi-language Support - Specialized for Cantonese (粵語) transcripts
  • 🧠 AI-Powered Processing - Transform content to match different language proficiency levels
  • Smart Text Chunking - Intelligently split content based on token limits
  • 📊 Token Counting - Precise token management using tiktoken
  • 💾 File Output - Save processed results to text files
  • 🎧 Podcast Integration - Compatible with ElevenReader to convert YouTube videos into English podcasts

🚀 Quick Start

Basic Usage

from main import process_youtube

# Process a YouTube video
video_url = "https://www.youtube.com/watch?v=YOUR_VIDEO_ID"
results = process_youtube(video_url, level="b1", max_tokens=4000)

# Results are automatically saved to text files

Command Line Usage

python main.py

📋 How It Works

  1. 🔗 URL Parsing - Extracts video ID from YouTube URLs
  2. 📜 Transcript Retrieval - Fetches Cantonese transcripts using YouTube Transcript API
  3. ✂️ Smart Chunking - Splits text into manageable chunks while preserving sentence integrity
  4. 🤖 AI Processing - Sends chunks to AI model for language level adaptation
  5. 💾 File Export - Saves processed content to organized text files

🛠️ Configuration

Language Levels

  • a1 - Beginner
  • a2 - Elementary
  • b1 - Intermediate
  • b2 - Upper Intermediate
  • c1 - Advanced
  • c2 - Proficient

Token Limits

Default: 4000 tokens per chunk

  • Adjustable based on your AI model's context window
  • Automatically handles sentences that exceed token limits

📁 Project Structure

youtube-transcript-processor/
├── main.py              # Main processing script
├── robot.py            # AI model interface
├── text/               # Output directory for processed files
├── requirements.txt    # Python dependencies
└── README.md          # This file

🔧 API Reference

process_youtube(link, level, max_tokens=4000, is_chinese=True)

Parameters:

  • link (str): YouTube video URL
  • level (str): Target language proficiency level
  • max_tokens (int): Maximum tokens per chunk
  • is_chinese (bool): Enable Chinese text processing

Returns:

  • List of processed text chunks

get_youtube_transcript(video_url)

Parameters:

  • video_url (str): YouTube video URL

Returns:

  • Full transcript text or None if error

🎯 Use Cases

  • Language Learning - Adapt YouTube content to your proficiency level
  • Content Creation - Generate educational materials from videos
  • Research - Process video content for analysis
  • Accessibility - Create readable transcripts from video content
  • 🎧 Podcast Creation - Use with ElevenReader to transform YouTube videos into English podcasts for on-the-go learning

🔍 Example Output

Input: Complex Cantonese YouTube video
Output: Simplified text adapted to B1 level with proper sentence structure and vocabulary

Original: 今日我哋要講嘅係一個好複雜嘅概念...
Processed: Today what we're going to talk about is a very complex concept...

🤝 Contributing

  1. Fork the repository
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

📞 Support

If you encounter any issues or have questions:

  • Open an issue on GitHub
  • Check the Wiki for detailed documentation
  • Join our Discussions for community support

🔄 Integration with ElevenReader

Transform your processed transcripts into engaging audio content:

  1. Process YouTube Video - Extract and adapt transcript using this tool
  2. Export Text File - Save the processed content to a text file
  3. Upload to ElevenReader - Visit ElevenReader.io and upload your text file
  4. Generate Podcast - Convert your adapted transcript into an English podcast
  5. Listen & Learn - Enjoy your personalized audio content on any device

Perfect Workflow:

YouTube Video → Transcript Extraction → AI Processing → Text File → ElevenReader → English Podcast

This integration allows you to:

  • Turn any YouTube video into an English learning transcript
  • Create audio content at your desired proficiency level (use with ElevenReader)
  • Learn through multiple modalities (reading + listening)

Made with ❤️ for language learners worldwide

About

Transform YouTube videos into language learning materials by extracting transcripts and adapting them to different proficiency levels

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages