🏥 MedPages Scraper

A high-performance parallel web scraper for extracting healthcare professional data from MedPages.info.

✨ Features

🚀 Parallel Processing: Leverages all CPU cores for maximum performance
📊 Comprehensive Coverage: Scrapes 100+ medical service categories
🔄 Automated Updates: GitHub Actions workflow for scheduled scraping
💾 JSON Output: Clean, structured data in JSON format
📦 CI/CD Ready: Automatic artifact upload and repository commits

🖥️ System Requirements

Python 3.10+
Multi-core processor (automatically detected and utilized)
Internet connection

📦 Installation

# Clone the repository
git clone https://github.com/adgsenpai/ExtractingMedPages.git
cd ExtractingMedPages

# Install dependencies
pip install -r requirements.txt

🚀 Usage

Local Execution

python scrapeAll.py

The script will:

Auto-detect your CPU cores
Spawn optimal number of worker threads (2x CPU cores, max 16)
Scrape all services in parallel
Save results to medpages_all_data.json

Performance

Sequential: ~150 seconds for 100 services
Parallel (8 cores): ~20-30 seconds for 100 services
Speed improvement: 5-7x faster

📋 Data Sources

The scraper reads from two CSV files:

medpages_full_links.csv - General medical practitioners
medpages_mental_health.csv - Mental health professionals

📁 Output Format

{
  "Service Name": [
    {
      "name": "Dr. John Doe",
      "title": "Clinical Psychologist",
      "location": "Cape Town, Western Cape",
      "description": "Specializes in...",
      "profile_url": "https://www.medpages.info/sf/...",
      "latitude": -33.9249,
      "longitude": 18.4241
    }
  ]
}

🤖 GitHub Actions Workflow

Triggers

Manual: Via workflow_dispatch
Scheduled: Every Sunday at 00:00 UTC
Automatic: On push to main (if scraper files change)

What it does

✅ Runs the scraper on Ubuntu (latest)
📊 Generates summary statistics
📤 Uploads JSON as artifact (90-day retention)
🔄 Commits and pushes data back to repository

Running Manually

Go to Actions tab in GitHub
Select Scrape MedPages Data workflow
Click Run workflow
Download artifact from workflow run

📊 Output Statistics

The scraper provides detailed statistics:

Total services scraped
Total professionals found
Execution time
Average time per service
CPU cores utilized
Worker threads spawned

🔧 Configuration

Adjust Workers

Edit scrapeAll.py:

max_workers = min(cpu_count * 2, 16)  # Adjust multiplier or max

Change Delay

def scrape_listing(service_name, service_code, total_services, delay=0.5):

Increase delay if you encounter rate limiting.

📝 CSV Format

Your CSV files should have these columns:

Name,Full URL
Service Name,https://www.medpages.info/sf/index.php?page=listing&servicecode=123...

The scraper automatically extracts the service code from URLs.

🛠️ Technologies

requests: HTTP client
BeautifulSoup4: HTML parsing
ThreadPoolExecutor: Parallel processing
GitHub Actions: CI/CD automation

📈 Performance Tips

Increase workers for faster scraping (if server allows)
Reduce delay between requests (use responsibly)
Use SSD for faster I/O operations
Run on dedicated server for uninterrupted execution

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Submit a pull request

📄 License

This project is for educational purposes. Please respect MedPages.info's terms of service and rate limits.

⚠️ Disclaimer

This scraper is intended for research and data analysis purposes. Always:

Respect robots.txt
Use reasonable rate limiting
Comply with website terms of service
Don't overload servers

🐛 Troubleshooting

"Failed to fetch" errors

Check internet connection
Verify service codes in CSV files
Increase timeout in requests.get(..., timeout=30)

Rate limiting

Increase delay between requests
Reduce number of workers
Add exponential backoff

Memory issues

Process services in batches
Write to file incrementally
Reduce max_workers

📞 Support

For issues or questions, please open an issue on GitHub.

Made with ❤️ by ADGSENPAI

Name		Name	Last commit message	Last commit date
Latest commit History 16 Commits
.github/workflows		.github/workflows
.gitignore		.gitignore
README.md		README.md
extractLinks.py		extractLinks.py
medpages_all_data.json		medpages_all_data.json
medpages_full_links.csv		medpages_full_links.csv
medpages_mental_health.csv		medpages_mental_health.csv
medpages_psychologists.json		medpages_psychologists.json
requirements.txt		requirements.txt
scrapeAll.py		scrapeAll.py
testScrape.py		testScrape.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

🏥 MedPages Scraper

✨ Features

🖥️ System Requirements

📦 Installation

🚀 Usage

Local Execution

Performance

📋 Data Sources

📁 Output Format

🤖 GitHub Actions Workflow

Triggers

What it does

Running Manually

📊 Output Statistics

🔧 Configuration

Adjust Workers

Change Delay

📝 CSV Format

🛠️ Technologies

📈 Performance Tips

🤝 Contributing

📄 License

⚠️ Disclaimer

🐛 Troubleshooting

"Failed to fetch" errors

Rate limiting

Memory issues

📞 Support

About

Uh oh!

Releases

Packages

Languages

ADGSTUDIOS/ExtractingMedPages

Folders and files

Latest commit

History

Repository files navigation

🏥 MedPages Scraper

✨ Features

🖥️ System Requirements

📦 Installation

🚀 Usage

Local Execution

Performance

📋 Data Sources

📁 Output Format

🤖 GitHub Actions Workflow

Triggers

What it does

Running Manually

📊 Output Statistics

🔧 Configuration

Adjust Workers

Change Delay

📝 CSV Format

🛠️ Technologies

📈 Performance Tips

🤝 Contributing

📄 License

⚠️ Disclaimer

🐛 Troubleshooting

"Failed to fetch" errors

Rate limiting

Memory issues

📞 Support

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages