A high-performance parallel web scraper for extracting healthcare professional data from MedPages.info.
- 🚀 Parallel Processing: Leverages all CPU cores for maximum performance
- 📊 Comprehensive Coverage: Scrapes 100+ medical service categories
- 🔄 Automated Updates: GitHub Actions workflow for scheduled scraping
- 💾 JSON Output: Clean, structured data in JSON format
- 📦 CI/CD Ready: Automatic artifact upload and repository commits
- Python 3.10+
- Multi-core processor (automatically detected and utilized)
- Internet connection
# Clone the repository
git clone https://github.com/adgsenpai/ExtractingMedPages.git
cd ExtractingMedPages
# Install dependencies
pip install -r requirements.txtpython scrapeAll.pyThe script will:
- Auto-detect your CPU cores
- Spawn optimal number of worker threads (2x CPU cores, max 16)
- Scrape all services in parallel
- Save results to
medpages_all_data.json
- Sequential: ~150 seconds for 100 services
- Parallel (8 cores): ~20-30 seconds for 100 services
- Speed improvement: 5-7x faster
The scraper reads from two CSV files:
medpages_full_links.csv- General medical practitionersmedpages_mental_health.csv- Mental health professionals
{
"Service Name": [
{
"name": "Dr. John Doe",
"title": "Clinical Psychologist",
"location": "Cape Town, Western Cape",
"description": "Specializes in...",
"profile_url": "https://www.medpages.info/sf/...",
"latitude": -33.9249,
"longitude": 18.4241
}
]
}- Manual: Via workflow_dispatch
- Scheduled: Every Sunday at 00:00 UTC
- Automatic: On push to main (if scraper files change)
- ✅ Runs the scraper on Ubuntu (latest)
- 📊 Generates summary statistics
- 📤 Uploads JSON as artifact (90-day retention)
- 🔄 Commits and pushes data back to repository
- Go to Actions tab in GitHub
- Select Scrape MedPages Data workflow
- Click Run workflow
- Download artifact from workflow run
The scraper provides detailed statistics:
- Total services scraped
- Total professionals found
- Execution time
- Average time per service
- CPU cores utilized
- Worker threads spawned
Edit scrapeAll.py:
max_workers = min(cpu_count * 2, 16) # Adjust multiplier or maxdef scrape_listing(service_name, service_code, total_services, delay=0.5):Increase delay if you encounter rate limiting.
Your CSV files should have these columns:
Name,Full URL
Service Name,https://www.medpages.info/sf/index.php?page=listing&servicecode=123...
The scraper automatically extracts the service code from URLs.
- requests: HTTP client
- BeautifulSoup4: HTML parsing
- ThreadPoolExecutor: Parallel processing
- GitHub Actions: CI/CD automation
- Increase workers for faster scraping (if server allows)
- Reduce delay between requests (use responsibly)
- Use SSD for faster I/O operations
- Run on dedicated server for uninterrupted execution
- Fork the repository
- Create a feature branch
- Make your changes
- Submit a pull request
This project is for educational purposes. Please respect MedPages.info's terms of service and rate limits.
This scraper is intended for research and data analysis purposes. Always:
- Respect robots.txt
- Use reasonable rate limiting
- Comply with website terms of service
- Don't overload servers
- Check internet connection
- Verify service codes in CSV files
- Increase timeout in
requests.get(..., timeout=30)
- Increase delay between requests
- Reduce number of workers
- Add exponential backoff
- Process services in batches
- Write to file incrementally
- Reduce max_workers
For issues or questions, please open an issue on GitHub.
Made with ❤️ by ADGSENPAI