By Sarah Marion
A powerful Python tool for extracting comprehensive metadata from websites. This script goes beyond basic metadata to extract emails, phone numbers, social media links, named entities, and more.
- Features
- Installation
- Usage
- Project Structure
- Output Format
- Contributing
- Ethical Considerations
- License
- Acknowledgments
- Support
- Comprehensive Metadata Extraction: Title, meta tags, descriptions, etc.
- Contact Information: Email addresses and phone numbers
- Named Entity Recognition: People, locations, and organizations using spaCy
- Social Media Links: Detection of links to popular social platforms
- Image & Link Extraction: All images and hyperlinks on the page
- Robust Error Handling: Retry mechanism with exponential backoff
- Respectful Crawling: Configurable delays between requests
- Logging: Comprehensive logging to both file and console
- Configurable: Multiple options via command line arguments
pip install metadata-scraper- Clone the repository:
git clone https://github.com/Sarah-Marion/metadata-scraper.git
cd metadata-scraper- Install dependencies:
pip install -r requirements.txt
python -m spacy download en_core_web_sm- Create a file named urls.txt with one URL per line
- Run the script:
python -m metadata_scraper.scraperusage: scraper.py [-h] [-i INPUT] [-o OUTPUT] [-t TIMEOUT] [-r RETRIES] [-d DELAY] [-v]
Advanced web metadata extraction tool
options:
-h, --help show this help message and exit
-i INPUT, --input INPUT
Input file containing URLs (one per line) (default: urls.txt)
-o OUTPUT, --output OUTPUT
Output JSON file (default: metadata_output.json)
-t TIMEOUT, --timeout TIMEOUT
Request timeout in seconds (default: 15)
-r RETRIES, --retries RETRIES
Number of retries for failed requests (default: 2)
-d DELAY, --delay DELAY
Delay between requests in seconds (default: 1.0)
-v, --verbose Enable verbose output (default: False)from metadata_scraper import MetadataScraper
scraper = MetadataScraper(timeout=15, max_retries=2, delay=1.0)
metadata = scraper.extract_metadata("https://example.com")
print(metadata)metadata-scraper/
├── metadata_scraper/
│ ├── __init__.py
│ └── scraper.py
├── examples/
│ ├── urls.txt
│ └── sample_output.json
├── tests/
│ └── __init__.py
├── README.md
├── requirements.txt
├── setup.py
├── LICENSE
└── .gitignore
{
"https://example.com": {
"url": "https://example.com",
"title": "Example Domain",
"meta_tags": {
"description": "Example domain description",
"viewport": "width=device-width, initial-scale=1.0"
},
"emails": ["contact@example.com"],
"phones": ["+1-555-123-4567"],
"names": ["John Doe"],
"locations": ["New York", "USA"],
"organizations": ["Example Inc."],
"social_links": {
"twitter": ["https://twitter.com/example"],
"github": ["https://github.com/example"]
},
"images": ["https://example.com/image.jpg"],
"links": ["https://example.com/about"],
"status_code": 200,
"timestamp": 1634567890.123456,
"error": null
}
}I welcome contributions! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.
- Fork the project
- Create your feature branch (git checkout -b feature/AmazingFeature)
- Commit your changes (git commit -m 'Add some AmazingFeature')
- Push to the branch (git push origin feature/AmazingFeature)
- Open a Pull Request
Please make sure to update tests as appropriate.
-
Use this tool responsibly and only on websites you have permission to scrape
-
Respect robots.txt files and website terms of service
-
Implement delays between requests to avoid overwhelming servers
-
Consider caching results to avoid repeated requests to the same sites
This project is licensed under the MIT License - see the LICENSE file for details.
If you have any questions or issues, please open an issue on GitHub or contact Sarah Marion.
🙏 Acknowledgments
-
Built with Requests for HTTP requests
-
Uses BeautifulSoup for HTML parsing
-
LeveragesspaCy for named entity recognition
-
Inspired by various web scraping tools and techniques
If you have any questions or issues, please open an issue on GitHub or contact Sarah Marion.