Skip to content

Sarah-Marion/metadata-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

🦊 FoxScrape - Advanced Metadata Scraper

Author

By Sarah Marion


Python Version License: MIT Code style: black

A powerful Python tool for extracting comprehensive metadata from websites. This script goes beyond basic metadata to extract emails, phone numbers, social media links, named entities, and more.

📋 Table of Contents

Features

  • Comprehensive Metadata Extraction: Title, meta tags, descriptions, etc.
  • Contact Information: Email addresses and phone numbers
  • Named Entity Recognition: People, locations, and organizations using spaCy
  • Social Media Links: Detection of links to popular social platforms
  • Image & Link Extraction: All images and hyperlinks on the page
  • Robust Error Handling: Retry mechanism with exponential backoff
  • Respectful Crawling: Configurable delays between requests
  • Logging: Comprehensive logging to both file and console
  • Configurable: Multiple options via command line arguments

Installation

Using pip

pip install metadata-scraper

From source

  1. Clone the repository:
git clone https://github.com/Sarah-Marion/metadata-scraper.git
cd metadata-scraper
  1. Install dependencies:
pip install -r requirements.txt
python -m spacy download en_core_web_sm

Usage

Basic Usage

  1. Create a file named urls.txt with one URL per line
  2. Run the script:
python -m metadata_scraper.scraper

Command Line Options

usage: scraper.py [-h] [-i INPUT] [-o OUTPUT] [-t TIMEOUT] [-r RETRIES] [-d DELAY] [-v]

Advanced web metadata extraction tool

options:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Input file containing URLs (one per line) (default: urls.txt)
  -o OUTPUT, --output OUTPUT
                        Output JSON file (default: metadata_output.json)
  -t TIMEOUT, --timeout TIMEOUT
                        Request timeout in seconds (default: 15)
  -r RETRIES, --retries RETRIES
                        Number of retries for failed requests (default: 2)
  -d DELAY, --delay DELAY
                        Delay between requests in seconds (default: 1.0)
  -v, --verbose         Enable verbose output (default: False)

As a Python Module

from metadata_scraper import MetadataScraper

scraper = MetadataScraper(timeout=15, max_retries=2, delay=1.0)
metadata = scraper.extract_metadata("https://example.com")
print(metadata)

📁 Project Structure

metadata-scraper/
├── metadata_scraper/
│   ├── __init__.py
│   └── scraper.py
├── examples/
│   ├── urls.txt
│   └── sample_output.json
├── tests/
│   └── __init__.py
├── README.md
├── requirements.txt
├── setup.py
├── LICENSE
└── .gitignore

📊 Output Format

{
  "https://example.com": {
    "url": "https://example.com",
    "title": "Example Domain",
    "meta_tags": {
      "description": "Example domain description",
      "viewport": "width=device-width, initial-scale=1.0"
    },
    "emails": ["contact@example.com"],
    "phones": ["+1-555-123-4567"],
    "names": ["John Doe"],
    "locations": ["New York", "USA"],
    "organizations": ["Example Inc."],
    "social_links": {
      "twitter": ["https://twitter.com/example"],
      "github": ["https://github.com/example"]
    },
    "images": ["https://example.com/image.jpg"],
    "links": ["https://example.com/about"],
    "status_code": 200,
    "timestamp": 1634567890.123456,
    "error": null
  }
}

🤝 Contributing

I welcome contributions! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

  1. Fork the project
  2. Create your feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add some AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Please make sure to update tests as appropriate.

⚠️ Ethical Considerations

  • Use this tool responsibly and only on websites you have permission to scrape

  • Respect robots.txt files and website terms of service

  • Implement delays between requests to avoid overwhelming servers

  • Consider caching results to avoid repeated requests to the same sites

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📞 Support

If you have any questions or issues, please open an issue on GitHub or contact Sarah Marion.

🙏 Acknowledgments

  • Built with Requests for HTTP requests

  • Uses BeautifulSoup for HTML parsing

  • LeveragesspaCy for named entity recognition

  • Inspired by various web scraping tools and techniques

📞 Support

If you have any questions or issues, please open an issue on GitHub or contact Sarah Marion.

About

A powerful Python tool for extracting comprehensive metadata from websites. This script goes beyond basic metadata to extract emails, phone numbers, social media links, named entities, and more.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages