🦊 FoxScrape - Advanced Metadata Scraper

Author

By Sarah Marion

A powerful Python tool for extracting comprehensive metadata from websites. This script goes beyond basic metadata to extract emails, phone numbers, social media links, named entities, and more.

📋 Table of Contents

Features
Installation
Usage
Project Structure
Output Format
Contributing
Ethical Considerations
License
Acknowledgments
Support

Features

Comprehensive Metadata Extraction: Title, meta tags, descriptions, etc.
Contact Information: Email addresses and phone numbers
Named Entity Recognition: People, locations, and organizations using spaCy
Social Media Links: Detection of links to popular social platforms
Image & Link Extraction: All images and hyperlinks on the page
Robust Error Handling: Retry mechanism with exponential backoff
Respectful Crawling: Configurable delays between requests
Logging: Comprehensive logging to both file and console
Configurable: Multiple options via command line arguments

Installation

Using pip

pip install metadata-scraper

From source

Clone the repository:

git clone https://github.com/Sarah-Marion/metadata-scraper.git
cd metadata-scraper

Install dependencies:

pip install -r requirements.txt
python -m spacy download en_core_web_sm

Usage

Basic Usage

Create a file named urls.txt with one URL per line
Run the script:

python -m metadata_scraper.scraper

Command Line Options

usage: scraper.py [-h] [-i INPUT] [-o OUTPUT] [-t TIMEOUT] [-r RETRIES] [-d DELAY] [-v]

Advanced web metadata extraction tool

options:
  -h, --help            show this help message and exit
  -i INPUT, --input INPUT
                        Input file containing URLs (one per line) (default: urls.txt)
  -o OUTPUT, --output OUTPUT
                        Output JSON file (default: metadata_output.json)
  -t TIMEOUT, --timeout TIMEOUT
                        Request timeout in seconds (default: 15)
  -r RETRIES, --retries RETRIES
                        Number of retries for failed requests (default: 2)
  -d DELAY, --delay DELAY
                        Delay between requests in seconds (default: 1.0)
  -v, --verbose         Enable verbose output (default: False)

As a Python Module

from metadata_scraper import MetadataScraper

scraper = MetadataScraper(timeout=15, max_retries=2, delay=1.0)
metadata = scraper.extract_metadata("https://example.com")
print(metadata)

📁 Project Structure

metadata-scraper/
├── metadata_scraper/
│   ├── __init__.py
│   └── scraper.py
├── examples/
│   ├── urls.txt
│   └── sample_output.json
├── tests/
│   └── __init__.py
├── README.md
├── requirements.txt
├── setup.py
├── LICENSE
└── .gitignore

📊 Output Format

{
  "https://example.com": {
    "url": "https://example.com",
    "title": "Example Domain",
    "meta_tags": {
      "description": "Example domain description",
      "viewport": "width=device-width, initial-scale=1.0"
    },
    "emails": ["contact@example.com"],
    "phones": ["+1-555-123-4567"],
    "names": ["John Doe"],
    "locations": ["New York", "USA"],
    "organizations": ["Example Inc."],
    "social_links": {
      "twitter": ["https://twitter.com/example"],
      "github": ["https://github.com/example"]
    },
    "images": ["https://example.com/image.jpg"],
    "links": ["https://example.com/about"],
    "status_code": 200,
    "timestamp": 1634567890.123456,
    "error": null
  }
}

🤝 Contributing

I welcome contributions! Please feel free to submit a Pull Request. For major changes, please open an issue first to discuss what you would like to change.

Fork the project
Create your feature branch (git checkout -b feature/AmazingFeature)
Commit your changes (git commit -m 'Add some AmazingFeature')
Push to the branch (git push origin feature/AmazingFeature)
Open a Pull Request

Please make sure to update tests as appropriate.

⚠️ Ethical Considerations

Use this tool responsibly and only on websites you have permission to scrape
Respect robots.txt files and website terms of service
Implement delays between requests to avoid overwhelming servers
Consider caching results to avoid repeated requests to the same sites

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

📞 Support

If you have any questions or issues, please open an issue on GitHub or contact Sarah Marion.

🙏 Acknowledgments

Built with Requests for HTTP requests
Uses BeautifulSoup for HTML parsing
LeveragesspaCy for named entity recognition
Inspired by various web scraping tools and techniques

📞 Support

If you have any questions or issues, please open an issue on GitHub or contact Sarah Marion.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🦊 FoxScrape - Advanced Metadata Scraper

Author

📋 Table of Contents

Features

Installation

Using pip

From source

Usage

Basic Usage

Command Line Options

As a Python Module

📁 Project Structure

📊 Output Format

🤝 Contributing

⚠️ Ethical Considerations

📄 License

📞 Support

📞 Support

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
examples		examples
metadata_scraper		metadata_scraper
tests		tests
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt
setup.py		setup.py

Folders and files

Latest commit

History

Repository files navigation

🦊 FoxScrape - Advanced Metadata Scraper

Author

📋 Table of Contents

Features

Installation

Using pip

From source

Usage

Basic Usage

Command Line Options

As a Python Module

📁 Project Structure

📊 Output Format

🤝 Contributing

⚠️ Ethical Considerations

📄 License

📞 Support

📞 Support

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages