🕷️ AI Webscrapper : Gemini-Powered Universal Web Scraper

AIWebScraper is a Python-based tool that uses Google Gemini AI and DrissionPage to extract structured JSON data from any website. Unlike traditional scrapers, it uses LLMs to understand content, allowing it to bypass anti-bot protections and parse dynamic HTML automatically.

🚀 Overview

This project solves the "Universal Scraper" problem by combining two technologies:

DrissionPage: A browser automation tool used here to handle dynamic JavaScript rendering and study anti-bot behaviors (e.g., lazy loading).
Google Gemini 1.5: An AI model used to semantically understand HTML and normalize it into structured JSON (Schema Enforcement), eliminating the need for hardcoded CSS selectors.

🛠️ Features

stealth_mode: Implements browser fingerprint randomization for research.
smart_wait: Uses DOM-based waiting strategies instead of fixed timers.
ai_parsing: Extracts "Title", "Price", and "Image" automatically without regex.
Privacy-First: API keys are managed via environment variables and never logged.

⚙️ Installation

Clone the repository:

git clone [https://github.com/Chaitya44/AIWebscrapper.git](https://github.com/Chaitya44/AIWebscrapper.git)
cd AIWebscrapper

Install dependencies:
```
pip install -r requirements.txt
```

Configuration: Create a .env file in the root directory. Do not share this file.

# Required for AI processing
GEMINI_API_KEY=your_google_api_key_here

# Optional: Local password for the UI
APP_PASSWORD=admin

Run the application:
```
streamlit run app.py
```

⚠️ Legal Disclaimer & Ethical Use

READ BEFORE USING:

Educational Purpose Only: This repository is strictly for educational and research purposes. It is designed to demonstrate how LLMs can parse HTML structures. It is not intended for large-scale data harvesting, commercial scraping, or copyright infringement.
No Liability: The developer (Chaitya44) assumes no liability for how this tool is used. Any legal consequences arising from the use of this tool are the sole responsibility of the user.
Respect Terms of Service: Users are strictly advised to review and adhere to the Terms of Service (ToS) and Robots.txt of any website they interact with.
No Commercial Use: The code and any data derived from it must not be used for commercial purposes, resale, or competing products.
Rate Limiting: This tool includes artificial delays (time.sleep) to prevent server overload. Users should not modify these safety mechanisms to aggressively target servers (DDoS behavior).

🛡️ License

This project is licensed under the MIT License - see the LICENSE file for details.

Note: The MIT License grants permission for use, but does not override the Terms of Service of target websites. Use responsibly.

If you are a copyright holder or website administrator and wish to have specific scraping capabilities removed from this educational tool, please open an Issue.

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
ai_agent.py		ai_agent.py
app.py		app.py
auth.py		auth.py
requirements.txt		requirements.txt
scraper.py		scraper.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

🕷️ AI Webscrapper : Gemini-Powered Universal Web Scraper

⚠️ Legal Disclaimer & Ethical Use

About

Uh oh!

Releases

Packages

Languages

License

Chaitya44/AI-WebScraper

Folders and files

Latest commit

History

Repository files navigation

🕷️ AI Webscrapper : Gemini-Powered Universal Web Scraper

⚠️ Legal Disclaimer & Ethical Use

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages