Python_WebScarper

WebScraper to Generate Multimodal Dataset

Author: Prashant Mahto

This repository contains a specialized web scraper built with Selenium and the Gemini 2.5 Flash API. Developed for research, this tool is designed to navigate websites, capture both the DOM structure and visual rendering, and generate normalized JSON datasets. These datasets are specifically formatted to fine-tune Transformer-based Vision-Language Models (such as LayoutLM) for complex UI analysis and element detection.

Features

Headless DOM Parsing: Uses Selenium to interact with web elements and capture metadata (CSS attributes, visibility, click-path depth) across full-page scrolling viewports.
Visual Spatial Recognition: Integrates the cloud-based Gemini 2.5 Flash API to perform in-the-wild document parsing, capturing text from heavily styled, overlapping, or obfuscated UI elements without relying on local GPU OCR constraints.
LayoutLM Normalization: Automatically converts pixel coordinates from raw screenshots into the normalized 0-1000 bounding box scale required by Hugging Face LayoutLM models.

Prerequisites

Python 3.9 or higher
Google Chrome and ChromeDriver
Python packages: selenium, google-genai, pillow, python-dotenv
GEMINI_API_KEY

Getting the Gemini API Key

To utilize the cloud vision extraction, you must generate a free API key from Google AI Studio.

Navigate to Google AI Studio in your web browser.
Sign in with your Google Account.
On the left-hand navigation panel or the top-right corner, click on "Get API key".
Click the "Create API key" button.
Select an existing Google Cloud project or let the studio create a new default project for you.
Copy the generated alphanumeric string. Keep this secure and do not share it.

Environment Setup (.env)

This project uses python-dotenv to securely manage credentials. You must create an environment file to store your API key locally so it is never hardcoded into the repository.

In the root directory of this project (at the same level as the src folder), create a new file named exactly .env.
Open the .env file in your code editor and add the following line, pasting your copied key after the equals sign (do not use quotation marks): GEMINI_API_KEY=your_copied_api_key

This video provides a direct, visual walkthrough of navigating Google AI Studio to successfully generate and secure the API key required for your pipeline

Pipeline Architecture

Ingestion: Selenium opens the target URL, allowing all dynamic JavaScript and pop-ups to fully load.
Snapshot Capture: A high-resolution, full-page screenshot of the body element is taken and saved locally.
Visual Extraction: The screenshot is passed to the Gemini API, which processes the image to extract tokens and raw bounding boxes natively in a [ymin, xmin, ymax, xmax] format.
Data Transformation: The Python orchestrator parses the JSON response, translates the coordinates to the [xmin, ymin, xmax, ymax] 0-1000 scale, and packages it alongside the Selenium CSS data into a master dataset.

Name		Name	Last commit message	Last commit date
Latest commit History 8 Commits
data		data
notebooks		notebooks
src		src
.gitignore		.gitignore
README.md		README.md
requirments.txt		requirments.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Python_WebScarper

WebScraper to Generate Multimodal Dataset

Features

Prerequisites

Getting the Gemini API Key

Environment Setup (.env)

Pipeline Architecture

About

Uh oh!

Releases

Packages

Languages

PrashantMaht0/Python_WebScraper

Folders and files

Latest commit

History

Repository files navigation

Python_WebScarper

WebScraper to Generate Multimodal Dataset

Features

Prerequisites

Getting the Gemini API Key

Environment Setup (.env)

Pipeline Architecture

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages