Author: Prashant Mahto
This repository contains a specialized web scraper built with Selenium and the Gemini 2.5 Flash API. Developed for research, this tool is designed to navigate websites, capture both the DOM structure and visual rendering, and generate normalized JSON datasets. These datasets are specifically formatted to fine-tune Transformer-based Vision-Language Models (such as LayoutLM) for complex UI analysis and element detection.
- Headless DOM Parsing: Uses Selenium to interact with web elements and capture metadata (CSS attributes, visibility, click-path depth) across full-page scrolling viewports.
- Visual Spatial Recognition: Integrates the cloud-based Gemini 2.5 Flash API to perform in-the-wild document parsing, capturing text from heavily styled, overlapping, or obfuscated UI elements without relying on local GPU OCR constraints.
- LayoutLM Normalization: Automatically converts pixel coordinates from raw screenshots into the normalized 0-1000 bounding box scale required by Hugging Face LayoutLM models.
- Python 3.9 or higher
- Google Chrome and ChromeDriver
- Python packages:
selenium,google-genai,pillow,python-dotenv - GEMINI_API_KEY
To utilize the cloud vision extraction, you must generate a free API key from Google AI Studio.
- Navigate to Google AI Studio in your web browser.
- Sign in with your Google Account.
- On the left-hand navigation panel or the top-right corner, click on "Get API key".
- Click the "Create API key" button.
- Select an existing Google Cloud project or let the studio create a new default project for you.
- Copy the generated alphanumeric string. Keep this secure and do not share it.
This project uses python-dotenv to securely manage credentials. You must create an environment file to store your API key locally so it is never hardcoded into the repository.
- In the root directory of this project (at the same level as the
srcfolder), create a new file named exactly.env. - Open the
.envfile in your code editor and add the following line, pasting your copied key after the equals sign (do not use quotation marks):GEMINI_API_KEY=your_copied_api_key
This video provides a direct, visual walkthrough of navigating Google AI Studio to successfully generate and secure the API key required for your pipeline
- Ingestion: Selenium opens the target URL, allowing all dynamic JavaScript and pop-ups to fully load.
- Snapshot Capture: A high-resolution, full-page screenshot of the body element is taken and saved locally.
- Visual Extraction: The screenshot is passed to the Gemini API, which processes the image to extract tokens and raw bounding boxes natively in a [ymin, xmin, ymax, xmax] format.
- Data Transformation: The Python orchestrator parses the JSON response, translates the coordinates to the [xmin, ymin, xmax, ymax] 0-1000 scale, and packages it alongside the Selenium CSS data into a master dataset.
