Skip to content

Crawler

Aidan edited this page May 8, 2025 · 1 revision

Crawler

What is the Crawler?

The Crawler is an automated data collection system that extracts professional profile information from career platforms using headless browser technology, converting HTML content into structured JSON data for analysis.

What has been done so far?

  • Implemented headless browsing with human-like interaction patterns
  • Configured proxy rotation and randomized request delays (2-5 seconds)
  • Developed HTML-to-JSON conversion pipeline by parsing the HTMl with cheerio library
  • Connected PostgreSQL database schema for profile storage with parameters
  • Currently using Google Custom Search JSON API to make specific, filtered google queries
  • "human-like" scroll behavior and page waiting to act as a user (test further to see if needed).
  • Docker containerization for database management

How to run the Crawler

The Crawler searches platforms, extracts profile data, and stores it in the profiles table.

  1. Navigate to project root and install dependencies:
  • npm i
  • npm i pg
  • npm i cheerio
  • npm i dotenv
  1. Setup .env with environment variables:
  • Google Custom Search APi
  • Database connection configuration

Example .env file below:

  • API_KEY = example_api_key (Custom Search JSON API from Google)
  • SEARCH_ENGINE_ID = exmaple_search_engine_id
  • PGHOST = example-db.cluster-....rds.amazonaws.com
  • PGPORT = 5432
  • PGUSER = example_pguser
  • PGPASSWORD = example_pgpassword
  • PGDATABASE = example_pgdb

How to run crawler:

  • Make sure docker is installed and running
  • cd back into root structure
  • Run command:
  docker compose up

Next Semester Goals:

  • Make queries and HTML downloads quicker
  • More logic to bypass LinkedIn Authwall
  • Create friendly UI accessible to multiple members of naf
  • reintegrate academy iteration to query for each academy

Clone this wiki locally