Crawler

What is the Crawler?

The Crawler is an automated data collection system that extracts professional profile information from career platforms using headless browser technology, converting HTML content into structured JSON data for analysis.

What has been done so far?

Implemented headless browsing with human-like interaction patterns
Configured proxy rotation and randomized request delays (2-5 seconds)
Developed HTML-to-JSON conversion pipeline by parsing the HTMl with cheerio library
Connected PostgreSQL database schema for profile storage with parameters
Currently using Google Custom Search JSON API to make specific, filtered google queries
"human-like" scroll behavior and page waiting to act as a user (test further to see if needed).
Docker containerization for database management

How to run the Crawler

The Crawler searches platforms, extracts profile data, and stores it in the profiles table.

Navigate to project root and install dependencies:

npm i
npm i pg
npm i cheerio
npm i dotenv

Setup .env with environment variables:

Google Custom Search APi
Database connection configuration

Example .env file below:

API_KEY = example_api_key (Custom Search JSON API from Google)
SEARCH_ENGINE_ID = exmaple_search_engine_id
PGHOST = example-db.cluster-....rds.amazonaws.com
PGPORT = 5432
PGUSER = example_pguser
PGPASSWORD = example_pgpassword
PGDATABASE = example_pgdb

How to run crawler:

Make sure docker is installed and running
cd back into root structure
Run command:

  docker compose up

Next Semester Goals:

Make queries and HTML downloads quicker
More logic to bypass LinkedIn Authwall
Create friendly UI accessible to multiple members of naf
reintegrate academy iteration to query for each academy

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crawler

Crawler

What is the Crawler?

What has been done so far?

How to run the Crawler

The Crawler searches platforms, extracts profile data, and stores it in the profiles table.

Example .env file below:

How to run crawler:

Next Semester Goals:

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally