-
Notifications
You must be signed in to change notification settings - Fork 0
Crawler
Aidan edited this page May 8, 2025
·
1 revision
The Crawler is an automated data collection system that extracts professional profile information from career platforms using headless browser technology, converting HTML content into structured JSON data for analysis.
- Implemented headless browsing with human-like interaction patterns
- Configured proxy rotation and randomized request delays (2-5 seconds)
- Developed HTML-to-JSON conversion pipeline by parsing the HTMl with cheerio library
- Connected PostgreSQL database schema for profile storage with parameters
- Currently using Google Custom Search JSON API to make specific, filtered google queries
- "human-like" scroll behavior and page waiting to act as a user (test further to see if needed).
- Docker containerization for database management
- Navigate to project root and install dependencies:
- npm i
- npm i pg
- npm i cheerio
- npm i dotenv
- Setup .env with environment variables:
- Google Custom Search APi
- Database connection configuration
- API_KEY = example_api_key (Custom Search JSON API from Google)
- SEARCH_ENGINE_ID = exmaple_search_engine_id
- PGHOST = example-db.cluster-....rds.amazonaws.com
- PGPORT = 5432
- PGUSER = example_pguser
- PGPASSWORD = example_pgpassword
- PGDATABASE = example_pgdb
- Make sure docker is installed and running
- cd back into root structure
- Run command:
docker compose up- Make queries and HTML downloads quicker
- More logic to bypass LinkedIn Authwall
- Create friendly UI accessible to multiple members of naf
- reintegrate academy iteration to query for each academy