python-hltv-scraper

A simple and open-source HLTV.org web scraper built with AsyncCamoufox and BeautifulSoup, written entirely in Python.

Overview

This project provides a fast, asynchronous, and secure web scraper for HLTV.org, designed to overcome Cloudflare protections using Camoufox, asyncio, BeautifulSoup, and pandas. It supports scraping of recent matches, teams, players, and detailed match data. This is also my first project ever! I would be honored if you could leave feedback and maybe also a star :)

Features

Fully asynchronous scraping for high speed and efficiency
Cloudflare bypass using Camoufox with TLS/JA3 fingerprinting
Proxy support with automatic rotation (optional)
Dynamic User-Agent rotation for stealth
Cookie and session management for persistent scraping
Modular scripts for scraping matches, teams, and players
Data saved in CSV format for easy analysis
Detailed logging and progress reporting

Installation

Clone the repository and install dependencies:

git clone https://github.com/ClassicalClemi/python-hltv-scraper.git
cd python-hltv-scraper
pip install -r requirements.txt

Make sure you have Python 3.x installed.

Install the Camoufox browser:

camoufox fetch

Usage

Each script can be configured via a config dictionary inside the file. Configure input/output files, scraping limits, and other options before running.

Run a script from the command line:

python async_get_team_data.py

Important: Before scraping detailed team or match data, run the corresponding URL scraper scripts (async_get_team_urls.py, async_get_recent_match_urls.py) to collect URLs.

Scraped data is saved by default in the data/ folder as CSV files. You can change paths in the config.

Project Structure

async_get_recent_match_urls.py — Scrapes recent matches from https://www.hltv.org/results

async_get_match_data.py — Scrapes in-depth data from scraped match URLs

Exact structure:

match_info = {
                  "team_1": team_1,
                  "team_2": team_2,
                  "score_team_1": score_team_1,
                  "score_team_2": score_team_2,
                  "winner": winner,
                  "date": date,
                  "hour": hour,
                  "event": event,
                  "mode": mode,
                  "maps": maps,
              }
"maps": [
              {"map": "Dust2", "picked_by": "team_1", "winner": "team_1", "score": "16-14"},
              {"map": "Mirage", "picked_by": "team_2", "winner": "team_2", "score": "16-12"},
              {"map": "Inferno", "picked_by": "random", "winner": "team_1", "score": "16-10"}
          ]

async_get_team_urls.py — Scrapes all team URLs from https://www.hltv.org/ranking/teams

async_get_team_data.py — Scrapes in-depth data from scraped team URLs

Exact structure:

team_info = {
              # "team_url": url,
              "team_name": team_name,
              "team_region": team_region,
              "world_ranking": world_ranking,
              "valve_ranking": valve_ranking,
              "avg_player_age": average_age,
              "current_winstreak": current_winstreak,
              "winrate": winrate,
              "map_winrates": map_winrates,
              "coach_url": coach_url,
              "player_urls": player_urls,
          }
map_winrates = {  # only 6 best maps get scraped
              "Ancient": get_map_winrate(soup, "Ancient"),
              "Anubis": get_map_winrate(soup, "Anubis"),
              "Dust2": get_map_winrate(soup, "Dust2"),
              "Inferno": get_map_winrate(soup, "Inferno"),
              "Mirage": get_map_winrate(soup, "Mirage"),
              "Nuke": get_map_winrate(soup, "Nuke"),
              "Overpass": get_map_winrate(soup, "Overpass"),
              "Train": get_map_winrate(soup, "Train"),
              "Vertigo": get_map_winrate(soup, "Vertigo"),
          }

async_get_player_data.py — Scrapes in-depth data of every player in every scraped team

Exact structure:

player_info = {
              "name": name,
              "country": country,
              "team": team,
              "age": age,
              "overall": overall,
              "opening": opening,
              "round": rounds,
              "weapon": weapon_kills,
              "ct-side": {
                  "firepower": ct_firepower,
                  "entrying": ct_entrying,
                  "trading": ct_trading,
                  "opening": ct_opening,
                  "clutching": ct_clutching,
                  "sniping": ct_sniping,
                  "utility": ct_utility,
              },
              "t-side": {
                  "firepower": t_firepower,
                  "entrying": t_entrying,
                  "trading": t_trading,
                  "opening": t_opening,
                  "clutching": t_clutching,
                  "sniping": t_sniping,
                  "utility": t_utility,
              },
          }

Configuration

Each script contains a config dictionary with options such as:

file_to_read — str — CSV file location to read URLs from
savefile_location — str — CSV file location to save scraped data
???_amount — int — Amount of items to scrape
headless — bool — Hide/Show the browser/s while scraping
screen — Screen — Min/max screen width/height
screen_amount — int — Amount of browsers you want to show (if headless = False)
session_amount — int — Amount of parallel sessions (you will get rate limited if you set this too high)
session_timeout — int/list — Timeout afer session in seconds, random range possible: [0.8, 1.2]
use_proxy — bool — Use proxies
use_proxy_once — bool — Each proxy only gets used by one session
proxy_location — str — TXT file location to read proxies, (format: server:port:username:password), 1 every line
user_agents_location — str — JSON file location to read user_agents

cookie_location — str — JSON file location to get cookies to apply

Example structure:

config = {
      "file_to_read": "rework/data/team_urls.csv",
      "savefile_location": "rework/data/team_data.csv",
      "team_amount": 100,  # -1 = all
      "headless": True,
      "screen": Screen(max_width=1920, max_height=1080),
      "screen_amount": 1,
      "session_amount": 5,
      "session_timeout": 1,
      "use_proxy": True,
      "use_proxy_once": True,
      "proxy_location": "rework/data/proxies.txt",
      "user_agents_location": "rework/data/user_agents.json",
      "cookie_location": "rework/data/autologin_cookie.json",
  }

All options are also commented inside the scripts for clarity.

Advanced Configuration

Proxies

To avoid IP bans and improve anonymity, you can configure proxy support:

Proxies (as .txt file):

User Agents

Dynamic User-Agent rotation helps mimic real browsers:

User Agents (as .json file):

I like to get user agents from https://www.useragents.me/#most-common-desktop-useragents-json-csv

Cookies & Sessions

Persistent cookies and session management improve Cloudflare bypass:

Cookies incl. autologin (as .json file):

How to get autologin cookie:

Open HLTV.org
Login
Inspect the site (F12)
Click on the "Application" tab
At Cookies click/open "https://www.hltv.org"
There is a table with all your cookies, including "autologin"

Contributing

Contributions, bug reports, and feature requests are welcome! Please open issues or pull requests on GitHub.

Contact

You can contact me on Discord: clxmi

Name		Name	Last commit message	Last commit date
Latest commit History 46 Commits
data		data
scraping		scraping
LICENSE		LICENSE
README.md		README.md
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

python-hltv-scraper

Table of Contents

Overview

Features

Installation

Usage

Project Structure

Configuration

Advanced Configuration

Proxies

User Agents

Cookies & Sessions

Contributing

Contact

About

Uh oh!

Releases 2

Packages

Languages

License

ClassicalClemi/python-hltv-scraper

Folders and files

Latest commit

History

Repository files navigation

python-hltv-scraper

Table of Contents

Overview

Features

Installation

Usage

Project Structure

Configuration

Advanced Configuration

Proxies

User Agents

Cookies & Sessions

Contributing

Contact

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 2

Packages 0

Languages

Packages