vmcrawl is a Mastodon-focused version reporting crawler.
It is written in Python, with a PostgreSQL database backend.
It performs periodic polling of known Mastodon instances to track version information, user counts, and security patch status.
- Python 3.13 or higher
- PostgreSQL database
- UV
For development or testing, you can quickly set up vmcrawl in the current directory:
git clone https://github.com/vmstio/vmcrawl.git
cd vmcrawl
uv sync
./vmcrawl.shFor production deployments, follow these steps to install vmcrawl as a system service:
# Clone application files
git clone https://github.com/vmstio/vmcrawl.git /opt/vmcrawl# Create vmcrawl user and set ownership
useradd -r -s /bin/bash -d /opt/vmcrawl vmcrawl
chown -R vmcrawl:vmcrawl /opt/vmcrawl# Switch to vmcrawl user
sudo -u vmcrawl -i
# Install uv package manager
curl -LsSf https://astral.sh/uv/install.sh | sh
# Exit and log back in to refresh the PATH
exit
sudo -u vmcrawl -i
# Create virtual environment and install dependencies
cd /opt/vmcrawl
uv sync
# Exit vmcrawl user
exitsudo -u vmcrawl vim /opt/vmcrawl/.envAdd your configuration:
VMCRAWL_POSTGRES_DATA="dbname"
VMCRAWL_POSTGRES_USER="username"
VMCRAWL_POSTGRES_PASS="password"
VMCRAWL_POSTGRES_HOST="localhost"
VMCRAWL_POSTGRES_PORT="5432"On your PostgreSQL server, execute the contents of database.sql to create tables and indexes.
# Make the shell script executable
chmod +x /opt/vmcrawl/vmcrawl.sh
# Copy service files to systemd
cp /opt/vmcrawl/vmcrawl.service /etc/systemd/system/
# Reload systemd
systemctl daemon-reload# Enable crawler service to start on boot
systemctl enable vmcrawl.service
# Start the crawler service
systemctl start vmcrawl.service
# Check crawler status
systemctl status vmcrawl.serviceYou can also run vmcrawl using Docker:
docker build -t vmcrawl .
docker run -d --name vmcrawl --env-file .env vmcrawlThe project includes a single main script with multiple subcommands:
| Command | Purpose |
|---|---|
crawler.py |
Default crawl mode - processes domains, collects version/user data, generates statistics |
crawler.py fetch |
Fetches new domains from federated instance peer lists |
crawler.py manage |
Interactive menu for database and version management |
Statistics Generation:
Statistics are automatically generated and recorded by the main crawler (crawler.py) during its crawling operations. Historical statistics tracking is integrated into the crawling workflow, eliminating the need for a separate statistics service.
To start using vmcrawl you will need to populate your database with instances to crawl. You can fetch a list of fediverse instances from an existing Mastodon instance:
./vmcrawl.sh fetchThe first time this is launched it will default to polling vmst.io for instances to crawl.
If you wish to override this you can target a specific instance:
./vmcrawl.sh fetch --target example.socialOnce you have established a set of known good Mastodon instances, you can use them to fetch new federated instances:
./vmcrawl.sh fetchThis will scan the top 10 instances in your database by total users.
You can change the limits or offset the domain list from the top:
./vmcrawl.sh fetch --limit 100 --offset 50You can use limit and offset together, or individually, but neither option can be combined with the target argument.
Unless you specifically target a server, vmcrawl.sh fetch will only attempt to fetch from instances with over 100 active users.
If a server fails to fetch with persistent errors, mastodon_domains.peers is set to false and it will not be polled in future non-targeted fetch runs.
You can also select a random sampling of servers to fetch from, instead of going by user count:
./vmcrawl.sh fetch --randomYou can combine random with the limit command, but not with target or offset.
After you have a list of instances to crawl, run the following command:
./vmcrawl.shSelecting 0 from the interactive menu will begin to process all of your fetched domains.
The crawler provides an interactive curses-based menu with color support and arrow key navigation when run in a terminal (TTY mode). In headless mode, it automatically uses text-based prompts.
You can customize the crawling process with the following options:
Process new domains:
0Uncrawled (recently fetched domains)
Change process direction:
1Standard (alphabetical)2Reverse (reverse alphabetical)3Random (default for headless runs)
Retry any (non-fatal) errors:
4Offline (all domains with errors)5Issues (known Mastodon instances with errors)
Retry fatal errors:
10Other (non-Mastodon platforms)
Retry errors by type:
20DNS (name resolution failures)21SSL (certificate errors)22TCP (connection errors)23Type (content type errors)24Size (response size errors)25API (API errors)26JSON (JSON parsing errors)27HTTP 2xx status codes28HTTP 3xx redirects29HTTP 4xx client errors30HTTP 5xx server errors31Hard Fail (HTTP 410/418/451/999)32Robots (robots.txt prohibited)
Retry known instances:
50Unpatched (instances not running latest patches)51Main (instances on development/main branch)52Active (instances with active monthly users)53All (all known instances)
Retry terminal error states:
Terminal states are domains that have failed repeatedly and been marked as permanently bad. These options retry them:
60DNS (bad_dns flag)61SSL (bad_ssl flag)62TCP (bad_tcp flag)63Type (bad_type flag)64File (bad_file flag)65API (bad_api flag)66JSON (bad_json flag)67HTTP 2xx (bad_http2xx flag)68HTTP 3xx (bad_http3xx flag)69HTTP 4xx (bad_http4xx flag)70HTTP 5xx (bad_http5xx flag)71Hard Fail (bad_hard flag - HTTP 410/418/451/999)72Robots (bad_robot flag - robots.txt prohibited)
Menu Navigation (TTY mode):
- Arrow keys or
j/k- Navigate options - Enter - Select option
qor Esc - Quit
The crawler uses a probe-first discovery flow per domain:
- Check
robots.txt - Probe
/.well-known/nodeinfoon the original domain with suppressed errors - If probe fails, run host-meta and webfinger discovery in parallel
- Retry nodeinfo on discovered backend domain
- Resolve NodeInfo data (direct return or follow NodeInfo 2.0 URL)
- Branch to Mastodon or non-Mastodon processing
host-meta(/.well-known/host-meta) andwebfinger(/.well-known/webfinger) run concurrently.- host-meta result is preferred when both methods succeed.
- if both fail, crawler falls back to original domain and replays the suppressed probe error.
- if fallback occurs without a captured probe-error object, crawler records
TCP+nodeinfo.
- Matrix servers (
m.serverin nodeinfo response) are marked asmatrixand skipped as non-error. - Some servers return full NodeInfo directly at the well-known endpoint; this is accepted without another fetch.
- For Mastodon instances:
- instance API is queried (
/api/v2/instance, fallback/api/v1/instance) to obtain canonical domain. - version and MAU are validated before writing to
mastodon_domains.
- instance API is queried (
- For non-Mastodon instances:
- software name is stored in
raw_domains.nodeinfo. - errors/reason are cleared because platform mismatch is not an error condition.
- software name is stored in
- Requests use an HTTP/2-enabled
httpxclient by default. - On
httpx.RemoteProtocolError(servers that break HTTP/2 sessions), the request is retried once with an HTTP/1.1-only client.
- Same terminal failure type: preserve existing
bad_*state and do not rewriteerrors/reason. - Different terminal failure type: clear previously preserved
bad_*state so status can transition.
By default, when the script is run headless it will do a random crawl of instances in the database.
To limit what is crawled in headless mode, use the following arguments:
--newwill function like option0, and only process new domains recently fetched.
You can target a specific domain to fetch or crawl with the target option:
./vmcrawl.sh --target vmst.ioYou can include multiple domains in a comma-separated list:
./vmcrawl.sh --target mas.to,infosec.exchangeYou can also process multiple domains using an external file, which contains each domain on a new line:
./vmcrawl.sh --file ~/domains.txtTo quickly inspect all known information for one domain (from both raw_domains and mastodon_domains) without running a crawl:
./vmcrawl.sh --db vmst.ioThe crawler provides comprehensive database and version management through an interactive manage menu:
./vmcrawl.sh manage| Option | Category | Action |
|---|---|---|
1 |
DNI | Fetch and import DNI list |
2 |
DNI | List all DNI domains |
3 |
DNI | Count DNI domains |
4 |
DNI | Add DNI domain manually |
5 |
DNI | Remove DNI domain |
6 |
Nightly | List all nightly versions |
7 |
Nightly | Add a new nightly version |
8 |
Nightly | Update nightly version end date |
9 |
Mastodon | Update latest Mastodon versions |
10 |
Mastodon | Show current version info |
11 |
Mastodon | Promote branch to release |
12 |
Mastodon | Mark branch as EOL |
13 |
Mastodon | Reorder release branches |
14 |
TLD Cache | Update TLD cache |
15 |
Search | Search domain details |
Manages the IFTAS DNI (Do Not Interact) list and manual DNI entries for trust and safety controls.
- Option 1: Fetch and import DNI list - Downloads the latest IFTAS DNI CSV source and imports new domains
- Option 2: List all DNI domains - Displays all domains currently in the DNI table with comment/force/timestamp
- Option 3: Count DNI domains - Shows total count of DNI domains
- Option 4: Add DNI domain manually - Prompts for
domain,comment, andforce(hardorsoft) - Option 5: Remove DNI domain - Prompts for a domain and removes it from DNI table
Manual DNI entries are useful for local policy overrides without waiting for the upstream feed.
Manages tracking of development/nightly versions used to identify instances running pre-release software (alpha, beta, rc versions).
Submenu Options:
1. List all nightly versions - Displays all tracked nightly/development versions with their date ranges
2. Add new nightly version - Interactive workflow to add a new pre-release version:
- Enter version string (e.g., 4.9.0-alpha.7)
- Enter start date (when version was released)
- Optionally enter end date (when version was superseded)
- Choose whether to auto-update the previous version's end date
3. Update end date - Modify the end date of an existing nightly version
When a new nightly version is added, the system can automatically set the end date of the previous nightly version to maintain a continuous timeline of development releases.
The crawler provides a comprehensive version management system for tracking Mastodon releases and their lifecycle from main development branch through release to end-of-life.
Version Tracking Overview:
The crawler maintains a unified release_versions table that tracks:
- Main branch: Current development version (n_level = -1)
- Release branches: Active stable releases (n_level = 0, 1, 2, ...)
- EOL branches: End-of-life releases no longer receiving updates
Version tracking is manual-only - all branch lifecycle management is controlled through the manage menu. The crawler automatically updates the latest version within each tracked branch but does not automatically promote or deprecate branches.
GitHub CLI Integration:
The crawler uses the gh CLI tool for GitHub API calls when available, providing:
- Better rate limits for authenticated users
- More reliable access to release data
- Fallback to HTTP API if
ghis not authenticated or available
To authenticate the GitHub CLI:
gh auth loginOption 9: Update Latest Mastodon Versions
Updates the latest version for all tracked branches (main, release, and EOL) by querying the Mastodon GitHub repository. This should be run periodically to keep version data current.
- Fetches up to 500 releases from GitHub to ensure old EOL versions are updated
- Updates only the
latestcolumn for existing branches - Does not create new branches or change branch status
Option 10: Show Current Version Information
Displays a detailed view of all tracked Mastodon versions:
Main Development Version:
Branch: 4.4 Status: main n_level: -1 Latest: 4.4.0+nightly-20250215
Active Release Branches:
Branch: 4.3 Status: release n_level: 0 Latest: 4.3.3
Branch: 4.2 Status: release n_level: 1 Latest: 4.2.15
Branch: 4.1 Status: release n_level: 2 Latest: 4.1.22
End-of-Life Branches:
Branch: 4.0 Status: eol n_level: 3 Latest: 4.0.15
Option 11: Promote Main Branch to Release
Promotes the current main development branch to a release branch and creates a new main branch for the next version.
Workflow:
- Shows current main branch (e.g., 4.4)
- Confirms promotion to release status
- Shifts all existing release branches down (increases n_level by 1)
- Converts main branch to release at n_level = 0
- Creates new main branch with incremented version (e.g., 4.5)
This should be used when a new stable version is released.
Option 12: Mark Branch as End-of-Life
Marks a release branch as EOL when it no longer receives updates.
Workflow:
- Shows list of current release branches
- Select branch to mark as EOL
- Branch status changes to 'eol' and is moved to the end of the tracking list
EOL branches continue to be tracked for statistics but are not considered "supported" releases.
Option 13: Reorder Release Branches
Manually adjusts the ordering (n_level) of release branches. This is useful if you need to reorganize the priority of tracked releases.
Workflow:
- Shows current release branches with current order
- Enter a full new order as a comma-separated branch list (e.g.
4.6,4.5,4.4) - All release branches are re-assigned
n_levelbased on the provided order
- Option 14: Update TLD cache - Fetches the latest IANA TLD list and refreshes the
tld_cachetable.
- Option 15: Search domain details - Prompts for a domain and displays all known columns from both
raw_domainsandmastodon_domains.
Version Lifecycle Workflow:
Typical version management workflow:
- New Development Begins: Main branch tracks next version (e.g., 4.5.0+nightly)
- Regular Updates: Run Option 9 periodically to update latest versions
- New Release: Use Option 11 to promote main to release when stable version ships
- EOL Declaration: Use Option 12 to mark old releases as EOL when support ends
- Monitoring: Use Option 10 to review current version status
The crawler uses HTTP/2 by default for all requests, with automatic fallback to HTTP/1.1 for servers that have protocol compatibility issues. The HTTP client is configured with:
- TLS 1.2+ minimum
- Certificate verification enabled
- Connection pooling for improved performance
- 10MB default response size limit
DNS lookups are cached in-memory with a 5-minute TTL to reduce repeated DNS queries. The cache:
- Holds up to 10,000 entries
- Uses thread-safe monkey-patching of
socket.getaddrinfo - Automatically evicts old entries
The crawler supports International Domain Names (IDN) including emoji domains (e.g., 🍕.ws) through IDNA library patching.
Domains can be automatically detected as aliases of canonical domains. When a domain is marked as an alias:
- All error state is cleared
- Future crawls skip the alias domain
- The canonical domain is tracked instead
When retrying domains with terminal error states (menu options 60-72):
- If the domain fails again with the same error type, the existing error count and state are preserved
- If the domain fails with a different error type, the previous terminal state is cleared to allow the status to transition
The crawler uses PostgreSQL advisory locks to prevent race conditions when multiple crawler instances run concurrently. This ensures safe cleanup operations and prevents duplicate work.
In TTY mode, the crawler displays:
- Real-time progress with color-coded status
- Per-domain elapsed time
- Slow domain highlighting (configurable threshold)
- Periodic heartbeat updates
Configure these in your .env file:
# Database connection
VMCRAWL_POSTGRES_DATA="dbname"
VMCRAWL_POSTGRES_USER="username"
VMCRAWL_POSTGRES_PASS="password"
VMCRAWL_POSTGRES_HOST="localhost"
VMCRAWL_POSTGRES_PORT="5432"Note: Version tracking is now managed through the database via the manage menu. The VMCRAWL_BACKPORTS environment variable is no longer used.
For remote database access via SSH tunnel:
VMCRAWL_SSH_HOST="ssh.example.com"
VMCRAWL_SSH_PORT="22"
VMCRAWL_SSH_USER="username"
VMCRAWL_SSH_KEY="~/.ssh/id_rsa"
VMCRAWL_SSH_KEY_PASS="passphrase" # Optional# Concurrent domain processing tasks (default: 2)
VMCRAWL_MAX_THREADS="4"
# HTTP request timeout in seconds (default: 5)
VMCRAWL_HTTP_TIMEOUT="10"
# Maximum HTTP redirects to follow (default: 2)
VMCRAWL_HTTP_REDIRECT="5"
# Maximum response size in bytes (default: 10485760 = 10MB)
VMCRAWL_MAX_RESPONSE_SIZE="20971520"The crawler uses Python's asyncio for concurrent I/O operations, with an asyncio.Semaphore limiting concurrent domain processing to the configured value.
DNS resolution results are cached in-memory for 5 minutes (300 seconds) to reduce repeated DNS lookups when crawling domains. The cache holds up to 10,000 entries and is automatically managed.
Configure DNS retry behavior:
# DNS retry attempts (default: 3)
VMCRAWL_DNS_RETRY_ATTEMPTS="5"
# Base retry delay in milliseconds (default: 80)
VMCRAWL_DNS_RETRY_BASE_DELAY_MS="100"
# Jitter range in milliseconds (default: 40)
VMCRAWL_DNS_RETRY_JITTER_MS="50"
# Maximum total backoff delay in milliseconds (default: 500)
VMCRAWL_DNS_RETRY_MAX_TOTAL_DELAY_MS="1000"# Error threshold before marking domain as bad (default: 24)
VMCRAWL_ERROR_BUFFER="30"# Default number of domains to fetch from (default: 10)
VMCRAWL_FETCH_LIMIT="20"
# Default offset for domain selection (default: 0)
VMCRAWL_FETCH_OFFSET="5"
# Minimum active users required to fetch peers from instance (default: 100)
VMCRAWL_FETCH_MIN_ACTIVE="50"# Progress update frequency in seconds (default: 5)
VMCRAWL_PROGRESS_HEARTBEAT_SECONDS="10"
# Threshold to mark domain as slow in seconds (default: 8)
VMCRAWL_SLOW_DOMAIN_SECONDS="15"# Version data refresh interval in seconds (default: 3600)
VMCRAWL_VERSION_REFRESH_INTERVAL="7200"# Filter data cache TTL in seconds (default: 300)
VMCRAWL_FILTER_CACHE_SECONDS="600"# API authentication key (optional, disables auth if not set)
VMCRAWL_API_KEY="your-secret-key-here"For production installations using systemd:
# Follow crawler logs in real-time
journalctl -u vmcrawl.service -f
# View recent crawler logs
journalctl -u vmcrawl.service -n 100
# View crawler logs since boot
journalctl -u vmcrawl.service -bCrawler Service:
# Stop service
systemctl stop vmcrawl.service
# Restart service
systemctl restart vmcrawl.service
# Disable service
systemctl disable vmcrawl.service- Check logs:
journalctl -u vmcrawl.service -n 50 - Verify permissions:
ls -la /opt/vmcrawl - Test script manually:
sudo -u vmcrawl /opt/vmcrawl/vmcrawl.sh
# Fix ownership
chown -R vmcrawl:vmcrawl /opt/vmcrawl
# Fix script permission
chmod +x /opt/vmcrawl/vmcrawl.shWe welcome contributions! Please read our contributing guidelines for more details.
This project is licensed under the MIT License. See the LICENSE file for more information.
For any questions or feedback, please open an issue on GitHub.