Pure API web crawling service with markdown generation, following the gnosis service standard.
Gnosis-Crawl is a focused, API-only crawling service that provides:
- Single URL crawling - Synchronous HTML + markdown extraction
- Markdown-only crawling - Optimized markdown extraction
- Batch processing - Asynchronous multi-URL crawling with job tracking
- User partitioned storage - Secure, isolated data storage
- Gnosis-auth integration - Standardized authentication
POST /api/crawl- Crawl single URL (returns HTML + markdown)POST /api/markdown- Crawl single URL (markdown only)POST /api/batch- Start batch crawl job
GET /api/jobs/{job_id}- Get job status and resultsGET /api/jobs- List user jobs
GET /api/sessions/{session_id}/files- List files for a sessionGET /api/sessions/{session_id}/file- Get specific file from session
GET /health- Health check
-
Clone and setup:
git clone <repo> cd gnosis-crawl cp .env.example .env
-
Install dependencies:
pip install -r requirements.txt
-
Run locally:
uvicorn app.main:app --reload --host 0.0.0.0 --port 8080
-
Access service:
- API: http://localhost:8080
- Docs: http://localhost:8080/docs
- Health: http://localhost:8080/health
-
Local Docker:
./deploy.ps1 -Target local
-
Google Cloud Run:
./deploy.ps1 -Target cloudrun -Tag v1.0.0
For standalone deployments without gnosis-auth:
-
Copy Porter config:
cp .env.porter .env
-
Deploy to your cluster:
- The
DISABLE_AUTH=trueflag bypasses all authentication - All endpoints become publicly accessible
- Recommended for internal/private clusters only
- The
-
Build and deploy:
docker build -t gnosis-crawl:latest . # Push to your registry and deploy via Porter/kubectl
For production deployments using Google Cloud Storage:
-
Create GCS bucket:
gsutil mb gs://gnosis-crawl-storage
-
Set permissions:
# Grant service account write access gsutil iam ch serviceAccount:YOUR-SA@PROJECT.iam.gserviceaccount.com:objectAdmin gs://gnosis-crawl-storage -
Use cloud config:
cp .env.cloud .env
-
Update environment variables:
RUNNING_IN_CLOUD=true GCS_BUCKET_NAME=gnosis-crawl-storage GOOGLE_CLOUD_PROJECT=your-project-id # Optional if running in GCP -
Install GCS client:
pip install google-cloud-storage
Note: When running in GCP (Cloud Run, GKE), authentication is automatic via service accounts. For local development, set GOOGLE_APPLICATION_CREDENTIALS to your service account key file.
Environment variables (see .env.example):
HOST- Server host (default: 0.0.0.0)PORT- Server port (default: 8080)DEBUG- Debug mode (default: false)
STORAGE_PATH- Local storage path (default: ./storage)RUNNING_IN_CLOUD- Enable GCS cloud storage (default: false)GCS_BUCKET_NAME- GCS bucket name (required if RUNNING_IN_CLOUD=true)GOOGLE_CLOUD_PROJECT- GCP project ID (optional, auto-detected in GCP)
DISABLE_AUTH- Disable all authentication (default: false)⚠️ GNOSIS_AUTH_URL- Gnosis-auth service URL
MAX_CONCURRENT_CRAWLS- Max concurrent crawls (default: 5)CRAWL_TIMEOUT- Crawl timeout in seconds (default: 30)ENABLE_JAVASCRIPT- Enable JS rendering (default: true)ENABLE_SCREENSHOTS- Enable screenshots (default: false)
All API endpoints require authentication via Bearer token. User email from the token is used for storage partitioning:
curl -H "Authorization: Bearer <token>" \
-X POST http://localhost:8080/api/crawl \
-H "Content-Type: application/json" \
-d '{"url": "https://example.com"}'When auth is disabled (Porter/Kubernetes deployments), use customer_id for storage partitioning:
curl -X POST http://localhost:8080/api/crawl \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"customer_id": "client-xyz-123"
}'DISABLE_AUTH=true in trusted, internal environments.
All crawl endpoints support an optional customer_id field for flexible storage partitioning:
- Priority:
customer_id(if provided) → authenticated user email → "anonymous" - Use cases:
- Unauthenticated API access (with
DISABLE_AUTH=true) - Multi-tenant storage partitioning even with auth
- Custom storage organization
- Unauthenticated API access (with
- Storage path:
storage/{hash(customer_id or user_email)}/{session_id}/
Example with customer_id override:
curl -H "Authorization: Bearer <token>" \
-X POST http://localhost:8080/api/crawl \
-H "Content-Type: application/json" \
-d '{
"url": "https://example.com",
"customer_id": "custom-partition-id",
"session_id": "my-session"
}'Session file access with customer_id:
# List session files
curl "http://localhost:8080/api/sessions/my-session/files?customer_id=client-xyz-123"
# Get specific file
curl "http://localhost:8080/api/sessions/my-session/file?path=results/abc123.json&customer_id=client-xyz-123"gnosis-crawl/
├── app/ # Application code
│ ├── main.py # FastAPI app
│ ├── config.py # Configuration
│ ├── auth.py # Authentication
│ ├── models.py # Data models
│ ├── routes.py # API routes
│ ├── storage.py # Storage service
│ └── crawler.py # Crawling logic
├── tests/ # Test suite
├── storage/ # Local storage
├── Dockerfile # Container config
├── docker-compose.yml # Local deployment
├── deploy.ps1 # Deployment script
└── requirements.txt # Dependencies
storage/
└── {customer_hash}/ # Customer partition (hash of customer_id or user_email)
└── {session_id}/ # Session partition
├── metadata.json
└── results/
├── {url_hash}.json
└── ...
Customer Hash: 12-character SHA256 hash provides:
- Privacy (doesn't expose actual customer_id or email)
- Consistent bucketing per customer
- File system safety
- Local: ThreadPoolExecutor for development
- Cloud: Google Cloud Tasks for production
- Status: Derived from actual storage files
- Sessions: User-scoped job grouping
- Directory structure
- FastAPI application
- Authentication integration
- Customer ID support (optional auth bypass)
- Storage service with customer partitioning
- API routes
- Docker configuration
- Deployment scripts
- Browser automation (Playwright)
- HTML extraction
- Markdown generation
- Batch processing
- Session management
- Test suite
- Error handling
- Monitoring
- Documentation
This service follows the gnosis deployment standard:
- Flat app structure - All code in
/appdirectory - Environment-based config -
.envpattern - PowerShell deployment -
deploy.ps1script - Docker-first - Containerized deployment
- Gnosis-auth integration - Standard authentication
Gnosis Project License