Gnosis-Crawl

Pure API web crawling service with markdown generation, following the gnosis service standard.

Overview

Gnosis-Crawl is a focused, API-only crawling service that provides:

Single URL crawling - Synchronous HTML + markdown extraction
Markdown-only crawling - Optimized markdown extraction
Batch processing - Asynchronous multi-URL crawling with job tracking
User partitioned storage - Secure, isolated data storage
Gnosis-auth integration - Standardized authentication

API Endpoints

Core Crawling

POST /api/crawl - Crawl single URL (returns HTML + markdown)
POST /api/markdown - Crawl single URL (markdown only)
POST /api/batch - Start batch crawl job

Job Management

GET /api/jobs/{job_id} - Get job status and results
GET /api/jobs - List user jobs

Session Management

GET /api/sessions/{session_id}/files - List files for a session
GET /api/sessions/{session_id}/file - Get specific file from session

System

GET /health - Health check

Quick Start

Local Development

Clone and setup:

git clone <repo>
cd gnosis-crawl
cp .env.example .env

Install dependencies:
```
pip install -r requirements.txt
```

Run locally:

uvicorn app.main:app --reload --host 0.0.0.0 --port 8080

Access service:
- API: http://localhost:8080
- Docs: http://localhost:8080/docs
- Health: http://localhost:8080/health

Docker Deployment

Local Docker:
```
./deploy.ps1 -Target local
```

Google Cloud Run:

./deploy.ps1 -Target cloudrun -Tag v1.0.0

Porter/Kubernetes Deployment (No Auth)

For standalone deployments without gnosis-auth:

Copy Porter config:
```
cp .env.porter .env
```
Deploy to your cluster:
- The DISABLE_AUTH=true flag bypasses all authentication
- All endpoints become publicly accessible
- Recommended for internal/private clusters only

Build and deploy:

docker build -t gnosis-crawl:latest .
# Push to your registry and deploy via Porter/kubectl

Cloud Storage (GCS) Setup

For production deployments using Google Cloud Storage:

Create GCS bucket:
```
gsutil mb gs://gnosis-crawl-storage
```

Set permissions:

# Grant service account write access
gsutil iam ch serviceAccount:YOUR-SA@PROJECT.iam.gserviceaccount.com:objectAdmin gs://gnosis-crawl-storage

Use cloud config:
```
cp .env.cloud .env
```

Update environment variables:

RUNNING_IN_CLOUD=true
GCS_BUCKET_NAME=gnosis-crawl-storage
GOOGLE_CLOUD_PROJECT=your-project-id  # Optional if running in GCP

Install GCS client:
```
pip install google-cloud-storage
```

Note: When running in GCP (Cloud Run, GKE), authentication is automatic via service accounts. For local development, set GOOGLE_APPLICATION_CREDENTIALS to your service account key file.

Configuration

Environment variables (see .env.example):

Server

HOST - Server host (default: 0.0.0.0)
PORT - Server port (default: 8080)
DEBUG - Debug mode (default: false)

Storage

STORAGE_PATH - Local storage path (default: ./storage)
RUNNING_IN_CLOUD - Enable GCS cloud storage (default: false)
GCS_BUCKET_NAME - GCS bucket name (required if RUNNING_IN_CLOUD=true)
GOOGLE_CLOUD_PROJECT - GCP project ID (optional, auto-detected in GCP)

Authentication

DISABLE_AUTH - Disable all authentication (default: false) ⚠️
GNOSIS_AUTH_URL - Gnosis-auth service URL

Crawling

MAX_CONCURRENT_CRAWLS - Max concurrent crawls (default: 5)
CRAWL_TIMEOUT - Crawl timeout in seconds (default: 30)
ENABLE_JAVASCRIPT - Enable JS rendering (default: true)
ENABLE_SCREENSHOTS - Enable screenshots (default: false)

Authentication

With gnosis-auth (default)

All API endpoints require authentication via Bearer token. User email from the token is used for storage partitioning:

curl -H "Authorization: Bearer <token>" \
     -X POST http://localhost:8080/api/crawl \
     -H "Content-Type: application/json" \
     -d '{"url": "https://example.com"}'

Without authentication (DISABLE_AUTH=true)

When auth is disabled (Porter/Kubernetes deployments), use customer_id for storage partitioning:

curl -X POST http://localhost:8080/api/crawl \
     -H "Content-Type: application/json" \
     -d '{
       "url": "https://example.com",
       "customer_id": "client-xyz-123"
     }'

⚠️ Warning: Only use DISABLE_AUTH=true in trusted, internal environments.

Customer ID Support

All crawl endpoints support an optional customer_id field for flexible storage partitioning:

Priority: customer_id (if provided) → authenticated user email → "anonymous"
Use cases:
- Unauthenticated API access (with DISABLE_AUTH=true)
- Multi-tenant storage partitioning even with auth
- Custom storage organization
Storage path: storage/{hash(customer_id or user_email)}/{session_id}/

Example with customer_id override:

curl -H "Authorization: Bearer <token>" \
     -X POST http://localhost:8080/api/crawl \
     -H "Content-Type: application/json" \
     -d '{
       "url": "https://example.com",
       "customer_id": "custom-partition-id",
       "session_id": "my-session"
     }'

Session file access with customer_id:

# List session files
curl "http://localhost:8080/api/sessions/my-session/files?customer_id=client-xyz-123"

# Get specific file
curl "http://localhost:8080/api/sessions/my-session/file?path=results/abc123.json&customer_id=client-xyz-123"

Architecture

Directory Structure

gnosis-crawl/
├── app/                 # Application code
│   ├── main.py         # FastAPI app
│   ├── config.py       # Configuration  
│   ├── auth.py         # Authentication
│   ├── models.py       # Data models
│   ├── routes.py       # API routes
│   ├── storage.py      # Storage service
│   └── crawler.py      # Crawling logic
├── tests/              # Test suite
├── storage/            # Local storage
├── Dockerfile          # Container config
├── docker-compose.yml  # Local deployment
├── deploy.ps1          # Deployment script
└── requirements.txt    # Dependencies

Storage Organization

storage/
└── {customer_hash}/        # Customer partition (hash of customer_id or user_email)
    └── {session_id}/       # Session partition
        ├── metadata.json
        └── results/
            ├── {url_hash}.json
            └── ...

Customer Hash: 12-character SHA256 hash provides:

Privacy (doesn't expose actual customer_id or email)
Consistent bucketing per customer
File system safety

Job System

Local: ThreadPoolExecutor for development
Cloud: Google Cloud Tasks for production
Status: Derived from actual storage files
Sessions: User-scoped job grouping

Development Status

Phase 1: Core Infrastructure ✅

Phase 2: Crawling ✅

Phase 3: Testing & Production

Test suite
Error handling
Monitoring
Documentation

Contributing

This service follows the gnosis deployment standard:

Flat app structure - All code in /app directory
Environment-based config - .env pattern
PowerShell deployment - deploy.ps1 script
Docker-first - Containerized deployment
Gnosis-auth integration - Standard authentication

License

Gnosis Project License

Name		Name	Last commit message	Last commit date
Latest commit History 18 Commits
.claude		.claude
app		app
tests		tests
.env.cloud		.env.cloud
.env.example		.env.example
.env.porter		.env.porter
.gitignore		.gitignore
CLAUDE.md		CLAUDE.md
CUSTOMER_ID_IMPLEMENTATION.md		CUSTOMER_ID_IMPLEMENTATION.md
Dockerfile		Dockerfile
PORTER_DEPLOYMENT.md		PORTER_DEPLOYMENT.md
README.md		README.md
SERVICE_REGISTRY.md		SERVICE_REGISTRY.md
deploy.ps1		deploy.ps1
docker-compose.yml		docker-compose.yml
gnosis_services.json		gnosis_services.json
pytest.ini		pytest.ini
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Gnosis-Crawl

Overview

API Endpoints

Core Crawling

Job Management

Session Management

System

Quick Start

Local Development

Docker Deployment

Porter/Kubernetes Deployment (No Auth)

Cloud Storage (GCS) Setup

Configuration

Server

Storage

Authentication

Crawling

Authentication

With gnosis-auth (default)

Without authentication (DISABLE_AUTH=true)

Customer ID Support

Architecture

Directory Structure

Storage Organization

Job System

Development Status

Phase 1: Core Infrastructure ✅

Phase 2: Crawling ✅

Phase 3: Testing & Production

Contributing

License

About

Uh oh!

Releases

Packages

Languages

DeepBlueDynamics/gnosis-crawl

Folders and files

Latest commit

History

Repository files navigation

Gnosis-Crawl

Overview

API Endpoints

Core Crawling

Job Management

Session Management

System

Quick Start

Local Development

Docker Deployment

Porter/Kubernetes Deployment (No Auth)

Cloud Storage (GCS) Setup

Configuration

Server

Storage

Authentication

Crawling

Authentication

With gnosis-auth (default)

Without authentication (DISABLE_AUTH=true)

Customer ID Support

Architecture

Directory Structure

Storage Organization

Job System

Development Status

Phase 1: Core Infrastructure ✅

Phase 2: Crawling ✅

Phase 3: Testing & Production

Contributing

License

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages