Welcome to the DocEX developer guide! This document will help you get started with DocEX, understand its core concepts, and extend it with your own processors and integrations.
- Install DocEX (from PyPI or GitHub):
pip install docex # or for latest development version: pip install git+https://github.com/tommyGPT2S/DocEX.git - Initialize DocEX (run once per environment):
docex init # Follow the prompts to set up config and database
from docex import DocEX
# Create DocEX instance
docEX = DocEX()
# Create or get a basket
basket = docEX.basket('mybasket')
# Add a document
doc = basket.add('path/to/file.txt', metadata={'source': 'example'})
# List all baskets
for b in docEX.list_baskets():
print(b.name)- Get document details:
print(doc.get_details())
- Get document content:
text = doc.get_content(mode='text') data = doc.get_content(mode='json') bytes_data = doc.get_content(mode='bytes')
- Get and update metadata:
meta = doc.get_metadata() # Update metadata from docex.services.metadata_service import MetadataService MetadataService().update_metadata(doc.id, {'my_key': 'my_value'})
- Get document operations:
print(doc.get_operations())
- List routes:
for route in docEX.list_routes(): print(route.name, route.protocol)
- Download a file:
route = docEX.get_route('my_download_route') result = route.download('remote_file.txt', 'local_file.txt') print(result.message)
- Upload a document:
upload_route = docEX.get_route('my_upload_route') result = upload_route.upload_document(doc) print(result.message)
- List available processors:
docex processor list
- Get a processor for a document:
from docex.processors.factory import factory processor_cls = factory.map_document_to_processor(doc) if processor_cls: processor = processor_cls(config={}) result = processor.process(doc) print(result.content) else: print('No processor found for this document.')
- Create a new processor class:
from docex.processors.base import BaseProcessor, ProcessingResult from docex.document import Document from pdfminer.high_level import extract_text import io class MyPDFTextProcessor(BaseProcessor): def can_process(self, document: Document) -> bool: return document.name.lower().endswith('.pdf') def process(self, document: Document) -> ProcessingResult: pdf_bytes = document.get_content(mode='bytes') text = extract_text(io.BytesIO(pdf_bytes)) return ProcessingResult(success=True, content=text)
- Dynamically add a processor mapping rule:
Instead of editing the main package, you can patch the processor mapping at runtime:
This allows you to use your custom processor for PDFs (or any custom logic) without modifying DocEX internals.
from docex.processors.factory import factory from my_pdf_text_processor import MyPDFTextProcessor def pdf_rule(document): if document.name.lower().endswith('.pdf'): return MyPDFTextProcessor return None factory.mapper.rules.insert(0, pdf_rule) # Highest priority
- Register your processor (optional):
docex processor register --name MyPDFTextProcessor --type content_processor --description "Extracts text from PDFs" --config '{}'
- Add a mapping rule (optional):
You can still edit
docex/processors/mapper.pyfor static rules, but dynamic patching is recommended for custom/external processors.
- Always use the Document API for content and metadata access (never access storage directly).
- Use baskets to organize documents by business context.
- Use metadata to enrich and search documents.
- Add custom processors for your business logic and register them via the CLI.
- Keep mapping logic in
mapper.pyfor easy extensibility.
DocEX supports multiple storage and database backends. You can configure these in your config file (usually ~/.docex/config.yaml) or during docex init.
Edit your config file or use the CLI to set:
database:
type: postgres
postgres:
host: localhost
port: 5432
database: docex
user: docex
password: secret
schema: docex- Make sure the Postgres server is running and the user/database exist.
- Re-run
docex initif you want to reinitialize the database.
Edit your config file:
storage:
default_type: s3
s3:
bucket: docex-bucket
region: us-east-1
# Optional: credentials (can also use environment variables or IAM roles)
access_key: your-access-key
secret_key: your-secret-key
# Optional: S3 key prefix for organizing files
prefix: docex/
# Optional: retry configuration
max_retries: 3
retry_delay: 1.0
# Optional: timeout configuration
connect_timeout: 60
read_timeout: 60Credential Sources (in priority order):
- Config file credentials (highest priority)
- Environment variables (
AWS_ACCESS_KEY_ID,AWS_SECRET_ACCESS_KEY,AWS_SESSION_TOKEN) - IAM role / instance profile (for EC2/ECS)
- AWS profile from
~/.aws/credentials(lowest priority)
Using Environment Variables:
storage:
default_type: s3
s3:
bucket: docex-bucket
region: us-east-1
# Credentials will be read from environment variablesexport AWS_ACCESS_KEY_ID=your-access-key
export AWS_SECRET_ACCESS_KEY=your-secret-key
export AWS_DEFAULT_REGION=us-east-1Using IAM Roles (EC2/ECS):
storage:
default_type: s3
s3:
bucket: docex-bucket
region: us-east-1
# No credentials needed - IAM role will be used automaticallyPer-Basket S3 Configuration: You can configure per-basket storage by passing a storage config when creating a basket:
basket = docEX.create_basket('mybasket', storage_config={
'type': 's3',
's3': {
'bucket': 'my-bucket',
'region': 'us-east-1',
'prefix': 'mybasket/', # Optional prefix for this basket
'access_key': '...', # Optional
'secret_key': '...' # Optional
}
})S3 Storage Features:
- Automatic retry on transient errors (500, 503, throttling, timeouts)
- Configurable retry attempts and delays
- Support for S3 key prefixes for organizing files
- Presigned URL generation for secure access
- Comprehensive error handling and logging
storage:
default_type: filesystem
filesystem:
base_path: /path/to/storageDocEX provides a set of standard metadata keys in docex/models/metadata_keys.py via the MetadataKey enum. These help you use consistent, searchable metadata across your documents.
- File-related:
MetadataKey.ORIGINAL_PATH→ 'original_path'MetadataKey.FILE_TYPE→ 'file_type'MetadataKey.FILE_SIZE→ 'file_size'MetadataKey.FILE_EXTENSION→ 'file_extension'MetadataKey.ORIGINAL_FILE_TIMESTAMP→ 'original_file_timestamp'
- Processing:
MetadataKey.PROCESSING_STATUS→ 'processing_status'MetadataKey.PROCESSING_ERROR→ 'processing_error'
- Business:
MetadataKey.RELATED_PO→ 'related_po'MetadataKey.CUSTOMER_ID→ 'customer_id'MetadataKey.INVOICE_NUMBER→ 'invoice_number'
- Security:
MetadataKey.ACCESS_LEVEL→ 'access_level'
- Audit:
MetadataKey.CREATED_BY→ 'created_by'MetadataKey.CREATED_AT→ 'created_at'
from docex.models.metadata_keys import MetadataKey
from docex.services.metadata_service import MetadataService
# Set standard metadata
MetadataService().update_metadata(doc.id, {
MetadataKey.FILE_TYPE.value: 'pdf',
MetadataKey.CUSTOMER_ID.value: 'CUST-123',
MetadataKey.INVOICE_NUMBER.value: 'INV-2024-001',
})
# Get metadata
meta = doc.get_metadata()
print(meta[MetadataKey.FILE_TYPE.value]) # e.g., 'pdf'
# Use custom metadata keys
custom_key = MetadataKey.get_custom_key('my_custom_field')
MetadataService().update_metadata(doc.id, {custom_key: 'custom_value'})DocEX supports user context for audit logging and operation tracking. The UserContext class provides a way to track user operations without implementing tenant-specific logic.
from docex.context import UserContext
from docex import DocEX
# Create user context
user_context = UserContext(
user_id="user123",
user_email="user@example.com",
roles=["admin"]
)
# Initialize DocEX with user context
docEX = DocEX(user_context=user_context)The user context is used for:
- Audit logging of operations
- Operation tracking
- User-aware logging
DocEX supports two multi-tenancy models:
All tenants share the same database/schema, with tenant_id columns providing logical isolation. This model is proposed but not yet implemented.
Each tenant has its own database (SQLite) or schema (PostgreSQL), providing physical data isolation. This model is fully implemented and ready for production use.
Configuration:
# ~/.docex/config.yaml
security:
multi_tenancy_model: database_level
tenant_database_routing: true
database:
type: postgresql
postgres:
host: localhost
port: 5432
database: docex
user: postgres
password: postgres
schema_template: "tenant_{tenant_id}" # Schema per tenantUsage:
from docex import DocEX
from docex.context import UserContext
# Tenant 1
user_context1 = UserContext(user_id="alice", tenant_id="tenant1")
docEX1 = DocEX(user_context=user_context1)
basket1 = docEX1.create_basket("invoices") # Uses tenant1 schema
# Tenant 2 (isolated)
user_context2 = UserContext(user_id="bob", tenant_id="tenant2")
docEX2 = DocEX(user_context=user_context2)
basket2 = docEX2.create_basket("invoices") # Uses tenant2 schemaFeatures:
- ✅ Automatic database/schema routing based on
UserContext.tenant_id - ✅ Connection pooling per tenant
- ✅ Automatic schema/database creation for new tenants
- ✅ Thread-safe connection management
- ✅ Support for SQLite (separate DB files) and PostgreSQL (separate schemas)
Benefits:
- Strongest data isolation (physical separation)
- Best for compliance (HIPAA, GDPR, SOX)
- Independent scaling per tenant
- No risk of cross-tenant data leaks
For applications using row-level isolation or custom tenant management, tenant logic should be handled at the upper layer:
-
Database Configuration
- Configure separate databases or schemas per tenant
- Use connection pooling with tenant-specific credentials
- Handle database routing at the application layer
-
Storage Configuration
- Configure separate storage paths per tenant
- Manage storage quotas and access at the application layer
- Handle storage path routing based on tenant context
-
Access Control
- Implement tenant-specific access control at the application layer
- Use middleware or decorators for tenant validation
- Handle user-tenant mapping in the application layer
Example of tenant management at the application layer:
class TenantAwareDocEX:
def __init__(self, tenant_id: str):
self.tenant_id = tenant_id
self.db_config = self._get_tenant_db_config()
self.storage_config = self._get_tenant_storage_config()
def _get_tenant_db_config(self):
# Get tenant-specific database configuration
return {
"type": "postgres",
"database": f"tenant_{self.tenant_id}",
# ... other config
}
def _get_tenant_storage_config(self):
# Get tenant-specific storage configuration
return {
"filesystem": {
"path": f"/storage/tenant_{self.tenant_id}"
}
}
def get_docex(self, user_context: UserContext):
# Initialize DocEX with tenant-specific config
DocEX.setup(
database=self.db_config,
storage=self.storage_config
)
return DocEX(user_context=user_context)-
Keep DocEX Focused
- Use DocEX for document management only
- Handle tenant logic at the application layer
- Use user context for auditing and logging
-
Configuration Management
- Store tenant configurations separately
- Use environment variables for sensitive data
- Implement configuration validation
-
Security
- Validate tenant access at the application layer
- Use proper authentication and authorization
- Implement audit logging for all operations
-
Performance
- Use connection pooling for database access
- Implement caching where appropriate
- Monitor resource usage per tenant
Happy coding with DocEX!
