Skip to content

refactor: introduce Pydantic settings for centralized configuration management#159

Closed
priyankeshh wants to merge 3 commits intomarker-deployfrom
refactor/pydantic-settings
Closed

refactor: introduce Pydantic settings for centralized configuration management#159
priyankeshh wants to merge 3 commits intomarker-deployfrom
refactor/pydantic-settings

Conversation

@priyankeshh
Copy link
Contributor

🎯 Overview

This PR refactors the Extralit Server configuration system to use Pydantic Settings, providing type-safe, validated, and well-documented configuration management. This replaces scattered os.getenv() calls throughout the codebase with a centralized Settings class.

🚀 Motivation

Problems with the old approach:

  • ❌ Configuration scattered across multiple files using os.getenv()
  • ❌ No type safety or validation
  • ❌ Easy to have typos in environment variable names
  • ❌ No clear documentation of available settings
  • ❌ Configuration errors only discovered at runtime
  • ❌ Difficult to debug missing or invalid configuration

Benefits of the new approach:

  • ✅ Centralized configuration in config.py
  • ✅ Strong type safety with Pydantic validation
  • ✅ Configuration errors caught at startup with clear messages
  • ✅ IDE autocomplete and type hints for all settings
  • ✅ Comprehensive documentation for every setting
  • ✅ Built-in secret protection with SecretStr
  • ✅ Easy debugging with mask_secrets() method

📋 Changes Made

Core Configuration System

New Files:

  • ✨ config.py - Pydantic Settings class with 40+ documented configuration options
  • extralit-server/.env.example - Comprehensive configuration template with examples and comments

Enhanced Features:

  • 📝 Field-level documentation using Field() with descriptions
  • 🔒 Secret management with SecretStr type
  • ✅ Validators for conditional requirements (Modal mode, S3 configuration)
  • 🐛 mask_secrets() method for safe configuration debugging
  • 📊 Type hints for all configuration values

Code Refactoring

Migrated to use settings:

  • ocr_jobs.py - Marker integration
  • chat.py - Chat validation limits
  • migrate.py - User migration
  • marker_client.py - Modal client configuration
  • _helpers.py - Added clarifying comments
  • settings.py - Added documentation

Documentation

New Documentation:

  • 📘 extralit/docs/admin_guide/configuration.md (346 lines)

    • Configuration overview and quick start
    • Category-by-category setting reference
    • Validation error examples and solutions
    • Best practices for development and production
    • Security recommendations
    • Migration guide and troubleshooting
  • 📗 extralit/docs/community/developer/configuration-system.md (462 lines)

    • Architecture overview and design patterns
    • How to add new configuration options
    • Type safety benefits and testing patterns
    • Code examples and migration patterns
    • Security best practices
    • Performance considerations

🔧 Configuration Categories

The new settings system organizes configuration into logical groups:

  • Client API Configuration - API URL and keys
  • Authentication & Security - JWT secrets, user database
  • Database Configuration - SQLite/PostgreSQL connection strings
  • Redis Configuration - Cache and job queue settings
  • S3 Storage - Object storage for documents (validated)
  • Search Engine - Elasticsearch/OpenSearch configuration
  • Marker PDF Processing - Local vs Modal mode (validated)
  • Document Preprocessing - Layout analysis and page rotation
  • Chat & Message Validation - Length limits
  • External Services - MinIO, Weaviate, HuggingFace

✅ Validation Examples

Modal Configuration Validation:

# This will fail at startup with a clear error message
MARKER_RUN_MODE=modal
# Missing: MARKER_MODAL_BASE_URL (validator catches this!)

S3 Configuration Validation:

# This will fail - all three S3 fields required
EXTRALIT_S3_ENDPOINT=http://localhost:9000
# Missing: EXTRALIT_S3_ACCESS_KEY, EXTRALIT_S3_SECRET_KEY

🔒 Security Improvements

  • Sensitive fields use SecretStr to prevent accidental logging
  • Added mask_secrets() method to safely export configuration
  • Documentation on proper secret handling patterns
  • Examples of generating secure secret keys

📊 Impact

Files Changed: 11 files
Lines Added: 1,440+ lines
Lines Removed: 18 lines
Net Addition: +1,422 lines

Configuration Options Documented: 40+
Validators Added: 2 (Modal URL, S3 completeness)
Documentation Pages: 2 (Admin + Developer guides)

🧪 Testing

The refactoring maintains 100% backward compatibility:

  • ✅ All existing environment variables continue to work
  • ✅ Default values preserved from original implementation
  • ✅ No breaking changes to existing deployments
  • ✅ Added validation catches configuration errors earlier

🚦 Migration Path

For Users:

  1. Continue using existing .env files - no changes required
  2. Optional: Copy .env.example for new settings documentation
  3. Review new validation - may catch previously silent configuration issues

For Developers:

# Old pattern
import os
timeout = int(os.getenv("MARKER_MODAL_TIMEOUT_SECS", "600"))

# New pattern
from extralit_server.config import settings
timeout = settings.MARKER_MODAL_TIMEOUT_SECS  # Already typed as int!

📚 Related Documentation

  • Configuration Guide (Admin)
  • Configuration System (Developer)
  • .env.example Template

🔍 Review Focus Areas

  1. Validation Logic - Check that validators catch edge cases correctly
  2. Documentation - Verify examples and explanations are clear
  3. Backward Compatibility - Ensure existing deployments aren't broken
  4. Type Safety - Confirm type hints are accurate

📝 Notes

  • Some files intentionally kept using os.environ (e.g., OAuth integration, OpenCV runtime config)
  • Client library (extralit/src/extralit/**) remains independent from server config
  • Pre-commit hooks applied formatting and linting

✨ Future Enhancements

Potential follow-up improvements (out of scope for this PR):

  • Settings reload capability for dynamic configuration
  • CLI command to validate configuration
  • Integration tests for settings validation
  • Settings export command for debugging

🙏 Checklist

  • Code follows project style guidelines
  • Self-review completed
  • Comments added for complex logic
  • Documentation updated
  • No breaking changes
  • All validators tested manually
  • .env.example created with all options

- Add new settings to config.py for Marker, chat validation, and user migration
- Replace os.getenv calls with settings object in multiple modules
- Improve configuration management and type safety
- Add comprehensive field documentation and validators to Settings class
- Add field-level descriptions using Pydantic Field()
- Implement validators for conditional requirements (Modal, S3)
- Add mask_secrets() method for safe configuration debugging
- Refactor marker_client.py to use centralized settings
- Create .env.example template with all configuration options
- Add comprehensive configuration documentation for admins and developers
- Include usage examples, troubleshooting, and best practices
@JonnyTran JonnyTran changed the base branch from develop to marker-deploy October 7, 2025 05:11
@JonnyTran JonnyTran marked this pull request as ready for review October 7, 2025 05:12
@JonnyTran JonnyTran requested review from a team as code owners October 7, 2025 05:12
@JonnyTran
Copy link
Member

extralit-server/src/extralit_server/config.py added too much duplicated code that was unneeded

@JonnyTran JonnyTran closed this Oct 7, 2025
@priyankeshh priyankeshh deleted the refactor/pydantic-settings branch October 8, 2025 21:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants

Comments