Stop arguing about file formats. This MCP server ingests any data file and makes it queryable instantly.
Upload CSV, JSON, Excel, Parquet, Avro β the system auto-detects, routes to the right parser, and loads into a query engine.
No configuration. No format wars. No custom loaders.
Every data and AI team wastes time on:
- β Debating which format to use (CSV vs Parquet vs JSON)
- β Writing custom parsers for every data source
- β "We don't support that format" tickets blocking progress
- β Ad-hoc scripts breaking in production
- β Friction between business users and engineers
This MCP server eliminates all of that.
β
Auto-format detection - Upload any file, system figures out the format
β
Smart routing - Polars for speed, Pandas for compatibility
β
Instant SQL queries - DuckDB integration, query uploaded data immediately
β
Zero configuration - Works out of the box
β
REST API - Easy integration with any tool or LLM
β
Production-ready - Health checks, error handling, proper logging
| Format | Extensions | Status |
|---|---|---|
| CSV | .csv, .tsv, .txt |
β Full support |
| JSON | .json, .jsonl |
β Full support |
| Excel | .xlsx, .xls |
β Full support |
| Parquet | .parquet |
β Full support |
| Avro | .avro |
β Full support |
- Docker installed (Get Docker)
- Docker Compose installed (included with Docker Desktop)
- That's it. No Python setup needed.
# 1. Navigate to project directory
cd mcp-data-server
# 2. Start the server
docker-compose up --build
# 3. Wait for this message:
# "Application startup complete"Server is now running at: http://localhost:8000
Create a sample CSV:
cat > sample.csv << EOF
id,name,department,salary
1,Alice,Engineering,95000
2,Bob,Marketing,75000
3,Charlie,Engineering,105000
4,Diana,Sales,85000
EOFUpload it:
curl -X POST "http://localhost:8000/upload" \
-F "file=@sample.csv"Response you'll get:
{
"success": true,
"filename": "sample.csv",
"data": {
"table_name": "sample",
"rows": 4,
"columns": 4,
"column_names": ["id", "name", "department", "salary"],
"format": "csv",
"preview": [
{"id": 1, "name": "Alice", "department": "Engineering", "salary": 95000},
{"id": 2, "name": "Bob", "department": "Marketing", "salary": 75000}
]
}
}Simple filter:
curl -X POST "http://localhost:8000/query" \
-H "Content-Type: application/json" \
-d '{"sql": "SELECT * FROM sample WHERE department = '\''Engineering'\''"}'Aggregation:
curl -X POST "http://localhost:8000/query" \
-H "Content-Type: application/json" \
-d '{"sql": "SELECT department, AVG(salary) as avg_salary, COUNT(*) as count FROM sample GROUP BY department"}'Response:
{
"success": true,
"rows": 3,
"data": [
{"department": "Engineering", "avg_salary": 100000, "count": 2},
{"department": "Marketing", "avg_salary": 75000, "count": 1},
{"department": "Sales", "avg_salary": 85000, "count": 1}
]
}curl http://localhost:8000/tablesReturns all uploaded datasets currently loaded.
curl http://localhost:8000/healthThe project includes a test suite:
# Make test script executable
chmod +x test.sh
# Run all tests
./test.shWhat it tests:
- β Server health check
- β CSV upload and query
- β JSON upload and query
- β Filtering queries
- β Aggregation queries
- β Table listing
# Upload your Excel file
curl -X POST "http://localhost:8000/upload" \
-F "file=@quarterly_report.xlsx"
# Query it immediately (no conversion needed!)
curl -X POST "http://localhost:8000/query" \
-H "Content-Type: application/json" \
-d '{"sql": "SELECT quarter, SUM(revenue) as total_revenue FROM quarterly_report GROUP BY quarter"}'# Create JSON data (e.g., from an API response)
cat > products.json << EOF
[
{"id": 1, "name": "Laptop", "price": 1200, "stock": 45},
{"id": 2, "name": "Mouse", "price": 25, "stock": 150},
{"id": 3, "name": "Keyboard", "price": 75, "stock": 89}
]
EOF
# Upload
curl -X POST "http://localhost:8000/upload" -F "file=@products.json"
# Find low-price items
curl -X POST "http://localhost:8000/query" \
-H "Content-Type: application/json" \
-d '{"sql": "SELECT name, price FROM products WHERE price < 100 ORDER BY price DESC"}'# Upload customers data
curl -X POST "http://localhost:8000/upload" -F "file=@customers.csv"
# Upload orders data
curl -X POST "http://localhost:8000/upload" -F "file=@orders.csv"
# Join across datasets
curl -X POST "http://localhost:8000/query" \
-H "Content-Type: application/json" \
-d '{"sql": "SELECT c.name, COUNT(o.id) as order_count, SUM(o.amount) as total_spent FROM customers c LEFT JOIN orders o ON c.id = o.customer_id GROUP BY c.name ORDER BY total_spent DESC"}'# Upload Parquet file (common in data engineering)
curl -X POST "http://localhost:8000/upload" \
-F "file=@large_dataset.parquet"
# Query instantly - no conversion needed
curl -X POST "http://localhost:8000/query" \
-H "Content-Type: application/json" \
-d '{"sql": "SELECT category, COUNT(*) FROM large_dataset GROUP BY category LIMIT 10"}'import requests
def upload_file(filepath: str):
with open(filepath, 'rb') as f:
response = requests.post(
'http://localhost:8000/upload',
files={'file': f}
)
return response.json()
def query_data(sql: str):
response = requests.post(
'http://localhost:8000/query',
json={'sql': sql}
)
return response.json()
# Use it
upload_file('sales_data.csv')
result = query_data("SELECT product, SUM(revenue) FROM sales_data GROUP BY product")
print(result)from langchain.tools import Tool
import requests
def query_database(sql_query: str) -> dict:
response = requests.post(
"http://localhost:8000/query",
json={"sql": sql_query}
)
return response.json()
# Create tool for your agent
data_query_tool = Tool(
name="QueryData",
func=query_database,
description="Query uploaded datasets using SQL. Input should be a valid SQL query string."
)
# Add to your agent's tools
# Now your LLM can query any uploaded dataset!import openai
import requests
functions = [
{
"name": "query_database",
"description": "Execute SQL query on uploaded data files",
"parameters": {
"type": "object",
"properties": {
"sql": {
"type": "string",
"description": "SQL query to execute on the data"
}
},
"required": ["sql"]
}
}
]
def execute_query(sql: str):
return requests.post(
"http://localhost:8000/query",
json={"sql": sql}
).json()
# Your AI can now generate and execute queries on any uploaded dataOnce the server is running, explore the API interactively:
- Swagger UI: http://localhost:8000/docs
- ReDoc: http://localhost:8000/redoc
Test endpoints directly in your browser!
Edit docker-compose.yml:
ports:
- "8001:8000" # Change 8001 to your desired portThen restart:
docker-compose down
docker-compose up --buildBy default, data is stored in memory. To persist data across restarts:
Edit server.py, change line:
conn = duckdb.connect(':memory:')To:
conn = duckdb.connect('/app/data/database.db')Restart the server. Your data will now persist!
Edit server.py, add to the upload_file function:
@app.post("/upload")
async def upload_file(file: UploadFile = File(...)):
MAX_FILE_SIZE = 100 * 1024 * 1024 # 100MB
content = await file.read()
if len(content) > MAX_FILE_SIZE:
raise HTTPException(status_code=413, detail="File too large")
# ... rest of existing codeFor production use, add API key authentication:
from fastapi import Header, HTTPException
API_KEY = "your-secret-key"
async def verify_api_key(x_api_key: str = Header(...)):
if x_api_key != API_KEY:
raise HTTPException(status_code=403, detail="Invalid API key")
# Add to endpoints
@app.post("/upload", dependencies=[Depends(verify_api_key)])
async def upload_file(file: UploadFile = File(...)):
# ... existing codeβββββββββββββββββββ
β Upload File β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββββββββββ
β Auto-Detect Format β β CSV, JSON, Excel, Parquet, Avro
ββββββββββ¬βββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββ
β Route to Parser β
βββββββββββββββββββββββββββ€
β Try: Polars (fast) β β 5-10x faster than Pandas
β Fallback: Pandas β β Handles edge cases
ββββββββββ¬βββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββ
β Load into DuckDB β β In-memory SQL engine
ββββββββββ¬βββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββ
β Query via REST API β β SQL queries, instant results
βββββββββββββββββββββββββββ
Key Design Decisions:
- Polars First: 5-10x faster for most operations
- Pandas Fallback: Broader compatibility for edge cases
- DuckDB: Fast analytical queries without database setup
- In-Memory: Zero configuration, instant queries
- REST API: Easy integration with any language or tool
Error: Bind for 0.0.0.0:8000 failed: port is already allocated
Solution:
# Check what's using port 8000
lsof -i :8000
# Kill the process or change port in docker-compose.ymlError: Permission denied when accessing uploads directory
Solution:
chmod -R 777 uploads/Solution:
# Check logs for detailed error
docker-compose logs
# Rebuild from scratch
docker-compose down
docker-compose up --build --force-recreateError: Container crashes with large files
Solutions:
- Increase Docker memory: Docker Desktop β Settings β Resources β Memory
- Use persistent storage instead of in-memory (see Configuration section)
- Process large files in chunks
Solution: Server takes time to start. Wait longer:
sleep 10 && ./test.shSolution: Check format is supported. For unsupported formats:
- Convert to CSV/JSON first
- Open an issue on GitHub for format support request
mcp-data-server/
βββ Dockerfile # Container definition
βββ docker-compose.yml # Service orchestration
βββ requirements.txt # Python dependencies
βββ server.py # Main application
βββ test.sh # Automated tests
βββ .gitignore # Git exclusions
βββ README.md # This file
βββ uploads/ # (created on first run)
βββ data/ # (created on first run)
Dockerfile
Defines the Python 3.11 environment with all system dependencies.
docker-compose.yml
Single-service setup with health checks and volume mounts for data persistence.
requirements.txt
All Python dependencies with pinned versions:
- FastAPI: REST API framework
- Polars: Fast data processing
- Pandas: Data compatibility layer
- DuckDB: In-memory SQL engine
- Format libraries: openpyxl, pyarrow, xlrd
server.py
Main application with:
- Format auto-detection logic
- Smart routing (Polars β Pandas)
- DuckDB integration
- REST endpoints
test.sh
Automated test suite covering all features.
Before deploying to production:
- Authentication: Add API key or OAuth
- Rate Limiting: Prevent abuse
- Persistent Storage: Use DuckDB file storage
- File Validation: Check file types and sizes
- Monitoring: Add logging and metrics
- CORS: Configure for web clients
- SSL/TLS: Use HTTPS
- Backups: Regular data backups
- Load Balancing: For high traffic
- Environment Variables: For secrets management
- Analysts upload their own data files
- Query without waiting for engineering
- No format conversion needed
- Upload datasets once
- LLM generates SQL queries
- Conversational data exploration
- Upload test data in any format
- Validate transformations
- Quick iteration
- Experiment with different data sources
- No database setup required
- Instant feedback
- Upload production exports
- Run validation queries
- Identify issues quickly
Found a bug? Want a feature?
- Check existing issues on GitHub
- Fork the repository
- Create a feature branch
- Make your changes
- Add tests if applicable
- Submit a pull request
MIT License - Use it, modify it, ship it to production.
See LICENSE file for full details.
- Issues: GitHub Issues
- API Docs: http://localhost:8000/docs (when running)
- Examples: See test.sh for working code samples
This MCP server is a foundation. Extend it:
- Add streaming data support
- Connect to cloud storage (S3, GCS, Azure Blob)
- Build a web UI for non-technical users
- Add more data sources (APIs, databases)
- Integrate with data warehouses
- Add data transformation capabilities
- Support for more file formats
- Implement caching for repeated queries