BMECatExplorer is an end‑to‑end, memory‑light pipeline to ingest and explore large BMECat 1.2 product catalogs. It stream‑converts XML to JSONL, imports into PostgreSQL, indexes to OpenSearch (BM25 and optional ), and exposes a FastAPI backend plus a small HTMX/Tailwind UI.
Pipeline
- Stream‑convert large BMECat XML → JSONL (
main.py) - Import JSONL into PostgreSQL (
src/db) - Index products into OpenSearch with optional OpenAI embeddings (
src/search) - Serve search + hybrid RAG‑friendly endpoints via FastAPI (
src/api) - Browse/export results in the frontend (
frontend/)
Highlights
- Streaming XML converter (
iterparse+clear()) stays O(1) memory - Faceted BM25 search + autocomplete
- Hybrid BM25 + vector search with RRF fusion (
POST /api/v1/search/hybrid) - Multi‑catalog namespaces via
catalog_id(composite uniqueness in DB + index IDs) - Normalized unit price (
price_unit_amount) for correct price filters - Optional embeddings (
OPENAI_API_KEY) and provenance fields for RAG
uv sync
just up
# Convert → import → index (safe to rerun; replaces the "default" catalog)
just pipeline data/BME-cat_eClass_8.xml
just serve
just serve-frontend- API docs: http://localhost:9019/docs
- Frontend: http://localhost:9018
You can also run the steps manually:
just convert data/BME-cat_eClass_8.xml data/products.jsonl
just import data/products.jsonl --replace-catalog
just indexBMECat prices can refer to bundles. PRICE_AMOUNT applies to PRICE_QUANTITY
units (often 100). The backend computes a normalized unit price:
price_unit_amount = price_amount / price_quantity
- UI and API show both unit price and raw amount.
price_min,price_max, andprice_bandfilters operate on unit price.
To keep multiple XML sources in one DB/index without ID collisions:
just up
just pipeline-catalog data/catalog_a.xml catalog_a
just pipeline-catalog data/catalog_b.xml catalog_bSearch can be scoped with catalog_id=catalog_a (repeatable).
Upgrade note: OpenSearch document IDs are catalog_id:supplier_aid. If you
have an existing index from an older version that used supplier_aid as _id,
recreate and reindex (e.g., just index) to avoid duplicates.
| Endpoint | Description |
|---|---|
GET /api/v1/search |
BM25 search with filters and facets |
GET /api/v1/search/autocomplete?q= |
Prefix suggestions |
GET /api/v1/products/{supplier_aid} |
Fetch a single product (use ?catalog_id= if needed) |
GET /api/v1/facets |
Facet counts for UI |
POST /api/v1/search/hybrid |
BM25 / vector / hybrid RRF search |
POST /api/v1/search/batch |
Batch hybrid queries |
GET /api/v1/catalogs |
List available catalogs |
q– Full‑text query (descriptions, manufacturer, IDs)manufacturer– Manufacturer name filter (repeatable)eclass_id– Exact ECLASS ID filter (repeatable)eclass_segment– ECLASS segment/2‑digit prefix filter (repeatable)order_unit– Order unit filter (repeatable)price_min/price_max– Unit price range filterprice_band– Predefined unit price bands (0‑10, 10‑50, 50‑200, 200‑1000, 1000+)catalog_id– Catalog namespace filter (repeatable)exact_match– Exact matches for EAN/IDspage/size– Pagination
Example:
curl "http://localhost:9019/api/v1/search?q=Kabel&manufacturer=Walraven%20GmbH&catalog_id=default&size=10"Run just --list for all tasks. Common ones:
| Command | Description |
|---|---|
just up / just down |
Start/stop PostgreSQL and OpenSearch |
just convert <in.xml> <out.jsonl> |
XML → JSONL |
just convert-with-header <in.xml> <out.jsonl> <header.json> |
Convert and save header |
just import <file.jsonl> [--catalog-id <id>] [--source-file <xml>] [--replace-catalog] |
Load JSONL into PostgreSQL |
just index / just index-embed |
Index DB rows to OpenSearch (embeddings optional) |
just index-catalog <catalog_id> <source.xml> |
Append a catalog to existing index |
just pipeline <xml> |
Convert → import → index (replaces default catalog) |
just pipeline-catalog <xml> <catalog_id> |
Pipeline under a catalog namespace |
just serve / just serve-frontend |
Run backend / frontend |
| `just test-unit | test-integration |
just lint / just format |
Ruff / Black |
Backend env vars (via .env or shell) follow src/config.py. Key ones:
| Variable | Default | Notes |
|---|---|---|
POSTGRES_* |
from docker-compose.yml |
DB connection |
OPENSEARCH_* |
from docker-compose.yml |
OpenSearch connection |
OPENAI_API_KEY |
unset | Required for index-embed and server‑side vector fallback |
OPENAI_EMBEDDING_MODEL |
text-embedding-3-small |
|
OPENAI_EMBEDDING_DIMENSIONS |
1536 |
Must match index mapping |
Frontend uses FRONTEND_API_BASE_URL and related settings (see frontend/config.py).
For production or long‑lived databases, prefer Alembic migrations over runtime
create_all:
uv run alembic upgrade head├── main.py # XML → JSONL converter
├── alembic/ # DB migrations
├── justfile # Task runner commands
├── docker-compose.yml # PostgreSQL + OpenSearch
├── src/
│ ├── config.py # Settings
│ ├── db/ # SQLAlchemy models + importer
│ ├── search/ # OpenSearch mapping/client/indexer
│ └── api/ # FastAPI app + routes
├── frontend/ # HTMX/Tailwind web UI
└── tests/ # unit/, integration/, smoke/