From 733f1a1be39b77fbbca60d790dc242af0e154be9 Mon Sep 17 00:00:00 2001 From: Rajiv Shah Date: Tue, 9 Sep 2025 07:54:31 -0400 Subject: [PATCH] adding metadata notebook --- 15-metadata-intro/README.md | 52 + 15-metadata-intro/data/KB_Template_DE.md | 3 + 15-metadata-intro/data/KB_Template_EU.md | 3 + 15-metadata-intro/data/KB_Template_US.md | 3 + 15-metadata-intro/data/POL_EU-v3.md | 3 + 15-metadata-intro/data/POL_US-v2.md | 3 + 15-metadata-intro/metadata_intro.ipynb | 1922 ++++++++++++++++++++++ README.md | 1 + 8 files changed, 1990 insertions(+) create mode 100644 15-metadata-intro/README.md create mode 100644 15-metadata-intro/data/KB_Template_DE.md create mode 100644 15-metadata-intro/data/KB_Template_EU.md create mode 100644 15-metadata-intro/data/KB_Template_US.md create mode 100644 15-metadata-intro/data/POL_EU-v3.md create mode 100644 15-metadata-intro/data/POL_US-v2.md create mode 100644 15-metadata-intro/metadata_intro.ipynb diff --git a/15-metadata-intro/README.md b/15-metadata-intro/README.md new file mode 100644 index 0000000..1d37ecb --- /dev/null +++ b/15-metadata-intro/README.md @@ -0,0 +1,52 @@ +# Metadata Management in Contextual AI + +A comprehensive notebook demonstrating how to add, configure, and utilize metadata within the Contextual AI Platform for enhanced document retrieval and filtering. Learn to organize documents with rich metadata, apply precise filters, and leverage metadata in reranking and generation. + +## πŸ“‹ Overview + +This example showcases how to implement advanced metadata management for RAG systems that can: + +1. **Configure Metadata at Ingest Time** with flexible field settings and configurations +2. **Update Metadata Post-Processing** for dynamic document organization +3. **Apply Precise Document Filters** using multiple operators and complex logic +4. **Enhance Retrieval Quality** through metadata-based filtering and reranking +5. **Leverage Metadata in Generation** for contextually aware responses + +## πŸ—‚οΈ Project Structure + +``` +πŸ“ Metadata Management/ +β”œβ”€β”€ πŸ“ data/ # Sample policy documents +β”‚ β”œβ”€β”€ πŸ“„ POL_EU-v3.md # EU refund policy document +β”‚ β”œβ”€β”€ πŸ“„ POL_US-v2.md # US refund policy document +β”‚ β”œβ”€β”€ πŸ“„ KB_Template_US.md # US customer email template +β”‚ β”œβ”€β”€ πŸ“„ KB_Template_EU.md # EU customer email template +β”‚ └── πŸ“„ KB_Template_DE.md # German customer template +β”œβ”€β”€ πŸ““ metadata_intro.ipynb # Main metadata notebook +└── πŸ“„ README.md # This file +``` + +## πŸš€ Quick Start + +### Prerequisites +- **API Key:** Contextual AI API key from your workspace dashboard +- **Python Environment:** Google Colab or Jupyter with internet access +- **Python Client:** Version 0.8.0 or higher required + +### Run on Google Colab +[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ContextualAI/examples/blob/main/15-metadata-intro/metadata_intro.ipynb) + + +## πŸ“š Related Examples + +- πŸ”— **RAG Agent Monitoring**: [14-monitoring](../14-monitoring/) +- πŸ”— **Retrieval Analysis**: [11-retrieval-analysis](../11-retrieval-analysis/) +- πŸ”— **Policy Change Management**: [05-policy-changes](../05-policy-changes/) +- πŸ”— **Agent Performance**: [06-improve-agent-performance](../06-improve-agent-performance/) + +## πŸ“– Additional Resources + +- **Contextual AI Documentation**: [docs.contextual.ai](https://docs.contextual.ai/) +- **Metadata API Reference**: [API Documentation](https://docs.contextual.ai/api-reference/datastores-documents/get-document-metadata) +- **Best Practices Guide**: [Metadata Management](https://docs.contextual.ai/user-guides/beginner-guide) + diff --git a/15-metadata-intro/data/KB_Template_DE.md b/15-metadata-intro/data/KB_Template_DE.md new file mode 100644 index 0000000..c4250e2 --- /dev/null +++ b/15-metadata-intro/data/KB_Template_DE.md @@ -0,0 +1,3 @@ +# KB β€” Customer email template (DE) + +Localized template for customer emails for Germany / Deutschland. Sehr geehrte Kundin, sehr geehrter Kunde, vielen Dank fΓΌr Ihren Einkauf bei uns. FΓΌr Produkte, die in Deutschland erworben wurden, gilt eine gesetzliche GewΓ€hrleistung von 24 Monaten. DarΓΌber hinaus kΓΆnnen Sie innerhalb von 30 Tagen nach Lieferung eine RΓΌckerstattung beantragen. \ No newline at end of file diff --git a/15-metadata-intro/data/KB_Template_EU.md b/15-metadata-intro/data/KB_Template_EU.md new file mode 100644 index 0000000..7ddb927 --- /dev/null +++ b/15-metadata-intro/data/KB_Template_EU.md @@ -0,0 +1,3 @@ +# KB β€” Customer email template (EU) + +Localized template for customer emails for European Union. Subject: Your refund and warranty information Dear Customer, Thank you for your purchase. For products purchased in the European Union: You may request a refund within 30 days of delivery. Digital goods are refundable only if they have not been used or downloaded. Warranty coverage is provided according to EU consumer law. \ No newline at end of file diff --git a/15-metadata-intro/data/KB_Template_US.md b/15-metadata-intro/data/KB_Template_US.md new file mode 100644 index 0000000..adc899e --- /dev/null +++ b/15-metadata-intro/data/KB_Template_US.md @@ -0,0 +1,3 @@ +# KB β€” Customer email template (US) + +Localized template for customer emails for United States Dear Customer, Thank you for shopping with us. For products purchased in the United States: Refunds are available for 30 days on physical goods. Refunds for digital goods are limited to 14 days and only if the product has not been activated. Subscriptions are generally non-refundable once started. \ No newline at end of file diff --git a/15-metadata-intro/data/POL_EU-v3.md b/15-metadata-intro/data/POL_EU-v3.md new file mode 100644 index 0000000..03c753b --- /dev/null +++ b/15-metadata-intro/data/POL_EU-v3.md @@ -0,0 +1,3 @@ +# Refund Policy EU v3 + +EU customers: refunds within 30 days. Digital goods only if unused. \ No newline at end of file diff --git a/15-metadata-intro/data/POL_US-v2.md b/15-metadata-intro/data/POL_US-v2.md new file mode 100644 index 0000000..cff94cb --- /dev/null +++ b/15-metadata-intro/data/POL_US-v2.md @@ -0,0 +1,3 @@ +# Refund Policy US v2 + +US customers: refunds within 14 days. Subscriptions non-refundable. \ No newline at end of file diff --git a/15-metadata-intro/metadata_intro.ipynb b/15-metadata-intro/metadata_intro.ipynb new file mode 100644 index 0000000..2c890a8 --- /dev/null +++ b/15-metadata-intro/metadata_intro.ipynb @@ -0,0 +1,1922 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": { + "id": "8iPJnDH0HhXB" + }, + "source": [ + "\"Image\n", + "\n", + "
\n", + "\n", + "# Metadata Management in Contextual AI\n", + "\n", + "This notebook provides a comprehensive guide to adding and utilizing metadata within the Contextual AI Platform for enhanced document retrieval and filtering. \n", + "\n", + "The notebook walks through manually adding metadata into the datastore. With this additional metadata, it is possible to have more control over document organization, retrieval filtering, and response generation. \n", + "\n", + "For more background on metadata management in the platform, see also our [documentation](https://docs.contextual.ai/api-reference/datastores-documents/get-document-metadata).\n", + "\n", + " \n", + " \n", + "## Table of Contents\n", + "\n", + "1. **Environment Setup** - API configuration and dependencies\n", + "2. **Datastore and Agent Initialization** - Creating datastores and configuring agents\n", + "3. **Metadata Configuration** - Setting metadata at ingest time and post-processing\n", + "4. **Retrieval with Metadata Filters** - Using metadata for precise document filtering\n", + "5. **Advanced Filtering Operations** - Complex queries with multiple operators\n", + "6. **Generation using Metadata** - Leveraging metadata in reranking and generation\n", + "\n", + "[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/ContextualAI/examples/blob/main/15-metadata-intro/metadata_intro.ipynb)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "czJ6eN8OHxhj" + }, + "source": [ + "## 1. Environment Setup\n", + "\n", + "### Prerequisites\n", + "\n", + "- **API Key**: Obtain from your Contextual AI workplace dashboard\n", + "- **Python Client**: Required version 0.8.0 or higher\n", + "- **Sample Data**: Policy documents and knowledge base templates\n", + "\n", + "**Security Note**: Store API keys in environment variables or secure key management systems." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "executionInfo": { + "elapsed": 10338, + "status": "ok", + "timestamp": 1756996852453, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "PZpdSLxOiuaa", + "outputId": "5ba681b7-09ee-4356-a03e-2ad40a14b8f2" + }, + "outputs": [], + "source": [ + "!pip install \"contextual-client>=0.8.0\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "executionInfo": { + "elapsed": 1818, + "status": "ok", + "timestamp": 1756996854247, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "cvR_1XM8EULJ" + }, + "outputs": [], + "source": [ + "# Import required libraries\n", + "import os\n", + "import requests\n", + "import json\n", + "from pathlib import Path\n", + "from typing import List, Optional, Dict\n", + "from IPython.display import display, JSON\n", + "import pandas as pd\n", + "from contextual import ContextualAI" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "executionInfo": { + "elapsed": 510, + "status": "ok", + "timestamp": 1756996854760, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "4wJG66VTIQvO" + }, + "outputs": [], + "source": [ + "# Initialize Contextual AI client\n", + "# You can store the API key as an environment variable:\n", + "#os.environ[\"CONTEXTUAL_API_KEY\"] = \"key-YSVU\"\n", + "\n", + "client = ContextualAI(\n", + " api_key= os.environ[\"CONTEXTUAL_API_KEY\"]\n", + ")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "EM9SBlrcy3pn" + }, + "source": [ + "### Data Preparation\n", + "\n", + "Download sample policy documents and knowledge base templates for demonstration purposes." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "executionInfo": { + "elapsed": 23, + "status": "ok", + "timestamp": 1756996904818, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "LgpwfHWsy3pn" + }, + "outputs": [], + "source": [ + "def fetch_file(filepath):\n", + " os.makedirs(os.path.dirname(filepath), exist_ok=True) if '/' in filepath else None\n", + " if not os.path.exists(filepath):\n", + " print(f\"Fetching {filepath}\")\n", + " response = requests.get(f\"https://raw.githubusercontent.com/ContextualAI/examples/main/15-metadata-intro/{filepath}\")\n", + " if response.ok:\n", + " with open(filepath, 'wb') as f:\n", + " f.write(response.content)\n", + " print(f\"Saved {filepath}\")\n", + " else:\n", + " print(f\"Failed to fetch {filepath}\")\n", + "\n", + "fetch_file('data/POL_EU-v3.md')\n", + "fetch_file('data/POL_US-v2.md')\n", + "fetch_file('data/KB_Template_DE.md')\n", + "fetch_file('data/KB_Template_EU.md')\n", + "fetch_file('data/KB_Template_US.md')" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "hg0s7fAJEULK", + "vscode": { + "languageId": "raw" + } + }, + "source": [ + "## 2. Datastore and Agent Initialization\n", + "\n", + "### 2.1 Datastore Creation\n", + "\n", + "Initialize a new datastore to serve as the repository for documents with metadata capabilities." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "executionInfo": { + "elapsed": 544, + "status": "ok", + "timestamp": 1756996905364, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "HbjHQhqQEULK", + "outputId": "eb050a12-7472-4ef6-dce0-32d897314e64" + }, + "outputs": [], + "source": [ + "result = client.datastores.create(name=\"Metadata_Examples\")\n", + "datastore_id = result.id\n", + "print(f\"Datastore ID: {datastore_id}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Vs7ip-aBy3po" + }, + "source": [ + "### 2.2 Agent Configuration\n", + "\n", + "Create an agent to enable querying capabilities on the datastore." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "executionInfo": { + "elapsed": 128, + "status": "ok", + "timestamp": 1756996905493, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "HyH033Dvy3po", + "outputId": "2b46b374-f442-461e-d71d-eeeb6cda3c07" + }, + "outputs": [], + "source": [ + "app_response = client.agents.create(\n", + " name=\"Demo Metadata\",\n", + " datastore_ids=[datastore_id]\n", + ")\n", + "agent_id= app_response.id\n", + "print(f\"Agent ID created: {agent_id}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "i0u_eLrPy3po" + }, + "source": [ + "## 3. Metadata Configuration\n", + "\n", + "### Overview\n", + "\n", + "The Contextual AI platform provides flexible metadata management with two primary configuration methods:\n", + "\n", + "**Timing Options:**\n", + "- **At ingest time**: Set metadata during document upload\n", + "- **Post-processing**: Update metadata after document processing\n", + "\n", + "**Configuration Parameters:**\n", + "- `in_chunks` (default: False): Include metadata in chunk content for reranking\n", + "- `returned_in_response` (default: False): Include metadata in API responses\n", + "- `filterable` (default: True): Enable metadata-based filtering\n", + "\n", + "This section demonstrates various metadata configuration strategies.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NnMT9lSey3po" + }, + "source": [ + "### Metadata Structure\n", + "\n", + "Each metadata entry consists of:\n", + "- **Field**: The metadata key (e.g., 'language', 'region')\n", + "- **Value**: The corresponding value (e.g., 'EN', 'US')\n", + "\n", + "**Limitations:**\n", + "- Maximum of 15 metadata fields per workspace (contact support for extensions)\n", + "- Maximum 2KB per metadata field value\n", + "- Case sensitivity varies by operator (see Advanced Filtering Operations section)\n", + "\n", + "**Best Practices:**\n", + "- Metadata fields can be case sensitive, so I only use lowercase so there are less issues\n", + "- Here is a list of common metadata fields that can inspire you and you may reuse: author, category, company, custom_field_1, custom_field_2, date, description, document_id, document_title, document_type, filename, folder_path, industry, language, name, project_id, publication_date, region, source, status, subject, title, type, updated_at, upload_date, upload_timestamp, uploaded_by, url, user_id, user_name, version, year" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Vsp2tvaNy3po" + }, + "source": [ + "### Available Metadata Fields\n", + "\n", + "Sample metadata fields used in our demonstration documents:\n", + "- `date`: Document date\n", + "- `region`: Geographic region (US, EU, etc.)\n", + "- `language`: Language code (EN, DE, etc.)\n", + "- `entities`: Named entities in the document\n", + "- `version`: Document version\n", + "\n", + "Use can use create your own metadata fields for your documents." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "GMqPFY9xy3po" + }, + "source": [ + "### 3.1 Setting Metadata at Ingest Time\n", + "\n", + "Demonstrate adding metadata during document upload with default configuration parameters." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "executionInfo": { + "elapsed": 622, + "status": "ok", + "timestamp": 1756996906119, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "drHoWY7Xy3po" + }, + "outputs": [], + "source": [ + "# Convert it to a JSON string first:\n", + "metadata_string = json.dumps({\n", + " \"custom_metadata\": {\n", + " \"region\": \"EU\",\n", + " \"language\": \"EN\"\n", + " }\n", + "})\n", + "\n", + "# Then use it in your function:\n", + "with open('data/POL_EU-v3.md', 'rb') as f:\n", + " ingestion_result = client.datastores.documents.ingest(\n", + " datastore_id,\n", + " file=f,\n", + " metadata=metadata_string # Pass the string instead\n", + " )\n", + " document_id = ingestion_result.id" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "kM1cz24cy3po" + }, + "source": [ + "Monitor document processing status. The document progresses through states: pending β†’ processing β†’ completed. Custom metadata becomes visible upon completion." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "executionInfo": { + "elapsed": 357, + "status": "ok", + "timestamp": 1756997133787, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "YlSPThf3y3pp", + "outputId": "c5c9b24a-2975-4a6e-8fdd-34d564400de9" + }, + "outputs": [], + "source": [ + "metadata = client.datastores.documents.metadata(datastore_id = datastore_id, document_id = document_id)\n", + "print(\"Document metadata:\", metadata)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "h_zcq0Jsy3pp" + }, + "source": [ + "Note the metadata embedded in the document response: `custom_metadata={'region': 'EU', 'language': 'EN'}`" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Jk6xqfD6y3pp" + }, + "source": [ + "### Examining Metadata in Chunks\n", + "\n", + "Query the datastore to observe how metadata is incorporated into chunk content. For this query to run successfully, you should wait until the document has been fully processed and the status is complete." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "executionInfo": { + "elapsed": 5599, + "status": "ok", + "timestamp": 1756997334707, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "fjB8TnrEy3pp", + "outputId": "4fbf5bd5-0cab-4a0f-c1ed-689501ef7397" + }, + "outputs": [], + "source": [ + "query = \"During\"\n", + "\n", + "query_result = client.agents.query.create(\n", + " agent_id=agent_id,\n", + " messages=[{\n", + " \"content\": query,\n", + " \"role\": \"user\"\n", + " }],\n", + " override_configuration= {\n", + " \"enable_filter\": False,\n", + " \"enable_rerank\": False,\n", + " \"top_k_retrieved_chunks\": 1,\n", + " \"lexical_alpha\": 0.1,\n", + " \"semantic_alpha\": 0.9,\n", + " },\n", + " include_retrieval_content_text=True,\n", + " retrievals_only=False\n", + " )\n", + "\n", + "query_result" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 52 + }, + "executionInfo": { + "elapsed": 9, + "status": "ok", + "timestamp": 1756997338582, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "LNfSw6way3pp", + "outputId": "bd13a3e3-4860-4f35-a9b4-506042722c79" + }, + "outputs": [], + "source": [ + "query_result.retrieval_contents[0].content_text" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "g_4znn7py3pp" + }, + "source": [ + "The retrieved chunk includes metadata appended to the content:\n", + "\n", + "```\n", + "'File Name: POL_EU-v3.html\\nDocument Title: Refund Policy EU v3\\nSection: Refund Policy EU v3\\n\\n\\nEU customers: refunds within 30 days. Digital goods only if unused.\\nMetadata: {\\n\\t\"region\": \"EU\",\\n\\t\"language\": \"EN\"\\n}'\n", + "```\n", + "\n", + "Metadata is automatically appended when `in_chunks` is set to True (default behavior)." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "P2OZTOwry3pp" + }, + "source": [ + "### 3.2 Exploring Advanced Metadata Configuration\n", + "\n", + "Let's add a new document, where we configure all the different metadata parameters to show how they work." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "executionInfo": { + "elapsed": 901, + "status": "ok", + "timestamp": 1756997347603, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "VE6tT123y3pp" + }, + "outputs": [], + "source": [ + "metadata_string = json.dumps({\n", + " \"custom_metadata\": {\n", + " \"region\": \"US\",\n", + " \"language\": \"EN\",\n", + " \"version\": \"v2\"\n", + " },\n", + " \"custom_metadata_config\": {\n", + " \"region\": {\n", + " \"filterable\": True,\n", + " \"in_chunks\": True,\n", + " \"returned_in_response\": True\n", + " },\n", + " \"language\": {\n", + " \"filterable\": True,\n", + " \"in_chunks\": True,\n", + " \"returned_in_response\": True\n", + " },\n", + "\n", + " \"version\": {\n", + " \"filterable\": True,\n", + " \"in_chunks\": False,\n", + " \"returned_in_response\": False\n", + " }\n", + " },\n", + "})\n", + "\n", + "with open('data/POL_US-v2.md', 'rb') as f:\n", + " ingestion_result = client.datastores.documents.ingest(\n", + " datastore_id,\n", + " file=f,\n", + " metadata=metadata_string # Pass the string instead\n", + " )\n", + " document_id = ingestion_result.id" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lEsLX8pOy3pp" + }, + "source": [ + "Verify the complete metadata configuration:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "executionInfo": { + "elapsed": 429, + "status": "ok", + "timestamp": 1756997692995, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "pPI4GgPOy3pp", + "outputId": "959b2d71-1170-49b8-ba2a-70e3b331e1ea" + }, + "outputs": [], + "source": [ + "metadata = client.datastores.documents.metadata(datastore_id = datastore_id, document_id = document_id)\n", + "print(\"Document metadata:\", metadata)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "executionInfo": { + "elapsed": 3, + "status": "ok", + "timestamp": 1756997693000, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "gVeqUBHSy3pp", + "outputId": "d0f0ffeb-bc0f-4011-b373-1519d4912ae7" + }, + "outputs": [], + "source": [ + "metadata.custom_metadata" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "W2a7gpW3y3pp" + }, + "source": [ + "Document-level metadata: `{'region': 'US', 'version': 'v2', 'language': 'EN'}`" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AXmwgUXxy3pp" + }, + "source": [ + "### Examining Metadata in Chunks\n", + "\n", + "Let's look into how the metadata is applied to chunks. We will apply metadata filters to retrieve specific chunks. The `override_configuration` parameter is used here for demonstration purposes to bypass reranking and filtering, showing all matching chunks." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "executionInfo": { + "elapsed": 1330, + "status": "ok", + "timestamp": 1756997694330, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "GiAx17Q2y3pp", + "outputId": "c744879d-efd1-41c1-c996-e71924f4d99f" + }, + "outputs": [], + "source": [ + "query = \"During\"\n", + "\n", + "query_result = client.agents.query.create(\n", + " agent_id=agent_id,\n", + " messages=[{\n", + " \"content\": query,\n", + " \"role\": \"user\"\n", + " }],\n", + " documents_filters= {\n", + " \"operator\": \"AND\",\n", + " \"filters\": [\n", + " {\"field\": \"region\", \"operator\": \"equals\", \"value\": \"US\"}\n", + " ]\n", + " },\n", + " override_configuration= {\n", + " \"enable_filter\": False,\n", + " \"enable_rerank\": False,\n", + " \"top_k_retrieved_chunks\": 3,\n", + " \"lexical_alpha\": 0.1,\n", + " \"semantic_alpha\": 0.9,\n", + " },\n", + " retrievals_only=True,\n", + " include_retrieval_content_text=True,\n", + " )\n", + "\n", + "query_result" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "MXGbMJ7Xy3pp" + }, + "source": [ + "Extract chunk-level metadata from the query response:" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "executionInfo": { + "elapsed": 3, + "status": "ok", + "timestamp": 1756997694344, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "cNSeUhJry3pq", + "outputId": "daefab60-3531-4373-820a-36791f0d4e7b" + }, + "outputs": [], + "source": [ + "query_result.retrieval_contents[0].custom_metadata" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "vdeLaURBy3pq" + }, + "source": [ + "Chunk-level metadata: `{'language': 'EN', 'region': 'US'}`\n", + "\n", + "Note that `version` is excluded from the response metadata due to `returned_in_response: False` configuration." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 52 + }, + "executionInfo": { + "elapsed": 6, + "status": "ok", + "timestamp": 1756997694350, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "ZLu3Wj8Oy3pq", + "outputId": "c46dbe67-404f-4362-afb0-ca6e32db9089" + }, + "outputs": [], + "source": [ + "query_result.retrieval_contents[0].content_text" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pjIiVtrdy3pq" + }, + "source": [ + "Chunk content with embedded metadata:\n", + "\n", + "```\n", + "File Name: POL_US-v2.html\n", + "Document Title: Refund Policy US v2\n", + "Section: Refund Policy US v2\n", + "\n", + "US customers: refunds within 14 days. Subscriptions non-refundable.\n", + "Metadata: {\n", + " \"region\": \"US\",\n", + " \"language\": \"EN\"\n", + "}\n", + "```\n", + "\n", + "The `version` field is absent from chunk content because `in_chunks: False` was specified." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "pdiMRM6fCpdu" + }, + "source": [ + "### 3.3 Updating Metadata\n", + "\n", + "Metadata can also be updated after ingestion time. Let's change the version number for a document." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "executionInfo": { + "elapsed": 337, + "status": "ok", + "timestamp": 1756997721904, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "_bMLwwPdCz2d" + }, + "outputs": [], + "source": [ + "result = client.datastores.documents.set_metadata(\n", + " datastore_id=datastore_id,\n", + " document_id=document_id,\n", + " custom_metadata={\"version\": 'v4'}\n", + ")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "executionInfo": { + "elapsed": 490, + "status": "ok", + "timestamp": 1756997734232, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "1UQfc8gaC2BN", + "outputId": "dbad7bf1-0b6f-4ae2-f8ba-07a771c23439" + }, + "outputs": [], + "source": [ + "metadata = client.datastores.documents.metadata(datastore_id=datastore_id,\n", + " document_id=document_id)\n", + "print(\"Document metadata:\", metadata.custom_metadata)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SyRujHLaDD2A" + }, + "source": [ + "The output now shows the metadata has been updated to: `{'version': 'v4'}`" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "TJet_Mlky3pq" + }, + "source": [ + "### 3.4 Summary: Metadata Configuration Options\n", + "\n", + "**Configuration Timing:**\n", + "- At ingest time during document upload\n", + "- Post-processing via metadata updates\n", + "\n", + "**Configuration Parameters:**\n", + "- `in_chunks`: Controls metadata inclusion in chunk content\n", + "- `returned_in_response`: Controls metadata visibility in API responses\n", + "- `filterable`: Enables metadata-based query filtering" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "QL-dzY3Jy3pq" + }, + "source": [ + "## 4. Retrieval with Metadata Filters\n", + "\n", + "Metadata filtering enables precise control over document retrieval, improving query accuracy by narrowing the search scope to relevant documents based on metadata criteria.\n", + "\n", + "### 4.1 Ingesting Documents with Rich Metadata\n", + "\n", + "Prepare additional documents with comprehensive metadata for demonstration." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "executionInfo": { + "elapsed": 599, + "status": "ok", + "timestamp": 1756997734830, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "0MbVNj5Ny3pu" + }, + "outputs": [], + "source": [ + "metadata_string = json.dumps({\n", + " \"custom_metadata\": {\n", + " \"region\": \"US\",\n", + " \"language\": \"EN\",\n", + " \"date\": \"2025-02-10\",\n", + " \"entities\": '[\"REG:US\", \"TEMP:CUSTOMER\", \"PROC:REFUND\", \"PROD:SUBSCRIPTION\"]'\n", + " }\n", + "})\n", + "\n", + "with open('data/KB_Template_US.md', 'rb') as f:\n", + " ingestion_result = client.datastores.documents.ingest(\n", + " datastore_id,\n", + " file=f,\n", + " metadata=metadata_string\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "executionInfo": { + "elapsed": 3326, + "status": "ok", + "timestamp": 1756997738156, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "747mv-F8y3pu" + }, + "outputs": [], + "source": [ + "metadata_string = json.dumps({\n", + " \"custom_metadata\": {\n", + " \"region\": \"EU\",\n", + " \"language\": \"DE\",\n", + " \"date\": \"2022-02-10\",\n", + " \"entities\": '[\"REG:DE\", \"LANG:DE\", \"TEMP:CUSTOMER\", \"PROC:WARRANTY\"]'\n", + " }\n", + "})\n", + "\n", + "with open('data/KB_Template_DE.md', 'rb') as f:\n", + " ingestion_result = client.datastores.documents.ingest(\n", + " datastore_id,\n", + " file=f,\n", + " metadata=metadata_string\n", + " )" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "executionInfo": { + "elapsed": 650, + "status": "ok", + "timestamp": 1756997738812, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "YZQLhL1Gy3pu" + }, + "outputs": [], + "source": [ + "metadata_string = json.dumps({\n", + " \"custom_metadata\": {\n", + " \"region\": \"EU\",\n", + " \"language\": \"EN\",\n", + " \"date\": \"2024-02-10\",\n", + " \"entities\": '[\"REG:EU\", \"TEMP:CUSTOMER\", \"PROC:REFUND\", \"PROD:DIGITAL\"]'\n", + " }\n", + "})\n", + "\n", + "with open('data/KB_Template_EU.md', 'rb') as f:\n", + " ingestion_result = client.datastores.documents.ingest(\n", + " datastore_id,\n", + " file=f,\n", + " metadata=metadata_string\n", + " )" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-D0XQduVy3pu" + }, + "source": [ + "Monitor document processing status. Ensure all documents reach 'completed' status before proceeding." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "executionInfo": { + "elapsed": 308, + "status": "ok", + "timestamp": 1756998042933, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "wxvNQ1sEy3pu", + "outputId": "d6c3dfa3-0307-4dbf-986c-9f0bf4ca3e26" + }, + "outputs": [], + "source": [ + "# Retrieve all documents from the datastore\n", + "docs = client.datastores.documents.list(datastore_id=datastore_id)\n", + "doc_pairs = [(doc.id, doc.name, doc.status) for doc in docs.documents]\n", + "print(\"Document ID and Name pairs:\")\n", + "for doc_id, name, status in doc_pairs:\n", + " print(f\"ID: {doc_id}, Name: {name}, Status: {status}\")" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8snxcJa9y3pu" + }, + "source": [ + "### 4.2 Baseline Query Without Filters\n", + "\n", + "Execute a general query to demonstrate the challenge of ambiguous responses without metadata filtering." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 139 + }, + "executionInfo": { + "elapsed": 8041, + "status": "ok", + "timestamp": 1756998050975, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "ITRR4gHuy3pu", + "outputId": "775e190f-c37f-4265-969d-5542ed93792a" + }, + "outputs": [], + "source": [ + "query = \"What is the current refund window?\"\n", + "\n", + "query_result = client.agents.query.create(\n", + " agent_id=agent_id,\n", + " messages=[{\n", + " \"content\": query,\n", + " \"role\": \"user\"\n", + " }],\n", + " )\n", + "\n", + "query_result.message.content" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "URJU1vGey3pu" + }, + "source": [ + "The response contains multiple region-specific policies, demonstrating the ambiguity when metadata filtering is not applied. The system returns policies for both US and EU regions, requiring manual disambiguation.`" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "YIXffxK9y3pu" + }, + "source": [ + "### 4.3 Targeted Query with Metadata Filters\n", + "\n", + "Apply region and date filters to retrieve contextually relevant information for a US-based user requiring current policies." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 87 + }, + "executionInfo": { + "elapsed": 4046, + "status": "ok", + "timestamp": 1756998055018, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "xcN2ffUty3pu", + "outputId": "435c29b1-6feb-4da8-b87e-2af739804b12" + }, + "outputs": [], + "source": [ + "query = \"What is the current refund window?\"\n", + "value1 = \"US\"\n", + "value2 = \"2023-01-01\"\n", + "metadata_field1 = \"region\"\n", + "metadata_field2 = \"date\"\n", + "\n", + "query_result = client.agents.query.create(\n", + " agent_id=agent_id,\n", + " messages=[{\n", + " \"content\": query,\n", + " \"role\": \"user\"\n", + " }],\n", + " documents_filters= {\n", + " \"operator\": \"AND\",\n", + " \"filters\": [\n", + " { \"operator\": \"AND\",\n", + " \"filters\": [\n", + " {\"field\": metadata_field1, \"operator\": \"equals\", \"value\": value1},\n", + " {\"field\": metadata_field2, \"operator\": \"gt\", \"value\": value2}\n", + " ]\n", + " }\n", + " ]\n", + " }\n", + " )\n", + "\n", + "query_result.message.content" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "zKWb3RTDy3pu" + }, + "source": [ + "The filtered response provides US-specific information only, demonstrating improved precision through metadata filtering. The response correctly focuses on current US policies dated after January 2023." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "0ymPSQVSy3pu" + }, + "source": [ + "## 5. Advanced Filtering Operations\n", + "\n", + "### 5.1 Available Filter Operators\n", + "\n", + "| Operator | Description | Case Sensitivity | Example |\n", + "|----------|-------------|------------------|---------|\n", + "| `exists` | Checks if field has any value | N/A | `{\"field\": \"tags\", \"operator\": \"exists\"}` |\n", + "| `equals` | Exact match comparison | Case sensitive | `{\"field\": \"status\", \"operator\": \"equals\", \"value\": \"active\"}` |\n", + "| `notequals` | Negated exact match | Case sensitive | `{\"field\": \"status\", \"operator\": \"notequals\", \"value\": \"inactive\"}` |\n", + "| `wildcard` | Pattern matching with * | Lowercase only | `{\"field\": \"code\", \"operator\": \"wildcard\", \"value\": \"*123\"}` |\n", + "| `startswith` | Prefix matching | Case sensitive | `{\"field\": \"title\", \"operator\": \"startswith\", \"value\": \"HR-\"}` |\n", + "| `between` | Range comparison | N/A | `{\"field\": \"date\", \"operator\": \"between\", \"value\": [\"2023-01-01\", \"2023-12-31\"]}` |\n", + "| `containsany` | Array membership | Case sensitive | `{\"field\": \"tags\", \"operator\": \"containsany\", \"value\": [\"policy\", \"HR\"]}` |\n", + "| `gt`, `gte` | Greater than (or equal) | N/A | `{\"field\": \"timestamp\", \"operator\": \"gt\", \"value\": \"2023-12-31\"}` |\n", + "| `lt`, `lte` | Less than (or equal) | N/A | `{\"field\": \"timestamp\", \"operator\": \"lte\", \"value\": \"2024-01-01\"}` |" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "lkuptfPIy3pu" + }, + "source": [ + "### 5.2 Complex Query with OR Logic and Wildcard Matching\n", + "\n", + "Demonstrate combining multiple filter conditions using OR logic and wildcard pattern matching. Note that wildcard operators require lowercase values.\n" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "executionInfo": { + "elapsed": 3067, + "status": "ok", + "timestamp": 1756998058096, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "QrAGjt2Jy3pv", + "outputId": "9791c7ca-13c6-4b65-800b-2b3bfd09cadf" + }, + "outputs": [], + "source": [ + "query = \"What is the current refund window?\"\n", + "\n", + "#use lower cases !!!!\n", + "metadata_field1 = \"region\"\n", + "operator1 = \"equals\"\n", + "value1 = \"US\"\n", + "metadata_field2 = \"entities\"\n", + "operator2 = \"wildcard\"\n", + "value2 = \"*prod:digital*\"\n", + "\n", + "query_result = client.agents.query.create(\n", + " agent_id=agent_id,\n", + " messages=[{\n", + " \"content\": query,\n", + " \"role\": \"user\"\n", + " }],\n", + " documents_filters= {\n", + " \"operator\": \"OR\",\n", + " \"filters\": [\n", + " {\"field\": metadata_field1, \"operator\": operator1, \"value\": value1},\n", + " {\"field\": metadata_field2, \"operator\": operator2, \"value\": value2},\n", + " ]\n", + " },\n", + " override_configuration= {\n", + " \"enable_filter\": False,\n", + " \"enable_rerank\": False,\n", + " \"top_k_retrieved_chunks\": 3,\n", + " \"lexical_alpha\": 0.1,\n", + " \"semantic_alpha\": 0.9,\n", + " },\n", + " retrievals_only=True\n", + " )\n", + "query_result" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "7hCJri4sy3pv" + }, + "source": [ + "The query returns multiple retrievals, confirming that OR logic successfully matches documents meeting either condition." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "fYSdh5b2y3pv" + }, + "source": [ + "### 5.3 Case Sensitivity in Filter Operations\n", + "\n", + "Demonstrate the impact of case sensitivity on filter results. The `equals` operator is case-sensitive." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "executionInfo": { + "elapsed": 1067, + "status": "ok", + "timestamp": 1756998059168, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "kp9UHle9y3pv", + "outputId": "337cf493-b485-4949-aa1b-9ed5fd6fa108" + }, + "outputs": [], + "source": [ + "query = \"What is the current refund window?\"\n", + "\n", + "metadata_field1 = \"region\"\n", + "operator1 = \"equals\"\n", + "value1 = \"us\"\n", + "metadata_field2 = \"entities\"\n", + "operator2 = \"wildcard\"\n", + "value2 = \"*prod:digital*\"\n", + "\n", + "query_result = client.agents.query.create(\n", + " agent_id=agent_id,\n", + " messages=[{\n", + " \"content\": query,\n", + " \"role\": \"user\"\n", + " }],\n", + " documents_filters= {\n", + " \"operator\": \"OR\",\n", + " \"filters\": [\n", + " {\"field\": metadata_field1, \"operator\": operator1, \"value\": value1},\n", + " ],\n", + " \"operator\": \"OR\",\n", + " \"filters\": [\n", + " {\"field\": metadata_field2, \"operator\": operator2, \"value\": value2},\n", + " ]\n", + " },\n", + " override_configuration= {\n", + " \"enable_filter\": False,\n", + " \"enable_rerank\": False,\n", + " \"top_k_retrieved_chunks\": 2,\n", + " \"lexical_alpha\": 0.1,\n", + " \"semantic_alpha\": 0.9,\n", + " },\n", + " retrievals_only=True\n", + " )\n", + "\n", + "query_result" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "CCmwdbwcy3pv" + }, + "source": [ + "Result: Single retrieval returned. The lowercase \"us\" fails to match the stored value \"US\" due to case sensitivity of the `equals` operator.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "UeGfrhA_y3pv" + }, + "source": [ + "### 5.4 Wildcard Operator Case Requirements\n", + "\n", + "Test wildcard operator with uppercase values to demonstrate its lowercase requirement." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "executionInfo": { + "elapsed": 1395, + "status": "ok", + "timestamp": 1756998060564, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "167EYlb3y3pv", + "outputId": "27bb8421-532a-462a-ae16-677dd9da09d5" + }, + "outputs": [], + "source": [ + "query = \"What is the current refund window?\"\n", + "\n", + "metadata_field1 = \"region\"\n", + "operator1 = \"equals\"\n", + "value1 = \"us\"\n", + "metadata_field2 = \"entities\"\n", + "operator2 = \"wildcard\"\n", + "value2 = \"*PROD:DIGITAL*\"\n", + "\n", + "query_result = client.agents.query.create(\n", + " agent_id=agent_id,\n", + " messages=[{\n", + " \"content\": query,\n", + " \"role\": \"user\"\n", + " }],\n", + " documents_filters= {\n", + " \"operator\": \"OR\",\n", + " \"filters\": [\n", + " {\"field\": metadata_field1, \"operator\": operator1, \"value\": value1},\n", + " ],\n", + " \"operator\": \"OR\",\n", + " \"filters\": [\n", + " {\"field\": metadata_field2, \"operator\": operator2, \"value\": value2},\n", + " ]\n", + " },\n", + " override_configuration= {\n", + " \"enable_filter\": False,\n", + " \"enable_rerank\": False,\n", + " \"top_k_retrieved_chunks\": 2,\n", + " \"lexical_alpha\": 0.1,\n", + " \"semantic_alpha\": 0.9,\n", + " },\n", + " retrievals_only=True\n", + " )\n", + "\n", + "query_result" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "8w_RKuYly3pv" + }, + "source": [ + "Result: No retrievals returned.\n", + "\n", + "**Key Learning:**\n", + "- `wildcard` operator requires lowercase values\n", + "- `equals` operator is case sensitive\n", + "- Carefully validate case requirements for each operator to ensure successful filtering" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "l2zOR3Dgy3pv" + }, + "source": [ + "## 6. Generation using Metadata\n", + "\n", + "Metadata integration extends throughout the generation pipeline, enhancing:\n", + "- Document reranking\n", + "- Filter model behavior\n", + "- Generation model context" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4zUO4Pwhy3pv" + }, + "source": [ + "### 6.1 Instruction-Following Reranker" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "4EVC8Cfwy3pv" + }, + "source": [ + "Contextual AI's instruction-following reranker accepts custom prompts that can leverage metadata fields for enhanced relevance scoring.\n", + "\n", + "#### Baseline Reranking" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "executionInfo": { + "elapsed": 1389, + "status": "ok", + "timestamp": 1756998061952, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "y4xxnLH9y3pv", + "outputId": "16117ec3-20cc-4436-c982-b7a85bd28478" + }, + "outputs": [], + "source": [ + "query = \"What is the current refund window?\"\n", + "\n", + "query_result = client.agents.query.create(\n", + " agent_id=agent_id,\n", + " messages=[{\n", + " \"content\": query,\n", + " \"role\": \"user\"\n", + " }],\n", + " override_configuration= {\n", + " \"enable_filter\": False,\n", + " \"enable_rerank\": True,\n", + " },\n", + " retrievals_only=True,\n", + " include_retrieval_content_text=True\n", + " )\n", + "\n", + "top_3_doc_names = [retrieval.doc_name for retrieval in query_result.retrieval_contents[:3]]\n", + "print(top_3_doc_names)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "sNgf2i2ly3pv" + }, + "source": [ + "Default ranking order: `['KB_Template_US', 'KB_Template_EU', 'KB_Template_DE']`" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "-9xmMPony3pv" + }, + "source": [ + "#### Custom Reranking Instructions\n", + "\n", + "Apply metadata-based prioritization through reranker instructions." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "executionInfo": { + "elapsed": 1261, + "status": "ok", + "timestamp": 1756998063210, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "YNfmU329y3pv", + "outputId": "db629b03-1cff-4061-d8c4-fcd2860b80e4" + }, + "outputs": [], + "source": [ + "query = \"What is the current refund window?\"\n", + "\n", + "query_result = client.agents.query.create(\n", + " agent_id=agent_id,\n", + " messages=[{\n", + " \"content\": query,\n", + " \"role\": \"user\"\n", + " }],\n", + " override_configuration= {\n", + " \"enable_filter\": False,\n", + " \"enable_rerank\": True,\n", + " \"rerank_instructions\": \"Prioritize PROC:WARRANTY chunks\"\n", + " },\n", + " retrievals_only=True,\n", + " include_retrieval_content_text=True\n", + " )\n", + "\n", + "top_3_doc_names = [retrieval.doc_name for retrieval in query_result.retrieval_contents[:3]]\n", + "print(top_3_doc_names)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "SzsPIiqAy3pv" + }, + "source": [ + "Modified ranking order: `['KB_Template_EU', 'KB_Template_US', 'KB_Template_DE']`\n", + "\n", + "The reranker successfully prioritizes documents containing PROC:WARRANTY metadata." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "nmc5bJFBy3pv" + }, + "source": [ + "#### Temporal Prioritization\n", + "\n", + "Demonstrate using date metadata for recency-based reranking." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "executionInfo": { + "elapsed": 1388, + "status": "ok", + "timestamp": 1756998064602, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "XcTaO9kEy3pv", + "outputId": "31781443-6113-410c-fec0-a1145bdb86c2" + }, + "outputs": [], + "source": [ + "query = \"Customer email template\"\n", + "\n", + "query_result = client.agents.query.create(\n", + " agent_id=agent_id,\n", + " messages=[{\n", + " \"content\": query,\n", + " \"role\": \"user\"\n", + " }],\n", + " override_configuration= {\n", + " \"enable_filter\": False,\n", + " \"enable_rerank\": True,\n", + " },\n", + " retrievals_only=True,\n", + " include_retrieval_content_text=True\n", + " )\n", + "\n", + "top_3_doc_names = [retrieval.doc_name for retrieval in query_result.retrieval_contents[:3]]\n", + "print(top_3_doc_names)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Mhpy8u15y3pw" + }, + "source": [ + "Baseline order: `['KB_Template_EU', 'KB_Template_US', 'KB_Template_DE']`" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/" + }, + "executionInfo": { + "elapsed": 1458, + "status": "ok", + "timestamp": 1756998066061, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "PrKnl6kHy3pw", + "outputId": "383a35d3-3db7-4a73-8f15-a02ee92a2138" + }, + "outputs": [], + "source": [ + "query = \"Customer email template\"\n", + "\n", + "query_result = client.agents.query.create(\n", + " agent_id=agent_id,\n", + " messages=[{\n", + " \"content\": query,\n", + " \"role\": \"user\"\n", + " }],\n", + " override_configuration= {\n", + " \"enable_filter\": False,\n", + " \"enable_rerank\": True,\n", + " \"rerank_instructions\": \"Prioritize recent documents\"\n", + " },\n", + " retrievals_only=True,\n", + " include_retrieval_content_text=True\n", + " )\n", + "\n", + "top_3_doc_names = [retrieval.doc_name for retrieval in query_result.retrieval_contents[:3]]\n", + "print(top_3_doc_names)" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "NiyJ6_F_y3pw" + }, + "source": [ + "Recency-prioritized order: `['KB_Template_US', 'KB_Template_EU', 'KB_Template_DE']`\n", + "\n", + "The reranker successfully prioritizes documents based on date metadata, placing the most recent document (US, dated 2025-02-10) first." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "X9572Zzhy3pw" + }, + "source": [ + "### 6.2 Metadata in Generation" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "2vlHtnRfy3pw" + }, + "source": [ + "The generation model incorporates metadata from retrieved chunks into its responses. This example demonstrates how metadata enhances response generation." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 87 + }, + "executionInfo": { + "elapsed": 3735, + "status": "ok", + "timestamp": 1756998069795, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "q8EO1Hyby3pw", + "outputId": "5471fe96-72b9-4e26-e24b-c7efc0b71521" + }, + "outputs": [], + "source": [ + "query = \"What is the current refund window?\"\n", + "\n", + "metadata_field1 = \"region\"\n", + "value1 = \"US\"\n", + "\n", + "query_result = client.agents.query.create(\n", + " agent_id=agent_id,\n", + " messages=[{\n", + " \"content\": query,\n", + " \"role\": \"user\"\n", + " }],\n", + " documents_filters= {\n", + " \"operator\": \"AND\",\n", + " \"filters\": [\n", + " { \"operator\": \"AND\",\n", + " \"filters\": [\n", + " {\"field\": metadata_field1, \"operator\": \"equals\", \"value\": value1},\n", + " ]\n", + " }\n", + " ]\n", + " },\n", + " override_configuration= {\n", + " \"enable_filter\": False,\n", + " \"enable_rerank\": False,\n", + " \"top_k_retrieved_chunks\": 3,\n", + " \"lexical_alpha\": 0.1,\n", + " \"semantic_alpha\": 0.9,\n", + " },\n", + " include_retrieval_content_text=True\n", + " )\n", + "\n", + "query_result.message.content" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "usVRHu-qy3pw" + }, + "source": [ + "The generated response incorporates the date metadata (\"February 10, 2025\") from the retrieved chunk, demonstrating how metadata enriches the generation output with contextual information." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yhGEV4Kuy3pw" + }, + "source": [ + "#### Examining Metadata Source\n", + "\n", + "Verify how metadata is accessible to the generation model through chunk content." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "metadata": { + "colab": { + "base_uri": "https://localhost:8080/", + "height": 104 + }, + "executionInfo": { + "elapsed": 11, + "status": "ok", + "timestamp": 1756998069807, + "user": { + "displayName": "Rajiv Shah", + "userId": "11535354505093905855" + }, + "user_tz": 300 + }, + "id": "vxfOodx2y3pw", + "outputId": "7e004120-946f-4c5d-db27-c710c0bbee3b" + }, + "outputs": [], + "source": [ + "query_result.retrieval_contents[0].content_text" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "y1DdSbuNy3pw" + }, + "source": [ + "Retrieved chunk with embedded metadata:\n", + "\n", + "```\n", + "File Name: KB_Template_US.html\n", + "Document Title: KB β€” Customer email template (US)\n", + "Section: KB β€” Customer email template (US)\n", + "\n", + "Localized template for customer emails for United States Dear Customer, Thank you for shopping with us. For products purchased in the United States: Refunds are available for 30 days on physical goods. Refunds for digital goods are limited to 14 days and only if the product has not been activated. Subscriptions are generally non-refundable once started.\n", + "Metadata: {\n", + " \"region\": \"US\",\n", + " \"language\": \"EN\",\n", + " \"date\": \"2025-02-10\",\n", + " \"entities\": \"[\"REG:US\", \"TEMP:CUSTOMER\", \"PROC:REFUND\", \"PROD:SUBSCRIPTION\"]\"\n", + "}\n", + "```\n", + "\n", + "The metadata is embedded within the chunk content, making it available to the generation model for producing contextually aware responses.\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "yTmi2ogey3pw" + }, + "source": [ + "## Summary\n", + "\n", + "This notebook demonstrated comprehensive metadata utilization across the Contextual AI platform:\n", + "\n", + "- **Configuration**: Set metadata at ingest time or post-processing with granular control\n", + "- **Retrieval**: Apply precise document filters using various operators with attention to case sensitivity\n", + "- **Generation**: Leverage metadata in reranking, filtering, and generation stages\n", + "\n", + "Metadata provides powerful capabilities for improving retrieval accuracy and response relevance in production applications. While here we manually added metadata, depending on your use case, you can automate the creation of metadata and ingestion into Contextual AI." + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "AwxX7SDsnB2e" + }, + "source": [ + "## Best Practices and Recommendations\n", + "\n", + "1. **Test Case Sensitivity**: Validate filter behavior with different case combinations\n", + "2. **Plan Metadata Schema**: Design metadata fields before large-scale ingestion\n", + "3. **Monitor Performance**: Track retrieval quality improvements with metadata filtering along with latency. Some operators like wildcard can increase latency.\n", + "4. **Document Metadata Strategy**: Maintain clear documentation of metadata field meanings and usage\n", + "\n", + "For additional resources and examples, refer to the [Contextual AI documentation](https://docs.contextual.ai/).\n" + ] + }, + { + "cell_type": "markdown", + "metadata": { + "id": "Wej82uTSy3pw" + }, + "source": [] + } + ], + "metadata": { + "colab": { + "provenance": [] + }, + "kernelspec": { + "display_name": "py3.10", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.10.16" + } + }, + "nbformat": 4, + "nbformat_minor": 0 +} diff --git a/README.md b/README.md index a7410f3..856cd9e 100644 --- a/README.md +++ b/README.md @@ -24,6 +24,7 @@ This repository contains practical examples and demonstrations of how to interac - πŸ” [Retrieval Analysis](11-retrieval-analysis/) - Notebooks for an end-to-end evaluation of RAG retrieval - 🧾 [Structured Data Extraction](12-legal-contract-extraction/) - Showing how to perform extraction across legal documents. - πŸ‘€ [Using Metrics API and Monitoring RAG](14-monitoring) - Showing how to monitor your RAG agent + - 🏷️ [Metadata Intro](15-metadata-intro/) - Example notebook showing how to work with metadata ## πŸš€ Getting Started