diff --git a/docs/genai-use-cases/bedrock-text2sql-aurora-pgvector/text2sql-postgresql.md b/docs/genai-use-cases/bedrock-text2sql-aurora-pgvector/text2sql-postgresql.md new file mode 100644 index 000000000..54d958c98 --- /dev/null +++ b/docs/genai-use-cases/bedrock-text2sql-aurora-pgvector/text2sql-postgresql.md @@ -0,0 +1,981 @@ +

Text-to-SQL with Amazon Bedrock + PostgreSQL pgvector Aurora Serverless

+ +**Keywords:** `text-to-sql` `amazon-bedrock` `postgresql-pgvector` `aurora-serverless` `claude-sonnet` `vector-database` `semantic-search` `aws-rds-data-api` `natural-language-sql` `llm-database-integration` + +[![Open in GitHub](https://img.shields.io/badge/Open%20in-GitHub-blue?logo=github)](https://github.com/aws-samples/amazon-bedrock-samples) + +This notebook demonstrates the integration of **traditional relational database operations** with **vector search capabilities** in PostgreSQL, featuring automated query strategy selection based on user intent analysis. + +

🎯 Core Technical Demonstrations:

+ +#### 1. **Complex Schema Text-to-SQL Generation** + +- LLM-powered natural language to SQL conversion across multi-table schemas +- Handling hierarchical data structures, complex joins, and nested aggregations +- **Demonstrating schema comprehension for enterprise-scale database architectures** + +#### 2. **PostgreSQL pgvector Integration** + +- Native vector storage and similarity search within PostgreSQL +- Embedding-based semantic search on unstructured text data +- Demonstrating RDBMS + vector database convergence + +#### 3. **Automated Query Strategy Selection** + +- Foundation model analysis of query intent and optimal execution path determination +- Context-aware routing between structured SQL and semantic vector operations +- Unified interface abstracting query complexity from end users + +

πŸ—οΈ Database Schema Architecture

+ +ecommerce schema demonstrating complex relationships: + +``` +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ users β”‚ β”‚ categories β”‚ β”‚ products β”‚ +β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ +β”‚ user_id (PK) β”‚ β”‚ category_id (PK) β”‚ β”‚ product_id (PK) β”‚ +β”‚ email β”‚ β”‚ name β”‚ β”‚ sku β”‚ +β”‚ username β”‚ β”‚ slug β”‚ β”‚ name β”‚ +β”‚ first_name β”‚ β”‚ description β”‚ β”‚ description β”‚ +β”‚ last_name β”‚ β”‚ parent_category_idβ”‚ β”‚ category_id (FK)β”‚ +β”‚ city β”‚ β”‚ (FK to self) β”‚ β”‚ brand β”‚ +β”‚ state_province β”‚ β”‚ product_count β”‚ β”‚ price β”‚ +β”‚ total_orders β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ stock_quantity β”‚ +β”‚ total_spent β”‚ β”‚ β”‚ rating_average β”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ total_sales β”‚ + β”‚ β”‚ β”‚ revenue_generatedβ”‚ + β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ + β”‚ β”‚ β”‚ + β”‚ β”‚ β”‚ +β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” +β”‚ orders β”‚ β”‚ order_items β”‚ β”‚ reviews β”‚ +β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ +β”‚ order_id (PK) │────│ order_id (FK) β”‚ β”‚ review_id (PK) β”‚ +β”‚ order_number β”‚ β”‚ product_id (FK) │────│ product_id (FK) β”‚ +β”‚ user_id (FK) β”‚ β”‚ quantity β”‚ β”‚ user_id (FK) β”‚ +β”‚ order_status β”‚ β”‚ unit_price β”‚ β”‚ order_id (FK) β”‚ +β”‚ payment_status β”‚ β”‚ total_price β”‚ β”‚ rating β”‚ +β”‚ total_amount β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ title β”‚ +β”‚ shipping_method β”‚ β”‚ comment β”‚ +β”‚ created_at β”‚ β”‚ comment_embeddingβ”‚ +β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ (VECTOR) β”‚ + β”‚ pros β”‚ + β”‚ cons β”‚ + β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ +``` + +**Schema Complexity Features:** + +- **Self-referencing hierarchies**: Categories with parent/child relationships +- **Junction table patterns**: Many-to-many order-product relationships via order_items +- **Vector integration**: Native pgvector storage in reviews.comment_embedding +- **Multi-level foreign keys**: Reviews referencing users, products, and orders + +

πŸ’‘ Technical Implementation:

+ +1. **Hybrid Database Architecture**: PostgreSQL with pgvector extension for unified structured + vector operations +2. **LLM Schema Comprehension**: Foundation model understanding of complex table relationships and optimal query generation +3. **Embedding-based Similarity**: Amazon Titan text embeddings for semantic content matching +4. **Automated Tool Selection**: Context analysis determining SQL vs vector search execution paths + +## Technical Prerequisites + +- AWS account with Bedrock and RDS permissions +- Understanding of vector embeddings and similarity search concepts +- Familiarity with PostgreSQL and complex SQL operations + +--- + +

πŸ“¦ STEP 1: Install Required Packages

+ +```python +# Install required Python packages for AWS and SQL parsing +!pip install --upgrade pip +!pip install boto3 sqlparse +``` + +

πŸ—οΈ STEP 2: Deploy AWS Infrastructure

+ +This step creates: + +- **VPC with 3 subnets** across availability zones +- **Aurora PostgreSQL Serverless v2 cluster** with HTTP endpoint enabled +- **Security groups** and networking configuration +- **Secrets Manager** for database credentials + +**Note**: This takes ~5-10 minutes to complete + +```python +# Deploy AWS infrastructure (VPC, Aurora PostgreSQL, Security Groups) +# This script creates all necessary AWS resources for our demo + +!python infra.py +``` + +

πŸ”§ STEP 3: Setup Database Connection

+ +```python +# Import required libraries for AWS services and database operations +import json +import boto3 +import logging +import sqlparse +from typing import Dict, Any, List, Union +from botocore.exceptions import ClientError +from botocore.config import Config + +# Get current AWS region dynamically +session = boto3.Session() +AWS_REGION = session.region_name or 'us-west-2' # fallback to us-west-2 if not set +print(f"Using AWS region: {AWS_REGION}") + +# Setup logging to track our progress +logging.basicConfig(level=logging.INFO, format="%(asctime)s - %(levelname)s - %(message)s") +``` + +```python +# Database connection configuration +# **Update these values after running infra.py with the output provided** +CLUSTER_ARN = '' +SECRET_ARN = '' +DATABASE_NAME = '' +AWS_REGION = '' + +# Initialize RDS Data API client (allows SQL execution without direct connections, to learn more visit https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/data-api.html) +rds_client = boto3.client('rds-data', region_name=AWS_REGION) +``` + +

πŸ› οΈ STEP 4: Create Database Schema & Load Data

+ +We'll create a streamlined but complex ecommerce schema with 6 core tables that demonstrate: + +- **Hierarchical relationships** (categories with parent/child structure) +- **Many-to-many relationships** (orders ↔ products via junction table) +- **Vector integration** (reviews with embedding column for semantic search) +- **Analytics capabilities** (aggregated sales metrics and customer data) + +```python +def run_sql(query: str, database: str = None) -> dict: + """ + Execute SQL query using RDS Data API + This is our main function for running any SQL command + """ + try: + params = { + 'resourceArn': CLUSTER_ARN, + 'secretArn': SECRET_ARN, + 'sql': query + } + if database: + params['database'] = database + + response = rds_client.execute_statement(**params) + return response + except Exception as e: + print(f"SQL execution error: {e}") + return {"error": str(e)} +``` + +```python +# Enable pgvector extension for semantic search capabilities +# pgvector allows PostgreSQL to store and search vector embeddings +try: + result = run_sql('CREATE EXTENSION IF NOT EXISTS vector;', DATABASE_NAME) + print("βœ… pgvector extension enabled successfully") +except Exception as e: + print(f"Extension setup error: {e}") +``` + +```python +# Create tables by reading our schema file +# Parse SQL file into individual statements (RDS Data API requirement) +with open('ecommerce_schema.sql', 'r') as f: + schema_sql = f.read() + +statements = sqlparse.split(schema_sql) +statements = [stmt.strip() for stmt in statements if stmt.strip()] + +print(f"Creating {len(statements)} database tables...") +print("πŸ“Š Schema includes: users, categories, products, orders, order_items, reviews") +print("🧠 Vector integration: reviews.comment_embedding for semantic search") + +# Execute each CREATE TABLE statement +for i, statement in enumerate(statements, 1): + try: + run_sql(statement, DATABASE_NAME) + print(f" βœ… Table {i} created successfully") + except Exception as e: + print(f" ❌ Table {i} failed: {e}") + +print("βœ… Database schema creation completed!") +``` + +```python +# Insert sample data into our tables +with open('ecommerce_data.sql', 'r') as f: + data_sql = f.read() + +data_statements = sqlparse.split(data_sql) +data_statements = [stmt.strip() for stmt in data_statements if stmt.strip()] + +print(f"Inserting sample data with {len(data_statements)} statements...") +print("πŸ‘₯ 15 users across different US cities with spending history") +print("πŸ“¦ 16 products across 8 categories (Electronics β†’ Audio/Video, Smart Devices, etc.)") +print("πŸ›’ 10 orders with various statuses (delivered, shipped, processing, cancelled)") +print("⭐ 13 detailed product reviews perfect for semantic search") + +for i, statement in enumerate(data_statements, 1): + try: + result = run_sql(statement, DATABASE_NAME) + records_affected = result.get('numberOfRecordsUpdated', 0) + print(f" βœ… Dataset {i}: {records_affected} records inserted") + except Exception as e: + print(f" ❌ Dataset {i} failed: {e}") + +print("βœ… Sample data insertion completed!") +``` + +

🧠 STEP 5: Bedrock Setup

+ +```python +# Configure Bedrock client with extended timeouts for large requests +bedrock_config = Config( + connect_timeout=60*5, # 5 minutes + read_timeout=60*5, # 5 minutes +) + +bedrock = boto3.client( + service_name='bedrock-runtime', + region_name=AWS_REGION, + config=bedrock_config +) + +# Model IDs for our use +CLAUDE_MODEL = "us.anthropic.claude-3-7-sonnet-20250219-v1:0" # For text-to-SQL +EMBEDDING_MODEL = "amazon.titan-embed-text-v2:0" # For vector search + +print("βœ… Bedrock configured successfully") +``` + +```python +class DatabaseTools: + """Simple database helper for executing SQL queries""" + + def __init__(self): + self.rds_client = boto3.client("rds-data", region_name=AWS_REGION) + + def execute_sql(self, query: str) -> str: + """Execute SQL query and return results as JSON string""" + try: + response = self.rds_client.execute_statement( + resourceArn=CLUSTER_ARN, + secretArn=SECRET_ARN, + database=DATABASE_NAME, + sql=query, + includeResultMetadata=True, + ) + + # Handle empty results + if "records" not in response or not response["records"]: + return json.dumps([]) + + # Get column names and format results + columns = [field["name"] for field in response.get("columnMetadata", [])] + results = [] + + for record in response["records"]: + row_values = [] + for field in record: + # Extract value from different field types + if "stringValue" in field: + row_values.append(field["stringValue"]) + elif "longValue" in field: + row_values.append(field["longValue"]) + elif "doubleValue" in field: + row_values.append(field["doubleValue"]) + elif "booleanValue" in field: + row_values.append(field["booleanValue"]) + else: + row_values.append(None) + + results.append(dict(zip(columns, row_values))) + + return json.dumps(results, indent=2) + + except Exception as e: + return json.dumps({"error": f"Database error: {str(e)}"}) + +# Test database connection +db_tools = DatabaseTools() +result = db_tools.execute_sql("SELECT current_timestamp;") +print("βœ… Database connection test successful") +print("Current time:", json.loads(result)[0]["current_timestamp"]) +``` + +

πŸ”’ STEP 6: Generate Vector Embeddings for Semantic Search

+ +**Hybrid RDBMS + Vector Database Implementation:** + +Vector embeddings convert textual content into high-dimensional numerical representations that capture semantic relationships. PostgreSQL's pgvector extension enables native vector operations within the relational database, eliminating the need for separate vector database infrastructure. + +**Technical Implementation:** + +- Amazon Titan Text Embeddings v2 (1024-dimensional vectors) +- PostgreSQL VECTOR data type with cosine similarity operations +- Semantic search on review content independent of exact keyword matching + +This approach demonstrates the convergence of traditional RDBMS and vector database capabilities in production systems. + +```python +def create_embedding(text: str) -> List[float]: + """ + Convert text into a vector embedding using Amazon Titan + Returns a list of 1024 numbers that represent the text's meaning + """ + payload = { + "inputText": text, + "embeddingTypes": ["float"] + } + + try: + response = bedrock.invoke_model( + modelId=EMBEDDING_MODEL, + body=json.dumps(payload), + accept="application/json", + contentType="application/json" + ) + + body = json.loads(response["body"].read()) + embeddings = body.get("embeddingsByType", {}).get("float", []) + return embeddings + + except Exception as e: + print(f"Embedding generation error: {e}") + return [] + +# Test embedding generation +test_text = "This battery lasts a long time" +test_embedding = create_embedding(test_text) +print(f"βœ… Generated embedding with {len(test_embedding)} dimensions") +print(f"Sample values: {test_embedding[:5]}...") # Show first 5 numbers +``` + +```python +def add_embeddings_to_reviews(): + """ + Generate embeddings for all review comments and store them in the database + This enables semantic search on review content + """ + + # Step 1: Find reviews that need embeddings + count_query = "SELECT COUNT(*) FROM reviews WHERE comment_embedding IS NULL" + count_result = db_tools.execute_sql(count_query) + total_missing = json.loads(count_result)[0]["count"] + + print(f"Found {total_missing} reviews needing embeddings") + + if total_missing == 0: + print("βœ… All reviews already have embeddings!") + return + + # Step 2: Get reviews without embeddings + select_query = """ + SELECT review_id, comment + FROM reviews + WHERE comment_embedding IS NULL + AND comment IS NOT NULL + ORDER BY review_id + """ + + result = db_tools.execute_sql(select_query) + reviews = json.loads(result) + + # Step 3: Generate embeddings for each review + for review in reviews: + review_id = review["review_id"] + comment = review["comment"] + + if not comment: + continue + + print(f" Processing review {review_id}...") + + # Generate embedding + embedding = create_embedding(comment) + if not embedding: + continue + + # Convert to PostgreSQL vector format + vector_str = "[" + ",".join(str(x) for x in embedding) + "]" + + # Update database with embedding + update_query = f""" + UPDATE reviews + SET comment_embedding = '{vector_str}'::vector + WHERE review_id = {review_id} + """ + + run_sql(update_query, DATABASE_NAME) + print(f" βœ… Added embedding for review {review_id}") + + print("βœ… All review embeddings generated successfully!") + +# Generate embeddings for all reviews +add_embeddings_to_reviews() +``` + +

πŸ€– STEP 7: Foundation Model Tool Selection System

+ +**Query Strategy Determination:** + +Claude Sonnet analyzes natural language queries and automatically determines the optimal execution strategy through tool selection logic: + +**πŸ“Š Structured Query Scenarios (SQL Tool Selection):** + +- Aggregation operations: "What's the average order value by state?" +- Complex joins: "Show customers with repeat purchases in Electronics" +- Mathematical calculations: "Calculate profit margins by product category" +- Temporal analysis: "Find order trends over the last quarter" + +**πŸ” Semantic Search Scenarios (Vector Tool Selection):** + +- Content similarity: "Find reviews about build quality issues" +- Sentiment analysis: "Show complaints about customer service" +- Topic clustering: "What do users say about product durability?" +- Conceptual matching: Independent of exact keyword presence + +**🎯 Hybrid Query Execution:** + +- Complex scenarios may trigger multiple tool usage +- Foundation model orchestrates sequential or parallel execution +- Results synthesis from both structured and semantic operations + +**Technical Architecture:** + +- Tool specification via JSON schema definitions +- Automated function calling based on intent classification +- Context-aware execution path optimization + +```python +def semantic_search(search_text: str, limit: int = 5) -> str: + """ + Find reviews similar to the search text using vector similarity + Returns the most semantically similar reviews + """ + try: + # Generate embedding for search text + search_embedding = create_embedding(search_text) + if not search_embedding: + return json.dumps({"error": "Could not generate embedding"}) + + # Convert to PostgreSQL vector format + vector_str = "[" + ",".join(str(x) for x in search_embedding) + "]" + + # Find similar reviews using cosine distance (<-> operator) + query = f""" + SELECT + rating, + title, + comment, + pros, + cons, + helpful_count, + (1 - (comment_embedding <-> '{vector_str}'::vector)) as similarity_score + FROM reviews + WHERE comment IS NOT NULL + AND comment_embedding IS NOT NULL + ORDER BY comment_embedding <-> '{vector_str}'::vector + LIMIT {limit} + """ + + result = db_tools.execute_sql(query) + return result + + except Exception as e: + return json.dumps({"error": f"Vector search error: {str(e)}"}) + +# Test vector search +test_search = semantic_search("battery problems", limit=3) +print("βœ… Vector search test successful") +print("Sample results:", json.loads(test_search)[0]["title"] if json.loads(test_search) else "No results") +``` + +```python +# Define the tools available to Claude +TOOLS = { + "tools": [ + { + "toolSpec": { + "name": "execute_sql", + "description": "Execute SQL queries for structured data analysis (counts, filters, joins, aggregations)", + "inputSchema": { + "json": { + "type": "object", + "properties": { + "query": { + "type": "string", + "description": "SQL query to execute against the ecommerce database" + } + }, + "required": ["query"] + } + } + } + }, + { + "toolSpec": { + "name": "vector_search", + "description": "Perform semantic similarity search on review content to find similar topics/themes", + "inputSchema": { + "json": { + "type": "object", + "properties": { + "text": { + "type": "string", + "description": "Text to search for semantically similar content in reviews" + } + }, + "required": ["text"] + } + } + } + } + ], + "toolChoice": {"auto": {}} +} + +print("βœ… AI tools configured - Claude can now choose between SQL and vector search!") +``` + +````python +SYSTEM_PROMPT = """ +# Advanced Text-to-SQL System Prompt with PostgreSQL Vector Search + + +You are an advanced database query optimization system specializing in hybrid SQL and vector search operations. +Your primary function is to analyze natural language queries, determine optimal execution strategies, and generate PostgreSQL queries that leverage both relational and vector capabilities. + + + + + + +Customer profiles with pre-computed analytics for performance optimization + + +- user_id (SERIAL PRIMARY KEY) +- email, username (UNIQUE constraints for data integrity) +- first_name, last_name, phone_number, date_of_birth, gender +- city, state_province, country_code (geographic segmentation) +- account_status (operational flag) +- total_orders, total_spent (denormalized aggregates for fast analytics) +- created_at (temporal tracking) + +
+ + + +Hierarchical product taxonomy with recursive relationship support + + +- category_id (SERIAL PRIMARY KEY) +- name, slug (unique URL-safe identifier), description +- parent_category_id (SELF-REFERENTIAL FK enabling tree structures) +- is_active (soft delete support) +- product_count (denormalized for performance) +- created_at + +
+ + + +Product catalog with inventory tracking and performance metrics + + +- product_id (SERIAL PRIMARY KEY) +- sku (UNIQUE business identifier) +- name, slug, description, short_description +- category_id (FK to categories) +- brand, price, cost (profit margin calculation) +- weight_kg, stock_quantity (logistics data) +- is_active, is_featured (display control) +- warranty_months +- rating_average, rating_count (computed from reviews) +- total_sales, revenue_generated (business metrics) +- created_at, updated_at (audit trail) + +
+ + + +Transaction lifecycle management with comprehensive status tracking + + +- order_id (SERIAL PRIMARY KEY) +- order_number (UNIQUE human-readable identifier) +- user_id (FK to users) +- order_status: pending|processing|shipped|delivered|cancelled|refunded +- payment_status: pending|paid|failed|refunded +- shipping_address (full text for flexibility) +- financial_breakdown: + * subtotal (items before adjustments) + * tax_amount, shipping_cost, discount_amount + * total_amount (final charge) +- payment_method: credit_card|paypal|bank_transfer +- shipping_method: standard|express|overnight +- customer_notes, tracking_number +- shipped_at, delivered_at (fulfillment tracking) +- created_at, updated_at + +
+ + + +Junction table capturing point-in-time pricing and item-level details + + +- order_item_id (SERIAL PRIMARY KEY) +- order_id (FK CASCADE DELETE) +- product_id (FK) +- quantity, unit_price (historical pricing preservation) +- discount_amount, tax_amount (item-level adjustments) +- total_price (computed line total) +- created_at + +
+ + + +Customer feedback with vector embeddings for semantic analysis + + +- review_id (SERIAL PRIMARY KEY) +- product_id (FK CASCADE DELETE) +- user_id (FK) +- order_id (FK optional - links to purchase) +- rating (INTEGER CHECK 1-5) +- title (VARCHAR(200)) +- comment (TEXT - source for embeddings) +- comment_embedding (VECTOR(1024) - semantic representation) +- pros, cons (structured sentiment extraction) +- is_verified_purchase (trust signal) +- helpful_count (community validation) +- status: pending|approved|rejected +- created_at, updated_at + + +- comment_embedding enables semantic similarity search +- Cosine distance for finding related reviews +- Supports sentiment clustering and topic modeling + +
+
+ + +**execute_sql**: Use for structured queries, aggregations, joins, filtering by exact values + +**vector_search**: Use for semantic similarity on review comments, finding related content + +Examples: +- "total revenue by category" β†’ execute_sql +- "reviews similar to 'great battery life'" β†’ vector_search +- "average rating of products with positive reviews" β†’ both tools + + + +SQL Example: +```sql +SELECT c.name, SUM(p.revenue_generated) as revenue +FROM categories c +JOIN products p ON c.category_id = p.category_id +GROUP BY c.name +ORDER BY revenue DESC; +```` + +Vector Example: + +```sql +-- Note: query_embedding would be provided by the system as a VECTOR(1024) +SELECT r.comment, p.name, + r.comment_embedding <=> query_embedding as similarity +FROM reviews r +JOIN products p ON r.product_id = p.product_id +ORDER BY similarity +LIMIT 5; +``` + + + +### SQL Query Best Practices + + + +1. **Explicit JOINs**: Always use explicit JOIN syntax with ON conditions +2. **Table Aliases**: Use meaningful aliases (u for users, p for products) +3. **NULL Handling**: Account for optional fields with COALESCE or IS NULL +4. **Data Types**: Cast when necessary, especially for date operations +5. **Aggregation Rules**: Include all non-aggregate columns in GROUP BY +6. **Order Stability**: Add secondary ORDER BY for deterministic results +7. **Limit Appropriately**: Include LIMIT for top-N queries +8. **Comment Complex Logic**: Add -- comments for CTEs or complex conditions + + +### Vector Search Best Practices + + + +1. **Distance Metrics**: Use cosine distance (<=>) for normalized embeddings +2. **Result Limits**: Always limit results (default 10-20 for readability) +3. **Threshold Filtering**: Consider similarity threshold for quality control +4. **Metadata Inclusion**: Join with products/users for context +5. **Explain Similarity**: Include distance scores in results + + + + +### Response Structure + +``` +QUERY ANALYSIS: +- Intent: [Extracted user intent] +- Strategy: [Selected tool(s) and rationale] +- Key Operations: [Main database operations required] + +GENERATED QUERY: +[Actual SQL or vector search syntax] + +EXPECTED INSIGHTS: +- [Key patterns or metrics the query will reveal] +- [Business value of the results] +``` + + +""" + +```` + + +```python +def ask_ai(question: str) -> str: + """ + Send a question to Claude and handle tool execution + Claude will automatically choose between SQL and vector search + Handles multiple rounds of tool calls until completion + """ + + # Create the conversation + messages = [{"role": "user", "content": [{"text": question}]}] + + try: + # Continue conversation until Claude stops requesting tools + max_turns = 10 # Prevent infinite loops + turn_count = 0 + + while turn_count < max_turns: + turn_count += 1 + + # Send to Claude with tools + response = bedrock.converse( + modelId=CLAUDE_MODEL, + system=[{"text": SYSTEM_PROMPT}], + messages=messages, + toolConfig=TOOLS + ) + + assistant_message = response["output"]["message"] + messages.append(assistant_message) + + # Check if Claude wants to use tools + tool_uses = [content for content in assistant_message["content"] if "toolUse" in content] + + if tool_uses: + # Execute each tool Claude requested + for tool_use in tool_uses: + tool_name = tool_use["toolUse"]["name"] + tool_input = tool_use["toolUse"]["input"] + tool_id = tool_use["toolUse"]["toolUseId"] + + print(f"πŸ”§ Claude is using: {tool_name}") + + # Execute the appropriate tool + if tool_name == "execute_sql": + tool_result = db_tools.execute_sql(tool_input["query"]) + print(f"πŸ“Š SQL Query: {tool_input['query']}") + + elif tool_name == "vector_search": + tool_result = semantic_search(tool_input["text"]) + print(f"πŸ” Searching for: {tool_input['text']}") + + # Send tool result back to Claude + tool_message = { + "role": "user", + "content": [{ + "toolResult": { + "toolUseId": tool_id, + "content": [{"text": tool_result}] + } + }] + } + messages.append(tool_message) + + # Continue the loop to let Claude process results and potentially make more tool calls + continue + + else: + # No tools needed, extract and return the final response + final_content = assistant_message["content"] + text_response = next((c["text"] for c in final_content if "text" in c), "") + return text_response + + # If we hit max turns, return what we have + return "Response completed after maximum tool execution rounds." + + except Exception as e: + return f"Error: {str(e)}" + +print("βœ… Enhanced LLM assistant ready with multi-round tool execution support!") +```` + +

πŸš€ STEP 8: Technical Demonstrations

+ +

Demo 1: Complex Schema Text-to-SQL Generation

+**Objective:** Validate LLM comprehension of multi-table relationships and automated SQL generation for complex analytical queries. + +```python +# DEMO 1: Complex Schema Text-to-SQL Generation +print("=" * 70) +print("DEMO 1: Complex Schema Text-to-SQL Generation") +print("=" * 70) + +# Test multi-table join with hierarchical traversal and aggregation +question1 = "Show me the top 3 customers by total spending, including their order count and favorite product category" +print(f"Query: {question1}") +print("\nπŸ”§ Expected: Multi-table JOIN across users, orders, order_items, products, categories") +print("\nExecution:") +answer1 = ask_ai(question1) +print(answer1) +print("\n" + "="*70) +``` + +

Demo 2: PostgreSQL Vector Search Implementation

+**Objective:** Demonstrate native vector operations within PostgreSQL using pgvector for semantic similarity search on unstructured content. + +```python +# DEMO 2: PostgreSQL Vector Search Implementation +print("DEMO 2: PostgreSQL Vector Search Implementation") +print("=" * 70) + +question2 = "Find reviews about battery life issues and charging problems" +print(f"Query: {question2}") +print("\nπŸ”§ Expected: Vector similarity search using pgvector cosine distance") +print("πŸ“Š Operation: Embedding generation + semantic matching on reviews.comment_embedding") +print("🎯 Capability: Content similarity independent of exact keyword presence") +print("\nExecution:") +answer2 = ask_ai(question2) +print(answer2) +print("\n" + "="*70) +``` + +

Demo 3: Automated Query Strategy Selection

+**Objective:** Capability to analyze query intent and select optimal execution strategy between SQL and vector operations. + +```python +# DEMO 3: Automated Query Strategy Selection +print("DEMO 3: Automated Query Strategy Selection") +print("=" * 70) + +# Ambiguous query that could use either approach +question3 = "What are the main product quality issues customers mention in their reviews?" +print(f"Query: {question3}") +print("\nπŸ€” Strategy Options:") +print(" πŸ“Š SQL Approach: Aggregate review ratings and identify low-rated products") +print(" πŸ” Vector Approach: Semantic search for quality-related content themes") +print(" 🎯 Hybrid Approach: Combine structured filtering with content analysis") +print("\nπŸ”§ Foundation Model Decision Process:") +print("\nExecution:") +answer3 = ask_ai(question3) +print(answer3) +print("\n" + "="*70) +``` + +

πŸ’¬ Interactive Query Testing

+ +**Technical Validation Environment** + +Test the foundation model's query strategy selection across different analytical scenarios. The system will demonstrate automated tool selection based on query characteristics and optimal execution path determination. + +**Structured Query Test Cases:** + +**πŸ“Š Complex SQL Operations:** + +- "Calculate profit margins by hierarchical product category" +- "Identify customers with highest purchase frequency in Texas" +- "Analyze order value distribution across payment methods" + +**πŸ” Vector Similarity Operations:** + +- "Find reviews discussing build quality and manufacturing defects" +- "Locate customer feedback about shipping and logistics issues" +- "Identify content related to product longevity and durability concerns" +- "Search for mentions of value proposition and pricing feedback" + +**🎯 Complex Analytical Scenarios:** + +- "Which products receive the most quality-related complaints?" +- "Analyze sentiment patterns across different customer segments" +- "Find correlation between product price points and satisfaction themes" + +```python +# Interactive Query Testing Environment +print("πŸ”§ Foundation Model Query Strategy Testing") +print("Enter queries to validate automated tool selection logic. Type 'quit' to exit.") +print("\nπŸ“‹ Test Categories:") + +print("\nπŸ“Š Structured Data Operations:") +print("β€’ 'Which product categories have the highest profit margins?'") +print("β€’ 'Show customer geographic distribution by total spending'") +print("β€’ 'Analyze order completion rates by shipping method'") + +print("\nπŸ” Semantic Content Analysis:") +print("β€’ 'Find reviews about products being difficult to use or setup'") +print("β€’ 'Locate feedback about customer support experiences'") +print("β€’ 'Search for mentions of product packaging and presentation'") + +print("\n🎯 Hybrid Analysis Scenarios:") +print("β€’ 'Identify top-selling products with usability complaints'") +print("β€’ 'Find high-value customers who mention quality concerns'") + +print("-" * 70) + +while True: + question = input("\nπŸ” Query: ").strip() + + if question.lower() == 'quit': + print("βœ… Query testing session completed") + break + + if question: + print(f"\nπŸ“ Processing: {question}") + print("βš™οΈ Analyzing query intent and determining execution strategy...") + answer = ask_ai(question) + print(f"\nπŸ“Š Result: {answer}") + print("-" * 70) +``` + +

🧹 STEP 9: Cleanup (Optional)

+Run this to delete all AWS resources and avoid charges + +```python +# Cleanup AWS resources to avoid ongoing charges +# This will delete the Aurora cluster, VPC, and all related resources + +# Primary method: +!python clean.py +``` + +--- + +

πŸ” Tags

+ +`#text-to-sql` `#amazon-bedrock` `#postgresql-pgvector` `#aurora-serverless` `#claude-sonnet` `#vector-database` `#semantic-search` `#aws-rds-data-api` `#natural-language-sql` `#llm-database-integration` `#enterprise-ai` `#hybrid-database` `#sql-generation` `#vector-similarity` `#bedrock-integration` diff --git a/genai-use-cases/bedrock-text2sql-aurora-pgvector/clean.py b/genai-use-cases/bedrock-text2sql-aurora-pgvector/clean.py new file mode 100644 index 000000000..6aa62d2a4 --- /dev/null +++ b/genai-use-cases/bedrock-text2sql-aurora-pgvector/clean.py @@ -0,0 +1,228 @@ +import boto3 +import json +import time +import sys + +# Load configuration +with open("config.json", "r") as f: + config = json.load(f) + +# Get current AWS region dynamically +session = boto3.Session() +aws_region = session.region_name or "us-west-2" # fallback to us-west-2 if not set +print(f"Using AWS region: {aws_region}") + +# Initialize clients with dynamic region +ec2 = boto3.client("ec2", region_name=aws_region) +rds = boto3.client("rds", region_name=aws_region) +secrets = boto3.client("secretsmanager", region_name=aws_region) + +print("Starting cleanup of AWS resources...") + +# Delete Aurora instance first +try: + print("Deleting Aurora instance...") + rds.delete_db_instance( + DBInstanceIdentifier=config["aurora"]["instance_identifier"], + SkipFinalSnapshot=True, + DeleteAutomatedBackups=True, + ) + + # Wait for instance to be deleted + print("Waiting for instance deletion...") + while True: + try: + response = rds.describe_db_instances( + DBInstanceIdentifier=config["aurora"]["instance_identifier"] + ) + status = response["DBInstances"][0]["DBInstanceStatus"] + print(f"Instance status: {status}") + + if status == "deleting": + time.sleep(30) + continue + else: + break + + except Exception as e: + if "DBInstanceNotFound" in str(e): + print("Aurora instance deleted") + break + else: + raise +except Exception as e: + print(f"Instance deletion error (might not exist): {e}") + +# Delete Aurora cluster +try: + print("Deleting Aurora cluster...") + rds.delete_db_cluster( + DBClusterIdentifier=config["aurora"]["cluster_identifier"], + SkipFinalSnapshot=True, + ) + + # Wait for cluster to be deleted + print("Waiting for cluster deletion...") + while True: + try: + response = rds.describe_db_clusters( + DBClusterIdentifier=config["aurora"]["cluster_identifier"] + ) + status = response["DBClusters"][0]["Status"] + print(f"Cluster status: {status}") + + if status == "deleting": + time.sleep(30) + continue + else: + break + + except Exception as e: + if "DBClusterNotFoundFault" in str(e): + print("Aurora cluster deleted") + break + else: + raise +except Exception as e: + print(f"Cluster deletion error (might not exist): {e}") + +# Delete DB subnet group (now that cluster is deleted) +try: + print("Deleting DB subnet group...") + rds.delete_db_subnet_group( + DBSubnetGroupName=config["resources"]["subnet_group_name"] + ) + print("DB subnet group deleted") +except Exception as e: + print(f"DB subnet group deletion error: {e}") + +# Find and delete the secret (skip RDS-managed secrets) +try: + print("Finding and deleting secrets...") + secrets_list = secrets.list_secrets() + for secret in secrets_list["SecretList"]: + # Only delete user-created secrets, not RDS-managed ones + if config["aurora"]["cluster_identifier"] in secret["Name"] and not secret[ + "Name" + ].startswith("rds!"): + try: + secrets.delete_secret( + SecretId=secret["ARN"], ForceDeleteWithoutRecovery=True + ) + print(f"Deleted secret: {secret['Name']}") + except Exception as e: + print(f"Could not delete secret {secret['Name']}: {e}") +except Exception as e: + print(f"Secret deletion error: {e}") + +# Get VPC details +try: + vpcs = ec2.describe_vpcs( + Filters=[{"Name": "cidr-block", "Values": [config["vpc"]["cidr_block"]]}] + )["Vpcs"] + + if not vpcs: + print("VPC not found - cleanup complete") + sys.exit(0) + + vpc_id = vpcs[0]["VpcId"] + print(f"Found VPC to cleanup: {vpc_id}") + + # Delete security groups (except default) - but first remove their rules + try: + sgs = ec2.describe_security_groups( + Filters=[{"Name": "vpc-id", "Values": [vpc_id]}] + )["SecurityGroups"] + + # First, remove all ingress and egress rules from custom security groups + for sg in sgs: + if sg["GroupName"] != "default": + try: + # Remove ingress rules + if sg["IpPermissions"]: + ec2.revoke_security_group_ingress( + GroupId=sg["GroupId"], IpPermissions=sg["IpPermissions"] + ) + # Remove egress rules + if sg["IpPermissionsEgress"]: + ec2.revoke_security_group_egress( + GroupId=sg["GroupId"], + IpPermissions=sg["IpPermissionsEgress"], + ) + except Exception as e: + print( + f"Error removing rules from security group {sg['GroupId']}: {e}" + ) + + # Now delete the security groups + for sg in sgs: + if sg["GroupName"] != "default": + try: + print(f"Deleting security group: {sg['GroupId']}") + ec2.delete_security_group(GroupId=sg["GroupId"]) + except Exception as e: + print(f"Security group deletion error for {sg['GroupId']}: {e}") + except Exception as e: + print(f"Security group deletion error: {e}") + + # Get subnets + subnets = ec2.describe_subnets(Filters=[{"Name": "vpc-id", "Values": [vpc_id]}])[ + "Subnets" + ] + + # Detach and delete Internet Gateway FIRST (before route tables) + try: + igws = ec2.describe_internet_gateways( + Filters=[{"Name": "attachment.vpc-id", "Values": [vpc_id]}] + )["InternetGateways"] + + for igw in igws: + igw_id = igw["InternetGatewayId"] + print(f"Detaching and deleting Internet Gateway: {igw_id}") + ec2.detach_internet_gateway(InternetGatewayId=igw_id, VpcId=vpc_id) + ec2.delete_internet_gateway(InternetGatewayId=igw_id) + except Exception as e: + print(f"Internet Gateway deletion error: {e}") + + # Delete route table associations and custom route tables (after IGW is detached) + try: + route_tables = ec2.describe_route_tables( + Filters=[{"Name": "vpc-id", "Values": [vpc_id]}] + )["RouteTables"] + + for rt in route_tables: + # Skip the main route table + if not any(assoc.get("Main", False) for assoc in rt["Associations"]): + try: + print(f"Deleting route table: {rt['RouteTableId']}") + ec2.delete_route_table(RouteTableId=rt["RouteTableId"]) + except Exception as e: + print(f"Route table deletion error for {rt['RouteTableId']}: {e}") + except Exception as e: + print(f"Route table deletion error: {e}") + + # Delete subnets + for subnet in subnets: + try: + print(f"Deleting subnet: {subnet['SubnetId']}") + ec2.delete_subnet(SubnetId=subnet["SubnetId"]) + except Exception as e: + print(f"Subnet deletion error for {subnet['SubnetId']}: {e}") + + # Wait for all resources to be fully cleaned up before attempting VPC deletion + print("Waiting for all resources to be fully cleaned up before VPC deletion...") + time.sleep(30) # Increased wait time + + # Delete VPC as the very last step + try: + print(f"Attempting to delete VPC: {vpc_id}") + ec2.delete_vpc(VpcId=vpc_id) + print("βœ… VPC deleted successfully") + except Exception as e: + print(f"❌ VPC deletion failed: {e}") + print( + " This is common on first run. Try running clean.py again to complete VPC deletion." + ) + +except Exception as e: + print(f"Error during cleanup: {e}", flush=True) diff --git a/genai-use-cases/bedrock-text2sql-aurora-pgvector/config.json b/genai-use-cases/bedrock-text2sql-aurora-pgvector/config.json new file mode 100644 index 000000000..0ca9bc22d --- /dev/null +++ b/genai-use-cases/bedrock-text2sql-aurora-pgvector/config.json @@ -0,0 +1,18 @@ +{ + "vpc": { + "cidr_block": "10.11.0.0/16", + "subnet_cidrs": ["10.11.1.0/24", "10.11.2.0/24", "10.11.3.0/24"] + }, + "aurora": { + "cluster_identifier": "aurora-text2sql-cluster", + "instance_identifier": "aurora-text2sql-instance", + "database_name": "text2sql", + "master_username": "postgres", + "min_capacity": 1, + "max_capacity": 16 + }, + "resources": { + "security_group_name": "aurora-sg", + "subnet_group_name": "aurora-subnet-group" + } +} \ No newline at end of file diff --git a/genai-use-cases/bedrock-text2sql-aurora-pgvector/infra.py b/genai-use-cases/bedrock-text2sql-aurora-pgvector/infra.py new file mode 100644 index 000000000..ca3f79522 --- /dev/null +++ b/genai-use-cases/bedrock-text2sql-aurora-pgvector/infra.py @@ -0,0 +1,171 @@ +import boto3 +import json +import time +import sys + +# Load configuration +with open('config.json', 'r') as f: + config = json.load(f) + +# Get current AWS region dynamically +session = boto3.Session() +aws_region = session.region_name or 'us-west-2' # fallback to us-west-2 if not set +print(f"Using AWS region: {aws_region}") + +# Initialize clients with dynamic region +ec2 = boto3.client('ec2', region_name=aws_region) +rds = boto3.client('rds', region_name=aws_region) + +# Create VPC +vpc = ec2.create_vpc(CidrBlock=config['vpc']['cidr_block']) +vpc_id = vpc['Vpc']['VpcId'] +print(f"Created VPC: {vpc_id}") + +# Get availability zones +azs = ec2.describe_availability_zones()['AvailabilityZones'][:3] +az_names = [az['ZoneName'] for az in azs] + +# Create subnets in 3 AZs +subnet_ids = [] +for i, az in enumerate(az_names): + subnet = ec2.create_subnet( + VpcId=vpc_id, + CidrBlock=config['vpc']['subnet_cidrs'][i], + AvailabilityZone=az + ) + subnet_ids.append(subnet['Subnet']['SubnetId']) + print(f"Created subnet {subnet['Subnet']['SubnetId']} in {az}") + +# Create Internet Gateway +igw = ec2.create_internet_gateway() +igw_id = igw['InternetGateway']['InternetGatewayId'] +ec2.attach_internet_gateway(InternetGatewayId=igw_id, VpcId=vpc_id) + +# Create route table and add route to IGW +rt = ec2.create_route_table(VpcId=vpc_id) +rt_id = rt['RouteTable']['RouteTableId'] +ec2.create_route(RouteTableId=rt_id, DestinationCidrBlock='0.0.0.0/0', GatewayId=igw_id) + +# Associate subnets with route table +for subnet_id in subnet_ids: + ec2.associate_route_table(RouteTableId=rt_id, SubnetId=subnet_id) + +# Create security group for Aurora +sg = ec2.create_security_group( + GroupName=config['resources']['security_group_name'], + Description='Aurora PostgreSQL security group', + VpcId=vpc_id +) +sg_id = sg['GroupId'] + +# Allow PostgreSQL access within VPC +ec2.authorize_security_group_ingress( + GroupId=sg_id, + IpPermissions=[{ + 'IpProtocol': 'tcp', + 'FromPort': 5432, + 'ToPort': 5432, + 'IpRanges': [{'CidrIp': config['vpc']['cidr_block']}] + }] +) + +# Create DB subnet group +rds.create_db_subnet_group( + DBSubnetGroupName=config['resources']['subnet_group_name'], + DBSubnetGroupDescription='Aurora subnet group', + SubnetIds=subnet_ids +) + +# Create Aurora PostgreSQL Serverless v2 cluster +cluster = rds.create_db_cluster( + DBClusterIdentifier=config['aurora']['cluster_identifier'], + Engine='aurora-postgresql', + EngineMode='provisioned', + DatabaseName=config['aurora']['database_name'], + MasterUsername=config['aurora']['master_username'], + ManageMasterUserPassword=True, + VpcSecurityGroupIds=[sg_id], + DBSubnetGroupName=config['resources']['subnet_group_name'], + StorageType='aurora', + EnableHttpEndpoint=True, + MonitoringInterval=0, + ServerlessV2ScalingConfiguration={ + 'MinCapacity': config['aurora']['min_capacity'], + 'MaxCapacity': config['aurora']['max_capacity'] + } +) + +print(f"Created Aurora cluster: {cluster['DBCluster']['DBClusterIdentifier']}") + +# Wait for cluster to be available +print("Waiting for cluster to be available...") +while True: + try: + response = rds.describe_db_clusters(DBClusterIdentifier=config['aurora']['cluster_identifier']) + status = response['DBClusters'][0]['Status'] + print(f"Cluster status: {status}") + + if status == 'available': + break + elif status in ['failed', 'deleted', 'deleting']: + raise Exception(f"Cluster creation failed with status: {status}") + + time.sleep(30) + except Exception as e: + if 'DBClusterNotFoundFault' in str(e): + raise Exception("Cluster creation failed - cluster not found") + raise + +# Create Aurora instance +instance = rds.create_db_instance( + DBInstanceIdentifier=config['aurora']['instance_identifier'], + DBInstanceClass='db.serverless', + Engine='aurora-postgresql', + DBClusterIdentifier=config['aurora']['cluster_identifier'] +) + +print(f"Created Aurora instance: {instance['DBInstance']['DBInstanceIdentifier']}") + +# Wait for instance to be available +print("Waiting for instance to be available...") +while True: + try: + response = rds.describe_db_instances(DBInstanceIdentifier=config['aurora']['instance_identifier']) + status = response['DBInstances'][0]['DBInstanceStatus'] + print(f"Instance status: {status}", flush=True) + + if status == 'available': + print("βœ… Aurora instance is now available!", flush=True) + break + elif status in ['failed', 'deleted', 'deleting']: + raise Exception(f"Instance creation failed with status: {status}") + + print("⏳ Waiting 30 seconds before next status check...", flush=True) + time.sleep(30) + except Exception as e: + if 'DBInstanceNotFound' in str(e): + raise Exception("Instance creation failed - instance not found") + raise + +print("Infrastructure deployment completed!", flush=True) +print(f"VPC ID: {vpc_id}", flush=True) +print(f"Cluster: {config['aurora']['cluster_identifier']}", flush=True) +print(f"Database: {config['aurora']['database_name']}", flush=True) + +# Get the cluster ARN and secrets ARN for dynamic linking +cluster_arn = cluster['DBCluster']['DBClusterArn'] +secrets_arn = cluster['DBCluster']['MasterUserSecret']['SecretArn'] + +print("\n" + "="*80) +print("πŸ“‹ COPY THESE VALUES TO YOUR JUPYTER NOTEBOOK:") +print("="*80) +print("# Database connection configuration") +print("# Replace the hardcoded values in your notebook with these:") +print(f"CLUSTER_ARN = '{cluster_arn}'") +print(f"SECRET_ARN = '{secrets_arn}'") +print(f"DATABASE_NAME = '{config['aurora']['database_name']}'") +print(f"AWS_REGION = '{aws_region}'") +print("="*80) + +# Natural completion for Jupyter notebook compatibility +print("βœ… Infrastructure deployment completed successfully!", flush=True) \ No newline at end of file diff --git a/genai-use-cases/bedrock-text2sql-aurora-pgvector/text2sql-postgresql.ipynb b/genai-use-cases/bedrock-text2sql-aurora-pgvector/text2sql-postgresql.ipynb new file mode 100644 index 000000000..7f6d2a195 --- /dev/null +++ b/genai-use-cases/bedrock-text2sql-aurora-pgvector/text2sql-postgresql.ipynb @@ -0,0 +1,1162 @@ +{ + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Advanced Text-to-SQL with PostgreSQL Vector Search\n", + "\n", + "This notebook demonstrates the integration of **traditional relational database operations** with **vector search capabilities** in PostgreSQL, featuring automated query strategy selection based on user intent analysis.\n", + "\n", + "### 🎯 **Core Technical Demonstrations:**\n", + "\n", + "#### 1. **Complex Schema Text-to-SQL Generation**\n", + "- LLM-powered natural language to SQL conversion across multi-table schemas\n", + "- Handling hierarchical data structures, complex joins, and nested aggregations\n", + "- **Demonstrating schema comprehension for enterprise-scale database architectures**\n", + "\n", + "#### 2. **PostgreSQL pgvector Integration** \n", + "- Native vector storage and similarity search within PostgreSQL\n", + "- Embedding-based semantic search on unstructured text data\n", + "- Demonstrating RDBMS + vector database convergence\n", + "\n", + "#### 3. **Automated Query Strategy Selection**\n", + "- Foundation model analysis of query intent and optimal execution path determination\n", + "- Context-aware routing between structured SQL and semantic vector operations\n", + "- Unified interface abstracting query complexity from end users\n", + "\n", + "### πŸ—οΈ **Database Schema Architecture**\n", + "\n", + "ecommerce schema demonstrating complex relationships:\n", + "\n", + "```\n", + "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", + "β”‚ users β”‚ β”‚ categories β”‚ β”‚ products β”‚\n", + "β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€\n", + "β”‚ user_id (PK) β”‚ β”‚ category_id (PK) β”‚ β”‚ product_id (PK) β”‚\n", + "β”‚ email β”‚ β”‚ name β”‚ β”‚ sku β”‚\n", + "β”‚ username β”‚ β”‚ slug β”‚ β”‚ name β”‚\n", + "β”‚ first_name β”‚ β”‚ description β”‚ β”‚ description β”‚\n", + "β”‚ last_name β”‚ β”‚ parent_category_idβ”‚ β”‚ category_id (FK)β”‚\n", + "β”‚ city β”‚ β”‚ (FK to self) β”‚ β”‚ brand β”‚\n", + "β”‚ state_province β”‚ β”‚ product_count β”‚ β”‚ price β”‚\n", + "β”‚ total_orders β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ stock_quantity β”‚\n", + "β”‚ total_spent β”‚ β”‚ β”‚ rating_average β”‚\n", + "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ β”‚ total_sales β”‚\n", + " β”‚ β”‚ β”‚ revenue_generatedβ”‚\n", + " β”‚ β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜\n", + " β”‚ β”‚ β”‚\n", + " β”‚ β”‚ β”‚\n", + "β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”\n", + "β”‚ orders β”‚ β”‚ order_items β”‚ β”‚ reviews β”‚\n", + "β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€ β”œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€\n", + "β”‚ order_id (PK) │────│ order_id (FK) β”‚ β”‚ review_id (PK) β”‚\n", + "β”‚ order_number β”‚ β”‚ product_id (FK) │────│ product_id (FK) β”‚\n", + "β”‚ user_id (FK) β”‚ β”‚ quantity β”‚ β”‚ user_id (FK) β”‚\n", + "β”‚ order_status β”‚ β”‚ unit_price β”‚ β”‚ order_id (FK) β”‚\n", + "β”‚ payment_status β”‚ β”‚ total_price β”‚ β”‚ rating β”‚\n", + "β”‚ total_amount β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ title β”‚\n", + "β”‚ shipping_method β”‚ β”‚ comment β”‚\n", + "β”‚ created_at β”‚ β”‚ comment_embeddingβ”‚\n", + "β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β”‚ (VECTOR) β”‚\n", + " β”‚ pros β”‚\n", + " β”‚ cons β”‚\n", + " β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜\n", + "```\n", + "\n", + "**Schema Complexity Features:**\n", + "- **Self-referencing hierarchies**: Categories with parent/child relationships\n", + "- **Junction table patterns**: Many-to-many order-product relationships via order_items\n", + "- **Vector integration**: Native pgvector storage in reviews.comment_embedding\n", + "- **Multi-level foreign keys**: Reviews referencing users, products, and orders\n", + "\n", + "### πŸ’‘ **Technical Implementation:**\n", + "\n", + "1. **Hybrid Database Architecture**: PostgreSQL with pgvector extension for unified structured + vector operations\n", + "2. **LLM Schema Comprehension**: Foundation model understanding of complex table relationships and optimal query generation\n", + "3. **Embedding-based Similarity**: Amazon Titan text embeddings for semantic content matching\n", + "4. **Automated Tool Selection**: Context analysis determining SQL vs vector search execution paths\n", + "\n", + "## Technical Prerequisites\n", + "- AWS account with Bedrock and RDS permissions\n", + "- Understanding of vector embeddings and similarity search concepts\n", + "- Familiarity with PostgreSQL and complex SQL operations\n", + "\n", + "---" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## πŸ“¦ STEP 1: Install Required Packages" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "0afb548c-60e7-4929-9ae2-26113df7e5b0", + "metadata": {}, + "outputs": [], + "source": [ + "# Install required Python packages for AWS and SQL parsing\n", + "!pip install --upgrade pip\n", + "!pip install boto3 sqlparse" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## πŸ—οΈ STEP 2: Deploy AWS Infrastructure\n", + "\n", + "This step creates:\n", + "- **VPC with 3 subnets** across availability zones\n", + "- **Aurora PostgreSQL Serverless v2 cluster** with HTTP endpoint enabled\n", + "- **Security groups** and networking configuration\n", + "- **Secrets Manager** for database credentials\n", + "\n", + "**Note**: This takes ~5-10 minutes to complete" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "4efd1122-9bee-4cde-a536-0d149b796f0f", + "metadata": {}, + "outputs": [], + "source": [ + "# Deploy AWS infrastructure (VPC, Aurora PostgreSQL, Security Groups)\n", + "# This script creates all necessary AWS resources for our demo\n", + "\n", + "!python infra.py" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## πŸ”§ STEP 3: Setup Database Connection" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "6a29e0a2-12ff-42f7-b1aa-8ffecf097633", + "metadata": {}, + "outputs": [], + "source": [ + "# Import required libraries for AWS services and database operations\n", + "import json\n", + "import boto3\n", + "import logging\n", + "import sqlparse\n", + "from typing import Dict, Any, List, Union\n", + "from botocore.exceptions import ClientError\n", + "from botocore.config import Config\n", + "\n", + "# Get current AWS region dynamically\n", + "session = boto3.Session()\n", + "AWS_REGION = session.region_name or 'us-west-2' # fallback to us-west-2 if not set\n", + "print(f\"Using AWS region: {AWS_REGION}\")\n", + "\n", + "# Setup logging to track our progress\n", + "logging.basicConfig(level=logging.INFO, format=\"%(asctime)s - %(levelname)s - %(message)s\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1bb84fc4-0b36-493d-a269-60959ece77a9", + "metadata": {}, + "outputs": [], + "source": [ + "# Database connection configuration\n", + "# **Update these values after running infra.py with the output provided**\n", + "CLUSTER_ARN = ''\n", + "SECRET_ARN = ''\n", + "DATABASE_NAME = ''\n", + "AWS_REGION = ''\n", + "\n", + "# Initialize RDS Data API client (allows SQL execution without direct connections, to learn more visit https://docs.aws.amazon.com/AmazonRDS/latest/AuroraUserGuide/data-api.html)\n", + "rds_client = boto3.client('rds-data', region_name=AWS_REGION)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## πŸ› οΈ STEP 4: Create Database Schema & Load Data\n", + "\n", + "We'll create a streamlined but complex ecommerce schema with 6 core tables that demonstrate:\n", + "- **Hierarchical relationships** (categories with parent/child structure)\n", + "- **Many-to-many relationships** (orders ↔ products via junction table) \n", + "- **Vector integration** (reviews with embedding column for semantic search)\n", + "- **Analytics capabilities** (aggregated sales metrics and customer data)" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "bd1d2ddd-b839-4fe0-9fb9-edf566807245", + "metadata": {}, + "outputs": [], + "source": [ + "def run_sql(query: str, database: str = None) -> dict:\n", + " \"\"\"\n", + " Execute SQL query using RDS Data API\n", + " This is our main function for running any SQL command\n", + " \"\"\"\n", + " try:\n", + " params = {\n", + " 'resourceArn': CLUSTER_ARN,\n", + " 'secretArn': SECRET_ARN,\n", + " 'sql': query\n", + " }\n", + " if database:\n", + " params['database'] = database\n", + " \n", + " response = rds_client.execute_statement(**params)\n", + " return response\n", + " except Exception as e:\n", + " print(f\"SQL execution error: {e}\")\n", + " return {\"error\": str(e)}" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "440ad441-0550-49c5-af88-15efec549268", + "metadata": {}, + "outputs": [], + "source": [ + "# Enable pgvector extension for semantic search capabilities\n", + "# pgvector allows PostgreSQL to store and search vector embeddings\n", + "try:\n", + " result = run_sql('CREATE EXTENSION IF NOT EXISTS vector;', DATABASE_NAME)\n", + " print(\"βœ… pgvector extension enabled successfully\")\n", + "except Exception as e:\n", + " print(f\"Extension setup error: {e}\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "2d52f3db-5f84-44d1-9292-470afdc863a1", + "metadata": {}, + "outputs": [], + "source": [ + "# Create tables by reading our schema file\n", + "# Parse SQL file into individual statements (RDS Data API requirement)\n", + "with open('ecommerce_schema.sql', 'r') as f:\n", + " schema_sql = f.read()\n", + "\n", + "statements = sqlparse.split(schema_sql)\n", + "statements = [stmt.strip() for stmt in statements if stmt.strip()]\n", + "\n", + "print(f\"Creating {len(statements)} database tables...\")\n", + "print(\"πŸ“Š Schema includes: users, categories, products, orders, order_items, reviews\")\n", + "print(\"🧠 Vector integration: reviews.comment_embedding for semantic search\")\n", + "\n", + "# Execute each CREATE TABLE statement\n", + "for i, statement in enumerate(statements, 1):\n", + " try:\n", + " run_sql(statement, DATABASE_NAME)\n", + " print(f\" βœ… Table {i} created successfully\")\n", + " except Exception as e:\n", + " print(f\" ❌ Table {i} failed: {e}\")\n", + "\n", + "print(\"βœ… Database schema creation completed!\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "19105a1b-ae33-4619-9860-88d2c9a93be8", + "metadata": {}, + "outputs": [], + "source": [ + "# Insert sample data into our tables\n", + "with open('ecommerce_data.sql', 'r') as f:\n", + " data_sql = f.read()\n", + "\n", + "data_statements = sqlparse.split(data_sql)\n", + "data_statements = [stmt.strip() for stmt in data_statements if stmt.strip()]\n", + "\n", + "print(f\"Inserting sample data with {len(data_statements)} statements...\")\n", + "print(\"πŸ‘₯ 15 users across different US cities with spending history\")\n", + "print(\"πŸ“¦ 16 products across 8 categories (Electronics β†’ Audio/Video, Smart Devices, etc.)\")\n", + "print(\"πŸ›’ 10 orders with various statuses (delivered, shipped, processing, cancelled)\")\n", + "print(\"⭐ 13 detailed product reviews perfect for semantic search\")\n", + "\n", + "for i, statement in enumerate(data_statements, 1):\n", + " try:\n", + " result = run_sql(statement, DATABASE_NAME)\n", + " records_affected = result.get('numberOfRecordsUpdated', 0)\n", + " print(f\" βœ… Dataset {i}: {records_affected} records inserted\")\n", + " except Exception as e:\n", + " print(f\" ❌ Dataset {i} failed: {e}\")\n", + "\n", + "print(\"βœ… Sample data insertion completed!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 🧠 STEP 5: Bedrock Setup" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cfcd2744-65c2-4bf4-87e5-ef21e781bdc5", + "metadata": {}, + "outputs": [], + "source": [ + "# Configure Bedrock client with extended timeouts for large requests\n", + "bedrock_config = Config(\n", + " connect_timeout=60*5, # 5 minutes\n", + " read_timeout=60*5, # 5 minutes\n", + ")\n", + "\n", + "bedrock = boto3.client(\n", + " service_name='bedrock-runtime', \n", + " region_name=AWS_REGION,\n", + " config=bedrock_config\n", + ")\n", + "\n", + "# Model IDs for our use\n", + "CLAUDE_MODEL = \"us.anthropic.claude-3-7-sonnet-20250219-v1:0\" # For text-to-SQL\n", + "EMBEDDING_MODEL = \"amazon.titan-embed-text-v2:0\" # For vector search\n", + "\n", + "print(\"βœ… Bedrock configured successfully\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "5352dc09-f513-4c27-b619-376dfcf49d60", + "metadata": {}, + "outputs": [], + "source": [ + "class DatabaseTools:\n", + " \"\"\"Simple database helper for executing SQL queries\"\"\"\n", + " \n", + " def __init__(self):\n", + " self.rds_client = boto3.client(\"rds-data\", region_name=AWS_REGION)\n", + " \n", + " def execute_sql(self, query: str) -> str:\n", + " \"\"\"Execute SQL query and return results as JSON string\"\"\"\n", + " try:\n", + " response = self.rds_client.execute_statement(\n", + " resourceArn=CLUSTER_ARN,\n", + " secretArn=SECRET_ARN,\n", + " database=DATABASE_NAME,\n", + " sql=query,\n", + " includeResultMetadata=True,\n", + " )\n", + " \n", + " # Handle empty results\n", + " if \"records\" not in response or not response[\"records\"]:\n", + " return json.dumps([])\n", + " \n", + " # Get column names and format results\n", + " columns = [field[\"name\"] for field in response.get(\"columnMetadata\", [])]\n", + " results = []\n", + " \n", + " for record in response[\"records\"]:\n", + " row_values = []\n", + " for field in record:\n", + " # Extract value from different field types\n", + " if \"stringValue\" in field:\n", + " row_values.append(field[\"stringValue\"])\n", + " elif \"longValue\" in field:\n", + " row_values.append(field[\"longValue\"])\n", + " elif \"doubleValue\" in field:\n", + " row_values.append(field[\"doubleValue\"])\n", + " elif \"booleanValue\" in field:\n", + " row_values.append(field[\"booleanValue\"])\n", + " else:\n", + " row_values.append(None)\n", + " \n", + " results.append(dict(zip(columns, row_values)))\n", + " \n", + " return json.dumps(results, indent=2)\n", + " \n", + " except Exception as e:\n", + " return json.dumps({\"error\": f\"Database error: {str(e)}\"})\n", + "\n", + "# Test database connection\n", + "db_tools = DatabaseTools()\n", + "result = db_tools.execute_sql(\"SELECT current_timestamp;\")\n", + "print(\"βœ… Database connection test successful\")\n", + "print(\"Current time:\", json.loads(result)[0][\"current_timestamp\"])" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## πŸ”’ STEP 6: Generate Vector Embeddings for Semantic Search\n", + "\n", + "**Hybrid RDBMS + Vector Database Implementation:**\n", + "\n", + "Vector embeddings convert textual content into high-dimensional numerical representations that capture semantic relationships. PostgreSQL's pgvector extension enables native vector operations within the relational database, eliminating the need for separate vector database infrastructure.\n", + "\n", + "**Technical Implementation:**\n", + "- Amazon Titan Text Embeddings v2 (1024-dimensional vectors)\n", + "- PostgreSQL VECTOR data type with cosine similarity operations\n", + "- Semantic search on review content independent of exact keyword matching\n", + "\n", + "This approach demonstrates the convergence of traditional RDBMS and vector database capabilities in production systems." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "18a417e1-8051-40b0-84c0-464f9f7923ae", + "metadata": {}, + "outputs": [], + "source": [ + "def create_embedding(text: str) -> List[float]:\n", + " \"\"\"\n", + " Convert text into a vector embedding using Amazon Titan\n", + " Returns a list of 1024 numbers that represent the text's meaning\n", + " \"\"\"\n", + " payload = {\n", + " \"inputText\": text,\n", + " \"embeddingTypes\": [\"float\"]\n", + " }\n", + " \n", + " try:\n", + " response = bedrock.invoke_model(\n", + " modelId=EMBEDDING_MODEL,\n", + " body=json.dumps(payload),\n", + " accept=\"application/json\",\n", + " contentType=\"application/json\"\n", + " )\n", + " \n", + " body = json.loads(response[\"body\"].read())\n", + " embeddings = body.get(\"embeddingsByType\", {}).get(\"float\", [])\n", + " return embeddings\n", + " \n", + " except Exception as e:\n", + " print(f\"Embedding generation error: {e}\")\n", + " return []\n", + "\n", + "# Test embedding generation\n", + "test_text = \"This battery lasts a long time\"\n", + "test_embedding = create_embedding(test_text)\n", + "print(f\"βœ… Generated embedding with {len(test_embedding)} dimensions\")\n", + "print(f\"Sample values: {test_embedding[:5]}...\") # Show first 5 numbers" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "a2a95f6d-6ea7-418d-9930-c50b854ca0c5", + "metadata": {}, + "outputs": [], + "source": [ + "def add_embeddings_to_reviews():\n", + " \"\"\"\n", + " Generate embeddings for all review comments and store them in the database\n", + " This enables semantic search on review content\n", + " \"\"\"\n", + " \n", + " # Step 1: Find reviews that need embeddings\n", + " count_query = \"SELECT COUNT(*) FROM reviews WHERE comment_embedding IS NULL\"\n", + " count_result = db_tools.execute_sql(count_query)\n", + " total_missing = json.loads(count_result)[0][\"count\"]\n", + " \n", + " print(f\"Found {total_missing} reviews needing embeddings\")\n", + " \n", + " if total_missing == 0:\n", + " print(\"βœ… All reviews already have embeddings!\")\n", + " return\n", + " \n", + " # Step 2: Get reviews without embeddings\n", + " select_query = \"\"\"\n", + " SELECT review_id, comment \n", + " FROM reviews \n", + " WHERE comment_embedding IS NULL \n", + " AND comment IS NOT NULL\n", + " ORDER BY review_id\n", + " \"\"\"\n", + " \n", + " result = db_tools.execute_sql(select_query)\n", + " reviews = json.loads(result)\n", + " \n", + " # Step 3: Generate embeddings for each review\n", + " for review in reviews:\n", + " review_id = review[\"review_id\"]\n", + " comment = review[\"comment\"]\n", + " \n", + " if not comment:\n", + " continue\n", + " \n", + " print(f\" Processing review {review_id}...\")\n", + " \n", + " # Generate embedding\n", + " embedding = create_embedding(comment)\n", + " if not embedding:\n", + " continue\n", + " \n", + " # Convert to PostgreSQL vector format\n", + " vector_str = \"[\" + \",\".join(str(x) for x in embedding) + \"]\"\n", + " \n", + " # Update database with embedding\n", + " update_query = f\"\"\"\n", + " UPDATE reviews \n", + " SET comment_embedding = '{vector_str}'::vector \n", + " WHERE review_id = {review_id}\n", + " \"\"\"\n", + " \n", + " run_sql(update_query, DATABASE_NAME)\n", + " print(f\" βœ… Added embedding for review {review_id}\")\n", + " \n", + " print(\"βœ… All review embeddings generated successfully!\")\n", + "\n", + "# Generate embeddings for all reviews\n", + "add_embeddings_to_reviews()" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## πŸ€– STEP 7: Foundation Model Tool Selection System\n", + "\n", + "**Query Strategy Determination:**\n", + "\n", + "Claude Sonnet analyzes natural language queries and automatically determines the optimal execution strategy through tool selection logic:\n", + "\n", + "**πŸ“Š Structured Query Scenarios (SQL Tool Selection):**\n", + "- Aggregation operations: \"What's the average order value by state?\"\n", + "- Complex joins: \"Show customers with repeat purchases in Electronics\"\n", + "- Mathematical calculations: \"Calculate profit margins by product category\"\n", + "- Temporal analysis: \"Find order trends over the last quarter\"\n", + "\n", + "**πŸ” Semantic Search Scenarios (Vector Tool Selection):**\n", + "- Content similarity: \"Find reviews about build quality issues\"\n", + "- Sentiment analysis: \"Show complaints about customer service\"\n", + "- Topic clustering: \"What do users say about product durability?\"\n", + "- Conceptual matching: Independent of exact keyword presence\n", + "\n", + "**🎯 Hybrid Query Execution:**\n", + "- Complex scenarios may trigger multiple tool usage\n", + "- Foundation model orchestrates sequential or parallel execution\n", + "- Results synthesis from both structured and semantic operations\n", + "\n", + "**Technical Architecture:**\n", + "- Tool specification via JSON schema definitions\n", + "- Automated function calling based on intent classification\n", + "- Context-aware execution path optimization" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "cf4ee1a5-c963-48d3-bbb7-efbde659aac1", + "metadata": {}, + "outputs": [], + "source": [ + "def semantic_search(search_text: str, limit: int = 5) -> str:\n", + " \"\"\"\n", + " Find reviews similar to the search text using vector similarity\n", + " Returns the most semantically similar reviews\n", + " \"\"\"\n", + " try:\n", + " # Generate embedding for search text\n", + " search_embedding = create_embedding(search_text)\n", + " if not search_embedding:\n", + " return json.dumps({\"error\": \"Could not generate embedding\"})\n", + " \n", + " # Convert to PostgreSQL vector format\n", + " vector_str = \"[\" + \",\".join(str(x) for x in search_embedding) + \"]\"\n", + " \n", + " # Find similar reviews using cosine distance (<-> operator)\n", + " query = f\"\"\"\n", + " SELECT \n", + " rating,\n", + " title,\n", + " comment,\n", + " pros,\n", + " cons,\n", + " helpful_count,\n", + " (1 - (comment_embedding <-> '{vector_str}'::vector)) as similarity_score\n", + " FROM reviews\n", + " WHERE comment IS NOT NULL \n", + " AND comment_embedding IS NOT NULL\n", + " ORDER BY comment_embedding <-> '{vector_str}'::vector\n", + " LIMIT {limit}\n", + " \"\"\"\n", + " \n", + " result = db_tools.execute_sql(query)\n", + " return result\n", + " \n", + " except Exception as e:\n", + " return json.dumps({\"error\": f\"Vector search error: {str(e)}\"})\n", + "\n", + "# Test vector search\n", + "test_search = semantic_search(\"battery problems\", limit=3)\n", + "print(\"βœ… Vector search test successful\")\n", + "print(\"Sample results:\", json.loads(test_search)[0][\"title\"] if json.loads(test_search) else \"No results\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "1a5c586e-b60f-4283-a0be-8a4876e22de0", + "metadata": {}, + "outputs": [], + "source": [ + "# Define the tools available to Claude\n", + "TOOLS = {\n", + " \"tools\": [\n", + " {\n", + " \"toolSpec\": {\n", + " \"name\": \"execute_sql\",\n", + " \"description\": \"Execute SQL queries for structured data analysis (counts, filters, joins, aggregations)\",\n", + " \"inputSchema\": {\n", + " \"json\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"query\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"SQL query to execute against the ecommerce database\"\n", + " }\n", + " },\n", + " \"required\": [\"query\"]\n", + " }\n", + " }\n", + " }\n", + " },\n", + " {\n", + " \"toolSpec\": {\n", + " \"name\": \"vector_search\",\n", + " \"description\": \"Perform semantic similarity search on review content to find similar topics/themes\",\n", + " \"inputSchema\": {\n", + " \"json\": {\n", + " \"type\": \"object\",\n", + " \"properties\": {\n", + " \"text\": {\n", + " \"type\": \"string\",\n", + " \"description\": \"Text to search for semantically similar content in reviews\"\n", + " }\n", + " },\n", + " \"required\": [\"text\"]\n", + " }\n", + " }\n", + " }\n", + " }\n", + " ],\n", + " \"toolChoice\": {\"auto\": {}}\n", + "}\n", + "\n", + "print(\"βœ… AI tools configured - Claude can now choose between SQL and vector search!\")" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "68119147", + "metadata": {}, + "outputs": [], + "source": [ + "SYSTEM_PROMPT = \"\"\"\n", + "# Advanced Text-to-SQL System Prompt with PostgreSQL Vector Search\n", + "\n", + "\n", + "You are an advanced database query optimization system specializing in hybrid SQL and vector search operations. \n", + "Your primary function is to analyze natural language queries, determine optimal execution strategies, and generate PostgreSQL queries that leverage both relational and vector capabilities.\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "\n", + "Customer profiles with pre-computed analytics for performance optimization\n", + "\n", + "\n", + "- user_id (SERIAL PRIMARY KEY)\n", + "- email, username (UNIQUE constraints for data integrity)\n", + "- first_name, last_name, phone_number, date_of_birth, gender\n", + "- city, state_province, country_code (geographic segmentation)\n", + "- account_status (operational flag)\n", + "- total_orders, total_spent (denormalized aggregates for fast analytics)\n", + "- created_at (temporal tracking)\n", + "\n", + "
\n", + "\n", + "\n", + "\n", + "Hierarchical product taxonomy with recursive relationship support\n", + "\n", + "\n", + "- category_id (SERIAL PRIMARY KEY)\n", + "- name, slug (unique URL-safe identifier), description\n", + "- parent_category_id (SELF-REFERENTIAL FK enabling tree structures)\n", + "- is_active (soft delete support)\n", + "- product_count (denormalized for performance)\n", + "- created_at\n", + "\n", + "
\n", + "\n", + "\n", + "\n", + "Product catalog with inventory tracking and performance metrics\n", + "\n", + "\n", + "- product_id (SERIAL PRIMARY KEY)\n", + "- sku (UNIQUE business identifier)\n", + "- name, slug, description, short_description\n", + "- category_id (FK to categories)\n", + "- brand, price, cost (profit margin calculation)\n", + "- weight_kg, stock_quantity (logistics data)\n", + "- is_active, is_featured (display control)\n", + "- warranty_months\n", + "- rating_average, rating_count (computed from reviews)\n", + "- total_sales, revenue_generated (business metrics)\n", + "- created_at, updated_at (audit trail)\n", + "\n", + "
\n", + "\n", + "\n", + "\n", + "Transaction lifecycle management with comprehensive status tracking\n", + "\n", + "\n", + "- order_id (SERIAL PRIMARY KEY)\n", + "- order_number (UNIQUE human-readable identifier)\n", + "- user_id (FK to users)\n", + "- order_status: pending|processing|shipped|delivered|cancelled|refunded\n", + "- payment_status: pending|paid|failed|refunded\n", + "- shipping_address (full text for flexibility)\n", + "- financial_breakdown:\n", + " * subtotal (items before adjustments)\n", + " * tax_amount, shipping_cost, discount_amount\n", + " * total_amount (final charge)\n", + "- payment_method: credit_card|paypal|bank_transfer\n", + "- shipping_method: standard|express|overnight\n", + "- customer_notes, tracking_number\n", + "- shipped_at, delivered_at (fulfillment tracking)\n", + "- created_at, updated_at\n", + "\n", + "
\n", + "\n", + "\n", + "\n", + "Junction table capturing point-in-time pricing and item-level details\n", + "\n", + "\n", + "- order_item_id (SERIAL PRIMARY KEY)\n", + "- order_id (FK CASCADE DELETE)\n", + "- product_id (FK)\n", + "- quantity, unit_price (historical pricing preservation)\n", + "- discount_amount, tax_amount (item-level adjustments)\n", + "- total_price (computed line total)\n", + "- created_at\n", + "\n", + "
\n", + "\n", + "\n", + "\n", + "Customer feedback with vector embeddings for semantic analysis\n", + "\n", + "\n", + "- review_id (SERIAL PRIMARY KEY)\n", + "- product_id (FK CASCADE DELETE)\n", + "- user_id (FK)\n", + "- order_id (FK optional - links to purchase)\n", + "- rating (INTEGER CHECK 1-5)\n", + "- title (VARCHAR(200))\n", + "- comment (TEXT - source for embeddings)\n", + "- comment_embedding (VECTOR(1024) - semantic representation)\n", + "- pros, cons (structured sentiment extraction)\n", + "- is_verified_purchase (trust signal)\n", + "- helpful_count (community validation)\n", + "- status: pending|approved|rejected\n", + "- created_at, updated_at\n", + "\n", + "\n", + "- comment_embedding enables semantic similarity search\n", + "- Cosine distance for finding related reviews\n", + "- Supports sentiment clustering and topic modeling\n", + "\n", + "
\n", + "
\n", + "\n", + "\n", + "**execute_sql**: Use for structured queries, aggregations, joins, filtering by exact values\n", + "\n", + "**vector_search**: Use for semantic similarity on review comments, finding related content\n", + "\n", + "Examples:\n", + "- \"total revenue by category\" β†’ execute_sql\n", + "- \"reviews similar to 'great battery life'\" β†’ vector_search\n", + "- \"average rating of products with positive reviews\" β†’ both tools\n", + "\n", + "\n", + "\n", + "SQL Example:\n", + "```sql\n", + "SELECT c.name, SUM(p.revenue_generated) as revenue\n", + "FROM categories c\n", + "JOIN products p ON c.category_id = p.category_id\n", + "GROUP BY c.name\n", + "ORDER BY revenue DESC;\n", + "```\n", + "\n", + "Vector Example:\n", + "```sql\n", + "-- Note: query_embedding would be provided by the system as a VECTOR(1024)\n", + "SELECT r.comment, p.name, \n", + " r.comment_embedding <=> query_embedding as similarity\n", + "FROM reviews r\n", + "JOIN products p ON r.product_id = p.product_id\n", + "ORDER BY similarity\n", + "LIMIT 5;\n", + "```\n", + "\n", + "\n", + "### SQL Query Best Practices\n", + "\n", + "1. **Explicit JOINs**: Always use explicit JOIN syntax with ON conditions\n", + "2. **Table Aliases**: Use meaningful aliases (u for users, p for products)\n", + "3. **NULL Handling**: Account for optional fields with COALESCE or IS NULL\n", + "4. **Data Types**: Cast when necessary, especially for date operations\n", + "5. **Aggregation Rules**: Include all non-aggregate columns in GROUP BY\n", + "6. **Order Stability**: Add secondary ORDER BY for deterministic results\n", + "7. **Limit Appropriately**: Include LIMIT for top-N queries\n", + "8. **Comment Complex Logic**: Add -- comments for CTEs or complex conditions\n", + "\n", + "\n", + "### Vector Search Best Practices\n", + "\n", + "1. **Distance Metrics**: Use cosine distance (<=>) for normalized embeddings\n", + "2. **Result Limits**: Always limit results (default 10-20 for readability)\n", + "3. **Threshold Filtering**: Consider similarity threshold for quality control\n", + "4. **Metadata Inclusion**: Join with products/users for context\n", + "5. **Explain Similarity**: Include distance scores in results\n", + "\n", + "\n", + "\n", + "### Response Structure\n", + "```\n", + "QUERY ANALYSIS:\n", + "- Intent: [Extracted user intent]\n", + "- Strategy: [Selected tool(s) and rationale]\n", + "- Key Operations: [Main database operations required]\n", + "\n", + "GENERATED QUERY:\n", + "[Actual SQL or vector search syntax]\n", + "\n", + "EXPECTED INSIGHTS:\n", + "- [Key patterns or metrics the query will reveal]\n", + "- [Business value of the results]\n", + "```\n", + "\n", + "\"\"\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "542b7cc1-da90-4d63-912b-2ba8b5f97488", + "metadata": {}, + "outputs": [], + "source": [ + "def ask_ai(question: str) -> str:\n", + " \"\"\"\n", + " Send a question to Claude and handle tool execution\n", + " Claude will automatically choose between SQL and vector search\n", + " Handles multiple rounds of tool calls until completion\n", + " \"\"\"\n", + " \n", + " # Create the conversation\n", + " messages = [{\"role\": \"user\", \"content\": [{\"text\": question}]}]\n", + " \n", + " try:\n", + " # Continue conversation until Claude stops requesting tools\n", + " max_turns = 10 # Prevent infinite loops\n", + " turn_count = 0\n", + " \n", + " while turn_count < max_turns:\n", + " turn_count += 1\n", + " \n", + " # Send to Claude with tools\n", + " response = bedrock.converse(\n", + " modelId=CLAUDE_MODEL,\n", + " system=[{\"text\": SYSTEM_PROMPT}],\n", + " messages=messages,\n", + " toolConfig=TOOLS\n", + " )\n", + " \n", + " assistant_message = response[\"output\"][\"message\"]\n", + " messages.append(assistant_message)\n", + " \n", + " # Check if Claude wants to use tools\n", + " tool_uses = [content for content in assistant_message[\"content\"] if \"toolUse\" in content]\n", + " \n", + " if tool_uses:\n", + " # Execute each tool Claude requested\n", + " for tool_use in tool_uses:\n", + " tool_name = tool_use[\"toolUse\"][\"name\"]\n", + " tool_input = tool_use[\"toolUse\"][\"input\"]\n", + " tool_id = tool_use[\"toolUse\"][\"toolUseId\"]\n", + " \n", + " print(f\"πŸ”§ Claude is using: {tool_name}\")\n", + " \n", + " # Execute the appropriate tool\n", + " if tool_name == \"execute_sql\":\n", + " tool_result = db_tools.execute_sql(tool_input[\"query\"])\n", + " print(f\"πŸ“Š SQL Query: {tool_input['query']}\")\n", + " \n", + " elif tool_name == \"vector_search\":\n", + " tool_result = semantic_search(tool_input[\"text\"])\n", + " print(f\"πŸ” Searching for: {tool_input['text']}\")\n", + " \n", + " # Send tool result back to Claude\n", + " tool_message = {\n", + " \"role\": \"user\",\n", + " \"content\": [{\n", + " \"toolResult\": {\n", + " \"toolUseId\": tool_id,\n", + " \"content\": [{\"text\": tool_result}]\n", + " }\n", + " }]\n", + " }\n", + " messages.append(tool_message)\n", + " \n", + " # Continue the loop to let Claude process results and potentially make more tool calls\n", + " continue\n", + " \n", + " else:\n", + " # No tools needed, extract and return the final response\n", + " final_content = assistant_message[\"content\"]\n", + " text_response = next((c[\"text\"] for c in final_content if \"text\" in c), \"\")\n", + " return text_response\n", + " \n", + " # If we hit max turns, return what we have\n", + " return \"Response completed after maximum tool execution rounds.\"\n", + " \n", + " except Exception as e:\n", + " return f\"Error: {str(e)}\"\n", + "\n", + "print(\"βœ… Enhanced LLM assistant ready with multi-round tool execution support!\")" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## πŸš€ STEP 8: Technical Demonstrations\n", + "\n", + "### Demo 1: Complex Schema Text-to-SQL Generation\n", + "**Objective:** Validate LLM comprehension of multi-table relationships and automated SQL generation for complex analytical queries." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "8ca62e41-6c03-49c8-a280-426d9de633c6", + "metadata": {}, + "outputs": [], + "source": [ + "# DEMO 1: Complex Schema Text-to-SQL Generation\n", + "print(\"=\" * 70)\n", + "print(\"DEMO 1: Complex Schema Text-to-SQL Generation\")\n", + "print(\"=\" * 70)\n", + "\n", + "# Test multi-table join with hierarchical traversal and aggregation\n", + "question1 = \"Show me the top 3 customers by total spending, including their order count and favorite product category\"\n", + "print(f\"Query: {question1}\")\n", + "print(\"\\nπŸ”§ Expected: Multi-table JOIN across users, orders, order_items, products, categories\")\n", + "print(\"\\nExecution:\")\n", + "answer1 = ask_ai(question1)\n", + "print(answer1)\n", + "print(\"\\n\" + \"=\"*70)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Demo 2: PostgreSQL Vector Search Implementation\n", + "**Objective:** Demonstrate native vector operations within PostgreSQL using pgvector for semantic similarity search on unstructured content." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "47633120-d09e-47c6-a0ca-c8973eb70463", + "metadata": {}, + "outputs": [], + "source": [ + "# DEMO 2: PostgreSQL Vector Search Implementation\n", + "print(\"DEMO 2: PostgreSQL Vector Search Implementation\")\n", + "print(\"=\" * 70)\n", + "\n", + "question2 = \"Find reviews about battery life issues and charging problems\"\n", + "print(f\"Query: {question2}\")\n", + "print(\"\\nπŸ”§ Expected: Vector similarity search using pgvector cosine distance\")\n", + "print(\"πŸ“Š Operation: Embedding generation + semantic matching on reviews.comment_embedding\")\n", + "print(\"🎯 Capability: Content similarity independent of exact keyword presence\")\n", + "print(\"\\nExecution:\")\n", + "answer2 = ask_ai(question2)\n", + "print(answer2)\n", + "print(\"\\n\" + \"=\"*70)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Demo 3: Automated Query Strategy Selection\n", + "**Objective:** Capability to analyze query intent and select optimal execution strategy between SQL and vector operations." + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "12c86224-f7ff-40e6-92f7-9da4d520c895", + "metadata": {}, + "outputs": [], + "source": [ + "# DEMO 3: Automated Query Strategy Selection\n", + "print(\"DEMO 3: Automated Query Strategy Selection\")\n", + "print(\"=\" * 70)\n", + "\n", + "# Ambiguous query that could use either approach\n", + "question3 = \"What are the main product quality issues customers mention in their reviews?\"\n", + "print(f\"Query: {question3}\")\n", + "print(\"\\nπŸ€” Strategy Options:\")\n", + "print(\" πŸ“Š SQL Approach: Aggregate review ratings and identify low-rated products\")\n", + "print(\" πŸ” Vector Approach: Semantic search for quality-related content themes\") \n", + "print(\" 🎯 Hybrid Approach: Combine structured filtering with content analysis\")\n", + "print(\"\\nπŸ”§ Foundation Model Decision Process:\")\n", + "print(\"\\nExecution:\")\n", + "answer3 = ask_ai(question3)\n", + "print(answer3)\n", + "print(\"\\n\" + \"=\"*70)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## πŸ’¬ Interactive Query Testing\n", + "\n", + "**Technical Validation Environment**\n", + "\n", + "Test the foundation model's query strategy selection across different analytical scenarios. The system will demonstrate automated tool selection based on query characteristics and optimal execution path determination.\n", + "\n", + "**Structured Query Test Cases:**\n", + "\n", + "**πŸ“Š Complex SQL Operations:**\n", + "- \"Calculate profit margins by hierarchical product category\"\n", + "- \"Identify customers with highest purchase frequency in Texas\"\n", + "- \"Analyze order value distribution across payment methods\"\n", + "\n", + "**πŸ” Vector Similarity Operations:**\n", + "- \"Find reviews discussing build quality and manufacturing defects\"\n", + "- \"Locate customer feedback about shipping and logistics issues\"\n", + "- \"Identify content related to product longevity and durability concerns\"\n", + "- \"Search for mentions of value proposition and pricing feedback\"\n", + "\n", + "**🎯 Complex Analytical Scenarios:**\n", + "- \"Which products receive the most quality-related complaints?\"\n", + "- \"Analyze sentiment patterns across different customer segments\"\n", + "- \"Find correlation between product price points and satisfaction themes\"" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "92781fad-9afc-469d-aec7-c693da3b2fe6", + "metadata": {}, + "outputs": [], + "source": [ + "# Interactive Query Testing Environment\n", + "print(\"πŸ”§ Foundation Model Query Strategy Testing\")\n", + "print(\"Enter queries to validate automated tool selection logic. Type 'quit' to exit.\")\n", + "print(\"\\nπŸ“‹ Test Categories:\")\n", + "\n", + "print(\"\\nπŸ“Š Structured Data Operations:\")\n", + "print(\"β€’ 'Which product categories have the highest profit margins?'\")\n", + "print(\"β€’ 'Show customer geographic distribution by total spending'\")\n", + "print(\"β€’ 'Analyze order completion rates by shipping method'\")\n", + "\n", + "print(\"\\nπŸ” Semantic Content Analysis:\")\n", + "print(\"β€’ 'Find reviews about products being difficult to use or setup'\")\n", + "print(\"β€’ 'Locate feedback about customer support experiences'\")\n", + "print(\"β€’ 'Search for mentions of product packaging and presentation'\")\n", + "\n", + "print(\"\\n🎯 Hybrid Analysis Scenarios:\")\n", + "print(\"β€’ 'Identify top-selling products with usability complaints'\")\n", + "print(\"β€’ 'Find high-value customers who mention quality concerns'\")\n", + "\n", + "print(\"-\" * 70)\n", + "\n", + "while True:\n", + " question = input(\"\\nπŸ” Query: \").strip()\n", + " \n", + " if question.lower() == 'quit':\n", + " print(\"βœ… Query testing session completed\")\n", + " break\n", + " \n", + " if question:\n", + " print(f\"\\nπŸ“ Processing: {question}\")\n", + " print(\"βš™οΈ Analyzing query intent and determining execution strategy...\")\n", + " answer = ask_ai(question)\n", + " print(f\"\\nπŸ“Š Result: {answer}\")\n", + " print(\"-\" * 70)" + ] + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## 🧹 STEP 9: Cleanup (Optional)\n", + "Run this to delete all AWS resources and avoid charges" + ] + }, + { + "cell_type": "code", + "execution_count": null, + "id": "51e71657-53ff-403b-a3ee-dc28bdcf1188", + "metadata": {}, + "outputs": [], + "source": [ + "# Cleanup AWS resources to avoid ongoing charges\n", + "# This will delete the Aurora cluster, VPC, and all related resources\n", + "\n", + "# Primary method:\n", + "!python clean.py" + ] + } + ], + "metadata": { + "kernelspec": { + "display_name": "Python 3 (ipykernel)", + "language": "python", + "name": "python3" + }, + "language_info": { + "codemirror_mode": { + "name": "ipython", + "version": 3 + }, + "file_extension": ".py", + "mimetype": "text/x-python", + "name": "python", + "nbconvert_exporter": "python", + "pygments_lexer": "ipython3", + "version": "3.12.9" + } + }, + "nbformat": 4, + "nbformat_minor": 5 +}