Skip to content

feat(discovery): Improve category coverage detection with AI classification #258

@frankbria

Description

@frankbria

Summary

The current category coverage detection in LeadAgent._get_category_coverage() uses simple keyword matching to determine which discovery topics have been covered. This approach has limitations:

Current Implementation

category_keywords = {
    "problem": ["problem", "solve", "issue", "challenge", "pain", "need"],
    "users": ["user", "customer", "audience", "target", "who", "people"],
    "features": ["feature", "function", "capability", "able to", "requirement"],
    "constraints": ["constraint", "limit", "budget", "timeline", "restriction"],
    "tech_stack": ["tech", "stack", "language", "framework", "database", "tool"],
}

Limitations

  1. False positives: Answer "I don't have any problem with the current solution" would mark "problem" as covered
  2. False negatives: Detailed problem descriptions without keyword matches get missed
  3. Context ignorance: Doesn't understand semantic meaning of answers
  4. Language sensitivity: Only works with English keywords

Proposed Improvements

Option 1: AI-Powered Classification (Recommended)

Use Claude to analyze each answer and determine which categories it addresses:

def _classify_answer_categories(self, answer: str, question: str) -> List[str]:
    """Use AI to classify which categories an answer addresses."""
    prompt = f"""Analyze this Q&A from a software project discovery session.

Question: {question}
Answer: {answer}

Which of these categories does the answer provide meaningful information about?
- problem: What problem the application solves
- users: Who the target users are
- features: What features are required
- constraints: Technical or business constraints
- tech_stack: Preferred technologies

Return only the category names that are meaningfully addressed, comma-separated.
If none apply, return "none".
"""
    # Call AI and parse response

Benefits:

  • Semantic understanding of answers
  • Handles varied phrasings
  • Can detect partial coverage
  • Language agnostic

Drawbacks:

  • Additional API calls (cost/latency)
  • Non-deterministic

Option 2: Enhanced Keyword Matching

Improve keyword lists and add:

  • Phrase matching ("target audience", "end users")
  • Negative keyword exclusion ("don't have a problem")
  • Synonym expansion
  • Answer length thresholds per category

Option 3: Hybrid Approach

Use keyword matching first, then AI classification only for ambiguous cases.

Acceptance Criteria

  • Category detection correctly identifies topic coverage from semantic meaning
  • False positive rate reduced (e.g., "no problem" doesn't mark problem covered)
  • Tests updated to verify improved accuracy
  • Document the classification approach in docs/discovery-socratic-methodology.md

Related

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions