Skip to content

feat(contrib): Add SemanticCacheProcessor for semantic similarity caching#46

Open
Karanjot786 wants to merge 6 commits intogoogle-gemini:mainfrom
Karanjot786:feature/semantic-cache-processor
Open

feat(contrib): Add SemanticCacheProcessor for semantic similarity caching#46
Karanjot786 wants to merge 6 commits intogoogle-gemini:mainfrom
Karanjot786:feature/semantic-cache-processor

Conversation

@Karanjot786
Copy link

@Karanjot786 Karanjot786 commented Jan 25, 2026

Summary

Adds SemanticCacheProcessor, a new contrib processor that caches LLM responses based on semantic similarity using vector embeddings. Unlike exact-match caching, this approach matches queries like "What is the capital of France?" and "Tell me France's capital city" to the same cached response.

Motivation

Current caching in genai-processors uses exact hash matching, which misses cache hits for semantically equivalent queries. This causes:

  • Unnecessary API calls for rephrased questions
  • Higher costs in production chatbots and assistants
  • Increased latency for users

Changes

New Files

  • genai_processors/contrib/semantic_cache.py - Main implementation
  • genai_processors/contrib/semantic_cache.md - Documentation
  • genai_processors/contrib/tests/semantic_cache_test.py - Test suite (38 tests)

Modified Files

  • genai_processors/contrib/README.md - Added to processor list

Features

Feature Description
Semantic matching Uses Gemini text-embedding-004 + cosine similarity
Configurable threshold Default 0.90, adjustable per use case
TTL expiration Entries expire after configurable duration
LRU eviction Removes least-used entries when cache is full
Cache metadata Adds hit/miss info to output part metadata
Extensible backends VectorCacheBase ABC for custom implementations

Usage

from genai_processors.contrib import semantic_cache
from genai_processors.core import genai_model

model = genai_model.GenaiModel(api_key=API_KEY, model_name="gemini-3-flash-preview")

cached_model = semantic_cache.SemanticCacheProcessor(
    wrapped_processor=model,
    api_key=API_KEY,
    similarity_threshold=0.90,
)

# First call - cache miss
result1 = processor.apply_sync(cached_model, ["What is the capital of France?"])

# Second call - cache hit (semantically similar)
result2 = processor.apply_sync(cached_model, ["Tell me France's capital"])

@gemini-code-assist
Copy link

Summary of Changes

Hello @Karanjot786, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request introduces a powerful new "SemanticCacheProcessor" to the "genai-processors" library, fundamentally changing how LLM responses can be cached. Instead of relying on brittle exact-match caching, this new processor leverages vector embeddings and cosine similarity to identify and serve responses for semantically similar queries. This intelligent caching mechanism is designed to drastically cut down on redundant API calls, reduce operational costs, and improve response times for users interacting with LLM-powered applications, making the system more efficient and cost-effective.

Highlights

  • Semantic Caching Introduction: A new "SemanticCacheProcessor" has been added to the "contrib" module, enabling caching of LLM responses based on semantic similarity rather than exact string matching.
  • Embedding-based Matching: The processor utilizes vector embeddings (specifically Gemini's "text-embedding-004" model) and cosine similarity to identify semantically equivalent queries, improving cache hit rates for rephrased questions.
  • Performance and Cost Optimization: This feature aims to significantly reduce unnecessary LLM API calls, lower operational costs, and decrease latency for users by serving cached responses for similar queries.
  • Configurable Cache Behavior: The cache offers configurable parameters such as "similarity_threshold", Time-To-Live (TTL) for entries, and Least Recently Used (LRU) eviction policies, allowing fine-tuned control over caching behavior.
  • Extensible Cache Backends: An abstract base class "VectorCacheBase" is provided, allowing developers to implement custom cache backends (e.g., Redis, FAISS) beyond the default "InMemoryVectorCache".
  • PartProcessor Variant: A "SemanticCachePartProcessor" is also included, designed for caching responses of individual parts in high-concurrency scenarios.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces a SemanticCacheProcessor to cache LLM responses based on semantic similarity, which is a great feature for reducing API calls and latency. The implementation is well-structured with a clear separation of concerns between the embedding client, cache storage, and the processor logic. The addition of comprehensive documentation and a thorough test suite is commendable.

My review includes a few suggestions for minor improvements, such as removing unused code, simplifying part creation for consistency, and correcting a potential typo in the documentation.

@Karanjot786
Copy link
Author

Hi @kibergus, when you have a moment, could you please review this PR? Thanks!

hit_count: int = 0
metadata: dict[str, Any] = dataclasses.field(default_factory=dict)

def get_response_parts(self) -> list[content_api.ProcessorPart]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You have now the to_dict and from_dict method in ProcessorPart, you could use them directly. Not sure this is in the latest packaged version but we plan to release the new one very soon. So best to use those.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for pointing this out. I'll switch to ProcessorPart.to_dict() and ProcessorPart.from_dict() directly. Will update once the new release lands.

a = np.array(vec1, dtype=np.float32)
b = np.array(vec2, dtype=np.float32)

dot_product = np.dot(a, b)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: use it in return statement directly, no need to compute it if norm is zero.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. I'll move the computation into the return statement and skip it entirely when norm is zero.

return float(dot_product / (norm_a * norm_b))


def _serialize_part(part: content_api.ProcessorPart) -> dict[str, Any]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no need for it: ProcessorPart.to_dict should work here (works for multi-model data btw).

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll remove _serialize_part and use ProcessorPart.to_dict() instead. Cleaner and handles multi-modal data too.

"""
# Extract text from all parts
text_parts = []
for part in content.all_parts:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

raise an exception if you have non text content?

We could generate a cache miss whenever the prompt contains non-textual data.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense. I'll add a check: if the prompt contains non-textual parts, we generate a cache miss. No silent failures.

embedding: list[float],
threshold: float,
limit: int = 1,
) -> SimilaritySearchResult | None:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

needs to be a list if limit > 1

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll change the return type to list[SimilaritySearchResult] so it properly supports limit > 1.

streams.stream_content(input_parts)
):
yield part
return
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not aure you'd need this - you could handle empty input_content inside the embed method - raising an exception, and the following block would apply.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll move the empty input handling into the embed method itself and raise an exception there. Removes the need for the early return block.


current_time = time.time()

for entry in self._entries.values():
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you could put this into a separate thread asyncio.to_thread() as it might take a while, then the asyncio loop does not block on it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll wrap the cleanup loop in asyncio.to_thread() so the event loop stays unblocked during eviction.

If a match is found above the similarity threshold, returns the cached
response instead of calling the wrapped processor.

Reduces API costs and latency when similar queries are frequently repeated.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note: only works for turn based processor, not for realtime ones.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add a clear note in the docstring: this processor works for turn-based use cases only, not realtime ones.

Copy link
Collaborator

@aelissee aelissee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, thanks for the proposal: I'd let Kibergus check it as well but first quick pass at it.

@Karanjot786 Karanjot786 requested a review from aelissee February 20, 2026 05:32
@Karanjot786
Copy link
Author

@aelissee @kibergus Applied all the review feedback. Here is what changed:

Serialization:

  • Replaced custom _serialize_part with ProcessorPart.to_dict() for storage
  • Used ProcessorPart.from_dict(data=part_dict) for deserialization

Error handling:

  • embed() now raises ValueError for empty content and non-text parts
  • This triggers a cache miss in call(), which falls through to the wrapped processor
  • Removed the separate empty-input check from call()

API changes:

  • find_similar() returns list[SimilaritySearchResult] instead of SimilaritySearchResult | None
  • Supports limit parameter for multiple results

Performance:

  • Inlined dot product in cosine_similarity, returns 0.0 when norms are zero
  • Wrapped _evict_expired in asyncio.to_thread() to avoid blocking the event loop

Documentation:

  • Added turn-based-only limitation to class docstring and docs
  • Updated custom cache backend example with new find_similar signature
  • Expanded limitations section for non-text content handling
  • Removed _serialize_part from API reference

Tests:

  • Updated all assertions for list return type
  • Added tests for find_similar with limit > 1
  • Added test for non-text content bypass
  • Updated get_response_parts test to use to_dict/from_dict format
  • Fixed concurrency test race condition
  • All 41 tests pass with 0 warnings

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants