feat(contrib): Add SemanticCacheProcessor for semantic similarity caching#46
feat(contrib): Add SemanticCacheProcessor for semantic similarity caching#46Karanjot786 wants to merge 6 commits intogoogle-gemini:mainfrom
Conversation
Summary of ChangesHello @Karanjot786, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces a powerful new "SemanticCacheProcessor" to the "genai-processors" library, fundamentally changing how LLM responses can be cached. Instead of relying on brittle exact-match caching, this new processor leverages vector embeddings and cosine similarity to identify and serve responses for semantically similar queries. This intelligent caching mechanism is designed to drastically cut down on redundant API calls, reduce operational costs, and improve response times for users interacting with LLM-powered applications, making the system more efficient and cost-effective. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces a SemanticCacheProcessor to cache LLM responses based on semantic similarity, which is a great feature for reducing API calls and latency. The implementation is well-structured with a clear separation of concerns between the embedding client, cache storage, and the processor logic. The addition of comprehensive documentation and a thorough test suite is commendable.
My review includes a few suggestions for minor improvements, such as removing unused code, simplifying part creation for consistency, and correcting a potential typo in the documentation.
… SemanticCacheProcessor
|
Hi @kibergus, when you have a moment, could you please review this PR? Thanks! |
| hit_count: int = 0 | ||
| metadata: dict[str, Any] = dataclasses.field(default_factory=dict) | ||
|
|
||
| def get_response_parts(self) -> list[content_api.ProcessorPart]: |
There was a problem hiding this comment.
You have now the to_dict and from_dict method in ProcessorPart, you could use them directly. Not sure this is in the latest packaged version but we plan to release the new one very soon. So best to use those.
There was a problem hiding this comment.
Thanks for pointing this out. I'll switch to ProcessorPart.to_dict() and ProcessorPart.from_dict() directly. Will update once the new release lands.
| a = np.array(vec1, dtype=np.float32) | ||
| b = np.array(vec2, dtype=np.float32) | ||
|
|
||
| dot_product = np.dot(a, b) |
There was a problem hiding this comment.
nit: use it in return statement directly, no need to compute it if norm is zero.
There was a problem hiding this comment.
Good catch. I'll move the computation into the return statement and skip it entirely when norm is zero.
| return float(dot_product / (norm_a * norm_b)) | ||
|
|
||
|
|
||
| def _serialize_part(part: content_api.ProcessorPart) -> dict[str, Any]: |
There was a problem hiding this comment.
no need for it: ProcessorPart.to_dict should work here (works for multi-model data btw).
There was a problem hiding this comment.
I'll remove _serialize_part and use ProcessorPart.to_dict() instead. Cleaner and handles multi-modal data too.
| """ | ||
| # Extract text from all parts | ||
| text_parts = [] | ||
| for part in content.all_parts: |
There was a problem hiding this comment.
raise an exception if you have non text content?
We could generate a cache miss whenever the prompt contains non-textual data.
There was a problem hiding this comment.
Makes sense. I'll add a check: if the prompt contains non-textual parts, we generate a cache miss. No silent failures.
| embedding: list[float], | ||
| threshold: float, | ||
| limit: int = 1, | ||
| ) -> SimilaritySearchResult | None: |
There was a problem hiding this comment.
needs to be a list if limit > 1
There was a problem hiding this comment.
I'll change the return type to list[SimilaritySearchResult] so it properly supports limit > 1.
| streams.stream_content(input_parts) | ||
| ): | ||
| yield part | ||
| return |
There was a problem hiding this comment.
not aure you'd need this - you could handle empty input_content inside the embed method - raising an exception, and the following block would apply.
There was a problem hiding this comment.
I'll move the empty input handling into the embed method itself and raise an exception there. Removes the need for the early return block.
|
|
||
| current_time = time.time() | ||
|
|
||
| for entry in self._entries.values(): |
There was a problem hiding this comment.
you could put this into a separate thread asyncio.to_thread() as it might take a while, then the asyncio loop does not block on it.
There was a problem hiding this comment.
I'll wrap the cleanup loop in asyncio.to_thread() so the event loop stays unblocked during eviction.
| If a match is found above the similarity threshold, returns the cached | ||
| response instead of calling the wrapped processor. | ||
|
|
||
| Reduces API costs and latency when similar queries are frequently repeated. |
There was a problem hiding this comment.
note: only works for turn based processor, not for realtime ones.
There was a problem hiding this comment.
Will add a clear note in the docstring: this processor works for turn-based use cases only, not realtime ones.
aelissee
left a comment
There was a problem hiding this comment.
Hi, thanks for the proposal: I'd let Kibergus check it as well but first quick pass at it.
|
@aelissee @kibergus Applied all the review feedback. Here is what changed: Serialization:
Error handling:
API changes:
Performance:
Documentation:
Tests:
|
Summary
Adds
SemanticCacheProcessor, a new contrib processor that caches LLM responses based on semantic similarity using vector embeddings. Unlike exact-match caching, this approach matches queries like "What is the capital of France?" and "Tell me France's capital city" to the same cached response.Motivation
Current caching in genai-processors uses exact hash matching, which misses cache hits for semantically equivalent queries. This causes:
Changes
New Files
genai_processors/contrib/semantic_cache.py- Main implementationgenai_processors/contrib/semantic_cache.md- Documentationgenai_processors/contrib/tests/semantic_cache_test.py- Test suite (38 tests)Modified Files
genai_processors/contrib/README.md- Added to processor listFeatures
VectorCacheBaseABC for custom implementationsUsage