feature: Update WordEmbeddingModel class#62
Merged
pbadillatorrealba merged 2 commits intodevelopfrom Jul 22, 2025
Merged
Conversation
There was a problem hiding this comment.
Pull Request Overview
This PR introduces significant improvements to the WordEmbeddingModel class, focusing on enhanced type safety, better error handling, improved gensim compatibility, and comprehensive test coverage. The changes modernize the codebase with better type hints and more robust validation while maintaining backward compatibility.
Key changes include:
- Enhanced type safety with stricter parameter validation and modern type annotations using
NDArray[np.float64] - Improved gensim version compatibility through the
GENSIM_V4_OR_GREATERconstant - Comprehensive batch update functionality with atomic operations and detailed validation
Reviewed Changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| wefe/word_embedding_model.py | Major refactoring with enhanced type safety, improved error messages, new batch update method, and better gensim compatibility |
| wefe/preprocessing.py | Updated to handle the new KeyError-raising behavior of __getitem__ |
| tests/test_word_embedding_model.py | Comprehensive test updates with detailed docstrings and extensive validation for new functionality |
Comments suppressed due to low confidence (1)
wefe/word_embedding_model.py:205
- This line appears to be misplaced in the word_embedding_model.py file but belongs to preprocessing.py based on the diff context. This could indicate a merge error or incorrect file placement.
if self.vocab_prefix is not None:
WordEmbeddingModel module
WordEmbeddingModel module
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
This pull request introduces significant updates to the
WordEmbeddingModelclass and its associated tests, improving type safety, error handling, and documentation. The changes also enhance compatibility with different versions ofgensimand streamline the codebase for better readability and maintainability.Enhancements to
WordEmbeddingModelclass:Type Safety and Error Handling:
wv,name, andvocab_prefixparameters in the constructor, with more descriptive error messages.__getitem__to raiseKeyErrorfor words not in the vocabulary and added type validation for thekeyparameter.__contains__and__len__methods for checking word existence and vocabulary size.Compatibility Updates:
GENSIM_V4_OR_GREATERto handle differences ingensimversions, ensuring compatibility with both pre-4.0 and 4.0+ versions.Type Aliases:
Union[np.ndarray, None]withNDArray[np.float64]for better type hinting and consistency.Updates to Unit Tests:
Improved Test Coverage:
__eq__,__contains__,__getitem__,__repr__).Batch Update Tests:
test_update_embeddingsto validate batch updates with detailed checks for input types, sizes, and errors.Code Simplification:
get_embeddings_from_setinwefe/preprocessing.pyto handle missing embeddings more concisely.