fix(file): delete physical cache files when removing documents#807
fix(file): delete physical cache files when removing documents#807majiayu000 wants to merge 2 commits intoCinnamon:mainfrom
Conversation
Fixes Cinnamon#717 When deleting documents via the UI, the deletion now also removes: - Chunk cache files from KH_CHUNKS_OUTPUT_DIR - Markdown cache files from KH_MARKDOWN_OUTPUT_DIR This prevents storage bloat from orphaned cache files that were previously left behind after document deletion. Signed-off-by: majiayu000 <1835304752@qq.com>
There was a problem hiding this comment.
Pull request overview
This PR adds cache clean-up during document deletion in the file index UI, aiming to prevent chunk/markdown cache directories from accumulating stale files after a document is removed.
Changes:
- Add
FileIndexPage._delete_physical_files()to delete matching entries underKH_CHUNKS_OUTPUT_DIRandKH_MARKDOWN_OUTPUT_DIR. - Invoke
_delete_physical_files()fromdelete_event()after docstore/vectorstore deletion. - Add a new test module intended to cover cache deletion scenarios.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| libs/ktem/ktem/index/file/ui.py | Adds physical cache deletion logic and hooks it into the UI-driven delete flow. |
| libs/ktem/ktem_tests/test_file_deletion.py | Adds tests intended to validate cache deletion behaviors. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
libs/ktem/ktem/index/file/ui.py
Outdated
| for file_path in chunks_dir.iterdir(): | ||
| if file_stem in file_path.name: | ||
| try: | ||
| if file_path.is_file(): | ||
| file_path.unlink() | ||
| elif file_path.is_dir(): | ||
| shutil.rmtree(file_path) |
There was a problem hiding this comment.
The deletion match if file_stem in file_path.name is overly broad and can remove cache entries for other documents whose stems merely contain the target stem (e.g., deleting report would also match report_2024_0.md). Use a stricter match (e.g., name == f"{stem}.md" or name.startswith(f"{stem}_") / name.startswith(f"{stem}.")) and consider using glob() with a prefix pattern to avoid scanning the entire directory.
libs/ktem/ktem/index/file/ui.py
Outdated
| # Delete markdown cache files | ||
| try: | ||
| markdown_dir = Path(flowsettings.KH_MARKDOWN_OUTPUT_DIR) | ||
| if markdown_dir.exists(): | ||
| for file_path in markdown_dir.iterdir(): | ||
| if file_stem in file_path.name: |
There was a problem hiding this comment.
The chunk-cache and markdown-cache deletion blocks are nearly identical. Consider extracting a small helper (e.g., _delete_matching_paths(cache_dir: Path, stem: str)) to keep behavior consistent and reduce the chance of future fixes being applied to only one of the two directories.
| # Delete physical files associated with the document | ||
| self._delete_physical_files(file_name) |
There was a problem hiding this comment.
PR metadata says this "Fixes #717", but this code only deletes KH_CHUNKS_OUTPUT_DIR / KH_MARKDOWN_OUTPUT_DIR cache entries. The issue report also calls out persistent on-disk data for docstores/vectorstores (e.g., LanceDB/Chroma directories), which are not addressed here (their delete() implementations remove rows but do not remove underlying data files). Either broaden the deletion to cover those storage backends’ physical data (where safe), or adjust the PR/issue linkage so the issue isn’t closed prematurely.
| # Simulate deletion | ||
| file_stem = Path("test_document.pdf").stem | ||
| for file_path in chunks_dir.iterdir(): | ||
| if file_stem in file_path.name: | ||
| file_path.unlink() | ||
|
|
There was a problem hiding this comment.
These tests don’t exercise _delete_physical_files at all—they reimplement a simplified deletion loop inside the test. As a result they won’t catch regressions in the real method (e.g., error handling, directory deletion, matching rules, or flowsettings paths). Refactor the tests to call the real _delete_physical_files while patching flowsettings.KH_CHUNKS_OUTPUT_DIR / flowsettings.KH_MARKDOWN_OUTPUT_DIR to tmp_path directories, and assert the filesystem effects.
| # Simulating the check in _delete_physical_files | ||
| if chunks_dir.exists(): | ||
| pass # Would iterate but doesn't exist | ||
| if markdown_dir.exists(): | ||
| pass # Would iterate but doesn't exist | ||
|
|
||
| # No errors should be raised | ||
| assert True |
There was a problem hiding this comment.
test_nonexistent_cache_dirs_handled currently only asserts the directories don’t exist and ends with assert True; it doesn’t call the production code path. Once the test calls _delete_physical_files, this case should assert that invoking it with missing cache dirs does not raise and does not create/delete anything unexpectedly.
| # Simulating the check in _delete_physical_files | |
| if chunks_dir.exists(): | |
| pass # Would iterate but doesn't exist | |
| if markdown_dir.exists(): | |
| pass # Would iterate but doesn't exist | |
| # No errors should be raised | |
| assert True | |
| # No operation should create any new files or directories | |
| # If an error were raised, the test would fail; here we assert that | |
| # the temporary directory remains empty. | |
| assert list(tmp_path.iterdir()) == [] |
Signed-off-by: majiayu000 <1835304752@qq.com>
Summary
_delete_physical_filesmethod to clean up cache files on document deletionKH_CHUNKS_OUTPUT_DIRKH_MARKDOWN_OUTPUT_DIRTest plan
Fixes #717