Skip to content

fix(file): delete physical cache files when removing documents#807

Open
majiayu000 wants to merge 2 commits intoCinnamon:mainfrom
majiayu000:fix/delete-physical-files-on-document-removal
Open

fix(file): delete physical cache files when removing documents#807
majiayu000 wants to merge 2 commits intoCinnamon:mainfrom
majiayu000:fix/delete-physical-files-on-document-removal

Conversation

@majiayu000
Copy link

Summary

  • Add _delete_physical_files method to clean up cache files on document deletion
  • Delete chunk cache files from KH_CHUNKS_OUTPUT_DIR
  • Delete markdown cache files from KH_MARKDOWN_OUTPUT_DIR
  • Graceful error handling to prevent deletion failures from blocking the operation

Test plan

  • Added unit tests for cache file deletion scenarios
  • Tests cover: file deletion, directory deletion, empty filename, missing dirs

Fixes #717

Fixes Cinnamon#717

When deleting documents via the UI, the deletion now also removes:
- Chunk cache files from KH_CHUNKS_OUTPUT_DIR
- Markdown cache files from KH_MARKDOWN_OUTPUT_DIR

This prevents storage bloat from orphaned cache files that were
previously left behind after document deletion.

Signed-off-by: majiayu000 <1835304752@qq.com>
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds cache clean-up during document deletion in the file index UI, aiming to prevent chunk/markdown cache directories from accumulating stale files after a document is removed.

Changes:

  • Add FileIndexPage._delete_physical_files() to delete matching entries under KH_CHUNKS_OUTPUT_DIR and KH_MARKDOWN_OUTPUT_DIR.
  • Invoke _delete_physical_files() from delete_event() after docstore/vectorstore deletion.
  • Add a new test module intended to cover cache deletion scenarios.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 5 comments.

File Description
libs/ktem/ktem/index/file/ui.py Adds physical cache deletion logic and hooks it into the UI-driven delete flow.
libs/ktem/ktem_tests/test_file_deletion.py Adds tests intended to validate cache deletion behaviors.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +463 to +469
for file_path in chunks_dir.iterdir():
if file_stem in file_path.name:
try:
if file_path.is_file():
file_path.unlink()
elif file_path.is_dir():
shutil.rmtree(file_path)
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The deletion match if file_stem in file_path.name is overly broad and can remove cache entries for other documents whose stems merely contain the target stem (e.g., deleting report would also match report_2024_0.md). Use a stricter match (e.g., name == f"{stem}.md" or name.startswith(f"{stem}_") / name.startswith(f"{stem}.")) and consider using glob() with a prefix pattern to avoid scanning the entire directory.

Copilot uses AI. Check for mistakes.
Comment on lines +475 to +480
# Delete markdown cache files
try:
markdown_dir = Path(flowsettings.KH_MARKDOWN_OUTPUT_DIR)
if markdown_dir.exists():
for file_path in markdown_dir.iterdir():
if file_stem in file_path.name:
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chunk-cache and markdown-cache deletion blocks are nearly identical. Consider extracting a small helper (e.g., _delete_matching_paths(cache_dir: Path, stem: str)) to keep behavior consistent and reduce the chance of future fixes being applied to only one of the two directories.

Copilot uses AI. Check for mistakes.
Comment on lines +521 to +522
# Delete physical files associated with the document
self._delete_physical_files(file_name)
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR metadata says this "Fixes #717", but this code only deletes KH_CHUNKS_OUTPUT_DIR / KH_MARKDOWN_OUTPUT_DIR cache entries. The issue report also calls out persistent on-disk data for docstores/vectorstores (e.g., LanceDB/Chroma directories), which are not addressed here (their delete() implementations remove rows but do not remove underlying data files). Either broaden the deletion to cover those storage backends’ physical data (where safe), or adjust the PR/issue linkage so the issue isn’t closed prematurely.

Copilot uses AI. Check for mistakes.
Comment on lines +27 to +32
# Simulate deletion
file_stem = Path("test_document.pdf").stem
for file_path in chunks_dir.iterdir():
if file_stem in file_path.name:
file_path.unlink()

Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests don’t exercise _delete_physical_files at all—they reimplement a simplified deletion loop inside the test. As a result they won’t catch regressions in the real method (e.g., error handling, directory deletion, matching rules, or flowsettings paths). Refactor the tests to call the real _delete_physical_files while patching flowsettings.KH_CHUNKS_OUTPUT_DIR / flowsettings.KH_MARKDOWN_OUTPUT_DIR to tmp_path directories, and assert the filesystem effects.

Copilot uses AI. Check for mistakes.
Comment on lines +113 to +120
# Simulating the check in _delete_physical_files
if chunks_dir.exists():
pass # Would iterate but doesn't exist
if markdown_dir.exists():
pass # Would iterate but doesn't exist

# No errors should be raised
assert True
Copy link

Copilot AI Mar 8, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

test_nonexistent_cache_dirs_handled currently only asserts the directories don’t exist and ends with assert True; it doesn’t call the production code path. Once the test calls _delete_physical_files, this case should assert that invoking it with missing cache dirs does not raise and does not create/delete anything unexpectedly.

Suggested change
# Simulating the check in _delete_physical_files
if chunks_dir.exists():
pass # Would iterate but doesn't exist
if markdown_dir.exists():
pass # Would iterate but doesn't exist
# No errors should be raised
assert True
# No operation should create any new files or directories
# If an error were raised, the test would fail; here we assert that
# the temporary directory remains empty.
assert list(tmp_path.iterdir()) == []

Copilot uses AI. Check for mistakes.
Signed-off-by: majiayu000 <1835304752@qq.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Physical Files Not Deleted When Removing Documents via UI

2 participants