-
Notifications
You must be signed in to change notification settings - Fork 11
Add support for negative phrase filtering #10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
Merged
Changes from all commits
Commits
Show all changes
17 commits
Select commit
Hold shift + click to select a range
ffa5560
Add support for the ct report type
cynthia-d-lo 2abbf32
Update README
cynthia-d-lo 068187f
Add comment to config
cynthia-d-lo 66373d7
Add support for ct negative filtering
cynthia-d-lo 89739a8
Add pytest and cli support
cynthia-d-lo 308e183
Add report type cli support
cynthia-d-lo 9a43812
Remove breakpoint
cynthia-d-lo 4c811a9
Upgrade ubuntu in ci checks
cynthia-d-lo 3099376
Remove phrasification output folder overwrite, merge main
cynthia-d-lo 261d9f5
Merge main into clo/radfact_negative_filtering_support
cynthia-d-lo 07640cb
Rename rephrases to phrase list
cynthia-d-lo 3a31d66
Minor fixes
cynthia-d-lo 12d7e29
Update src/radfact/llm_utils/negative_filtering/processor.py
cynthia-d-lo e081497
Update src/radfact/llm_utils/negative_filtering/processor.py
cynthia-d-lo 47ec6af
Update documentation
cynthia-d-lo 118b9b7
Merge branch 'clo/radfact_negative_filtering_support' of github-perso…
cynthia-d-lo a0586eb
Lint
cynthia-d-lo File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| #@package __global__ | ||
|
|
||
| defaults: | ||
| - default | ||
| - override endpoints: azure_chat_openai | ||
| - _self_ | ||
|
|
||
| processing: | ||
| index_col: sentence_id |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Empty file.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,140 @@ | ||
| # ------------------------------------------------------------------------------------------ | ||
| # Copyright (c) Microsoft Corporation. All rights reserved. | ||
| # Licensed under the MIT License (MIT). See LICENSE in the repo root for license information. | ||
| # ------------------------------------------------------------------------------------------ | ||
|
|
||
| from collections import defaultdict | ||
| import json | ||
| from pathlib import Path | ||
|
|
||
| import pandas as pd | ||
| from radfact.llm_utils.prompt_tasks import NEGATIVE_FILTERING_PARSING_TASK, NegativeFilteringTaskOptions, ReportType | ||
| from omegaconf import DictConfig | ||
|
|
||
| from radfact.llm_utils.engine.engine import LLMEngine, get_subfolder | ||
| from radfact.llm_utils.processor.structured_processor import StructuredProcessor, parse_examples_from_json | ||
| from radfact.llm_utils.report_to_phrases.schema import ( | ||
| ParsedReport, | ||
| PhraseList, | ||
| PhraseListExample, | ||
| SentenceWithRephrases, | ||
| ) | ||
| from radfact.paths import OUTPUT_DIR | ||
|
|
||
| ORIG = "orig" | ||
| NEW = "new" | ||
|
|
||
|
|
||
| def get_negative_filtering_phrase_processor( | ||
| report_type: ReportType, log_dir: Path | None = None | ||
| ) -> StructuredProcessor[list[str], PhraseList]: | ||
| """Return a processor for filtering negative findings from a list of phrases. | ||
|
|
||
| :param report_type: The type of report, e.g., "ReportType.CXR" or "ReportType.CT". | ||
| :param log_dir: The directory to save logs. | ||
| :return: The processor for negative finding filtering. | ||
| """ | ||
| task = NegativeFilteringTaskOptions[report_type.name].value | ||
| system_prompt = task.system_message_path.read_text() | ||
| few_shot_examples = parse_examples_from_json(task.few_shot_examples_path, PhraseListExample) | ||
| processor = StructuredProcessor( | ||
| query_type=list[str], | ||
| result_type=PhraseList, | ||
| system_prompt=system_prompt, | ||
| format_query_fn=lambda x: json.dumps(x), | ||
| few_shot_examples=few_shot_examples, | ||
| log_dir=log_dir, | ||
| ) | ||
| return processor | ||
|
|
||
|
|
||
| def load_filtering_queries_from_parsed_reports( | ||
| reports: list[ParsedReport], | ||
| index_col: str, | ||
| ) -> pd.DataFrame: | ||
| """ | ||
| Load queries for filtering from a list of parsed reports. Queries consist of all the | ||
| newly parsed phrases from phrasification, along with metadata including the study ID | ||
| and original phrase. | ||
|
|
||
| :param reports: A list of ParsedReport objects. | ||
| :param index_col: The column containing the index | ||
| :return: A dataframe of queries. | ||
| """ | ||
| queries = [] | ||
| for report in reports: | ||
| for i, sentence in enumerate(report.sentence_list): | ||
| queries.append([f"{report.id}_{i}", sentence.orig, sentence.new]) | ||
| query_df = pd.DataFrame(queries, columns=[index_col, ORIG, NEW]) | ||
| return query_df | ||
|
|
||
|
|
||
| def get_negative_filtering_engine( | ||
| cfg: DictConfig, parsed_reports: list[ParsedReport], subfolder_prefix: str, report_type: ReportType | ||
| ) -> LLMEngine: | ||
| """ | ||
| Create the processing engine for filtering negative findings from parsed reports. | ||
|
|
||
| :param cfg: The configuration for the processing engine. | ||
| :param parsed_reports: A list of ParsedReport objects to filter. | ||
| :param subfolder_prefix: The prefix for the metric folder | ||
| :param report_type: The type of report, e.g., CT. | ||
| :return: The processing engine. | ||
| """ | ||
| OUTPUT_FOLDER = OUTPUT_DIR / NEGATIVE_FILTERING_PARSING_TASK | ||
| output_folder = get_subfolder(OUTPUT_FOLDER, subfolder_prefix) | ||
| final_output_folder = get_subfolder(OUTPUT_FOLDER, subfolder_prefix) | ||
| log_dir = get_subfolder(OUTPUT_FOLDER, "logs") | ||
|
|
||
| query_df = load_filtering_queries_from_parsed_reports(parsed_reports, cfg.processing.index_col) | ||
| negative_filtering_processor = get_negative_filtering_phrase_processor(report_type=report_type, log_dir=log_dir) | ||
|
|
||
| engine = LLMEngine( | ||
| cfg=cfg, | ||
| processor=negative_filtering_processor, | ||
| dataset_df=query_df, | ||
| row_to_query_fn=lambda row: row[NEW], | ||
| progress_output_folder=output_folder, | ||
| final_output_folder=final_output_folder, | ||
| ) | ||
| return engine | ||
|
|
||
|
|
||
| def process_filtered_reports(engine: LLMEngine, cfg: DictConfig) -> tuple[list[ParsedReport], int]: | ||
| """ | ||
| Process the filtered reports using the provided engine. | ||
|
|
||
| :param engine: The LLMEngine used for processing. | ||
| :param cfg: The configuration for negative filtering processing. | ||
| :return: A tuple containing a list of ParsedReport objects and the number of rewritten sentences. | ||
| """ | ||
| outputs = engine.return_raw_outputs | ||
| metadata = engine.return_dataset_subsets | ||
|
|
||
| parsed_report_dict = defaultdict(list) | ||
| num_rewritten_sentences = 0 | ||
|
|
||
| for k in outputs.keys(): | ||
| phrase_list = outputs[k] | ||
| metadata_df = metadata[k].df | ||
|
|
||
| for idx, row in metadata_df.iterrows(): | ||
| study_id = row[cfg.processing.index_col].rsplit("_", 1)[0] | ||
| orig = row[ORIG] | ||
| unfiltered_phrases = set(row[NEW]) | ||
| filtered_phrases = set(phrase_list[idx].phrases) | ||
|
|
||
| if not filtered_phrases.issubset(unfiltered_phrases): | ||
| rewritten_phrases = filtered_phrases - unfiltered_phrases | ||
| print( | ||
| f"New phrases {rewritten_phrases} not in original phrases {unfiltered_phrases}. Reverting back to original phrases." | ||
| ) | ||
| filtered_phrases = unfiltered_phrases | ||
| num_rewritten_sentences += 1 | ||
|
|
||
| parsed_report_dict[study_id].append(SentenceWithRephrases(orig=orig, new=list(filtered_phrases))) | ||
|
|
||
| parsed_reports = [ | ||
| ParsedReport(id=study_id, sentence_list=sentences) for study_id, sentences in parsed_report_dict.items() | ||
| ] | ||
| return parsed_reports, num_rewritten_sentences | ||
44 changes: 44 additions & 0 deletions
44
src/radfact/llm_utils/negative_filtering/prompts/ct/few_shot_examples.json
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,44 @@ | ||
| [ | ||
| { | ||
| "input": [ | ||
| "There is a relative opacity observed in the left mid-to-lower lung, possibly located in the lingula.", | ||
| "There is no evidence of pneumothorax.", | ||
| "The cardiac silhouette is unremarkable.", | ||
| "The mediastinal silhouette is unremarkable.", | ||
| "Mild recessions are observed in the upper lobe of the left lung." | ||
| ], | ||
| "output": { | ||
| "phrases": [ | ||
| "There is a relative opacity observed in the left mid-to-lower lung, possibly located in the lingula.", | ||
| "Mild recessions are observed in the upper lobe of the left lung." | ||
| ] | ||
| } | ||
| }, { | ||
| "input": [ | ||
| "The right lung is well aerated.", | ||
| "No signs of pulmonary edema.", | ||
| "No signs of focal consolidation.", | ||
| "The left side still shows mediastinal shifting and volume loss.", | ||
| "No signs of pleural effusions." | ||
| ], | ||
| "output": { | ||
| "phrases": [ | ||
| "The left side still shows mediastinal shifting and volume loss." | ||
| ] | ||
| } | ||
| }, { | ||
| "input": [ | ||
| "There is a moderate right pleural effusion.", | ||
| "There is no pneumothorax.", | ||
| "The heart size is within normal limits.", | ||
| "The radiograph shows linear opacities in the right middle lobe and left lower lobe, indicating atelectasis.", | ||
| "The mediastinal contours are unremarkable." | ||
| ], | ||
| "output": { | ||
| "phrases": [ | ||
| "There is a moderate right pleural effusion.", | ||
| "The radiograph shows linear opacities in the right middle lobe and left lower lobe, indicating atelectasis." | ||
| ] | ||
| } | ||
| } | ||
| ] |
13 changes: 13 additions & 0 deletions
13
src/radfact/llm_utils/negative_filtering/prompts/ct/system_message.txt
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,13 @@ | ||
| You are an AI radiology assistant. You are helping process reports from CT (computed tomography) scans. | ||
|
|
||
| You are given a list of phrases from a radiology report which refer to objects, findings, or anatomies visible in a CT scan, or the absence of such. | ||
|
|
||
| Your goal is to filter phrases that do not refer to positive radiology findings. | ||
|
|
||
| Rules: | ||
| - Remove statements describing the absence of pathology (e.g. "No pneumothorax", "No pleural effusion detected") | ||
| - Remove statements describing normal anatomical appearance, calibration, or function (e.g. "The liver is normal in size", "Upper abdominal organs are normal", "Thoracic esophageal calibration was normal", "The lungs are well aerated", "Lungs are clear") | ||
| - Remove statements describing unremarkable appearances (e.g. "Kidneys appear unremarkable", "The mediastinum is unremarkable") | ||
| - Keep statements referring to "mild" observations or conditions, as those are still considered positive radiology findings | ||
|
|
||
| The objective is to remove phrases which do not refer to positive radiology findings. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.