Skip to content

txt2phrases — Feature Enhancement Proposal #6

@Uditaagarwal1

Description

@Uditaagarwal1

txt2phrases — Feature Enhancement Proposal

Enhance txt2phrases to support more flexible input handling and compatibility with research workflows such as pygetpapers.

This update will make the library capable of automatically processing research papers in varied directory structures, converting PDFs to text, and allowing both single-file and batch-folder input.


Proposed Enhancements

1. pygetpapers Output Compatibility

  • Goal: Enable txt2phrases to automatically detect and process the directory structure generated by pygetpapers.
  • Why: The current structure of pygetpapers outputs differs from standard input formats expected by txt2phrases.
  • Expected Behavior:
    txt2phrases should intelligently navigate nested folders to find and process .pdf or .txt files.

2. PDF → TXT Conversion Method

  • Goal: Add a built-in method to convert .pdf files into .txt for downstream keyword extraction.
  • Why: Users should be able to directly process PDF research papers without manual text extraction.

3. File and Folder Input Support

  • Goal: Allow txt2phrases to work seamlessly with both single files and entire directories.

  • Why: This provides flexibility for users who want to analyze one document or batch-process an entire dataset.

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions