Skip to content
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
167 commits
Select commit Hold shift + click to select a range
a65770f
Added file extension check for HTML/XML processing paths.
Thomas-Rowlands Mar 25, 2025
1c66ea0
lowered file extension checks to avoid potential bug with string comp…
Thomas-Rowlands Mar 25, 2025
427bb16
Refactored extension check logic so the string is only lowered once p…
Thomas-Rowlands Mar 25, 2025
bc9b229
Implemented parsing to validate input file types.
Thomas-Rowlands Mar 26, 2025
831c9e5
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Mar 26, 2025
a2fdfb4
Update autocorpus/Autocorpus.py
Thomas-Rowlands Apr 15, 2025
ef7e596
Initial word processing integration
Thomas-Rowlands Apr 15, 2025
4ad22d9
Merge branch 'main' into file_type_checking
Thomas-Rowlands Apr 16, 2025
a7a6392
Merge conflict resolution
Thomas-Rowlands Apr 16, 2025
68adf8b
manual correction of merge conflict
Thomas-Rowlands Apr 16, 2025
b8d1504
reintegrated file type checking changes
Thomas-Rowlands Apr 16, 2025
a827e51
Git merge correction
Thomas-Rowlands Apr 16, 2025
8fecb3b
Type hints and optimisation
Thomas-Rowlands Apr 19, 2025
ff2e0a8
Add codecov config files and upload coverage with CI
alexdewar May 16, 2025
1363286
Add codecov badge to readme
alexdewar May 16, 2025
bc1256e
Merge branch '222-process-pdf-documents' into 220-process-word-documents
Thomas-Rowlands May 19, 2025
ebddbd1
Implemented word document extraction
Thomas-Rowlands May 19, 2025
edc8016
Added the macos ci skip to PDF test.
Thomas-Rowlands May 20, 2025
c08a0bf
Added local output files produced from running tests to .gitignore.
Thomas-Rowlands May 20, 2025
e3960ce
Word test additions and old .doc document conversion
Thomas-Rowlands May 20, 2025
5b4f5c9
Merge branch 'main' into 220-process-word-documents
Thomas-Rowlands May 20, 2025
0a1eccc
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] May 20, 2025
50cecd9
Word extraction now functionally working.
Thomas-Rowlands May 20, 2025
84b5f19
Sample word document for testing
Thomas-Rowlands May 20, 2025
9061eb1
Merge branch '220-process-word-documents' of https://github.com/omics…
Thomas-Rowlands May 20, 2025
05d1a42
Added type cast for word bioc text building
Thomas-Rowlands May 20, 2025
33090c1
Updated expected PDF output with latest bioc key changes
Thomas-Rowlands May 21, 2025
3b8aca1
Type hints added
Thomas-Rowlands May 21, 2025
992fc90
Fix URL for codecov badge
alexdewar May 21, 2025
343400b
Merge remote-tracking branch 'origin/main' into codecov
alexdewar May 21, 2025
46d94cf
Merge pull request #247 from omicsNLP/codecov
AdrianDAlessandro May 21, 2025
6a0ae52
pyproject.toml: Move `pandas-stubs` and `lxml-stubs` to `dev` group
alexdewar May 21, 2025
a60b4a5
Make code robust to absence of `marker-pdf` package
alexdewar May 21, 2025
f145ce8
Move other PDF-related functionality to `pdf` module
alexdewar May 21, 2025
81c1594
Make `marker-pdf` an optional dependency
alexdewar May 21, 2025
a78349e
Update readme with instructions for enabling PDF support
alexdewar May 21, 2025
51e5f71
Update .github/actions/setup/action.yml
AdrianDAlessandro May 21, 2025
afe1a16
Suggest --all-extras in README for development
AdrianDAlessandro May 21, 2025
8f95271
Merge pull request #261 from omicsNLP/make-pdf-deps-optional
AdrianDAlessandro May 21, 2025
0bc2764
Fix running Auto-CORPus from command line
alexdewar May 22, 2025
a417a08
Updates based on suggested changes.
Thomas-Rowlands May 22, 2025
66ccd3d
Revert "Fix running Auto-CORPus from command line"
alexdewar May 22, 2025
4c1ae3d
Try fixing again, by converting `Path`s to `str`s
alexdewar May 22, 2025
9e47242
Merge pull request #262 from omicsNLP/fix-html-processing
AdrianDAlessandro May 22, 2025
0e2937b
Fix: Accidental reference to `marker` outside `pdf` module
alexdewar May 23, 2025
d79b3d0
Merge pull request #266 from omicsNLP/fix-optional-pdf-dep
AdrianDAlessandro May 23, 2025
b331e15
Reorganise HTML test data
alexdewar May 22, 2025
c0f8504
Dynamically load regression test data based on paths
alexdewar May 22, 2025
1718abf
Move HTML test data to 'public' subfolder
alexdewar May 22, 2025
de58323
Add placeholder test for private HTML data
alexdewar May 22, 2025
b4978fa
Handle test files without tables correctly
alexdewar May 22, 2025
e827ce2
Mark tests using known problematic files as xfail
alexdewar May 22, 2025
70245e9
Add private data repo as git submodule
alexdewar May 22, 2025
5d7f86f
Add `pytest-xdist` to dependencies and use to parallelise tests
alexdewar May 22, 2025
b4febd8
Remove redundant conversion to `str`
alexdewar May 23, 2025
336feba
Use custom PAT so GitHub can access private data
alexdewar May 23, 2025
e3f7806
Add instructions for downloading private data
alexdewar May 23, 2025
ee64f4c
Add readme for test data
alexdewar May 23, 2025
98896c3
Update private data repo
alexdewar May 23, 2025
c2d33b7
Review fixes/changes and altered supplementary tests to use temp dire…
Thomas-Rowlands May 23, 2025
373bcd5
Merge branch 'main' into 220-process-word-documents
Thomas-Rowlands May 23, 2025
58a984a
Post-merge cleanup. Removed duplicate dependency entry. Adjusted Word…
Thomas-Rowlands May 23, 2025
420d5a9
regenerated lock file
Thomas-Rowlands May 23, 2025
9c53e87
Bump mkdocstrings-python from 1.16.10 to 1.16.11
dependabot[bot] May 26, 2025
00cb1ed
Merge pull request #267 from omicsNLP/dependabot/pip/mkdocstrings-pyt…
github-actions[bot] May 26, 2025
beedd68
Bump ruff from 0.11.10 to 0.11.11
dependabot[bot] May 26, 2025
1efa635
Merge pull request #269 from omicsNLP/dependabot/pip/ruff-0.11.11
github-actions[bot] May 26, 2025
cc5ec3a
Bump pytest-mock from 3.14.0 to 3.14.1
dependabot[bot] May 26, 2025
57cc5ff
Merge pull request #270 from omicsNLP/dependabot/pip/pytest-mock-3.14.1
github-actions[bot] May 26, 2025
0cf9820
Bump marker-pdf from 1.6.2 to 1.7.3
dependabot[bot] May 26, 2025
b1bd686
Merge pull request #268 from omicsNLP/dependabot/pip/marker-pdf-1.7.3
github-actions[bot] May 26, 2025
fb77a28
[pre-commit.ci] pre-commit autoupdate
pre-commit-ci[bot] May 26, 2025
c6d8acc
Merge pull request #255 from omicsNLP/pre-commit-ci-update-config
github-actions[bot] May 26, 2025
461bb2f
Added linux and windows-specific dependencies for word processing
Thomas-Rowlands May 27, 2025
39d616c
Merge branch 'main' into 220-process-word-documents
Thomas-Rowlands May 27, 2025
4ac3c39
Windows word processing needs microsoft office (for now) so this will…
Thomas-Rowlands May 27, 2025
e8e8307
Merge branch '220-process-word-documents' of https://github.com/omics…
Thomas-Rowlands May 27, 2025
da33631
Added windows skip flag
Thomas-Rowlands May 27, 2025
c8c1973
mac runner requires Microsoft Word too
Thomas-Rowlands May 27, 2025
2153b8a
Correction for windows skip flag
Thomas-Rowlands May 27, 2025
826ead4
Attempt no. 999 to push a working skip_ci_windows flag.
Thomas-Rowlands May 27, 2025
de85be2
altered item marker logic
Thomas-Rowlands May 27, 2025
1e9a886
Move read_config out of Autocorpus class
AdrianDAlessandro May 20, 2025
bdc02be
Do not pass Autocorpus object to formatter
AdrianDAlessandro May 21, 2025
0c99d6c
Take all methods that don't use self out of the class
AdrianDAlessandro May 22, 2025
dad61fc
take extract_text out of AC class. Create data_structures.py
AdrianDAlessandro May 27, 2025
4f20697
Extract soup and tables. Use pathlib object
AdrianDAlessandro May 28, 2025
e17b419
Completely change entrypoint for AC class
AdrianDAlessandro May 28, 2025
5d12605
Use BioCJSONEncoder to make tests pass. Temporary until #272 is fixed
AdrianDAlessandro May 29, 2025
8277b78
Add more detail about test directory structure
alexdewar May 29, 2025
25d7568
Use definition of data path in `conftest.py`
alexdewar May 29, 2025
00059bd
Use BioCTableJSONEncoder to make tests pass
AdrianDAlessandro May 29, 2025
84abfda
Merge branch 'refactor-autocorpus-class' into file_type_checking
AdrianDAlessandro May 29, 2025
bfe24ef
Include PDF files in check_file_type and use a FileType Enum and a ma…
AdrianDAlessandro May 29, 2025
b51f260
Use check_file_type in _process_file
AdrianDAlessandro May 29, 2025
1308f92
Move file checker to file_type module and include test
AdrianDAlessandro May 29, 2025
3cbef8e
Correct docstrings
AdrianDAlessandro May 29, 2025
6fc4a22
Merge pull request #172 from omicsNLP/file_type_checking
AdrianDAlessandro May 29, 2025
d1dbb14
Apply suggestions from code review
AdrianDAlessandro May 29, 2025
abc6f06
Make changes suggested by @alexdewar
AdrianDAlessandro May 29, 2025
5059e07
Fix `is_abbreviation`
alexdewar May 14, 2025
94c48ea
Allow abbreviations separated by hyphens
alexdewar May 29, 2025
4db9f3a
Merge pull request #245 from omicsNLP/fix-is-abbreviation
AdrianDAlessandro May 29, 2025
452ceff
Merge branch 'main' into refactor-autocorpus-class
AdrianDAlessandro May 29, 2025
56e52b6
Update bug_report.md
AdrianDAlessandro May 29, 2025
3e281d9
Move the table extending and merging functions outof the AC class
AdrianDAlessandro May 29, 2025
72708e2
Move process_file out of AC class
AdrianDAlessandro May 29, 2025
75d6aa7
Remove process_html_article and convert Autocorpus to a dataclass
AdrianDAlessandro May 29, 2025
562dfa5
Move functions into new html and file_processins modules
AdrianDAlessandro May 29, 2025
e5def23
Merge branch 'main' into refactor-autocorpus-class
AdrianDAlessandro May 29, 2025
f6cebf4
Use the table file to extract linked_tables
AdrianDAlessandro May 29, 2025
21f5f31
implement some copilot suggestions
AdrianDAlessandro May 29, 2025
e24a3e7
Merge remote-tracking branch 'origin/main' into private-test-data
alexdewar May 30, 2025
bdd0b78
Update private test data for changes to abbreviation code
alexdewar May 30, 2025
1080dd0
get_table_json(): Make output deterministic
alexdewar May 30, 2025
b6ec7ce
Merge pull request #276 from omicsNLP/make-tables-output-deterministic
AdrianDAlessandro May 30, 2025
afb2add
Merge remote-tracking branch 'origin/main' into private-test-data
alexdewar May 30, 2025
4811643
Add Inputs and Outputs to documentation
AdrianDAlessandro May 29, 2025
8c7248b
Merge pull request #273 from omicsNLP/io-docs
AdrianDAlessandro May 30, 2025
73beb19
Merge branch 'main' into refactor-autocorpus-class
AdrianDAlessandro May 30, 2025
3de57dd
Update private test data for tables fix
alexdewar May 30, 2025
a38850a
Reinclude excluded private test files
alexdewar May 30, 2025
893c998
Merge branch 'main' into 220-process-word-documents
Thomas-Rowlands May 30, 2025
ee173ef
Merge branch 'main' into private-test-data
Thomas-Rowlands May 31, 2025
0ca473e
Merge pull request #265 from omicsNLP/private-test-data
Thomas-Rowlands Jun 1, 2025
ffeb689
Merge branch 'main' into 220-process-word-documents
Thomas-Rowlands Jun 1, 2025
e3e3b11
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 1, 2025
f6b31e2
Updated lock file
Thomas-Rowlands Jun 1, 2025
567772f
Merge branch '220-process-word-documents' of https://github.com/omics…
Thomas-Rowlands Jun 1, 2025
81d658c
Merge branch 'main' into refactor-autocorpus-class
Thomas-Rowlands Jun 1, 2025
c9e4b6b
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 1, 2025
ac35017
Update test_regression.py
Thomas-Rowlands Jun 1, 2025
4bda422
README.md: Use default config flag for example usages
alexdewar May 30, 2025
0848c2f
VS Code: Use default config flag for launch configs
alexdewar May 30, 2025
b0f60fb
Merge pull request #277 from omicsNLP/readme-use-default-config
AdrianDAlessandro Jun 2, 2025
0be8e24
Fix file path in test_file_type
AdrianDAlessandro Jun 2, 2025
e8a7de5
Use UNKNOWN instead of OTHER and check if file exists
AdrianDAlessandro Jun 2, 2025
b74d759
Do not use asserts for error checking
AdrianDAlessandro Jun 2, 2025
d1b4d74
Review suggestions for file_processing
AdrianDAlessandro Jun 2, 2025
606a14b
Update autocorpus/html.py
AdrianDAlessandro Jun 2, 2025
4330527
merge_tables stylistic suggested changes
AdrianDAlessandro Jun 2, 2025
5a0869b
Use _set_table_passage helper function in _merge_tables
AdrianDAlessandro Jun 2, 2025
094c244
Fix error in empty_tables processing
AdrianDAlessandro Jun 2, 2025
b530424
Merge branch 'main' into refactor-autocorpus-class
AdrianDAlessandro Jun 2, 2025
67dfbf1
Merge pull request #264 from omicsNLP/refactor-autocorpus-class
AdrianDAlessandro Jun 2, 2025
e55b012
Merge branch 'main' into 220-process-word-documents
Thomas-Rowlands Jun 2, 2025
7ee9cd9
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 2, 2025
ef28bd6
Post-merge fixes for Word regression test
Thomas-Rowlands Jun 2, 2025
f87fed4
[pre-commit.ci] pre-commit autoupdate
pre-commit-ci[bot] Jun 2, 2025
258c2c6
Merge pull request #284 from omicsNLP/pre-commit-ci-update-config
github-actions[bot] Jun 2, 2025
29287e5
Refactored for the new autocorpus class. Implemented dataclass_json d…
Thomas-Rowlands Jun 3, 2025
3a13f19
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 3, 2025
00ba1cf
Merge branch 'main' into 220-process-word-documents
Thomas-Rowlands Jun 3, 2025
2f314fa
Ruff fix
Thomas-Rowlands Jun 3, 2025
9ce48ed
Merge branch '220-process-word-documents' of https://github.com/omics…
Thomas-Rowlands Jun 3, 2025
1db882c
Fixes for mypy typing errors
Thomas-Rowlands Jun 3, 2025
acbf5b1
Fixed unbound values referenced
Thomas-Rowlands Jun 3, 2025
b23909e
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Jun 3, 2025
9407bbe
mypy fixes
Thomas-Rowlands Jun 3, 2025
df268f8
Merge branch '220-process-word-documents' of https://github.com/omics…
Thomas-Rowlands Jun 3, 2025
7eb4b83
Added new fields to expected output, matching the HTML bioc structure
Thomas-Rowlands Jun 3, 2025
7b48d20
Added missing sentence field
Thomas-Rowlands Jun 3, 2025
794372f
Added missing sentences field
Thomas-Rowlands Jun 3, 2025
c8a48a0
Merge pull request #260 from omicsNLP/220-process-word-documents
Thomas-Rowlands Jun 3, 2025
d7e6972
Use secret PAT on release workflow
AdrianDAlessandro Jun 3, 2025
387da46
Revert "Use secret PAT on release workflow"
AdrianDAlessandro Jun 3, 2025
bd28aad
Inherit secrets from release to test in CI
AdrianDAlessandro Jun 3, 2025
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion .github/ISSUE_TEMPLATE/bug_report.md
Original file line number Diff line number Diff line change
Expand Up @@ -22,7 +22,7 @@ If applicable, add screenshots to help explain your problem.

## Context

Please, complete the following to better understand the system you are using to run MUSE.
Please, complete the following to better understand the system you are using to run Auto-CORPus.

- Operating system (eg. Windows 10):
- Auto-CORPus version (eg. 1.0.0):
Expand Down
2 changes: 1 addition & 1 deletion .github/actions/setup/action.yml
Original file line number Diff line number Diff line change
Expand Up @@ -19,4 +19,4 @@ runs:

- name: Install dependencies
shell: bash
run: poetry install
run: poetry install --all-extras
18 changes: 17 additions & 1 deletion .github/workflows/ci.yml
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,27 @@ jobs:
python-version: ['3.10', '3.11', '3.12', '3.13']
steps:
- uses: actions/checkout@v4
with:
# Use a custom PAT so the runners can access the private submodule
token: ${{ secrets.PAT }}
submodules: true
- name: Install LibreOffice
if: runner.os == 'Linux'
run: sudo apt-get update && sudo apt-get install -y libreoffice
- uses: ./.github/actions/setup
with:
python-version: ${{ matrix.python-version }}
- name: Install pywin32 on Windows
if: runner.os == 'Windows'
run: poetry add pywin32
- name: Run tests
run: poetry run pytest --skip-ci-macos
run: poetry run pytest --skip-ci-macos --skip-ci-windows
- name: Upload coverage reports to Codecov
if: ${{ matrix.os == 'ubuntu-latest' && matrix.python-version == '3.13' && github.event.pull_request.user.login != 'dependabot[bot]' && github.event.pull_request.user.login != 'pre-commit-ci[bot]' }}
uses: codecov/codecov-action@v5
with:
fail_ci_if_error: true
token: ${{ secrets.CODECOV_TOKEN }}
- name: Check docs build
run: poetry run mkdocs build
- name: Check types
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/release.yml
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,7 @@ on:
jobs:
test:
uses: ./.github/workflows/ci.yml
secrets: inherit

build-wheel:
needs: test
Expand Down
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "tests/data/private"]
path = tests/data/private
url = ../Auto-CORPus-private-test-data
6 changes: 3 additions & 3 deletions .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -21,19 +21,19 @@ repos:
hooks:
- id: check-github-workflows
- repo: https://github.com/astral-sh/ruff-pre-commit
rev: v0.11.9
rev: v0.11.12
hooks:
- id: ruff
args: [--fix, --exit-non-zero-on-fix]
- id: ruff-format
- repo: https://github.com/pre-commit/mirrors-mypy
rev: v1.15.0
rev: v1.16.0
hooks:
- id: mypy
exclude: autocorpus/parse_xml.py
additional_dependencies: [types-beautifulsoup4, types-regex, lxml-stubs, types-tqdm, types-jsonschema]
- repo: https://github.com/igorshubovych/markdownlint-cli
rev: v0.44.0
rev: v0.45.0
hooks:
- id: markdownlint-fix
- repo: https://github.com/codespell-project/codespell
Expand Down
12 changes: 6 additions & 6 deletions .vscode/launch.json
Original file line number Diff line number Diff line change
Expand Up @@ -10,12 +10,12 @@
"request": "launch",
"module": "autocorpus",
"args": [
"-c",
"${workspaceFolder}/autocorpus/configs/config_pmc.json",
"-b",
"PMC",
"-t",
"output",
"-f",
"${workspaceFolder}/tests/data/PMC/Current/PMC8885717.html"
"${workspaceFolder}/tests/data/public/html/PMC/PMC8885717.html"
]
},
{
Expand All @@ -24,12 +24,12 @@
"request": "launch",
"module": "autocorpus",
"args": [
"-c",
"${workspaceFolder}/autocorpus/configs/config_pmc_pre_oct_2024.json",
"-b",
"LEGACY_PMC",
"-t",
"output",
"-f",
"${workspaceFolder}/tests/data/PMC/Pre-Oct-2024/PMC8885717.html"
"${workspaceFolder}/tests/data/public/html/LEGACY_PMC/PMC8885717.html"
]
}
]
Expand Down
32 changes: 26 additions & 6 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,7 @@
[![License: GPL v3](https://img.shields.io/badge/License-GPLv3-blue.svg)](https://www.gnu.org/licenses/gpl-3.0)
[![pre-commit.ci status](https://results.pre-commit.ci/badge/github/omicsNLP/Auto-CORPus/main.svg)](https://results.pre-commit.ci/latest/github/omicsNLP/Auto-CORPus/main)
[![PyPI version](https://badge.fury.io/py/autocorpus.svg)](https://badge.fury.io/py/autocorpus)
[![codecov](https://codecov.io/gh/omicsNLP/Auto-CORPus/graph/badge.svg?token=ZTKK4URM4A)](https://codecov.io/gh/omicsNLP/Auto-CORPus)

*Requires Python 3.10+* <!-- markdownlint-disable-line MD036 -->

Expand All @@ -21,24 +22,32 @@ The documentation for Auto-CORPus is available on our [GitHub Pages site].

## Installation

Install with pip
Install with pip:

```sh
pip install autocorpus
```

If you want to be able to process PDF files (only available with Auto-CORPus >v1.1.0),
you will need to install (large!) additional dependencies. To install Auto-CORPUS with
PDF processing support, run:

```sh
pip install autocorpus[pdf]
```

## Usage

Run the below command for a single file example
You can run Auto-CORPus on a single file like so:

```sh
auto-corpus -c "autocorpus/configs/config_pmc.json" -t "output" -f "path/to/html/file" -o JSON
auto-corpus -b PMC -t "output" -f "path/to/html/file" -o JSON
```

Run the main app for a directory of files example
Auto-CORPus can also process whole directories:

```sh
auto-corpus -c "autocorpus/configs/config_pmc.json" -t "output" -f "path/to/directory/of/html/files" -o JSON
auto-corpus -b PMC -t "output" -f "path/to/directory/of/html/files" -o JSON
```

### Available arguments
Expand Down Expand Up @@ -118,12 +127,23 @@ To get started:

1. [Download and install Poetry](https://python-poetry.org/docs/#installation) following the instructions for your OS.
1. Clone this repository and make it your working directory
1. (Optionally) download private test data for additional regression tests. This uses data which
cannot be redistributed publicly (only available to members of the
[omicsNLP](https://github.com/omicsNLP) organisation).

```sh
git submodule update --init
```

1. Set up the virtual environment:

```sh
poetry install
poetry install --all-extras
```

Note: The `--all-extras` flag is because of the additional dependencies required for
analysing extra file types (PDF, Word, Excel, etc).

1. Activate the virtual environment (alternatively, ensure any Python-related command is preceded by `poetry run`):

```sh
Expand Down
5 changes: 2 additions & 3 deletions autocorpus/__main__.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,8 +7,7 @@
from tqdm import tqdm

from . import add_file_logger, logger
from .autocorpus import Autocorpus
from .configs.default_config import DefaultConfig
from .config import DefaultConfig, read_config
from .inputs import read_file_structure
from .run import run_autocorpus

Expand Down Expand Up @@ -66,7 +65,7 @@ def main():

# Load the config
if args.config:
config = Autocorpus.read_config(args.config)
config = read_config(args.config)
elif args.default_config:
try:
config = DefaultConfig[args.default_config].load_config()
Expand Down
56 changes: 20 additions & 36 deletions autocorpus/abbreviation.py
Original file line number Diff line number Diff line change
Expand Up @@ -35,43 +35,27 @@ def _remove_quotes(text: str) -> str:
return re2.sub(r'([(])[\'"\p{Pi}]|[\'"\p{Pf}]([);:])', r"\1\2", text)


def _is_abbreviation(candidate: str) -> bool:
r"""Check whether input string is an abbreviation.
def _is_abbreviation(s: str) -> bool:
"""Check whether input string is an abbreviation.

Based on Schwartz&Hearst.
To be classified as an abbreviation, a string must be composed exclusively of
Unicode letters or digits, optionally separated by dots or hyphens. This sequence
must repeat between two and ten times. We exclude strings that are *exclusively*
composed of digits or lowercase letters.

2 <= len(str) <= 10
len(tokens) <= 2
re.search(r'\p{L}', str)
str[0].isalnum()

and extra:
if it matches (\p{L}\.?\s?){2,}
it is a good candidate.

Args:
candidate: Candidate abbreviation

Returns:
True if this is a good candidate
Adapted from Schwartz & Hearst.
"""
viable = True
# Disallow if exclusively composed of digits
if re2.match(r"\p{N}+$", s):
return False

# Broken: See https://github.com/omicsNLP/Auto-CORPus/issues/144
# if re2.match(r"(\p{L}\.?\s?){2,}", candidate.lstrip()):
# viable = True
if len(candidate) < 2 or len(candidate) > 10:
viable = False
if len(candidate.split()) > 2:
viable = False
if candidate.islower(): # customize function discard all lower case candidate
viable = False
if not re2.search(r"\p{L}", candidate): # \p{L} = All Unicode letter
viable = False
if not candidate[0].isalnum():
viable = False
# Disallow if exclusively composed of lowercase unicode chars
if re2.match(r"\p{Ll}+$", s):
return False

return viable
# Should be a repeating sequence of unicode chars or digits, optionally separated
# by dots or hyphens. The sequence must repeat between 2 and 10 times.
return bool(re2.match(r"([\p{L}\p{N}][\.\-]?){2,10}$", s))


def _get_definition(candidate: str, preceding: str) -> str:
Expand Down Expand Up @@ -398,7 +382,7 @@ def _extract_abbreviations(


def _biocify_abbreviations(
abbreviations: _AbbreviationsDict, file_path: str
abbreviations: _AbbreviationsDict, file_path: Path
) -> dict[str, Any]:
passages = []
for short, long in abbreviations.items():
Expand All @@ -416,16 +400,16 @@ def _biocify_abbreviations(
"key": "autocorpus_abbreviations.key",
"documents": [
{
"id": Path(file_path).name.partition(".")[0],
"inputfile": file_path,
"id": file_path.name.partition(".")[0],
"inputfile": str(file_path),
"passages": passages,
}
],
}


def get_abbreviations(
main_text: dict[str, Any], soup: BeautifulSoup, file_path: str
main_text: dict[str, Any], soup: BeautifulSoup, file_path: Path
) -> dict[str, Any]:
"""Extract abbreviations from the input main text.

Expand Down
16 changes: 2 additions & 14 deletions autocorpus/ac_bioc/annotation.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,7 @@
from __future__ import annotations

import xml.etree.ElementTree as ET
from dataclasses import dataclass, field
from dataclasses import asdict, dataclass, field
from typing import Any

from .location import BioCLocation
Expand All @@ -20,19 +20,7 @@ class BioCAnnotation:
infons: dict[str, str] = field(default_factory=dict)
locations: list[BioCLocation] = field(default_factory=list)

def to_dict(self):
"""Convert the annotation to a dictionary representation.

Returns:
dict: A dictionary containing the annotation's id, text, offset, length, and infons.
"""
return {
"id": self.id,
"text": self.text,
"offset": self.offset,
"length": self.length,
"infons": self.infons,
}
to_dict = asdict

def to_json(self) -> dict[str, Any]:
"""Convert the annotation to a JSON-serializable dictionary.
Expand Down
9 changes: 0 additions & 9 deletions autocorpus/ac_bioc/bioctable/cell.py
Original file line number Diff line number Diff line change
Expand Up @@ -9,12 +9,3 @@ class BioCTableCell:

cell_id: str = field(default_factory=str)
cell_text: str = field(default_factory=str)

def to_dict(self) -> dict[str, str]:
"""Convert the cell's attributes to a dictionary.

Returns:
dict[str, str]
A dictionary containing the cell's ID and text content.
"""
return {"cell_id": self.cell_id, "cell_text": self.cell_text}
11 changes: 0 additions & 11 deletions autocorpus/ac_bioc/bioctable/collection.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,6 @@
"""This module defines the BioCTableCollection class."""

from dataclasses import dataclass, field
from typing import Any

from ...ac_bioc import BioCCollection, BioCDocument

Expand All @@ -11,13 +10,3 @@ class BioCTableCollection(BioCCollection):
"""A collection of BioCTableDocument objects extending BioCCollection."""

documents: list[BioCDocument] = field(default_factory=list)

def to_dict(self) -> dict[str, Any]:
"""Convert the BioCTableCollection to a dictionary representation.

Returns:
dict[str, Any]: A dictionary containing the collection's data, including its documents.
"""
base = super().to_dict()
base["documents"] = [doc.to_dict() for doc in self.documents]
return base
10 changes: 0 additions & 10 deletions autocorpus/ac_bioc/bioctable/document.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,10 +5,8 @@
"""

from dataclasses import dataclass, field
from typing import Any

from ...ac_bioc import BioCAnnotation, BioCDocument, BioCPassage
from ...ac_bioc.bioctable.passage import BioCTablePassage


@dataclass
Expand All @@ -17,11 +15,3 @@ class BioCTableDocument(BioCDocument):

passages: list[BioCPassage] = field(default_factory=list)
annotations: list[BioCAnnotation] = field(default_factory=list)

def to_dict(self) -> dict[str, Any]:
"""Convert the BioCTableDocument to a dictionary representation."""
base = super().to_dict()
base["passages"] = [
p.to_dict() for p in self.passages if isinstance(p, BioCTablePassage)
]
return base
1 change: 0 additions & 1 deletion autocorpus/ac_bioc/bioctable/json.py
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,6 @@ def default(self, o):
"source": o.source,
"date": o.date,
"key": o.key,
"version": o.version,
"infons": o.infons,
"documents": [self.default(d) for d in o.documents],
}
Expand Down
Loading
Loading