Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
275 commits
Select commit Hold shift + click to select a range
d0bd2e3
updated data index
stevenbird Mar 2, 2016
604771d
POS tagger model file trained on the Russian National Corpus; contrib…
stevenbird Jul 13, 2016
7d5849e
update stopwords, resolves #49, resolves #52
stevenbird Aug 20, 2016
9373d81
updated with new readme files, resolves #48
stevenbird Aug 20, 2016
42fac35
updated data index
stevenbird Aug 20, 2016
040985b
updated data index
stevenbird Aug 22, 2016
300619a
added Porter Stemmer test files, cf https://github.com/nltk/nltk/pull…
stevenbird Aug 22, 2016
3140aae
updated data index
stevenbird Oct 25, 2016
56b4ec8
added evaluation data from WMT15, cf https://github.com/nltk/nltk/iss…
stevenbird Oct 25, 2016
1d6ca44
Add Dolch word list
simonrichard Nov 4, 2016
f57983d
Update panlex_lite to latest version
MartyMacGyver Nov 24, 2016
8bdf367
Merge pull request #60 from MartyMacGyver/bugfix_59_panlex_update
stevenbird Dec 2, 2016
2856679
added Framenet v1.7, cf https://github.com/nltk/nltk/pull/1556
stevenbird Dec 30, 2016
09c414f
updated data index
stevenbird Dec 30, 2016
ce20ab8
added MWA subset of PPDB
alvations Jan 21, 2017
279a838
updated data index
alvations Jan 21, 2017
5bb8e4d
Revert the checksum and sizes for panlex
alvations Jan 21, 2017
3c0a508
Update all.xml
alvations Jan 21, 2017
0677e2f
Update all-corpora.xml
alvations Jan 21, 2017
8fa0021
Update index.xml
alvations Jan 21, 2017
46f0653
Update index.xml
alvations Jan 21, 2017
751a58b
Remove mwa_ppdb, should be in misc not in corpus
alvations Feb 15, 2017
a7dd62f
Merge pull request #63 from nltk/alvations-patch-mwa_ppdb
alvations Feb 15, 2017
77f2a86
update panlex
stevenbird Mar 2, 2017
7e54830
updated data index
stevenbird Mar 2, 2017
8b80375
Removed the empty sentences from bangla.pos
djokester Mar 19, 2017
079d2dc
Merge pull request #67 from djokester/gh-pages
stevenbird Apr 13, 2017
897da07
updated data index
stevenbird Apr 14, 2017
87a4f21
updated version of panlex_lite
stevenbird Apr 14, 2017
1164703
updated data index
stevenbird Apr 14, 2017
e93c1eb
Merge pull request #58 from simonrichard/gh-pages
stevenbird Apr 14, 2017
7b505ff
Fix index.xml typo breaking nltk.download
glowskir Apr 14, 2017
6536f8a
Merge pull request #70 from glowskir/patch-1
alvations Apr 14, 2017
6ea47f4
update checksums, resolves https://github.com/nltk/nltk/issues/1338
stevenbird May 5, 2017
6b4fe99
Create all-no-third-party.xml
alvations May 9, 2017
57a7d4e
Create third-party.xml
alvations May 9, 2017
2d2c99c
Update and rename all-no-third-party.xml to all-nltk.xml
alvations May 9, 2017
112f806
Create popular.xml
alvations May 9, 2017
73b24b0
Delete popular.xml
alvations May 9, 2017
8da427f
Create popular.xml
alvations May 9, 2017
9a02e23
updated data index
alvations May 9, 2017
09bbd58
changed the alias to popular
alvations May 9, 2017
d31afd8
dropping panlex lite from data index cf #41 #45 #57 #59
stevenbird May 10, 2017
5ae9e27
updated data index
stevenbird May 10, 2017
3c89178
Merge branch 'gh-pages' into alvations-patch-1
stevenbird May 10, 2017
7baa274
update panlex swadesh
stevenbird May 10, 2017
e633c7c
Merge pull request #75 from nltk/alvations-patch-1
stevenbird May 10, 2017
a104dfa
updated data index
stevenbird May 10, 2017
92ed37d
dropped panlex_lite; resolves #76
stevenbird May 11, 2017
ba0ceaa
updated data index
stevenbird May 11, 2017
6bbf131
Added stop words for the Romanian language
ihulub Jun 9, 2017
6a2dd86
Merge pull request #80 from ionuthulub/feature/ro-stopwords
stevenbird Jun 11, 2017
4920f7f
updated data index
stevenbird Jun 11, 2017
b0319ed
Merge pull request #2 from nltk/gh-pages
alvations Jul 14, 2017
a1646df
updated data index
Jul 14, 2017
ab036eb
updated perluniprops
Jul 14, 2017
bce5920
Merge pull request #82 from alvations/gh-pages
alvations Aug 14, 2017
30d6827
updated data index
alvations Aug 14, 2017
649ca88
updated Perluniprops with Symbol.txt
alvations Aug 14, 2017
1bfe565
Merge pull request #88 from alvations/gh-pages
alvations Aug 14, 2017
68adab2
FrameNet license
gdemelo Aug 15, 2017
ec56265
Add tests.xml
jacksonllee Oct 10, 2017
2a761ab
add arabic stopwords, resolves #95
stevenbird Oct 17, 2017
c737a4d
updated data index
stevenbird Oct 17, 2017
c22c72f
Merge pull request #89 from gdemelo/gh-pages
stevenbird Oct 22, 2017
c0a3730
Merge pull request #93 from jacksonllee/add-tests-collection
stevenbird Oct 22, 2017
c3ffe01
updated data index
stevenbird Dec 22, 2017
74f95d8
added stopwords for Greek, Nepali, Azerbaijani, resolves #83 #100 #103
stevenbird Dec 22, 2017
cd01278
add Irish, resolves #99
stevenbird Dec 22, 2017
1c06a14
updated data index
stevenbird Dec 22, 2017
3fe3266
added dutch to omw, resolves #55
stevenbird Dec 22, 2017
8a03886
updated data index
stevenbird Dec 22, 2017
6c9d932
removed stale models; resolves #104
stevenbird Dec 23, 2017
12ae9e1
updated data index
stevenbird Dec 23, 2017
3319915
Removing references to hmm_treebank_pos_tagger
theredpea Dec 23, 2017
b7908d5
Merge pull request #1 from theredpea/hmm_treebank_pos_tagger__remove_ref
theredpea Dec 23, 2017
ac20c46
Merge pull request #105 from theredpea/gh-pages
stevenbird Dec 24, 2017
f3dc1f3
remove hmm_treebank_pos_tagger from collections cf #105
stevenbird Dec 24, 2017
6c1f701
updated data index
stevenbird Mar 13, 2018
29abe1e
add Indonesian, resolves #112
stevenbird Mar 13, 2018
abdfe67
add verbnet 3.3, cf https://github.com/nltk/nltk/issues/2015
stevenbird May 14, 2018
d0704ba
updated data index
stevenbird Oct 20, 2018
2be8322
add averaged_perceptron_tagger_ru
stevenbird Oct 20, 2018
e7b01d8
Adds 2 last inaugural speeches
nimbusaeta Jun 14, 2019
3ff3b61
Merge pull request #135 from nimbusaeta/gh-pages
stevenbird Jun 16, 2019
ff5fbc0
updated data index
stevenbird Jun 16, 2019
4a9e992
update german stopwords, resolves #134
stevenbird Jun 16, 2019
c3b2098
updated data index
stevenbird Jun 16, 2019
d6d777e
update stopwords, resolves #134
stevenbird Jun 16, 2019
b63a469
Added punkt model for Russian, resolves #118
stevenbird Jul 4, 2019
1964502
updated data index
stevenbird Jul 4, 2019
069c105
updated collections, resolves #125
stevenbird Jul 4, 2019
0ad24d8
updated data index
stevenbird Jul 4, 2019
fdf42c5
update Slovene data to sloWNet version 3.1
stevenbird Jul 4, 2019
30a6bee
updated data index
stevenbird Jul 4, 2019
be98c03
added Tajik stopwords, resolves #132
stevenbird Jul 4, 2019
9366458
updated data index
stevenbird Jul 4, 2019
c556275
minor tweaks; resolves #110; resolves #130
stevenbird Jul 4, 2019
22dffba
updated data index
stevenbird Jul 4, 2019
ecd703a
removed junk files; resolves #101
stevenbird Jul 4, 2019
5dd9b2b
updated data index
stevenbird Jul 4, 2019
52ab8b2
added slovene stopwords, resolves #54
stevenbird Jul 4, 2019
0c040f5
updated data index
stevenbird Jul 4, 2019
b5cee3d
added missing French words; resolves #16
stevenbird Jul 4, 2019
52625dd
updated data index
stevenbird Jul 4, 2019
1d6de2d
updated data index
stevenbird Oct 10, 2019
3a486db
update Slovene stopwords, resolves #139
stevenbird Oct 10, 2019
444c4c5
updated data index
ekaf Oct 19, 2021
6046d97
Added wordnet31
ekaf Oct 19, 2021
724e2f2
Merge pull request #165 from ekaf/wordnet31
stevenbird Oct 19, 2021
50b57a4
updated data index
stevenbird Oct 19, 2021
92f9667
more stopwords for bengali and arabic, resolves #153, resolves #156
stevenbird Oct 19, 2021
fe79292
new index
stevenbird Oct 19, 2021
113b294
add wordnet31
stevenbird Oct 19, 2021
d26de39
updated data index
stevenbird Oct 19, 2021
caf6cb1
updated data index
stevenbird Oct 21, 2021
9e60372
repackaged stopwords, resolves #167
stevenbird Oct 21, 2021
05d7cd2
updated data index
stevenbird Oct 21, 2021
c58d09a
correct for missing file in https://github.com/nltk/nltk_data/commit/…
stevenbird Oct 21, 2021
9f73cac
Updated [0]VP(eva to [0] VP(eva, see nltk/nltk#2467
tomaarsen Oct 22, 2021
8243b94
Delete inaugural.zip
nimbusaeta Nov 1, 2021
b4e8646
Adds Biden's inaugural address
nimbusaeta Nov 1, 2021
f63b086
Add Open English Wordnet 2021
ekaf Nov 24, 2021
ee3fb3a
updated data index
ekaf Nov 24, 2021
4b92667
Upgrade omw to 1.4
ekaf Nov 29, 2021
28d106b
Add documentation files
ekaf Dec 1, 2021
de29219
Add documentation files
ekaf Dec 1, 2021
c238465
Merge pull request #170 from ekaf/wordnet2021
stevenbird Dec 2, 2021
af0f090
Merge pull request #172 from ekaf/wn31doc
stevenbird Dec 2, 2021
1b8d30b
Merge pull request #169 from nimbusaeta/gh-pages
stevenbird Dec 2, 2021
0d1918d
Remove problematic folders
ekaf Dec 3, 2021
86562f0
Run packaging script
ekaf Dec 4, 2021
d91e365
Avoid to duplicate first line
ekaf Dec 5, 2021
c38fae3
Resolve critical installation and usage issue of inaugural data (#174)
tomaarsen Dec 5, 2021
d5d2a8a
Merge branch 'gh-pages' of https://github.com/nltk/nltk_data into omw14
ekaf Dec 6, 2021
1516bda
Merge pull request #168 from tomaarsen/bugfix/sinica-treebank-format
stevenbird Dec 6, 2021
8d83662
Merge pull request #171 from ekaf/omw14
stevenbird Dec 7, 2021
23cfe3d
updated data index
ekaf Dec 8, 2021
e2e0798
Revert omw and add omw-1.4
ekaf Dec 8, 2021
3b50c80
Merge pull request #175 from ekaf/omw_compat
stevenbird Dec 9, 2021
f846fae
Add wordnet2021.xml
ekaf Dec 9, 2021
c80f434
updated data index
ekaf Dec 9, 2021
71d8b79
Add `omw-1.4.xml` to allow OMW 1.4 to be downloaded (#176)
tomaarsen Dec 10, 2021
aff0a80
Merge pull request #177 from ekaf/xml_2021
stevenbird Dec 12, 2021
ec5e674
Add Extended Open Multilingual WordNet (extended_omw)
ExplorerFreda Dec 22, 2021
697e82a
Add corpus name
ExplorerFreda Dec 23, 2021
97959bb
updated data index
tomaarsen Dec 28, 2021
3a3529e
updated data index
tomaarsen Dec 28, 2021
9c8d5df
Add script to automatically build critical collections
tomaarsen Dec 29, 2021
9c90384
Merge pull request #182 from tomaarsen/collections
stevenbird Jan 6, 2022
896ae47
Merge branch 'gh-pages' into gh-pages
stevenbird Jan 6, 2022
c49da96
Merge pull request #180 from ExplorerFreda/gh-pages
stevenbird Jan 6, 2022
8df2545
Also add collections updates in pkg_index
tomaarsen Feb 9, 2022
440f2cb
Commit all, i.e. collections and index
tomaarsen Feb 9, 2022
444941d
updated data index
tomaarsen Feb 9, 2022
6c753e8
updated data index
stevenbird Jul 4, 2022
fa6c72c
set wordnet corpora not to unzip, resolves #187
stevenbird Jul 4, 2022
aa54613
updated data index
stevenbird Jul 4, 2022
d7109c6
add metadata for universal tagset, resolves #189
stevenbird Jul 4, 2022
57e8df6
updated data index
stevenbird Jul 4, 2022
9075ab4
update universal tagset metadata
stevenbird Jul 4, 2022
40670a3
updated data index
stevenbird Jul 4, 2022
4d5ecb6
updated data index
stevenbird Jul 4, 2022
ebbf2fb
fix Gutenberg URL, resolves #184
stevenbird Jul 14, 2022
f398879
updated data index
stevenbird Jul 14, 2022
1d3c34b
added malayalam, resolves #144
stevenbird Jul 14, 2022
005569f
updated data index
stevenbird Jul 14, 2022
2921374
Add bcp47 data for handling language tags (#191)
ekaf Dec 7, 2022
3730674
updated data index
tomaarsen Dec 7, 2022
dc3dac9
Add Open English Wordnet 2022
ekaf Feb 1, 2023
9464717
Merge pull request #193 from ekaf/oewn22
stevenbird Feb 2, 2023
5db857e
updated data index
stevenbird Feb 2, 2023
7110e89
added eng tagger in json format
alvations Jul 5, 2024
409475f
added json format alternative to perceptron tagger russian
alvations Jul 5, 2024
179e751
fix package id
alvations Jul 5, 2024
4f31340
Merge pull request #208 from nltk/pickle-patch
alvations Jul 5, 2024
0118f69
made sure tagger unzip to directory
alvations Jul 5, 2024
c4efd02
updated data index
alvations Jul 5, 2024
08ab17b
Merge pull request #209 from nltk/pickle-patch
alvations Jul 5, 2024
ab3e390
updated data index
alvations Jul 5, 2024
c82bd70
patch the list vs set in classes.json
alvations Jul 5, 2024
6f34b81
Merge pull request #210 from nltk/pickle-patch
alvations Jul 5, 2024
568a56a
updated data index
alvations Jul 5, 2024
304a3c5
fixed positions of classes
alvations Jul 5, 2024
2e6fcf9
Merge pull request #211 from nltk/pickle-patch
alvations Jul 5, 2024
b740b14
updated data index
alvations Jul 5, 2024
a15abe3
added the repr(classes) for the russian tagger
alvations Jul 5, 2024
6651c03
Merge pull request #212 from nltk/pickle-patch
alvations Jul 5, 2024
9780f4d
updated data index
alvations Jul 5, 2024
fd5bcb6
added tagsets_json
alvations Jul 5, 2024
3b48d69
Merge pull request #213 from nltk/pickle-patch
alvations Jul 5, 2024
d52e584
updated data index
alvations Jul 5, 2024
c0d9931
added actual json files
alvations Jul 5, 2024
37a064b
Merge pull request #214 from nltk/pickle-patch
alvations Jul 5, 2024
b20bc32
PunktParameters stored as tab files
ekaf Jul 9, 2024
b676bcd
Add tab-formatted maxent_ne chunkers
ekaf Jul 11, 2024
a1f72de
Store maxent_treebank_pos_tagger as tab files
ekaf Jul 11, 2024
47cbacf
Merge pull request #215 from ekaf/punkt_tab
stevenbird Jul 26, 2024
ee4f769
Merge pull request #217 from ekaf/chunker-tab
stevenbird Jul 26, 2024
2eab675
Merge pull request #218 from ekaf/maxent_taggger_tab
stevenbird Jul 26, 2024
90e5249
updated data index
ekaf Jul 28, 2024
97d56c2
Add tagsets_json.xml
ekaf Jul 29, 2024
3dc5332
updated data index
ekaf Jul 29, 2024
cfe8291
Merge pull request #220 from ekaf/mkindex
stevenbird Jul 29, 2024
7b868cc
Add English Wordnet, 2024 edition
ekaf Nov 2, 2024
6cf1032
Add Belarusian and Albanian stopwords
stevenbird Feb 17, 2025
3a8fde4
Add Belarusian and Albanian stopwords
stevenbird Feb 17, 2025
5b11ace
Update data index
stevenbird Feb 17, 2025
4d742ec
Rebuild index.xml
stevenbird Feb 17, 2025
6249ecb
updated data index
stevenbird Feb 17, 2025
4f15a3d
Add Malayalam to punkt_tab
ekaf Feb 17, 2025
a0c378a
Merge remote-tracking branch 'upstream/gh-pages' into ewn
ekaf Feb 17, 2025
9cf54df
Merge branch 'ewn' of https://github.com/ekaf/nltk_data into ewn
ekaf Feb 17, 2025
329f929
Add license and webpage information to Averaged Perceptron Tagger pac…
Hiroshiba Feb 19, 2025
57ecbb6
updated data index
stevenbird Feb 20, 2025
17405d9
updated inaugural speech corpus, resolves #234
stevenbird Feb 20, 2025
47111d8
updated data index
stevenbird Feb 20, 2025
d2a9239
Merge pull request #226 from ekaf/punkt_malayalam
stevenbird Feb 20, 2025
94d6c91
updated data index
stevenbird Feb 20, 2025
22f0516
updated data index
stevenbird Feb 20, 2025
5651241
updated data index
stevenbird Feb 20, 2025
8bd455f
Merge pull request #225 from ekaf/ewn
stevenbird Mar 8, 2025
6f9d40e
Merge pull request #233 from Hiroshiba/averaged_perceptron_tagger-to-MIT
stevenbird Mar 8, 2025
64d6b8e
index english_wordnet; fix ru(s) metadata; rebuild data index
stevenbird Mar 10, 2025
66f9f16
added Tamil stopwords; resolves #199
stevenbird Mar 10, 2025
077204a
Add license files
ekaf Jun 17, 2025
29e019e
Address reviewer's concerns
ekaf Jun 17, 2025
2dc5019
Add remaining files
ekaf Jun 18, 2025
1bb484f
Remind to update DATASET-LICENSES.md when adding a package
ekaf Jun 18, 2025
b2f5e5f
Merge pull request #242 from ekaf/hotfix-241
stevenbird Jun 19, 2025
fe60468
Automate index.xml rebuild
ekaf Jun 25, 2025
6496f05
Test index.xml automation
ekaf Jun 25, 2025
c34f140
Merge pull request #245 from ekaf/test_index
stevenbird Jun 25, 2025
88ed9aa
Install nltk before building index
ekaf Jun 25, 2025
ad82c61
Fix format of mock_corpus.xml
ekaf Jun 25, 2025
52fded3
Restore the original index files
ekaf Jun 25, 2025
c664bf2
Merge pull request #246 from ekaf/ci-install-nltk
ekaf Jun 26, 2025
6251daa
Auto-build index.xml after package update
github-actions[bot] Jun 26, 2025
311d692
Trigger automatic index rebuild
ekaf Aug 5, 2025
f1bf6fe
Merge pull request #248 from ekaf/rebuild_index
ekaf Aug 8, 2025
427fc05
Auto-build index.xml after package update
github-actions[bot] Aug 8, 2025
4f393cd
Mock update, to trigger index rebuild
ekaf Oct 2, 2025
3e8fcd9
Merge pull request #252 from ekaf/mock_update
ekaf Oct 2, 2025
fdd3e9a
Auto-build index.xml after package update
github-actions[bot] Oct 2, 2025
fb14824
Uprade english_wordnet to Edition 2025+
ekaf Jan 4, 2026
dd2daa9
Merge pull request #254 from ekaf/ewn25
stevenbird Jan 9, 2026
1f4ee16
Auto-build index.xml after package update
github-actions[bot] Jan 9, 2026
98e1426
Add Uzbek stopwords, resolves #255
stevenbird Jan 9, 2026
f7a5adc
Merge branch 'gh-pages' of https://github.com/nltk/nltk_data into gh-…
stevenbird Jan 9, 2026
984c35e
Auto-build index.xml after package update
github-actions[bot] Jan 9, 2026
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
40 changes: 40 additions & 0 deletions .github/workflows/pkg_index.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
name: Build and commit index.xml on package update

on:
push:
branches:
- gh-pages
paths:
- 'packages/**'

jobs:
build-index:
runs-on: ubuntu-latest
steps:
- name: Checkout repository
uses: actions/checkout@v3

- name: Set up Python
uses: actions/setup-python@v5
with:
python-version: '3.x'

- name: Install dependencies
run: pip install nltk

- name: Install make
run: sudo apt-get update && sudo apt-get install -y make

- name: Build index.xml
run: make pkg_index

- name: Configure git
run: |
git config user.name "github-actions[bot]"
git config user.email "github-actions[bot]@users.noreply.github.com"

- name: Commit and push index.xml
run: |
git add index.xml
git commit -m "Auto-build index.xml after package update" || echo "No changes to commit"
git push
81 changes: 81 additions & 0 deletions CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,81 @@
# Contributing to nltk_data

Thank you for your interest in contributing to [`nltk_data`](https://github.com/nltk/nltk_data)! This guide will help you add new data packages (corpora, taggers, models, etc.) and contribute improvements to existing ones.

## Adding a New Data Package

The `nltk_data` repository contains datasets and resources that can be downloaded by `nltk.downloader`. To add a new dataset or resource, please follow these steps:

### 1. Fork and Clone the Repository

First, fork the [`nltk_data`](https://github.com/nltk/nltk_data) repository to your own GitHub account. For help with forking, see the [GitHub documentation on forking a repository](https://docs.github.com/en/get-started/quickstart/fork-a-repo).

Then, clone your fork locally:

```bash
git clone https://github.com/<your-github-username>/nltk_data.git
cd nltk_data
```

### 2. Create a New Branch

Create a branch for your dataset:

```bash
git checkout -b add-my-dataset
```

### 3. Add Your Data Package

- Place your dataset in the appropriate directory (`corpora/`, `models/`, `tokenizers/`, etc.). If you are unsure, check the existing structure or open an issue for clarification.
- If your dataset has a license, include the license file in the same directory. If the license is unknown or separate from the repository, please add a note in a `README` or `LICENSE` file within the dataset’s folder, and document this in your pull request.

**Whenever you add a new data package, you must update [`DATASET-LICENSES.md`](DATASET-LICENSES.md) with the license information for your package.**

You only need to update [`LICENSE-OVERVIEW.md`](LICENSE-OVERVIEW.md) if you are making changes to the repository’s overall licensing structure or guidance.

### 4. Update Index Files

- You do **not** need to manually update `index.xml`. This file is now rebuilt automatically by a GitHub Actions workflow after your changes are merged.
- Any local changes you make to `index.xml` will be ignored and overwritten by the workflow.
- Provide a short README or metadata file describing the package, its origin, and its license.

### 5. Commit and Push Your Changes

```bash
git add <your new files>
git commit -m "Add <name> dataset to nltk_data"
git push origin add-my-dataset
```

### 6. Create a Pull Request

Open a pull request from your branch to the `master` branch of `nltk/nltk_data`. For help, see the [GitHub documentation on creating a pull request](https://docs.github.com/en/pull-requests/collaborating-with-pull-requests/proposing-changes-to-your-work-with-pull-requests/creating-a-pull-request).

In your pull request, please include:
- A description of the dataset and its purpose.
- Any relevant licensing information or restrictions.
- Instructions for any special installation or usage requirements.

### 7. Respond to Feedback

- Be responsive to comments and requested changes.
- If your dataset cannot be accepted (e.g., due to licensing issues), we will let you know in the pull request.

## General Guidelines

- **Licensing**: Please ensure you have the right to redistribute any data you submit, and document the license clearly. If the license is unknown, state this explicitly in your pull request.
- **No Large Files**: If your package is extremely large, consider hosting it elsewhere and providing an index/manifest, or open an issue to discuss options.
- **No Executable Files**: Only data, not code, should be included unless a script is essential for using the dataset.

## Additional Resources

- [GitHub Docs: Fork a repo](https://docs.github.com/en/get-started/quickstart/fork-a-repo)
- [GitHub Docs: Branches](https://docs.github.com/en/get-started/quickstart/github-glossary#branch)
- [GitHub Docs: Pull Requests](https://docs.github.com/en/pull-requests)

If you have questions or need help, please open an issue or join the [nltk-dev mailing list](https://groups.google.com/forum/#!forum/nltk-dev).

---

Thank you for helping improve NLTK’s data resources!
243 changes: 243 additions & 0 deletions DATASET-LICENSES.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,243 @@
# DATASET-LICENSES.md

This document provides a grouped summary of licenses for all data packages present in the [`nltk_data`](https://github.com/nltk/nltk_data) repository, based on the current `index.xml` file. Each package is listed by its exact `id` and `name`, and grouped by license type as declared in the metadata.

> **Disclaimer:**
> This information is provided as a convenience to users and is not legal advice.
> **You must verify the license for each dataset with the original source if your use case is sensitive (especially for commercial or redistributive use).**
> Licenses or terms can change over time; this file may become outdated if not maintained.

---

## MIT License

- averaged_perceptron_tagger — Averaged Perceptron Tagger
- averaged_perceptron_tagger_eng — Averaged Perceptron Tagger (JSON)
- averaged_perceptron_tagger_ru — Averaged Perceptron Tagger (Russian)
- averaged_perceptron_tagger_rus — Averaged Perceptron Tagger (Russian)
- vader_lexicon — VADER Sentiment Lexicon

---

## Creative Commons Licenses

### Creative Commons Attribution 4.0 International

- opinion_lexicon — Opinion Lexicon
- product_reviews_1 — Product Reviews (5 Products)
- product_reviews_2 — Product Reviews (9 Products)
- pros_cons — Pros and Cons
- subjectivity — Subjectivity Dataset v1.0

### Creative Commons Attribution 3.0 Unported License

- framenet_v17 — FrameNet 1.7

### Creative Commons Attribution-NonCommercial-ShareAlike 3.0 United States

- universal_treebanks_v20 — Universal Treebanks Version 2.0

### Creative Commons Attribution 3.0 (unspecified region)

- sentiwordnet — SentiWordNet

### CC0 1.0 Universal

- panlex_swadesh — PanLex Swadesh Corpora

### CC By SA 3.0 (Wiktionary) & UBY 1.0 (UBY)

- extended_omw — Extended Open Multilingual WordNet

---

## GNU Licenses

### GNU General Public License

- pl196x — Polish language of the XX century sixties

### GNU Free Documentation License

- swadesh — Swadesh Wordlists
- gazetteers — Gazetteer Lists (note: for some files only; others may be public domain)

### GNU Lesser General Public License

- nonbreaking_prefixes — Non-Breaking Prefixes (Moses Decoder)

---

## Public Domain

- genesis — Genesis Corpus
- gutenberg — Project Gutenberg Selections
- inaugural — C-Span Inaugural Address Corpus
- shakespeare — Shakespeare XML Corpus Sample
- udhr — Universal Declaration of Human Rights Corpus
- udhr2 — Universal Declaration of Human Rights Corpus (Unicode Version)
- words — Word Lists

---

## “Distributed with Permission” / “May be used with Permission” / “Freely Redistributable”

> **Warning:**
> These are not standard open licenses. Terms may prohibit redistribution, modification, or commercial use.
> **You must consult the upstream source for the actual terms and whether permission applies to your use case.**

- alpino — Alpino Dutch Treebank
- indian — Indian Language POS-Tagged Corpus
- lin_thesaurus — Lin's Dependency Thesaurus
- mac_morpho — MAC-MORPHO: Brazilian Portuguese news text with part-of-speech tags
- paradigms — Paradigm Corpus
- nombank.1.0 — NomBank Corpus 1.0
- propbank — Proposition Bank Corpus 1.0
- senseval — SENSEVAL 2 Corpus: Sense Tagged Text
- verbnet — VerbNet Lexicon, Version 2.1
- verbnet3 — VerbNet Lexicon, Version 3.3
- maxent_treebank_pos_tagger — Treebank Part of Speech Tagger (Maximum entropy)
- maxent_treebank_pos_tagger_tab — Treebank Part of Speech Tagger (Maximum entropy)
- maxent_ne_chunker — ACE Named Entity Chunker (Maximum entropy)
- maxent_ne_chunker_tab — ACE Named Entity Chunker (Maximum entropy)
- pil — The Patient Information Leaflet (PIL) Corpus
- pe08 — Cross-Framework and Cross-Domain Parser Evaluation Shared Task
- kimmo — PC-KIMMO Data Files
- jeita — JEITA Public Morphologically Tagged Corpus
- knbc — KNB Corpus (Annotated blog corpus)

---

## “Non-commercial Use Only” / Educational Use

- brown — Brown Corpus
- brown_tei — Brown Corpus (TEI XML Version)
- framenet_v15 — FrameNet 1.5
- floresta — Portuguese Treebank
- masc_tagged — MASC Tagged Corpus
- nps_chat — NPS Chat

---

## “See LICENSE Files” (Aggregated/Mixed Licensing)

> **Warning:**
> These packages include files from multiple sources, each with their own license. See LICENSE files inside the package and verify terms for your use case.

- omw — Open Multilingual Wordnet
- omw-1.4 — Open Multilingual Wordnet

---

## Special Cases, Custom, or Unique Licenses

- bcp47 — BCP-47 Language Tags ("IETF Trust and Unicode Inc."; custom)
- wordnet — WordNet ("Permission to use, copy, modify and distribute this software and database and its documentation for any purpose and without fee or royalty")
- wordnet31 — Wordnet 3.1 (same as above)
- wordnet2021 / wordnet2022 / english_wordnet — Open English Wordnet (combines WordNet License and Creative Commons Attribution)
- twitter_samples — Twitter Samples ("Must be used subject to Twitter Developer Agreement")
- switchboard — Switchboard Corpus Sample ("Permission is granted for use of this material in accordance with the Open Content License")
- dependency_treebank — Dependency Parsed Treebank (fragment of Penn Treebank; non-commercial, no redistribution)
- ptb — Penn Treebank (stub for full corpus)
- treebank — Penn Treebank Sample (fragment; non-commercial, no redistribution)
- conll2000 — CONLL 2000 Chunking Corpus (research use only)
- conll2002 — CONLL 2002 Named Entity Recognition Corpus (see website)
- conll2007 — Dependency Treebanks from CoNLL 2007 (Catalan and Basque Subset, see website)
- ieer — NIST IE-ER DATA SAMPLE (see website)
- reuters — Reuters-21578 benchmark corpus, ApteMod version (Reuters Ltd. copyright)
- timit — TIMIT Corpus Sample (Creative Commons Attribution-NonCommercial-ShareAlike 3.0)

---

## Unclarified, Unknown, Ambiguous, or Citation-Only

The following packages have:
- No `license` attribute
- An empty or ambiguous value
- A citation request instead of a license
- Or otherwise ambiguous status

> **Warning:**
> These packages lack open, standard, or clearly documented licenses.
> Citation requests do **not** constitute a license.
> Despite long-standing and ongoing efforts (see [nltk_data issue #241](https://github.com/nltk/nltk_data/issues/241) and related discussions), clarification has not been possible for these cases.
> **If you need to use any of these for commercial or redistributive purposes, consult a qualified legal professional.**

- abc — Australian Broadcasting Commission 2006
- basque_grammars — Grammars for Basque
- biocreative_ppi — BioCreAtIvE (Critical Assessment of Information Extraction Systems in Biology)
- bllip_wsj_no_aux — BLLIP Parser: WSJ Model
- book_grammars — Grammars from NLTK Book
- cess_cat — CESS-CAT Treebank (citation requested, not a license)
- cess_esp — CESS-ESP Treebank (citation requested, not a license)
- chat80 — Chat-80 Data Files
- city_database — City Database
- cmudict — The Carnegie Mellon Pronouncing Dictionary (0.6)
- comparative_sentences — Comparative Sentence Dataset (ambiguous license)
- comtrans — ComTrans Corpus Sample
- dolch — Dolch Word List
- europarl_raw — Sample European Parliament Proceedings Parallel Corpus
- framenet_v15 — FrameNet 1.5 (non-commercial use only)
- gazetteers — Gazetteer Lists (mixed per-file)
- large_grammars — Large context-free and feature-based grammars
- machado — Machado de Assis -- Obra Completa ("Public Domain", verify at source)
- moses_sample — Moses Sample Models
- mwa_ppdb — Monolingual word aligner (subset of Paraphrase Database)
- names — Names Corpus, Version 1.3 (1994-03-29)
- nonbreaking_prefixes — Non-Breaking Prefixes (empty license field)
- punkt — Punkt Tokenizer Models (no license attribute)
- punkt_tab — Punkt Tokenizer Models (no license attribute)
- porter_test — Porter Stemmer Test Files
- ppattach — Prepositional Phrase Attachment Corpus
- problem_reports — Problem Report Corpus
- qc — Experimental Data for Question Classification
- rslp — RSLP Stemmer (Removedor de Sufixos da Lingua Portuguesa)
- rte — PASCAL RTE Challenges 1, 2, and 3
- sample_grammars — Sample Grammars
- semcor — SemCor 3.0
- sentence_polarity — Sentence Polarity Dataset v1.0 (ambiguous license)
- smultron — SMULTRON Corpus Sample
- snowball_data — Snowball Data
- spanish_grammars — Grammars for Spanish
- state_union — C-Span State of the Union Address Corpus
- stopwords — Stopwords Corpus
- tagsets — Help on Tagsets
- tagsets_json — Help on Tagsets (JSON)
- toolbox — Toolbox Sample Files
- unicode_samples — Unicode Samples
- webtext — Web Text Corpus
- wmt15_eval — Evaluation data from WMT15
- word2vec_sample — Word2Vec Sample
- wordnet_ic — WordNet-InfoContent
- ycoe — York-Toronto-Helsinki Parsed Corpus of Old English Prose

---

## Packages with Citation Requests Instead of Licenses

> **Note:**
> These packages specifically request citation for use, but do not provide a license. Citation requests are not a license.

- cess_cat — CESS-CAT Treebank
- cess_esp — CESS-ESP Treebank

---

## Packages Citing Source Website or “See Website” for Terms

> **Note:**
> These packages refer users to an external website for their licensing terms.

- conll2002 — CONLL 2002 Named Entity Recognition Corpus
- conll2007 — Dependency Treebanks from CoNLL 2007 (Catalan and Basque Subset)
- ieer — NIST IE-ER DATA SAMPLE
- reuters — The Reuters-21578 benchmark corpus, ApteMod version

---

## Maintenance

**If you add, update, or remove any data packages, update this file accordingly to ensure continued transparency and compliance.**
If you find omissions, errors, or outdated information, please open an issue or pull request.

---
Loading