A program for counting the number of words(word tokenize) in PDF files.
It should be noted that this program does not detect scanned files.
To run this file; Just use steps below:
- Install
python3,pip,PyPDF2,nltk. - Clone the project Word_counter
- NLTK library to identify stopwords
- About stopwords Read more...
- NLTK library to word tokenize
- About word tokenize Read more...
- Sample input file
- Sample program output
NLTK libraries are required.
If you want to install them on your system
You must run the following code:
import nltk
nltk.download('stopwords')
nltk.download('punkt')You must modify the filename variable to rename the input file:
filename = 'Your_file.pdf'To change the number of output words, you must modify the variable count_word:
count_word = 30- Create a CSV file
- Create a Wordclouds