Duplicate Question Detection is a Machine Learning and Natural Language Processing (NLP) project that identifies whether two given questions are semantically identical, even if they are phrased differently. This system helps reduce redundancy in question–answer platforms, forums, search engines, and customer support systems.
Given a pair of questions, predict whether they have the same meaning.
- 1 → Duplicate questions
- 0 → Non-duplicate questions
- Apply NLP techniques to real-world text data
- Perform text preprocessing and feature extraction
- Train a supervised machine learning model
- Evaluate performance using standard classification metrics
The dataset contains question pairs with a binary label indicating duplication.
Columns:
- Question 1
- Question 2
- Label (0 or 1)
- Python
- Jupyter Notebook
- NumPy
- Pandas
- Scikit-learn
- NLTK / SpaCy
- Matplotlib
- Seaborn
- Load the dataset
- Text preprocessing
- Lowercasing
- Removing punctuation
- Stopword removal
- Tokenization
- Feature engineering (Bag of Words / TF-IDF)
- Model training
- Model evaluation
- Accuracy
- Precision
- Recall
- F1 Score
- Clone the repository:
git clone <repository-url>