Skip to content

Machine learning–based NLP system to detect semantically duplicate questions using TF-IDF and Logistic Regression

Notifications You must be signed in to change notification settings

rishig47-dev/Duplicate-Question-Detection

Repository files navigation

Duplicate Question Detection using Machine Learning

📌 Overview

Duplicate Question Detection is a Machine Learning and Natural Language Processing (NLP) project that identifies whether two given questions are semantically identical, even if they are phrased differently. This system helps reduce redundancy in question–answer platforms, forums, search engines, and customer support systems.


🎯 Problem Statement

Given a pair of questions, predict whether they have the same meaning.

  • 1 → Duplicate questions
  • 0 → Non-duplicate questions

🎯 Objectives

  • Apply NLP techniques to real-world text data
  • Perform text preprocessing and feature extraction
  • Train a supervised machine learning model
  • Evaluate performance using standard classification metrics

🗂 Dataset

The dataset contains question pairs with a binary label indicating duplication.

Columns:

  • Question 1
  • Question 2
  • Label (0 or 1)

⚙️ Technologies Used

  • Python
  • Jupyter Notebook
  • NumPy
  • Pandas
  • Scikit-learn
  • NLTK / SpaCy
  • Matplotlib
  • Seaborn

🔄 Project Workflow

  1. Load the dataset
  2. Text preprocessing
    • Lowercasing
    • Removing punctuation
    • Stopword removal
    • Tokenization
  3. Feature engineering (Bag of Words / TF-IDF)
  4. Model training
  5. Model evaluation

📊 Evaluation Metrics

  • Accuracy
  • Precision
  • Recall
  • F1 Score

▶️ How to Run

  1. Clone the repository:
    git clone <repository-url>

About

Machine learning–based NLP system to detect semantically duplicate questions using TF-IDF and Logistic Regression

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published