📬 Spam Filter using Probabilistic Classification

A Python-based spam filter that classifies SMS messages as spam or not spam using probabilistic reasoning, word frequency analysis, and text preprocessing. The model is trained directly from a labeled CSV file using basic conditional probability logic (similar to Naive Bayes).

📌 Overview

This project demonstrates a custom-built spam classifier trained from real SMS data. It processes a dataset, builds frequency dictionaries for spam and non-spam messages, and uses those statistics to classify new input messages using fundamental probabilistic principles without relying on external machine learning libraries.

Key Features:

Custom probabilistic classification algorithm
Text preprocessing and tokenization
Word frequency analysis
Real-time message classification
Educational implementation of Naive Bayes concepts

📁 Project Structure

Spam-Filter/
├── 📄 Spam_filter.py     # Main script for building the model and classifying messages
├── 📄 SMS_list.csv       # Input dataset (tab-separated) with labeled SMS messages
└── 📄 README.md          # This documentation file

📊 Dataset Requirements

CSV Format: `SMS_list.csv`

The dataset must be a tab-separated .csv file with two columns: Label and Message

Example format:

spam	WINNER!! You have won a free ticket to Bahamas!
notspam	Are we still on for lunch today?
spam	Congratulations! You've won £1000 cash prize!
notspam	Can you pick up some groceries on your way home?

Important Notes:

Use spam or notspam as labels
Ensure no header row in the CSV file
Use tab separation between label and message
Place this file in the same directory as Spam_filter.py

🧹 Features & Workflow

🔤 Text Preprocessing

Cleaning: Removes digits, punctuation, and special symbols using regex
Normalization: Converts all text to uppercase for consistency
Tokenization: Splits messages into individual words
Stopword Removal: Filters out high-frequency, non-informative words

🎯 Training Process

Dictionary Building: Creates two frequency dictionaries (spam_word and n_spam_word)
Word Counting: Tracks occurrence of each word in spam vs. non-spam messages
Probability Calculation: Computes conditional probabilities for classification

🔍 Classification Algorithm

Input Processing: Cleans and tokenizes new messages using the same preprocessing
Probability Computation: Calculates likelihood of each word belonging to spam/non-spam
Decision Making: Uses probabilistic reasoning to classify the entire message

🚀 Getting Started

Prerequisites

Python 3.6+ (no additional libraries required)
Dataset: Properly formatted SMS_list.csv file

Installation & Usage

Clone the repository:

git clone https://github.com/AlphaPruned/Spam-Filter.git
cd Spam-Filter

Prepare your dataset:
- Ensure SMS_list.csv is in the same directory as Spam_filter.py
- Verify the tab-separated format with no headers
Run the spam filter:
```
python Spam_filter.py
```

Test with messages:

Enter the message: You've been selected to win a free phone!
m is spam

🔍 Example Usage

Sample Interactions:

Spam Detection:

Enter the message: WINNER! You have won a lottery of $1000!
m is spam

Non-Spam Detection:

Enter the message: Your package has arrived. Collect it today.
m is not a spam

Marketing Spam:

Enter the message: FREE! Click here to claim your reward now!
m is spam

Normal Conversation:

Enter the message: Let's meet for coffee tomorrow at 3 PM
m is not a spam

⚙️ Algorithm Details

Probabilistic Classification Process:

Training Phase:
- Parse labeled dataset
- Build word frequency dictionaries for each class
- Calculate prior probabilities for spam/non-spam
Prediction Phase:
- Preprocess input message
- For each word, calculate: P(word|spam) and P(word|not_spam)
- Apply Bayes' theorem to determine final classification

Decision Rule:

If P(spam|message) > P(not_spam|message):
    Classify as SPAM
Else:
    Classify as NOT SPAM

🔬 Implementation Notes

Strengths:

Educational Value: Clear implementation of probabilistic classification concepts
No Dependencies: Uses only Python standard library
Customizable: Easy to modify preprocessing and classification logic
Transparent: All probability calculations are explicit and traceable

Limitations:

No Smoothing: May face zero probability issues with unseen words
Underflow Risk: Probabilities can underflow for very long messages
Simple Preprocessing: Basic text cleaning without advanced NLP techniques
No Cross-Validation: Limited evaluation methodology

Potential Improvements:

Add Laplace smoothing for unseen words
Implement logarithmic probabilities to prevent underflow
Include more sophisticated text preprocessing
Add model evaluation metrics (precision, recall, F1-score)

📈 Performance Considerations

Speed: Fast classification due to simple probability calculations
Memory: Efficient storage using Python dictionaries
Scalability: Performance depends on vocabulary size and message length
Accuracy: Depends heavily on training data quality and size

👤 Author

Arnav Rajesh Kadu

GitHub: @AlphaPruned

🤝 Contributing

Contributions are welcome! Here are some ways you can help:

Bug Fixes: Report and fix any issues
Feature Enhancements: Add smoothing, logging, or evaluation metrics
Documentation: Improve code comments and documentation
Dataset: Contribute additional training data

How to Contribute:

Fork the repository
Create a feature branch (git checkout -b feature/improvement)
Commit your changes (git commit -am 'Add improvement')
Push to the branch (git push origin feature/improvement)
Open a Pull Request

📬 Contact

For questions, suggestions, or collaboration opportunities:

Open an issue on GitHub
Contact through GitHub profile

🙏 Acknowledgments

Thanks to the SMS Spam Collection dataset contributors
Inspired by classical machine learning approaches to text classification

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

📬 Spam Filter using Probabilistic Classification

📌 Overview

📁 Project Structure

📊 Dataset Requirements

CSV Format: `SMS_list.csv`

🧹 Features & Workflow

🔤 Text Preprocessing

🎯 Training Process

🔍 Classification Algorithm

🚀 Getting Started

Prerequisites

Installation & Usage

🔍 Example Usage

Sample Interactions:

⚙️ Algorithm Details

Probabilistic Classification Process:

🔬 Implementation Notes

Strengths:

Limitations:

Potential Improvements:

📈 Performance Considerations

👤 Author

🤝 Contributing

How to Contribute:

📬 Contact

🙏 Acknowledgments

About

Uh oh!

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README.md		README.md
SMS_list.csv		SMS_list.csv
Spam_filter.py		Spam_filter.py

AlphaPruned/Spam-Filter

Folders and files

Latest commit

History

Repository files navigation

📬 Spam Filter using Probabilistic Classification

📌 Overview

📁 Project Structure

📊 Dataset Requirements

CSV Format: SMS_list.csv

🧹 Features & Workflow

🔤 Text Preprocessing

🎯 Training Process

🔍 Classification Algorithm

🚀 Getting Started

Prerequisites

Installation & Usage

🔍 Example Usage

Sample Interactions:

⚙️ Algorithm Details

Probabilistic Classification Process:

🔬 Implementation Notes

Strengths:

Limitations:

Potential Improvements:

📈 Performance Considerations

👤 Author

🤝 Contributing

How to Contribute:

📬 Contact

🙏 Acknowledgments

About

Topics

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

CSV Format: `SMS_list.csv`

Packages