Skip to content

A lightweight Java-based web crawler designed to explore and analyze web pages recursively, performing keyword-based discovery and depth-limited link extraction.

License

Notifications You must be signed in to change notification settings

khaledkadri/DeepFocusCrawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

6 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

DeepFocusCrawler

A simple web crawler that discovers, analyzes, and archives web content based on keyword searches with depth-based recursive link extraction.

License: MIT Java Version Status


⚠️ Important Notice

This project is a legacy codebase from 2013 and may contain outdated code, deprecated libraries, and unresolved bugs. The project is being revived and modernized, and we actively welcome bug reports and contributions to fix issues and bring it up to date.

Known Limitations

  1. French Naming Convention: Class names, variables, methods, and comments are written in French. If you'd like to help modernize the codebase, renaming everything to English is a major contribution we'd greatly appreciate!

    Example:

    // Current (French)
    public void lancerrecherche() { }
    private String chaineDeRecherche;
    
    // Needed (English)
    public void startSearch() { }
    private String searchKeyword;
  2. Outdated Code: Various parts may not be compatible with modern Java versions (11+, 17+, 21+)

  3. Deprecated Libraries: htmlparser and Google API integration may need updates

  4. Missing Error Handling: Some edge cases are not properly handled

How You Can Help

We welcome contributions in these areas:

  • Bug Reports: Open an Issue to report bugs
  • πŸ”§ Code Contributions:
    • Submit a Pull Request to fix issues
    • Internationalize the code by renaming French identifiers to English
    • Update deprecated libraries
    • Add proper error handling
  • Suggestions: Share ideas for modernization and improvements
  • Testing: Test on modern Java versions and report compatibility issues

Your feedback and contributions help make this project better for everyone!


Table of Contents


Overview

DeepFocusCrawler is a sophisticated Java-based web crawler designed for targeted content discovery and analysis. It combines Google's Custom Search API with recursive HTML parsing to locate, extract, and archive web pages containing specific keywords. The crawler features a user-friendly GUI, real-time progress monitoring, and intelligent queue management with configurable crawling depth.

Key Characteristics

  • Intelligent Search: Leverages Google Custom Search API for initial URL discovery
  • Recursive Extraction: Automatically extracts and processes nested links with customizable depth limits
  • Multi-language Support: Supports 45+ languages for targeted searches
  • Real-time Monitoring: Live progress tracking with UI updates
  • Respect for Web Standards: robots.txt compliance and configurable request delays
  • Efficient Queue Management: Smart duplicate detection and queue processing

Features

Core Crawling Features

  • Google-Powered Search: Integrates with Google Custom Search Engine API to discover initial URLs
  • Recursive Link Extraction: Automatically discovers and processes nested links up to configurable depth levels
  • Multi-language Support: Search in Afrikaans, Arabic, Armenian, Chinese, English, French, German, Spanish, and 36 other languages
  • Depth-Based Processing: Control recursion depth (1-100 levels) to limit crawling scope
  • Keyword Matching: Intelligently filters pages based on search keyword presence
  • Content Persistence: Automatically saves matching content to disk with organized file naming

Technical Features

  • Multi-threaded Execution: Asynchronous crawling for improved performance
  • robots.txt Compliance: Respects website crawling restrictions with caching
  • Real-time Progress Bar: Visual feedback on crawling completion percentage
  • Error Handling: Robust exception handling and graceful degradation
  • Duplicate Detection: Prevents processing of the same URL multiple times
  • Detailed Logging: Comprehensive error messages and debugging information
  • User-Friendly GUI: Intuitive interface built with Java Swing

Architecture

System Overview

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚                    DeepFocusCrawler                     β”‚
β”‚                                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚           GUI Layer (Fenetre.java)              β”‚   β”‚
β”‚  β”‚  - Search Interface                             β”‚   β”‚
β”‚  β”‚  - Progress Visualization                       β”‚   β”‚
β”‚  β”‚  - Results Display                              β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                         β”‚                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚      Observer/Observable Pattern Bus           β”‚   β”‚
β”‚  β”‚  (Event-driven communication layer)            β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                         β”‚                               β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”‚
β”‚  β”‚    Business Logic Layer (Manager.java)         β”‚   β”‚
β”‚  β”‚  - Search Orchestration                        β”‚   β”‚
β”‚  β”‚  - Queue Management                            β”‚   β”‚
β”‚  β”‚  - URL Validation                              β”‚   β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β”‚
β”‚                         β”‚                               β”‚
β”‚       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”Όβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”             β”‚
β”‚       β”‚                 β”‚                 β”‚             β”‚
β”‚  β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”€β”€β”€β”€β”  β”Œβ”€β”€β”€β”€β–Όβ”€β”€β”€β”€β”         β”‚
β”‚  β”‚GoogleAPIβ”‚  β”‚  Extracteur     β”‚  β”‚Sauvegardeβ”‚        β”‚
β”‚  β”‚(Search) β”‚  β”‚(Parse & Extract)β”‚  β”‚(Persist) β”‚        β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜         β”‚
β”‚                                                          β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Component Interaction Flow

User Input (Search Query)
        ↓
   [Manager]
        ↓
[GoogleRecherche] β†’ Search Results β†’ URL Queue
        ↓
[Extracteur] β†’ Parse HTML β†’ Extract Links & Text
        ↓
[Sauvegarde] β†’ Filter & Save Matching Content
        ↓
[UI Update] β†’ Display Results & Progress

πŸ“ Project Structure

DeepFocusCrawler/
β”‚
β”œβ”€β”€ com/crawl/
β”‚   β”‚
β”‚   β”œβ”€β”€ vue/
β”‚   β”‚   β”œβ”€β”€ Main.java                 # Application entry point
β”‚   β”‚   └── Fenetre.java              # Main GUI window
β”‚   β”‚
β”‚   β”œβ”€β”€ manager/
β”‚   β”‚   β”œβ”€β”€ Manager.java              # Central crawler coordinator
β”‚   β”‚   β”œβ”€β”€ Langue.java               # Language code mapper
β”‚   β”‚   └── languages                 # Language configuration file
β”‚   β”‚
β”‚   β”œβ”€β”€ downloader/
β”‚   β”‚   β”œβ”€β”€ Extracteur.java           # HTML parser & link extractor
β”‚   β”‚   β”œβ”€β”€ Sauvegarde.java           # Content persistence handler
β”‚   β”‚   └── Noeud.java                # Data structure for queue nodes
β”‚   β”‚
β”‚   β”œβ”€β”€ interfaces/
β”‚   β”‚   └── GoogleRecherche.java      # Google Custom Search API wrapper
β”‚   β”‚
β”‚   └── observer/
β”‚       β”œβ”€β”€ Observer.java             # Observer interface
β”‚       └── Observable.java           # Observable interface
β”‚
β”œβ”€β”€ README.md                          # This file
β”œβ”€β”€ LICENSE                            # MIT License
β”œβ”€β”€ .gitignore                         # Git ignore rules
└── libraries/
    └── htmlparser.jar                # HTML parsing library

πŸ”§ Component Details

1. Fenetre.java (GUI Layer)

Provides the user interface with:

  • Search keyword input field
  • Language selection dropdown (45+ languages)
  • Depth level selector (1-100)
  • Download location picker
  • Search/Stop buttons
  • Real-time progress bar
  • Results display table
  • Seed URLs table

Implements: Observer pattern to receive crawler updates


2. Manager.java (Business Logic & Orchestration)

Central coordinator handling:

  • Search initialization with Google API
  • URL queue management
  • Depth validation
  • Page type verification (HTML/Text only)
  • Observer notification for UI updates
  • Queue processing workflow

Key Methods:

  • lancerrecherche(): Initiates Google search and populates queue
  • parcourirLaQueue(): Processes all URLs in the queue
  • PageHtmlText(): Validates page MIME types using regex

Implements: Both Observer and Observable patterns


3. Extracteur.java (HTML Parsing & Link Extraction)

Handles page processing:

  • Downloads pages via Parser
  • Extracts text content using StringBean
  • Extracts page title from <title> tags
  • Recursively extracts all hyperlinks
  • Filters invalid links (JavaScript, anchors, etc.)
  • Resolves relative URLs to absolute URLs
  • Reads and caches robots.txt restrictions
  • Handles URL encoding and special characters

Key Methods:

  • extraireLiens(): Extract all hyperlinks from page
  • extraireTexte(): Extract plain text content
  • extraireTitre(): Extract page title
  • siRobotAutorise(): Check robots.txt compliance (with caching)
  • chainesCorrespondentes(): Keyword matching (case-insensitive)

Dependencies:

  • org.htmlparser library for HTML parsing
  • Custom Noeud class for queue nodes

4. Sauvegarde.java (Content Persistence)

Manages file storage:

  • Creates output directories
  • Sanitizes filenames (removes invalid characters)
  • Handles duplicate titles with numbering
  • Serializes content to disk as .txt files
  • Notifies observers of saved results

File Naming Convention:

emplacement/1 - PageTitle.txt
emplacement/2 - AnotherPage.txt
emplacement/3 - PageTitle2.txt    # Duplicate handling

5. GoogleRecherche.java (Search API)

Integrates with Google Custom Search:

  • Performs paginated searches (up to 100 results)
  • Supports language-specific queries
  • Extracts URLs from JSON responses
  • Handles HTTP connections with GET requests

API Details:

  • Base URL: https://www.googleapis.com/customsearch/v1
  • Pagination: 10 requests Γ— 10 results = 100 total results
  • Language support: Via lr parameter

6. Langue.java (Language Support)

Provides language code mapping:

  • Loads language list from languages configuration file
  • Maps language names to ISO codes
  • Supports 45+ languages globally

Example Mappings:

English β†’ en
FranΓ§ais β†’ fr
Ψ§Ω„ΨΉΨ±Ψ¨ΩŠΨ© β†’ ar
δΈ­ζ–‡ β†’ zh

7. Observer/Observable Pattern

Event-driven architecture for loose coupling:

  • Observable: Notifies observers of state changes
  • Observer: Receives notifications and updates

Communication Flow:

Manager (Observable) β†’ Notify β†’ Fenetre (Observer)
Extracteur (Observer) β†’ Receive URL β†’ Manager

Installation

Prerequisites

  • Java 8 or higher
  • Maven or direct JAR compilation
  • Internet connection (for Google API)

Step 1: Clone the Repository

git clone https://github.com/khaledkadri/DeepFocusCrawler.git
cd DeepFocusCrawler

Step 2: Compile the Project

Option A: Using Maven

mvn clean compile
mvn exec:java -Dexec.mainClass="com.crawl.vue.Main"

Option B: Using javac

javac -cp libraries/htmlparser.jar src/com/crawl/**/*.java
java -cp libraries/htmlparser.jar:src com.crawl.vue.Main

Step 3: Run the Application

java -cp .:libraries/htmlparser.jar com.crawl.vue.Main

A GUI window should appear with the search interface.


Usage Guide

Basic Search

  1. Enter Search Keyword: Type the word or phrase you want to find
  2. Select Language: Choose the language for search filtering
  3. Set Depth Level: Choose how many link levels to crawl (1-5 recommended)
  4. Choose Location: Select where to save downloaded content
  5. Click "Recherche": Start the crawl

GUI Components Explained

Component Purpose
Search Field Enter keyword to find
Language Dropdown Filter results by language
Depth Slider Set recursion depth (1-100)
Location Field Choose output directory
Search Button Start crawling
Stop Button Halt current operation
Progress Bar Visual crawling progress
Results Table Display found pages
Seed URLs Table Show starting URLs

Example: Finding Python Tutorials

1. Keyword: "python tutorial"
2. Language: English
3. Depth: 3
4. Location: /downloads/python_content
5. Click "Recherche"

Result: The crawler will:

  • Search for "python tutorial" with Google
  • Download matching pages
  • Extract links and crawl 2 more levels deep
  • Save all matching content to /downloads/python_content

βš™οΈ Configuration

Environment Variables

# Optional: Set Google API key as environment variable
export GOOGLE_API_KEY="your_api_key_here"

Configuration Files

languages (Language Codes)

Format: code,language_name

en,English
fr,French
ar,Arabic
es,Spanish
zh,Chinese

Search Parameters

All parameters are configurable through the GUI:

  • Minimum Depth: 1
  • Maximum Depth: 100
  • Default Depth: 1 (seeds only)
  • Maximum Results: 100 per search
  • Default Language: Arabic

πŸ“š API Documentation

Fenetre.java (GUI)

// Create and display the main window
Fenetre window = new Fenetre();
window.setVisible(true);
window.setLocationRelativeTo(null);

Manager.java (Core Crawler)

// Initialize crawler
Manager crawler = new Manager();
crawler.init();

// Configure search
crawler.setMot("python tutorial");
crawler.setLangue("English");
crawler.setProfondeur(3);
crawler.setEmplacement("/output/path");

// Start crawling
crawler.lancerrecherche();
crawler.parcourirLaQueue();

Extracteur.java (Parser)

// Extract links from a node
ArrayList<URL> links = extracteur.extraireLiens(node);

// Extract page title
String title = extracteur.extraireTitre();

// Extract text content
String text = extracteur.extraireTexte();

Sauvegarde.java (Storage)

// Save content to disk
sauvegarde.serialize(textContent, pageTitle, pageURL);

French to English Translation Guide

French English Location
lancerrecherche() startSearch() Manager.java
chaineDeRecherche searchKeyword Manager.java
parcourirLaQueue() processQueue() Manager.java
extraireLiens() extractLinks() Extracteur.java
extraireTexte() extractText() Extracteur.java
extraireTitre() extractTitle() Extracteur.java
profondeur depth Multiple files
emplacement location Multiple files
listObserver observerList Multiple files

Complete renaming is encouraged!


Ethical Crawling

Important Considerations

This crawler respects ethical web crawling practices:

Implemented Features

  • robots.txt Compliance: Checks and respects robots.txt restrictions
  • Depth Limiting: Prevents infinite recursion with configurable depth
  • Duplicate Detection: Avoids processing the same URL twice
  • Content Type Filtering: Only processes HTML and text files
  • Caching: Caches robots.txt files to reduce server load

Recommendations for Production Use

To use this crawler responsibly:

  1. Add Request Delays:
Thread.sleep(2000 + new Random().nextInt(3000)); // 2-5 seconds
  1. Set Proper User-Agent:
conn.setRequestProperty("User-Agent", 
    "DeepFocusCrawler/1.0 (+https://github.com/khaledkadri/DeepFocusCrawler)");
  1. Handle HTTP Errors:
if (conn.getResponseCode() == 429) {
    // Too Many Requests - wait before retrying
}
  1. Respect Crawl-Delay:
// Read and implement Crawl-delay from robots.txt

Legal Notice

This tool is intended for educational and research purposes only. Users are responsible for:

  • Complying with website Terms of Service
  • Respecting copyright and intellectual property rights
  • Following local laws regarding data collection
  • GDPR and privacy regulation compliance
  • Obtaining necessary permissions before crawling

Contributing

Contributions are welcome! Please follow these steps:

  1. Fork the repository
  2. Create a feature branch (git checkout -b feature/AmazingFeature)
  3. Commit your changes (git commit -m 'Add AmazingFeature')
  4. Push to the branch (git push origin feature/AmazingFeature)
  5. Open a Pull Request

Code Style

  • Follow Java naming conventions (camelCase for variables/methods)
  • Use meaningful variable names
  • Add JavaDoc comments for public methods
  • Include error handling for all I/O operations

Areas for Improvement

  • Add request delays (2-5 seconds) between requests
  • Implement proper Crawl-delay parsing from robots.txt
  • Add HTTP status code error handling (429, 503, etc.)
  • Support for Sitemap.xml parsing
  • Meta robots tag support
  • Connection timeout configuration
  • Exponential backoff for retries
  • Database support for large-scale crawling
  • Distributed crawling support
  • Add multithreading

Performance Metrics

Typical Performance

Metric Value
Initial Google Search < 2 seconds
Page Download 1-5 seconds
HTML Parsing < 1 second per page
Link Extraction < 1 second per page
Storage I/O < 500ms per file

Resource Usage

  • Memory: 50-200 MB (depends on page sizes)
  • CPU: Low (single thread with waiting)
  • Network: 1-10 Mbps (depends on content)
  • Disk: Variable (1KB-10MB per page)

Scalability

  • Current Limit: ~1000 URLs per crawl session
  • Depth Support: Up to 100 levels
  • Concurrent Downloads: 1 (sequential processing)
  • Language Support: 45+ languages

Troubleshooting

Common Issues

Issue: NullPointerException in Extracteur

Cause: Parser initialization failed

Solution:

if (parser == null) {
    System.err.println("Parser initialization failed");
    return "";
}

Issue: "Extension incorrecte" Error

Cause: Page is not HTML or plain text

Solution: This is normal. The crawler skips non-HTML files.

Issue: Low Results Found

Possible Causes:

  • Keyword too specific
  • Depth level too low (increase to 2-3)
  • Language filter too restrictive
  • Website doesn't match Google search results

Issue: API Rate Limit Exceeded

Solution:

// Add delay between requests
Thread.sleep(3000);

Issue: robots.txt Restrictions Blocking Crawl

Cause: Website forbids crawling in robots.txt

Solution:

  • Respect the restrictions (ethical crawling)
  • Contact website owner for permission
  • Use different search parameters

License

This project is licensed under the MIT License - see the LICENSE file for complete details.

Third-Party Licenses

  • htmlparser - Licensed under LGPL
  • Java Swing - Licensed under Oracle Binary Code License

Copyright

Copyright (c) 2013 Khaled Kadri
Licensed under MIT License
https://github.com/khaledkadri/DeepFocusCrawler

Contact & Support

Getting Help


πŸ“ˆ Statistics

  • Total Classes: 8
  • Total Methods: 40+
  • Lines of Code: ~2000
  • Comments: ~500
  • Supported Languages: 45+
  • Max Crawl Depth: 100

πŸ™ Acknowledgments

  • Google Custom Search API
  • htmlparser library community
  • Java Swing framework
  • All contributors and testers

Status: Active Development


Happy Crawling! πŸ•·οΈ

About

A lightweight Java-based web crawler designed to explore and analyze web pages recursively, performing keyword-based discovery and depth-limited link extraction.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages