A simple web crawler that discovers, analyzes, and archives web content based on keyword searches with depth-based recursive link extraction.
This project is a legacy codebase from 2013 and may contain outdated code, deprecated libraries, and unresolved bugs. The project is being revived and modernized, and we actively welcome bug reports and contributions to fix issues and bring it up to date.
-
French Naming Convention: Class names, variables, methods, and comments are written in French. If you'd like to help modernize the codebase, renaming everything to English is a major contribution we'd greatly appreciate!
Example:
// Current (French) public void lancerrecherche() { } private String chaineDeRecherche; // Needed (English) public void startSearch() { } private String searchKeyword;
-
Outdated Code: Various parts may not be compatible with modern Java versions (11+, 17+, 21+)
-
Deprecated Libraries: htmlparser and Google API integration may need updates
-
Missing Error Handling: Some edge cases are not properly handled
We welcome contributions in these areas:
- Bug Reports: Open an Issue to report bugs
- π§ Code Contributions:
- Submit a Pull Request to fix issues
- Internationalize the code by renaming French identifiers to English
- Update deprecated libraries
- Add proper error handling
- Suggestions: Share ideas for modernization and improvements
- Testing: Test on modern Java versions and report compatibility issues
Your feedback and contributions help make this project better for everyone!
- Overview
- Features
- Architecture
- Project Structure
- Installation
- Usage
- Configuration
- API Documentation
- Ethical Crawling
- Contributing
- License
DeepFocusCrawler is a sophisticated Java-based web crawler designed for targeted content discovery and analysis. It combines Google's Custom Search API with recursive HTML parsing to locate, extract, and archive web pages containing specific keywords. The crawler features a user-friendly GUI, real-time progress monitoring, and intelligent queue management with configurable crawling depth.
- Intelligent Search: Leverages Google Custom Search API for initial URL discovery
- Recursive Extraction: Automatically extracts and processes nested links with customizable depth limits
- Multi-language Support: Supports 45+ languages for targeted searches
- Real-time Monitoring: Live progress tracking with UI updates
- Respect for Web Standards: robots.txt compliance and configurable request delays
- Efficient Queue Management: Smart duplicate detection and queue processing
- Google-Powered Search: Integrates with Google Custom Search Engine API to discover initial URLs
- Recursive Link Extraction: Automatically discovers and processes nested links up to configurable depth levels
- Multi-language Support: Search in Afrikaans, Arabic, Armenian, Chinese, English, French, German, Spanish, and 36 other languages
- Depth-Based Processing: Control recursion depth (1-100 levels) to limit crawling scope
- Keyword Matching: Intelligently filters pages based on search keyword presence
- Content Persistence: Automatically saves matching content to disk with organized file naming
- Multi-threaded Execution: Asynchronous crawling for improved performance
- robots.txt Compliance: Respects website crawling restrictions with caching
- Real-time Progress Bar: Visual feedback on crawling completion percentage
- Error Handling: Robust exception handling and graceful degradation
- Duplicate Detection: Prevents processing of the same URL multiple times
- Detailed Logging: Comprehensive error messages and debugging information
- User-Friendly GUI: Intuitive interface built with Java Swing
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β DeepFocusCrawler β
β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β GUI Layer (Fenetre.java) β β
β β - Search Interface β β
β β - Progress Visualization β β
β β - Results Display β β
β ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββΌβββββββββββββββββββββββββββ β
β β Observer/Observable Pattern Bus β β
β β (Event-driven communication layer) β β
β ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββ β
β β β
β ββββββββββββββββββββββββΌβββββββββββββββββββββββββββ β
β β Business Logic Layer (Manager.java) β β
β β - Search Orchestration β β
β β - Queue Management β β
β β - URL Validation β β
β ββββββββββββββββββββββββ¬βββββββββββββββββββββββββββ β
β β β
β βββββββββββββββββββΌββββββββββββββββββ β
β β β β β
β ββββββΌβββββ ββββββββββΌβββββββββ ββββββΌβββββ β
β βGoogleAPIβ β Extracteur β βSauvegardeβ β
β β(Search) β β(Parse & Extract)β β(Persist) β β
β βββββββββββ βββββββββββββββββββ βββββββββββ β
β β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
User Input (Search Query)
β
[Manager]
β
[GoogleRecherche] β Search Results β URL Queue
β
[Extracteur] β Parse HTML β Extract Links & Text
β
[Sauvegarde] β Filter & Save Matching Content
β
[UI Update] β Display Results & Progress
DeepFocusCrawler/
β
βββ com/crawl/
β β
β βββ vue/
β β βββ Main.java # Application entry point
β β βββ Fenetre.java # Main GUI window
β β
β βββ manager/
β β βββ Manager.java # Central crawler coordinator
β β βββ Langue.java # Language code mapper
β β βββ languages # Language configuration file
β β
β βββ downloader/
β β βββ Extracteur.java # HTML parser & link extractor
β β βββ Sauvegarde.java # Content persistence handler
β β βββ Noeud.java # Data structure for queue nodes
β β
β βββ interfaces/
β β βββ GoogleRecherche.java # Google Custom Search API wrapper
β β
β βββ observer/
β βββ Observer.java # Observer interface
β βββ Observable.java # Observable interface
β
βββ README.md # This file
βββ LICENSE # MIT License
βββ .gitignore # Git ignore rules
βββ libraries/
βββ htmlparser.jar # HTML parsing library
Provides the user interface with:
- Search keyword input field
- Language selection dropdown (45+ languages)
- Depth level selector (1-100)
- Download location picker
- Search/Stop buttons
- Real-time progress bar
- Results display table
- Seed URLs table
Implements: Observer pattern to receive crawler updates
Central coordinator handling:
- Search initialization with Google API
- URL queue management
- Depth validation
- Page type verification (HTML/Text only)
- Observer notification for UI updates
- Queue processing workflow
Key Methods:
lancerrecherche(): Initiates Google search and populates queueparcourirLaQueue(): Processes all URLs in the queuePageHtmlText(): Validates page MIME types using regex
Implements: Both Observer and Observable patterns
Handles page processing:
- Downloads pages via Parser
- Extracts text content using StringBean
- Extracts page title from
<title>tags - Recursively extracts all hyperlinks
- Filters invalid links (JavaScript, anchors, etc.)
- Resolves relative URLs to absolute URLs
- Reads and caches robots.txt restrictions
- Handles URL encoding and special characters
Key Methods:
extraireLiens(): Extract all hyperlinks from pageextraireTexte(): Extract plain text contentextraireTitre(): Extract page titlesiRobotAutorise(): Check robots.txt compliance (with caching)chainesCorrespondentes(): Keyword matching (case-insensitive)
Dependencies:
org.htmlparserlibrary for HTML parsing- Custom
Noeudclass for queue nodes
Manages file storage:
- Creates output directories
- Sanitizes filenames (removes invalid characters)
- Handles duplicate titles with numbering
- Serializes content to disk as
.txtfiles - Notifies observers of saved results
File Naming Convention:
emplacement/1 - PageTitle.txt
emplacement/2 - AnotherPage.txt
emplacement/3 - PageTitle2.txt # Duplicate handling
Integrates with Google Custom Search:
- Performs paginated searches (up to 100 results)
- Supports language-specific queries
- Extracts URLs from JSON responses
- Handles HTTP connections with GET requests
API Details:
- Base URL:
https://www.googleapis.com/customsearch/v1 - Pagination: 10 requests Γ 10 results = 100 total results
- Language support: Via
lrparameter
Provides language code mapping:
- Loads language list from
languagesconfiguration file - Maps language names to ISO codes
- Supports 45+ languages globally
Example Mappings:
English β en
FranΓ§ais β fr
Ψ§ΩΨΉΨ±Ψ¨ΩΨ© β ar
δΈζ β zh
Event-driven architecture for loose coupling:
Observable: Notifies observers of state changesObserver: Receives notifications and updates
Communication Flow:
Manager (Observable) β Notify β Fenetre (Observer)
Extracteur (Observer) β Receive URL β Manager
- Java 8 or higher
- Maven or direct JAR compilation
- Internet connection (for Google API)
git clone https://github.com/khaledkadri/DeepFocusCrawler.git
cd DeepFocusCrawlermvn clean compile
mvn exec:java -Dexec.mainClass="com.crawl.vue.Main"javac -cp libraries/htmlparser.jar src/com/crawl/**/*.java
java -cp libraries/htmlparser.jar:src com.crawl.vue.Mainjava -cp .:libraries/htmlparser.jar com.crawl.vue.MainA GUI window should appear with the search interface.
- Enter Search Keyword: Type the word or phrase you want to find
- Select Language: Choose the language for search filtering
- Set Depth Level: Choose how many link levels to crawl (1-5 recommended)
- Choose Location: Select where to save downloaded content
- Click "Recherche": Start the crawl
| Component | Purpose |
|---|---|
| Search Field | Enter keyword to find |
| Language Dropdown | Filter results by language |
| Depth Slider | Set recursion depth (1-100) |
| Location Field | Choose output directory |
| Search Button | Start crawling |
| Stop Button | Halt current operation |
| Progress Bar | Visual crawling progress |
| Results Table | Display found pages |
| Seed URLs Table | Show starting URLs |
1. Keyword: "python tutorial"
2. Language: English
3. Depth: 3
4. Location: /downloads/python_content
5. Click "Recherche"
Result: The crawler will:
- Search for "python tutorial" with Google
- Download matching pages
- Extract links and crawl 2 more levels deep
- Save all matching content to
/downloads/python_content
# Optional: Set Google API key as environment variable
export GOOGLE_API_KEY="your_api_key_here"Format: code,language_name
en,English
fr,French
ar,Arabic
es,Spanish
zh,Chinese
All parameters are configurable through the GUI:
- Minimum Depth: 1
- Maximum Depth: 100
- Default Depth: 1 (seeds only)
- Maximum Results: 100 per search
- Default Language: Arabic
// Create and display the main window
Fenetre window = new Fenetre();
window.setVisible(true);
window.setLocationRelativeTo(null);// Initialize crawler
Manager crawler = new Manager();
crawler.init();
// Configure search
crawler.setMot("python tutorial");
crawler.setLangue("English");
crawler.setProfondeur(3);
crawler.setEmplacement("/output/path");
// Start crawling
crawler.lancerrecherche();
crawler.parcourirLaQueue();// Extract links from a node
ArrayList<URL> links = extracteur.extraireLiens(node);
// Extract page title
String title = extracteur.extraireTitre();
// Extract text content
String text = extracteur.extraireTexte();// Save content to disk
sauvegarde.serialize(textContent, pageTitle, pageURL);| French | English | Location |
|---|---|---|
lancerrecherche() |
startSearch() |
Manager.java |
chaineDeRecherche |
searchKeyword |
Manager.java |
parcourirLaQueue() |
processQueue() |
Manager.java |
extraireLiens() |
extractLinks() |
Extracteur.java |
extraireTexte() |
extractText() |
Extracteur.java |
extraireTitre() |
extractTitle() |
Extracteur.java |
profondeur |
depth |
Multiple files |
emplacement |
location |
Multiple files |
listObserver |
observerList |
Multiple files |
Complete renaming is encouraged!
This crawler respects ethical web crawling practices:
- robots.txt Compliance: Checks and respects
robots.txtrestrictions - Depth Limiting: Prevents infinite recursion with configurable depth
- Duplicate Detection: Avoids processing the same URL twice
- Content Type Filtering: Only processes HTML and text files
- Caching: Caches robots.txt files to reduce server load
To use this crawler responsibly:
- Add Request Delays:
Thread.sleep(2000 + new Random().nextInt(3000)); // 2-5 seconds- Set Proper User-Agent:
conn.setRequestProperty("User-Agent",
"DeepFocusCrawler/1.0 (+https://github.com/khaledkadri/DeepFocusCrawler)");- Handle HTTP Errors:
if (conn.getResponseCode() == 429) {
// Too Many Requests - wait before retrying
}- Respect Crawl-Delay:
// Read and implement Crawl-delay from robots.txtThis tool is intended for educational and research purposes only. Users are responsible for:
- Complying with website Terms of Service
- Respecting copyright and intellectual property rights
- Following local laws regarding data collection
- GDPR and privacy regulation compliance
- Obtaining necessary permissions before crawling
Contributions are welcome! Please follow these steps:
- Fork the repository
- Create a feature branch (
git checkout -b feature/AmazingFeature) - Commit your changes (
git commit -m 'Add AmazingFeature') - Push to the branch (
git push origin feature/AmazingFeature) - Open a Pull Request
- Follow Java naming conventions (camelCase for variables/methods)
- Use meaningful variable names
- Add JavaDoc comments for public methods
- Include error handling for all I/O operations
- Add request delays (2-5 seconds) between requests
- Implement proper Crawl-delay parsing from robots.txt
- Add HTTP status code error handling (429, 503, etc.)
- Support for Sitemap.xml parsing
- Meta robots tag support
- Connection timeout configuration
- Exponential backoff for retries
- Database support for large-scale crawling
- Distributed crawling support
- Add multithreading
| Metric | Value |
|---|---|
| Initial Google Search | < 2 seconds |
| Page Download | 1-5 seconds |
| HTML Parsing | < 1 second per page |
| Link Extraction | < 1 second per page |
| Storage I/O | < 500ms per file |
- Memory: 50-200 MB (depends on page sizes)
- CPU: Low (single thread with waiting)
- Network: 1-10 Mbps (depends on content)
- Disk: Variable (1KB-10MB per page)
- Current Limit: ~1000 URLs per crawl session
- Depth Support: Up to 100 levels
- Concurrent Downloads: 1 (sequential processing)
- Language Support: 45+ languages
Cause: Parser initialization failed
Solution:
if (parser == null) {
System.err.println("Parser initialization failed");
return "";
}Cause: Page is not HTML or plain text
Solution: This is normal. The crawler skips non-HTML files.
Possible Causes:
- Keyword too specific
- Depth level too low (increase to 2-3)
- Language filter too restrictive
- Website doesn't match Google search results
Solution:
// Add delay between requests
Thread.sleep(3000);Cause: Website forbids crawling in robots.txt
Solution:
- Respect the restrictions (ethical crawling)
- Contact website owner for permission
- Use different search parameters
This project is licensed under the MIT License - see the LICENSE file for complete details.
- htmlparser - Licensed under LGPL
- Java Swing - Licensed under Oracle Binary Code License
Copyright (c) 2013 Khaled Kadri
Licensed under MIT License
https://github.com/khaledkadri/DeepFocusCrawler
- Author: Khaled Kadri
- GitHub: @khaledkadri
- Project: DeepFocusCrawler
- Check the Issues page
- Start a Discussion
- Review the Troubleshooting section
- Total Classes: 8
- Total Methods: 40+
- Lines of Code: ~2000
- Comments: ~500
- Supported Languages: 45+
- Max Crawl Depth: 100
- Google Custom Search API
- htmlparser library community
- Java Swing framework
- All contributors and testers
Status: Active Development
Happy Crawling! π·οΈ