Better regex for finding proxies + deduplication of proxies based on exit ip#765
Better regex for finding proxies + deduplication of proxies based on exit ip#765gabearro wants to merge 6 commits intomonosans:mainfrom
Conversation
|
@monosans could you trigger this workflow please |
|
@monosans do you understand what is not passing in the newly failed rustfmt? It doesn't seem like there are actually syntax errors unless I'm blind |
There was a problem hiding this comment.
Pull Request Overview
This PR improves proxy discovery and deduplication by enhancing the regex pattern for proxy detection and implementing deduplication based on exit IP addresses. The changes address issues with proxy identification and prevent duplicate proxies that use different endpoints but share the same exit server.
- Enhanced regex pattern to support more flexible proxy URI formats with expanded character sets for usernames/passwords
- Added deduplication logic to remove proxies that share the same exit IP address within the same protocol
- Minor code organization improvements
Reviewed Changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| src/parsers.rs | Updated proxy regex pattern to support more characters in credentials and removed trailing whitespace |
| src/output.rs | Added exit IP-based deduplication logic and reorganized imports |
Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.
|
|
||
| pub static PROXY_REGEX: LazyLock<fancy_regex::Regex> = LazyLock::new(|| { | ||
| let pattern = r"(?:^|[^0-9A-Za-z])(?:(?P<protocol>https?|socks[45]):\/\/)?(?:(?P<username>[0-9A-Za-z]{1,64}):(?P<password>[0-9A-Za-z]{1,64})@)?(?P<host>[A-Za-z][\-\.A-Za-z]{0,251}[A-Za-z]|[A-Za-z]|(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])){3}):(?P<port>[0-9]|[1-9][0-9]{1,3}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5])(?=[^0-9A-Za-z]|$)"; | ||
| let pattern = r"(?:^|[^0-9A-Za-z])(?:(?P<protocol>https?|socks[45]):\/\/)?(?:(?P<username>[0-9A-Za-z._~\-]{1,256}):(?P<password>[0-9A-Za-z._~\-]{1,256})@)?(?P<host>[A-Za-z][\-\.A-Za-z]{0,251}[A-Za-z]|[A-Za-z]|(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])(?:\.(?:[0-9]|[1-9][0-9]|1[0-9]{2}|2[0-4][0-9]|25[0-5])){3}):(?P<port>[0-9]|[1-9][0-9]{1,3}|[1-5][0-9]{4}|6[0-4][0-9]{3}|65[0-4][0-9]{2}|655[0-2][0-9]|6553[0-5])(?=[^0-9A-Za-z]|$)"; |
There was a problem hiding this comment.
The regex allows hyphens in usernames and passwords without escaping them in the character class. In regex character classes, hyphens should be escaped or placed at the beginning/end to avoid being interpreted as a range operator. Consider changing [0-9A-Za-z._~\-] to [0-9A-Za-z._~-] or [0-9A-Za-z._~\-] where the hyphen is properly positioned.
| let mut seen: std::collections::HashSet<(ProxyType, String)> = | ||
| std::collections::HashSet::new(); |
There was a problem hiding this comment.
[nitpick] The deduplication logic uses std::collections::HashSet directly instead of the project's HashMap alias. For consistency with the existing codebase that imports HashMap, consider using std::collections::HashSet consistently or adding a HashSet alias to match the pattern.
| for p in proxies { | ||
| if let Some(ip) = &p.exit_ip { | ||
| let key = (p.protocol, ip.clone()); | ||
| if !seen.insert(key) { | ||
| continue; | ||
| } | ||
| } | ||
| deduped.push(p); | ||
| } |
There was a problem hiding this comment.
The deduplication logic clones the IP string for each proxy when creating the key. Consider using a reference to avoid unnecessary string allocations: change the HashSet type to HashSet<(ProxyType, &str)> and use &ip instead of ip.clone() in the key.
The imports are not sorted and the comment is longer than 80 characters in one line. |
Changes:
Deduplication based on the exit ip is because a lot of times proxy servers will have different IPs but they actually exit to the same server.
Improved the regex because I was having trouble getting the scraper/checker to identify my proxies