-
-
Notifications
You must be signed in to change notification settings - Fork 353
Resolve relative srcset URLs
#509
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
normalize_attributes always does a copy when the list is non-empty, so there's no need to copy again.
for more information, see https://pre-commit.ci
|
Hi Tom! I don't want to add to feedparser's custom HTML sanitizing code, as feedparser needs to contract its responsibilities down to just parsing feeds to become maintainable again. Instead, I'd prefer to see a PR that eliminates feedparser's HTML sanitizing code and replaces it with a Python library that is dedicated to that purpose. You may remember #257, where I thought that bleach might be a package that could achieve this goal. Could html-sanitizer be an alternative? |
|
I am skeptical of html-sanitizer — it depends on lxml which is ultimately backed by libxml2 and libxslt. Those projects are in a rough spot right now so I'd hesitate to move in that direction. (Also, it depends on beautifulsoup4. Perhaps that library has improved on its molasses-slow predecessors but I'd benchmark before adopting it.) On my own project I'm still using html5lib + lxml for sanitization and manipulation, but I've had my eye out for alternatives. I haven't seen anything else as flexible yet. For sanitization alone, I've used nh3 in the past. This is backed by the Rust ammonia and html5ever libraries, so less likely to blow your hand off than C libxml2. Last I checked (admittedly, a few years ago) html5ever also passed more of the html5lib-tests than html5lib itself, so more compliant. Caveat: html5ever is part of the Servo project, which while revived in 2023 has had fallow periods. It still seems safer for feedparser's users than lxml. I think that the main risk in adopting For this PR, I didn't want to touch sanitization at all (again, I don't use it!) but it is applied by default to the test vectors, so it would drop the |
|
@kurtmckee I wonder if JustHTML could be a viable alternative? As the maintainer I would be happy to work with you and improve the API if there's something missing. |
|
I'd be happy to help with any HTML parser transition, but I don't understand how to address the objection to the incidental sanitizer change here. At its core, this PR is about URL resolution. Is that in-scope for the feedparser project going forward? |
I stumbled across a feed with an
<img src="..." srcset="...">where thesrcwas busted, but thesrcsetis good, so here we are: this teaches the relative URL resolver how to resolve relative URLs within thesrcsetattribute.I also added
srcsetto the list of allowed attributes in the sanitizer so that the test would work. There's a caveat to this: the width descriptor form ofsrcsetrequires thesizesattribute too, but that attribute isn't allowed by the sanitizer (and it contains CSS media queries, so that'd be nontrivial to add). A browser will ignore thesrcsetin that case, so I think the end result is status quo.I wasn't sure how many tests to add, or exactly where to put them. Please advise!