Manually resolve t.co Card URL if guesswork fails#981
Manually resolve t.co Card URL if guesswork fails#981aidanharris wants to merge 1 commit intoJustAnotherArchivist:masterfrom
Conversation
In some cases translating the t.co URL is not possible and prints a warning. We can try to follow the link ourselves.
| u = self._head(card.url) | ||
| assert u.status_code >= 300 and u.status_code < 400 | ||
| card.url = u.headers["location"] | ||
| except: |
There was a problem hiding this comment.
Bare excepts aren't a great idea (they catch everything, including Ctrl-C), do you just want to catch AssertionError? If so, why not just use an if statement?
There was a problem hiding this comment.
do you just want to catch AssertionError?
AssertionError or any exception thrown from requests (I think it's better to continue with the t.co URL and log the warning than to crash)
There was a problem hiding this comment.
In that case I'd either specify the exceptions or catch Exception (rather than the default BaseException, which catches EVERYTHING, including Ctrl-C):
except (AssertionError, WhateverElseYouWantToHandle, ...):
...
or:
except Exception:
...
I'd recommend the former so you don't stifle valid errors.
Might also be good to logging.debug the actual exception, too.
| else: | ||
| _logger.warning(f'Could not translate t.co card URL on tweet {tweetId}') | ||
| try: | ||
| u = self._head(card.url) |
There was a problem hiding this comment.
Won't this slow down the scraper significantly?
There was a problem hiding this comment.
Won't this slow down the scraper significantly?
It would but I don't think it'd matter much. HEAD requests are very fast. Would it be better if this were opt-in?
Most tweets I scraped could have the t.co URL translated fine, I didn't hit the warning often.
There was a problem hiding this comment.
This is probably the big thing to figure out here first. (I'll have other remarks later.)
It should definitely be configurable. I'm not sure whether it should be on or off by default, though I'm leaning towards off (i.e. opt-in).
This comment was marked as outdated.
This comment was marked as outdated.
Sorry, something went wrong.
In some cases translating the t.co URL is not possible and prints a warning. We can try to follow the link ourselves.