Skip to content

Comments

fix goodreads parsing#3

Open
DerBeutlin wants to merge 1 commit intogoderich:masterfrom
DerBeutlin:fix/goodreads
Open

fix goodreads parsing#3
DerBeutlin wants to merge 1 commit intogoderich:masterfrom
DerBeutlin:fix/goodreads

Conversation

@DerBeutlin
Copy link

Parsing a book from goodreads fails for me with <URL> not understood

I used git bisect to identify 2dab795 as the first bad commit and reverted a part of it which fixed it for me.

I also tested LibraryThing with my version and it seems to work as well.

Unfortunately the unit tests are broken for me so I couldn't verify the functionality.

Feel free to close if inappropriate.

Revert partly Add LibraryThing series scraping capability (2dab795)
@goderich
Copy link
Owner

Hi @DerBeutlin ! Thanks for the PR!

I'm surprised Goodreads works for you at all. I haven't been able to use it for a while. (That's actually why I added LibraryThing and OpenLibrary scrapers.) They seem to have deliberately updated their website to make it more difficult to scrape. Half the time I get no response on enlive-fetch at all, and the other half it's unparseable junk. I just tried again, using both Elisp and Python, and couldn't get it to work. The html I get using enlive/requests is not what I see in the browser.

The code that you reverted was changed to allow a whole series to be parsed from LibraryThing at once, using just the series link. At this point it seems like a choice between Goodreads and LibraryThing, but Goodreads doesn't work for me.

Unless of course you can see a better way to handle scraping a series. Mine is pretty hacky, I'll admit, but hey, it works.

@goderich
Copy link
Owner

Hi @DerBeutlin ! I've recently refactored a lot of Goodreads parsing stuff. Does the new version work for you, or do you get the same error? (Or maybe a different one?)

@DerBeutlin
Copy link
Author

I now get no error anymore, however apart from the title nothing gets parsed, the author field for example is empty no additional details are parsed, not sure if this is expected

@goderich
Copy link
Owner

Not really, no. The problem is, Goodreads is very difficult to parse. They appear to either change the structure of their site periodically, or else use other shenanigans that make reliable scraping frustrating.

I'm considering dropping Goodreads support from my fork entirely because trying to keep up seems an exercise in futility at this point.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants