Skip to content

FIX Thanks to @imirkin#2

Open
Pelirrojo wants to merge 1 commit intohechmik:masterfrom
Machine-Learning-Labs:master
Open

FIX Thanks to @imirkin#2
Pelirrojo wants to merge 1 commit intohechmik:masterfrom
Machine-Learning-Labs:master

Conversation

@Pelirrojo
Copy link

Thanks @imirkin with your fix in the lambda function is working fine.
I've just run with Python 3.7 (MiniConda and run smooths) @huertaj2

@rodri270
Copy link

I'm having an issue where the scrapper only pulls down the first 200 games and formats things a little funky. Also its not pulling down any of the other regional sales just total sales.

Example output:

Rank,Name,Genre,Platform,Publisher,Developer,Vgchartz_Score,Critic_Score,User_Score,Total_Shipped,Total_Sales,NA_Sales,PAL_Sales,JP_Sales,Other_Sales,Release_Date,Last_Update
1,Tetris,,Series,Nintendo,Alexey Pajitnov,,,,496400000.0,,,,,,1989-07-31,2020-02-27 00:00:00
2,Pokemon,,Series,Nintendo,Game Freak,,,,402220000.0,,,,,,1998-09-28,2020-02-03 00:00:00
3,Call of Duty,,Series,Activision,Infinity Ward,,,,400000000.0,,,,,,2003-10-29,2020-02-03 00:00:00
4,Super Mario,,Series,Nintendo,Nintendo,,,,391450000.0,,,,,,1983-07-20,2020-02-20 00:00:00
5,Grand Theft Auto,,Series,Rockstar Games,Rockstar North,,,,370000000.0,,,,,,1998-03-27,2020-02-03 00:00:00
6,FIFA,,Series,EA Sports,Extended Play Productions (1991-1997),,,,325000000.0,,,,,,1993-12-15,2020-02-03 00:00:00

@Pelirrojo
Copy link
Author

Hi @rodri270

I'm speaking from memory, but I seem to recall that there is a json configuration file to configure the request size and paging (number of results x number of pages) and the fields that are retrieved. Have a look at: https://github.com/hechmik/vgchartzScrape/blob/master/cfg/resources.json

The output format is the one the ofiginal software had CSV, but as the code uses pandas it should be easy to export it in another way.

@rodri270
Copy link

I appreciate the heads up thanks a lot. I was able to find that and make the changes from there. Now the issue I'm having is I'm getting "Unexpected error: (<class 'urllib.error.HTTPError'>, <HTTPError 429: 'Too Many Requests'>, <traceback object at 0x118b2d500>)" so I just gotta keep testing. Thanks again for the fast reply!

@Pelirrojo
Copy link
Author

429 is an error message from the web, probably by executing it with very extensive requests over time we are affecting its capacity and/or performance and that is why it returns this warning.

Working around this loop, you can introduce some artificial latency between requests to avoid saturating the server and not receive the error message.

Another option is to introduce this delay/sleep when the error is caught, in the form of an exception that momentarily stops traffic when the problem occurs.

If you finally opt for any option, you can work on this same repository and join it in the same PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants