Skip to content

Small exercise asking to scrap through some Amazon product

Notifications You must be signed in to change notification settings

wizioo/amazon-scraper

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

9 Commits
 
 
 
 

Repository files navigation

Amazon Scraper

Topic

You must write a webscrapper to extract information from an amazon product page.

Instructions

Remarks

For this exercise, I used Curl and DOMDocument. I wasn't used to it but I decided to use it as it was close to jQuery way of searching. So I had to make some searches on the web as I had never made a web scraper even if I knew the principle in general. I discovered it was better to use Amazon Product API (and certainly easier and far less likely to change). Amazon API is also legal as they fight against "brute force" scraping. But as I was asked for a web scraper, I made it through curl.

Details

  • Product Manufacturer: I used product brand as "manufacturer" was sometimes found in product informations, so I thought it was a better to fetch brand for this.
  • Product images: As all images aren't loaded by default, I decided to fetch images thumbnails and remove their thumb pattern (SS40.). I used regular expression to find it as it seemed that all product images where on https://images-na.ssl-images-amazon.com/images/I/. Also I could have used Dom to restrict the search field. (I've lost a lot of time finding that I hadn't all images loaded by default...) :(
  • Prices: Just reach for "prices" element block and look for striken through text to fetch regular price. Special price is fetched by element ID.
  • Product Infos : Just get element of the table in "Product Information" section. I think this is what you are waiting for. So, you don't have "Warranty & support" and "Feedback" even if it's in the section. Information should be cleaned especially for customer review. I didn't do that by I would use regular expression to extract the number of stars and / or reviews.
  • OffersList : it's a array of Offer objects for me. I decided to use xPath to fetch informations I needed.
    • Offer Condition : I just get the exact title of the condition (not only USED or NEW) as sometimes the condition could be "refurbished" (and as it was faster for me).

About

Small exercise asking to scrap through some Amazon product

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages