How To Scrape Amazon Best Seller Products Using Python and BeautifulSoup?
The scrape product information of the most rated and popular products from Amazon’s best seller records. Fetch product details such as Bestseller rank, ratings, reviews, product name, pricing, images, and many other points from Amazon.com
Our achievements in the field of business digital transformation.
Contents
Introduction
Today, we will learn how to scrape Amazon Best Seller products using Python and BeautifulSoup in a simple and attractive way.
The objective of this blog is to assist you in starting solving real-world issues while making them as basic as possible so that you can understand and get practical results as soon as possible.
So, initially, we must ensure that Python 3 is installed. If not, then you may just download Python 3 and install it before continuing.
Installation
You can install BeautifulSoup with:
pip3 install beautifulsoup4
To acquire data, break it down to XML, and apply CSS selectors. We will also require library’s requests, lxml, and soupsieve. Install them by following these steps:
pip3 install requests soupsieve lxml
After you’ve installed it, open a text editor and put in:
# -*- coding: utf-8 -*- from bs4 import BeautifulSoup import requests
Now visit the Amazon Best Seller Products listing page and check the information that you can get:
Code
Now, let us get back to our script. Let us try and fetch the information by imagining we have a browser like this:
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requestsheaders = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
url = 'https://www.amazon.in/gp/bestsellers/garden/ref=zg_bs_nav_0/258-0752277-9771203'response=requests.get(url,headers=headers)
soup=BeautifulSoup(response.content,'lxml')
You can save this code as scrapeAmazonBS.py.
If you execute the script
python3 scrapeAmazonBS.py
You’ll be able to see the entire HTML page.
Let us utilize CSS selectors to get the information we are looking for. To do so, return to Chrome and launch the inspect tool.
All the individual product information is provided with the class ‘zg-item-immersion.’ With CSS selector ‘.zg-item-immersion,’ we can simply extract this. So, here’s how the code appears.
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requestsheaders = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
url = 'https://www.amazon.in/gp/bestsellers/garden/ref=zg_bs_nav_0/258-0752277-9771203'response=requests.get(url,headers=headers)
soup=BeautifulSoup(response.content,'lxml')
for item in soup.select('.zg-item-immersion'):
try:
print('----------------------------------------')
print(item) except Exception as e:
#raise e
print('')
This will print all the information for each of the product data elements.
We may now select subclasses within these rows that hold the data we require.
# -*- coding: utf-8 -*-
from bs4 import BeautifulSoup
import requestsheaders = {'User-Agent':'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_2) AppleWebKit/601.3.9 (KHTML, like Gecko) Version/9.0.2 Safari/601.3.9'}
url = 'https://www.amazon.in/gp/bestsellers/garden/ref=zg_bs_nav_0/258-0752277-9771203'response=requests.get(url,headers=headers)
soup=BeautifulSoup(response.content,'lxml')
for item in soup.select('.zg-item-immersion'):
try:
print('----------------------------------------')
print(item)
print(item.select('.p13n-sc-truncate')[0].get_text().strip())
print(item.select('.p13n-sc-price')[0].get_text().strip())
print(item.select('.a-icon-row i')[0].get_text().strip())
print(item.select('.a-icon-row a')[1].get_text().strip())
print(item.select('.a-icon-row a')[1]['href'])
print(item.select('img')[0]['src'])
except Exception as e:
#raise e
print('')
If you run, it will print the below details:
If you want to utilize it in reality and expand to hundreds of connections, you’ll discover that Amazon quickly blocks your IP address. In this case, using a revolving proxy service to cycle IPs is more or less necessary. You can route your calls through a network of thousands of residential proxies using a service like Proxies API.
If you want to increase the speed of crawling but would not like to build your technology, you can contact our experts at 3i Data Scraping to crawl thousands of URLs using Python and BeautifulSoup quickly.
Mention your requirements and ask for a quote!!!
What Will We Do Next?
- Our representative will contact you within 24 hours.
- We will collect all the necessary requirements from you.
- The team of analysts and developers will prepare estimation.
- We keep confidentiality with all our clients by signing NDA.