How to Use Web Scraping with Selenium and BeautifulSoup for Dynamic Pages?
Our achievements in the field of business digital transformation.
Web scraping could be well-defined as:
“The creation of an agent for downloading, parsing, as well as organizing data from the web in the automated manner.”
In other words: rather than a human end-user clicks away in the web browser as well as copy paste interesting parts in like a spreadsheet, web data scraping offloads the job to any computer program that can implement it much quicker, and more properly, than any human can.
Web scraping is very important in the data science arena.
Why is Python an appropriate Language to Get Used for Web Scraping?
This has the most extravagant and helpful ecosystem when comes to doing web scraping. While several languages have the libraries to assist using web scraping, Python’s libraries come with the most advanced features and tools.
A few python libraries used for web scraping include:
- BeautifulSoup
- LXML
- Requests
- Scrapy
- Selenium
In this blog, we will use Selenium and BeautifulSoup to extract review pages from Trip Advisor.
Why Use Selenium Also? Isn’t BeautifulSoup Sufficient Alone?
Installation
Web scraping using Python often needs not more than usage of BeautifulSoup to fulfil the objective. BeautifulSoup is an extremely powerful library, which makes data scraping by navigating the DOM (Document Object Model) easier to apply. However, it does static scraping only. Static scraping disregards JavaScript. This draws web pages from a server without any help of the browser. You have exactly what you get in the “view page source”, as well as you slice & dice it then. If any data you are searching for is accessible in “view page source”, you don’t have to go much further. However, if you require data, which is available in components that get rendered by clicking the JavaScript links, what comes to rescue is, dynamic scraping. The combination of Selenium and BeautifulSoup will complete the dynamic scraping job. Selenium powers web browser collaboration from Python. Therefore, the data extracted by JavaScript links could be made accessible by automating button clicks using Selenium as well as could be scraped by BeautifulSoup.
pip install bs4 selenium
Selenium Used for JavaScript Links Buttons
Initially, we will utilize Selenium for automating button clicks needed to render hidden and useful data. For review pages of Trip Advisor, longer reviews are somewhat accessible in the last DOM. They become completely accessible by clicking the “More” button. Therefore, we would automate clicking of different “More” buttons using Selenium.
Selenium Needs to Use the Browser’s Driver
from selenium import webdriver options = webdriver.ChromeOptions() options.add_argument('--ignore-certificate-errors') options.add_argument('--incognito') options.add_argument('--headless') driver = webdriver.Chrome("/usr/lib/chromium-browser/chromedriver", chrome_options=options)
Here, Selenium uses a Chrome browser driver into incognito mode as well as without opening the browser window (it looks headless argument).
Open TripAdvisor Review Page and Click the Applicable Buttons
import time driver.get("https://www.tripadvisor.com/Airline_Review-d8729157-Reviews-Spirit-Airlines#REVIEWS") more_buttons = driver.find_elements_by_class_name("moreLink") for x in range(len(more_buttons)): if more_buttons[x].is_displayed(): driver.execute_script("arguments[0].click();", more_buttons[x]) time.sleep(1) page_source = driver.page_source
Here, a Selenium web driver navigates through DOM of TripAdvisor reviews page as well as gets “More” buttons. After that, it repeats through different “More” buttons as well as automates the clicking. On automated clicking of the “More” buttons, reviews that were moderately accessible before becomes completely accessible.
After that, Selenium gives a manipulated page resource to BeautifulSoup.
BeautifulSoup to Extract Data
The received page source from Selenium contains complete reviews.
from bs4 import BeautifulSoup soup = BeautifulSoup(page_source, 'lxml') reviews = [] reviews_selector = soup.find_all('div', class_='reviewSelector') for review_selector in reviews_selector: review_div = review_selector.find('div', class_='dyn_full_review') if review_div is None: review_div = review_selector.find('div', class_='basic_review') review = review_div.find('div', class_='entry').find('p').get_text() review = review.strip() reviews.append(review)
At this time, BeautifulSoup loads a page source. This scrapes reviews texts through iterating through different review divs. This logic in the given code is for review pages of TripAdvisor. This can differ as per HTML structure of a page. For coming use, you could write the scraped reviews to the file.
Practical
We have scraped a TripAdvisor page reviews, scraped the reviews as well as wrote them into the file.
Here are the reviews that we have scraped from one of TripAdvisor pages.
JOKE of an airline. You act like you have such low fares, then turn around and charge people for EVERYTHING you could possibly think of. $65 for carry on, a joke. No seating assignments without an upcharge for newlyweds, a joke. Charge a veteran for a carry on, a f***ing joke. Personally, I will never fly spirit again, and I’ll gladly tell everyone I know the kind of company this airline is. No room, no amenities, nothing. A bunch of penny pinchers, who could give two sh**s about the customers. Take my flight miles and shove them, I won’t be using them with this pathetic a** airline again. My first travel experience with NK. Checked in on the mobile app and printed the boarding pass at the airport kiosk. My fare was $30.29 for a confirmed ticket. I declined all the extras as I would when renting a car. No, no, no and no. My small backpack passed the free item test as a personal item. I was a bit thirsty so I purchased a cold bottle of water in flight for $3.00 but I brought my own snacks. The plane pushed off the gate in Las Vegas on time and arrived in Dallas early. Overall an excellent flight. Original flight was at 3:53pm and now the most recent time in 9:28pm. Have waisted an entire day on the airport. Worst airline. I have had the same thing happen in the past were it feels like the are trying to combine two flights to make more money. If I would have know it would have taken this long I would have booked a different airline without a doubt. Made a bad weather flight great. Bumpy weather but they got the beverage and snack service done in styleFlew Spirit January 23rd and January 26th (flights 1672 from MCO to CMH and 1673 CMH to MCO). IF you plan accordingly you will have a good flight. We made sure our bag was correct, and checked in online. I do think the fees are ridiculous and aren't needed. $10 to check in at the terminal? Really.. That's dumb in my opinion. Frontier does not do that, and they are a no frill airline (pay for extras). I will say the crew members were very nice, and there was decent leg room. We had the Airbus A320. Not sure if I'd fly again because I prefer Frontier Airlines, but Spirit wasn't bad for a quick flight. If you get the right price on it, I would recommend it... just prepare accordingly, and get your bags early. Print your boarding pass at home! worst flight i have ever been on. the rear cabin flight attendents were the worst i have sever seen. rude, no help. the seats are the most cramped i have every seen. i looked up the seat pitch is the smallest in the airline industry. 28" delta and most other arilines are 32" plus. maybe ok for a short hop but not for a 3 or 4 hour flight no free water or anything. a manwas trying to get settle in with his kids and asked the male flight attendent for some help with luggage in the overhead andthe male flight attendent just said put your bags in the bin and offered no assitance. my son got up and help the manget the kidscarryons put away I was told incorrect information by the flight counter representative which costed me over $450 i did not have. I spoke with numerous customer service reps who were all very rude and unhelpful. It is not fair for the customer to have to pay the price for being told incorrect information. We got a great price on this flight. Unfortunately, we were going on a cruise and had to take luggage. By the time we added our luggage and seats the price more than doubled. Fun crew. Very friendly and happy--from the tag your bag kiosk to the ticket desk to the flight crew--everyone was exceptionally happy to help and friendly. We find this to be true of the many Spirit flights we've taken. Not impressed with the Spirit check-in staff at either airport. Very rude and just not inviting. The seats were very comfortable and roomy on my first flight in the exit row. On the way back there was very little cushion and narrow seats. The flight attendants and pilots were respectful, direct, and welcoming. Overall would fly Spirit again, but please improve airport staff at check-in.
Conclusion
BeautifulSoup is an extremely powerful tool to do web scraping. However, when JavaScript starts working and hides the content, BeautifulSoup and Selenium do the job of data scraping. Selenium could also get used to navigate the next page. You could also utilize Scrapy or other web scraping tools rather than BeautifulSoup to do web scraping. And lastly after collecting data, you could feed data for the data science’s work.
If you have any queries, you can contact 3i Data Scraping and if you want any web scraping services, ask for a free quote!
What Will We Do Next?
- Our representative will contact you within 24 hours.
- We will collect all the necessary requirements from you.
- The team of analysts and developers will prepare estimation.
- We keep confidentiality with all our clients by signing NDA.