Python Beautifulsoup Web Scraping



  1. How To Use Beautiful Soup
  2. Web Scraping With Beautifulsoup
  3. Python Beautifulsoup Web Scraping
  4. Beautiful Soup Documentation
  5. Web Scraping With Python Beautifulsoup

Loading Web Pages with 'request' The requests module allows you to send HTTP requests using.

Internet extends fast and modern websites pretty often use dynamic content load mechanisms to provide the best user experience. Still, on the other hand, it becomes harder to extract data from such web pages, as it requires the execution of internal Javascript in the page context while scraping. Let's review several conventional techniques that allow data extraction from dynamic websites using Python.

What is a dynamic website?#

  • Python & Web Scraping Projects for $10 - $30. You need to write three scrapers: one using Beautiful Soup, one using Scrapy, one using Selenium. All of them should scrap the same information from the domain. The goal is to gather the information.
  • You just need to obtain the src attribute of the iframe, and then request and parse its content: import requests from bs4 import BeautifulSoup s = requests.Session r = s.get ('soup = BeautifulSoup (r.content, 'html.parser') iframesrc = soup.selectone ('#detail-displayer').attrs 'src' r = s.get (f'https.
  • Web scraping python beautifulsoup tutorial with example: The data present are unstructured and web scraping will help to collect data and store it. There are many ways of scraping websites and online services. Use the API of the website. Example, Facebook has the Facebook Graph API and allows retrieval of data posted on Facebook.

A dynamic website is a type of website that can update or load content after the initial HTML load. So the browser receives basic HTML with JS and then loads content using received Javascript code. Such an approach allows increasing page load speed and prevents reloading the same layout each time you'd like to open a new page.

Usually, dynamic websites use AJAX to load content dynamically, or even the whole site is based on a Single-Page Application (SPA) technology.

In contrast to dynamic websites, we can observe static websites containing all the requested content on the page load.

A great example of a static website is example.com:

The whole content of this website is loaded as a plain HTML while the initial page load.

To demonstrate the basic idea of a dynamic website, we can create a web page that contains dynamically rendered text. It will not include any request to get information, just a render of a different HTML after the page load:

<head>
<script>
window.addEventListener('DOMContentLoaded',function(){
document.getElementById('test').innerHTML='I ❤️ ScrapingAnt'
</script>
<body>
</body>

All we have here is an HTML file with a single <div> in the body that contains text - Web Scraping is hard, but after the page load, that text is replaced with the text generated by the Javascript:

window.addEventListener('DOMContentLoaded',function(){
document.getElementById('test').innerHTML='I ❤️ ScrapingAnt'
</script>

To prove this, let's open this page in the browser and observe a dynamically replaced text:

Alright, so the browser displays a text, and HTML tags wrap this text.
Can't we use BeautifulSoup or LXML to parse it? Let's find out.

Extract data from a dynamic web page#

BeautifulSoup is one of the most popular Python libraries across the Internet for HTML parsing. Almost 80% of web scraping Python tutorials use this library to extract required content from the HTML.

Let's use BeautifulSoup for extracting the text inside <div> from our sample above.

import os
soup = BeautifulSoup(test_file)

This code snippet uses os library to open our test HTML file (test.html) from the local directory and creates an instance of the BeautifulSoup library stored in soup variable. Using the soup we find the tag with id test and extracts text from it.

In the screenshot from the first article part, we've seen that the content of the test page is I ❤️ ScrapingAnt, but the code snippet output is the following:

And the result is different from our expectation (except you've already found out what is going on there). Everything is correct from the BeautifulSoup perspective - it parsed the data from the provided HTML file, but we want to get the same result as the browser renders. The reason is in the dynamic Javascript that not been executed during HTML parsing.

We need the HTML to be run in a browser to see the correct values and then be able to capture those values programmatically.

Below you can find four different ways to execute dynamic website's Javascript and provide valid data for an HTML parser: Selenium, Pyppeteer, Playwright, and Web Scraping API.

Selenuim: web scraping with a webdriver#

Selenium is one of the most popular web browser automation tools for Python. It allows communication with different web browsers by using a special connector - a webdriver.

To use Selenium with Chrome/Chromium, we'll need to download webdriver from the repository and place it into the project folder. Don't forget to install Selenium itself by executing:

Selenium instantiating and scraping flow is the following:

  • define and setup Chrome path variable
  • define and setup Chrome webdriver path variable
  • define browser launch arguments (to use headless mode, proxy, etc.)
  • instantiate a webdriver with defined above options
  • load a webpage via instantiated webdriver

In the code perspective, it looks the following:

from selenium.webdriver.chrome.options import Options
import os
opts = Options()
# opts.add_argument(' — headless') # Uncomment if the headless version needed
opts.binary_location ='<path to Chrome executable>'
# Set the location of the webdriver
chrome_driver = os.getcwd()+'<Chrome webdriver filename>'
# Instantiate a webdriver
driver = webdriver.Chrome(options=opts, executable_path=chrome_driver)
# Load the HTML page
soup = BeautifulSoup(driver.page_source)

And finally, we'll receive the required result:

Selenium usage for dynamic website scraping with Python is not complicated and allows you to choose a specific browser with its version but consists of several moving components that should be maintained. The code itself contains some boilerplate parts like the setup of the browser, webdriver, etc.

I like to use Selenium for my web scraping project, but you can find easier ways to extract data from dynamic web pages below.

Pyppeteer: Python headless Chrome#

Pyppeteer is an unofficial Python port of Puppeteer JavaScript (headless) Chrome/Chromium browser automation library. It is capable of mainly doing the same as Puppeteer can, but using Python instead of NodeJS.

Puppeteer is a high-level API to control headless Chrome, so it allows you to automate actions you're doing manually with the browser: copy page's text, download images, save page as HTML, PDF, etc.

To install Pyppeteer you can execute the following command:

The usage of Pyppeteer for our needs is much simpler than Selenium:

from bs4 import BeautifulSoup
import os
# Launch the browser
page =await browser.newPage()
# Create a URI for our test file
await page.goto(page_path)
soup = BeautifulSoup(page_content)
await browser.close()
asyncio.get_event_loop().run_until_complete(main())

I've tried to comment on every atomic part of the code for a better understanding. However, generally, we've just opened a browser page, loaded a local HTML file into it, and extracted the final rendered HTML for further BeautifulSoup processing.

As we can expect, the result is the following:

We did it again and not worried about finding, downloading, and connecting webdriver to a browser. Though, Pyppeteer looks abandoned and not properly maintained. This situation may change in the nearest future, but I'd suggest looking at the more powerful library.

Playwright: Chromium, Firefox and Webkit browser automation#

Playwright can be considered as an extended Puppeteer, as it allows using more browser types (Chromium, Firefox, and Webkit) to automate modern web app testing and scraping. You can use Playwright API in JavaScript & TypeScript, Python, C# and, Java. And it's excellent, as the original Playwright maintainers support Python.

The API is almost the same as for Pyppeteer, but have sync and async version both.

Installation is simple as always:

playwright install

Let's rewrite the previous example using Playwright.

from playwright.sync_api import sync_playwright
with sync_playwright()as p:
browser = p.chromium.launch()
# Open a new browser page
page_path ='file://'+ os.getcwd()+'/test.html'
# Open our test file in the opened page
page_content = page.content()
# Process extracted content with BeautifulSoup
print(soup.find(id='test').get_text())
# Close browser

As a good tradition, we can observe our beloved output:

We've gone through several different data extraction methods with Python, but is there any more straightforward way to implement this job? How can we scale our solution and scrape data with several threads?

Meet the web scraping API!

Web Scraping API#

ScrapingAnt web scraping API provides an ability to scrape dynamic websites with only a single API call. It already handles headless Chrome and rotating proxies, so the response provided will already consist of Javascript rendered content. ScrapingAnt's proxy poll prevents blocking and provides a constant and high data extraction success rate.

Usage of web scraping API is the simplest option and requires only basic programming skills.

You do not need to maintain the browser, library, proxies, webdrivers, or every other aspect of web scraper and focus on the most exciting part of the work - data analysis.

As the web scraping API runs on the cloud servers, we have to serve our file somewhere to test it. I've created a repository with a single file: https://github.com/kami4ka/dynamic-website-example/blob/main/index.html

To check it out as HTML, we can use another great tool: HTMLPreview

The final test URL to scrape a dynamic web data has a following look: http://htmlpreview.github.io/?https://github.com/kami4ka/dynamic-website-example/blob/main/index.html

The scraping code itself is the simplest one across all four described libraries. We'll use ScrapingAntClient library to access the web scraping API.

Let's install in first:

And use the installed library:

from scrapingant_client import ScrapingAntClient
# Define URL with a dynamic web content
url ='http://htmlpreview.github.io/?https://github.com/kami4ka/dynamic-website-example/blob/main/index.html'
# Create a ScrapingAntClient instance
Soup
client = ScrapingAntClient(token='<YOUR-SCRAPINGANT-API-TOKEN>')
# Get the HTML page rendered content
page_content = client.general_request(url).content
# Parse content with BeautifulSoup
print(soup.find(id='test').get_text())

To get you API token, please, visit Login page to authorize in ScrapingAnt User panel. It's free.

And the result is still the required one.

All the headless browser magic happens in the cloud, so you need to make an API call to get the result.

Check out the documentation for more info about ScrapingAnt API.

Summary#

Today we've checked four free tools that allow scraping dynamic websites with Python. All these libraries use a headless browser (or API with a headless browser) under the hood to correctly render the internal Javascript inside an HTML page. Below you can find links to find out more information about those tools and choose the handiest one:

Happy web scraping, and don't forget to use proxies to avoid blocking 🚀

Web scraping python beautifulsoup tutorial with example

Web scraping python beautifulsoup tutorial with example : The data present are unstructured and web scraping will help to collect data and store it. There are many ways of scraping websites and online services. Use the API of the website. Example, Facebook has the Facebook Graph API and allows retrieval of data posted on Facebook. Then access the HTML of the webpage and extract useful data from it. This technique is called as web scraping or web harvesting or web data extraction.

Steps involved in web scraping python beautifulsoup :-

  1. Send a request to the URL of a webpage which you want to access.
  2. Then the server will respond to the request by returning the HTML content of the webpage.
  3. After accessing data from HTML content we are at the left task of parsing data.
  4. We need to navigate and search trees that we create a task.

Installing required third party library:-

Easy way to install the library in python to use pip and used to install and manage packages in python.
Pip install requests
Pip install html5lib
Pip install bs4
Then access HTML content from the webpage:-
Import requests
URL=http://www.geeksforgeeks.org/data-structures/
R=requests. get (URL)
Print (r.content)

Python Beautifulsoup Web Scraping
  1. First, the step is to import request library and specify URL of webpage which you want to scrape.
  2. And send an HTTP request to URL and then save response from the server in response object called r.
  3. Also print r.contemt ton get rawHTML content of webpage.

Parse HTML content:-

Import requests
From bs4 import Beautifulsoup
URL=”http://www.values.com/inspirational-quotes”
R=requests. get (URL)
Soup=Beautifulsoup (r.content,’html5lib’)
Print(soup.prettify ())
The library in beautifulsoup is build on top of the HTML libraries as html.parser.Lxml.and the it will specify parser library as,
Soup=BeautifulSoup (r.content,’html5lib’)
From above example soup=beautifulsoup (r.content,’html5lib’)-will create an object by passing the arguments.
Html5lib:-will specify parser which we use.
r.content:-also called as raw HTML content.

Libraries used for web scraping python beautifulsoup :-

How To Use Beautiful Soup

We will use the following libraries:

  1. Selenium: - It is a web testing library and used to automate browser activities.
  2. BeautifulSoup: -Beautiful Soup is also called Python package for parsing HTML and XML documents and creates the parse trees which are helpful to extract the data easily.
  3. Pandas: - the library is used for data manipulation and analysis. And also used to extract the data and store it in the desired format.

Automated web scraping can be used to speed up the data collection process.
You can write your code once and it will get the information you want from many times and many pages.
When you try to get the information and if you want to do manually you have to spend a lot of time clicking, scrolling, and searching.
You need large amounts of data from websites that are regularly updated with new content.
The manual web scraping can take a lot of time and repetition.
There is much information on the Web and new information is added.
Python Beautiful Soup and libraries requests both are powerful tools for the job.
If you like to learn with hands-on example you have a basic understanding of Python and HTML.
Web scraping will extract the data and presents it in a format you can easily make sense of.
It is the process of gathering information from the Internet.

HTML tags:-

<! DOCTYPE html>
<html>
<head>
</head>
<body>
<h1> first scraping</h1>
<p>Hello World</p>
<body>
</html>
1. <! DOCTYPE html>: it starts the document with a type declaration.
2 It is contained between <html> and </html>.
3. The script and Meta declaration of the HTML document is between <head>and </head>.
4. HTML document contains visible part between <body> and </body>tags.
5. The title headings are defined with the <h1> through <h6> tags.
6. All paragraphs are defined with the <p> tag.
And useful tags include <a> for hyperlinks, <table> for tables, <tr> for table rows, and <td> for table columns.
HTML tags sometimes come with the id or class attributes.
The id attribute specify a unique id for an HTML tag and the value must be unique within the HTML document.
The class attribute is used to define tags with the same class.
We use of these id and classes to help us locate the data we want.

The rules for scraping:-

We have to Terms and Conditions before you scrape it and be careful to read the statements about the legal use of data and should not be used for commercial purposes.
Do not request data from the website with your program as this may break the website. The layout may change from time to time we have to make sure to revisit the site and rewrite your code as needed.

Scraping Flipchart Website:-

Find the URL that you want to scrape
We are going to scrape the Flipchart website to extract the Price, Name, and Rating of Laptops.
The URL for this page is https://www.flipkart.com/laptops/~buyback-guarantee-on-laptops-/pr?sid=6bo%2Cb5g&uniqBStoreParam1=val1&wid=11.productCard.PMU_V2.
Inspecting the Page
The data is usually nested in tag and inspect the page to see which tag the data we want to scrape is nested.
To inspect the page we just right click on the element and click on “Inspect”.

The next step is that you will see a “Browser Inspector Box” open.
Find the data you want to extract
Then extract the Price, Name, and Rating which are nested in the “div”.

Web scraping python beautifulsoup Example:-

Importing libraries as,
From selenium import webdriver
From beautifulsoup import beautifulsoup
Import pandas as pd
For configuration:-
Driver=webdriver.chrome (“/usr/lib/chromium-browser/chromedriver”)
Products= []
Prices= []
Ratings= []
Driver. get(https://www.flipcart.com/laptops/>https://www.flipkart.com/laptops/~buyback-gauranteelaptops-/pr?sid=6bo%Cb5&uniq)
Code is as follows:-
content=driver.page_source
soup=Beautifulsoup (content)
for a in soup.finsAll (‘a’, href=True, attrs= {‘class’:’_31qSD5’}):
name=a. find (‘div’, attrs= {‘class’:’_3wU53n’})
price=a. find (‘div’, attrs= {‘class’:’_1vC4OE_2rQ-NK’})
name=a. find (‘div’, attrs= {‘class’:’hGSR34_2beYZw’})
products. append (name. text)
prices. append(price. text)
ratings. append (ratings. text)

Run the code and extract the data

To run the code, use the below command:
Python web-s.py
Store the data in a required format:-
df=pd.Dataframe ({‘product name’: products,’ Price’: prices, ‘Ratings’: ratings})
df.to_csv (‘products.csv’, index=False, encoding=’utf-8’)

APIs: An Alternative to Web Scraping:-

The Web is grown out of many sources and combines a ton of different technologies, styles, and personalities.
The API (application programming interfaces) allow to accessing data in a predefined manner.
You can avoid parsing HTML and instead access the data directly using format.
HTML is a way to visually present content to users.
The process is more stable than gathering the data through web scraping.
APIs are made to be consumed by programs than by human eyes.
Scraping the Monster Job Site:-
You will build a web scraper that fetches Software Developer job listings from the job aggregator site.
Web scraper will parse the HTML to pick out the pieces of information and filter the content for specific words.
Inspect Your Data Source:-
Click through the site and interact with it just like any normal user would.
In this example you could search for Software Developer jobs in Australia using the site’s native search interface:

Query parameters generally consist of three things:-

  1. Start: - The query parameters are denoted by a question mark (?).
  2. Information: - The pieces of information constitute one query parameter that is encoded in key value.

Where related keys and values are joined together by an equals sign.

  1. Separator: - Every URL can have multiple query parameters which are separated from each other by an ampersand.

Hidden Websites:-
The information is hidden in login and needs to see from the page.
The HTTP request from python script is different than accessing the page from the browser.
Some advanced techniques are also used with a request to access behind the login.
Dynamic Websites:-
They are easy to work with because the server will send you an HTML page which contains all the information as a response.
Then you can parse an HTML response with Beautiful Soup and begin to pick out the relevant data.
Using the dynamic website the server might not send HTML at all and receive JavaScript code as a response.

Parse HTML Code with Beautiful Soup:-

Pip3 install beautifulsoup4
After it import library and create beautiful soup object,
Import requests
From bs4 import Beautifulsoup
URL=’https://www.monster.com/jobs/search/?q=software-developer&where=Austrialia’
Page=requests. get (URL)
Soup=Beautifulsoup (page.content,’html.parser’)

Find the URL you want to scrape:-

To scrape the web for means to find speeches by famous politicians then scrape the text for the speech, and analyze it for how often they approach certain topics, or use certain phrases.
Before you try to start scraping a site we check the rules of the website first.
Rule can be found in the robots.txt file, which can be found by adding a /robots.txt path to the main domain of the site.

Identify the structure of the sites HTML:-

Web Scraping With Beautifulsoup

After finding a site to scrap use chrome’s developer tools to inspect the site’s HTML structure.
It is important because more you want to scrape data from certain HTML elements, or elements with specific classes or IDs.
Using the inspect tool you can identify which elements you need to target.

Install Beautiful Soup and Requests:-

Python Beautifulsoup Web Scraping

There are packages and frameworks, like Scrapy but Beautiful Soup will allow you to parse the HTML.
With Beautiful Soup we need to install a Request library, which will fetch the url content.
The Beautiful Soup documentation has a lot of examples to help get you started as well.
$pip install requests
$pip install beautifulsoup4

Web Scraping Code:-

Beautiful Soup Documentation

Results:-

Web Scraping With Python Beautifulsoup

This finds all of the <p> elements in the HTML.
The text allows selecting only the text from inside all the <p> elements.

It is messy and so filtering of results using the Beautiful Soup text allows us to get a cleaner return.
Other ways are present to search, filter and isolate the results you want from the HTML.
You can also be more specific, finding an element with a specific class as,

This would fine all the <div> elements with the class “cool_paragraph”.