Scrape Google Search Results using Python BeautifulSoup

Filed Under: Python Advanced
BEAUTIFULSOUP

Hello, readers! Here, we will be learning How to Scrape Google Search Results using BeautifulSoup in Python.

In this article, we will be having a look at one of the most interesting concept in Python — Scraping a website.

So, let us begin!


What is Web Scraping?

At times, when we surf through the web, we come across some user-related data that we believe would be beneficial for us in the future. And, then we try to copy it and save it to clipboard each time.

Now, let’s analyze the next scenario

We often need data to analyze the behavior of certain factors in terms of data modeling. Thus, we begin creating a dataset from scratch by copy-pasting the data.

This is when, Web Scraping or Web Crawling comes into picture.

Web Scraping is an easy way to perform the repetitive task of copy and pasting data from the websites. With web scraping, we can crawl/surf through the websites and save and represent the necessary data in a customized format.

Let us now understand the working of Web Scraping in the next section.


How Does Web Scraping Work?

Let us try to understand the functioning of Web Scraping through the below steps:

  • Initially, we write a piece of code that requests the server for the information with regards to the website we want to crawl or the information we want to scrape on the web.
  • Like a browser, the code would let us download the source code of the webpage.
  • Further, instead of visualizing the page in the manner that the browser does, we can filter the values based on the HTML tags and scrape only the needed information in a customized manner.

By this, we can load the source code of the webpage in a fast and customized manner.

Let us now try to implement Web Scraping in the upcoming section.


Bulk Scraping APIs

If you are looking to build some service by scraping bulk search, chances are high that Google will block you because of an unusually high number of requests. In that case, online APIs like Zenserp is a big help.

Zenserp performs searches through various IPs and proxies and allows you to focus on your logic rather than infrastructure. It also makes your job easier by supporting image search, shopping search, image reverse search, trends, etc. You can try it out here, just fire any search result and see the JSON response.


Implementing steps to Scrape Google Search results using BeautifulSoup

We will be implementing BeautifulSoup to scrape Google Search results here.

BeautifulSoup is a Python library that enables us to crawl through the website and scrape the XML and HTML documents, webpages, etc.


Scrape Google Search results for Customized search

Example 1:

import requests
from bs4 import BeautifulSoup
import random

text = 'python'
url = 'https://google.com/search?q=' + text
A = ("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
       "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36",
       "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36",
       )

Agent = A[random.randrange(len(A))]

headers = {'user-agent': Agent}
r = requests.get(url, headers=headers)

soup = BeautifulSoup(r.text, 'lxml')
for info in soup.find_all('h3'):
    print(info.text)
    print('#######')

Line by line explanation of the above code:

  1. Importing the necessary libraries In order to make use of BeautifulSoup for scraping, we need to import the library through the below code:
from bs4 import BeautifulSoup

Further, we need the Python requests library to download the webpage. The request module sends a GET request to the server, which enables it to download the HTML contents of the required webpage.

import requests

2. Set the URL: We need to provide the url i.e. the domain wherein we want our information to be searched and scraped. Here, we have provided the URL of google and appended the text ‘Python’ to scrape the results with respect to text=’Python’.

3. Setting User-Agent: We need to specify the User Agent Headers which lets the server identify the system and application, browsers wherein we want the data to be downloaded as shown below–

A = ("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
       "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36",
       "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36",
       )

4. The requests.get(url, header) sends the request to the web server so as to download the requested HTML content of the web page or the search results.

5. Create an object of BeautifulSoup with the requested data from ‘lxml‘ parsing headers. The ‘lxml‘ package must be installed for the below code to work.

soup = BeautifulSoup(r.text, 'lxml')

6. Further, we use object.find_all('h3') to scrape and display all the Header 3 content of the web browser for the text=’Python’.

Output:

Welcome to Python.org
#######
Downloads
#######
Documentation
#######
Python For Beginners
#######
Python 3.8.5
#######
Tutorial
#######
Python Software Foundation
#######
Python (programming language) - Wikipedia
#######
Python Tutorial - W3Schools
#######
Introduction to Python - W3Schools
#######
Python Tutorial - Tutorialspoint
#######
Learn Python - Free Interactive Python Tutorial
#######
Learn Python 2 | Codecademy
#######

Scrape Search results from a Particular Webpage

In this example, we have scraped the HTML tag values from the website as shown:

Example 2:

import requests
from bs4 import BeautifulSoup
import random

url = 'https://www.askpython.com/python/examples/python-predict-function'
A = ("Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2228.0 Safari/537.36",
       "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.1 Safari/537.36",
       "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/41.0.2227.0 Safari/537.36",
       )

Agent = A[random.randrange(len(A))]

headers = {'user-agent': Agent}
r = requests.get(url, headers=headers)

soup = BeautifulSoup(r.content, 'lxml')


title = soup.find('title')
print("Title of the webpage--\n")
print(title.string)
search = soup.find_all('div',class_="site")
print("Hyperlink in the div of class-site--\n")
for h in search:
    print(h.a.get('href'))

Further, we have scraped the title tag values and all the a href values present in the div tag of class value = site. Here, the class value differs for each website according to the structure of the code.

Output:

Title of the webpage--

Python predict() function - All you need to know! - AskPython
Hyperlink in the div of class-site--

https://www.askpython.com/

Conclusion

By this, we have come to the end of this topic. Feel free to comment below, in case you come across any question.

For more such posts related to Python, stay tuned and till then, Happy Learning!! 馃檪


References

Comments

  1. Giorgio says:

    I am just missing something. I copied your source code and I got the following msg:
    Traceback (most recent call last):
    File “C:\py_prjs\googlescrapping.py”, line 15, in
    r = request.get(url, headers=headers)
    AttributeError: module ‘urllib.request’ has no attribute ‘get’
    Any suggestion?

    1. Pankaj says:

      There was an error in the code, we need to import requests package. I have fixed it.

  2. ALEX L says:

    in first example :
    line 1: from urllib import request
    **** there is one mistake in word request is actually requests
    Thank you…

    1. Pankaj says:

      Thanks for the catch, I have fixed the wrong import issue in the code.

Leave a Reply

Your email address will not be published. Required fields are marked *

close
Generic selectors
Exact matches only
Search in title
Search in content
Search in posts
Search in pages