As a data analyst or a scientist, you need to collect the data before any analysis. Sometimes, you will get the data directly from the company’s database. But, it is not the same case all the time. You may be required to scrape the web to get some data for a particular analysis. Well, here is the solution – Wikipedia scraping using python. It is relatively very simple to scrape Wikipedia data using python. In this tutorial, we will see how we can scrape data in under 5 mins and with less than 10 lines of code.
Let’s dive deep.
1. About the Source
I will provide some information regarding the source of our data. It is very important to first identify where the required data is located.
In our case, our data which is related to NBA finals is located on the wiki website. We have 2 tables on that same page. Let’s see how we can scrape any one of the tables using some basic HTML knowledge.
You can find the link to the webpage as well as the screenshots of the tables – Wikipedia web page.
- Table 1 – Final appearances. This table includes the data of various teams and their final’s year as well along with other attributes.
2. Import the Libraries
First, we need to import the required libraries for web scraping in python. We require 3-4 libraries –
- Unicode data
#Import the libraries import pandas as pd import numpy as np import matplotlib.pyplot as plt from unicodedata import normalize
Well, we will be using the pandas read_html() function to parse the HTML tags to get the desired data. I hope you have imported all these libraries. If yes, we are good to go.
3. Read the Data
To read the HTML data, we need to create and table object using the function read_html as shown below. Call the function, add the link to the web page and then mention the table name which is to be scraped.
#scraping NBA_data_scraped = pd.read_html('https://en.wikipedia.org/wiki/NBA_Finals', match='Finals appearances')
In this code, we are requesting the first table to be printed out.
#View data data = NBA_data_scraped data
That’s awesome 😛
This is just a simple illustration of web scraping in python. We do have many advanced scraping libraries such as scrapy.
But, you need to get a hang of the basic HTML tags and parsing the data from public sites such as a wiki.
All I can say is, this simple application using basic python modules can serve many purposes effectively. Whenever you require a simple dataset from a public source, you can use this method to get the data in minutes.
I hope you found this tutorial helpful.
Wrapping Up – Web scraping in Python
Web scraping in python is a fascinating area. Whenever you need to collect data that is not available in your databases, it is the go-to way. As I said before, we do have many advanced web scraping libraries in python such as Scrapy. But this is a simple tutorial on web scraping using basic python modules.
In the next tutorial, We will see how we can scrape complex datasets from parts of the web using some advanced libraries.
That’s all for now. Happy Python!!!
More read: Beautiful Soup