Python pandas is an open-source library in python which is widely used for data analysis. It is robust and offers easily usable functions and go-to data structures for effective analysis. If are an analyst or a data scientist, you know very well that how invaluable pandas are.
Due to the wide range of functions, it is used in multiple domains such as finance, economics, business, and statistics. In this tutorial let’s see how pandas can be used for data analytics and how efficient it is in this process. Without wasting much time, let’s dive in!
Pandas for Data analysis
- Pandas offers robust functions for data manipulation and helps in reading and writing data into different file formats.
- Due to tendency towards data structures, it makes more flexible with huge labelled or relational datasets.
- It helps in high performance oriented actions such as aggregation, merging, concatenating and reshaping as well.
- Pandas series is the most effective data structure which helps in creating data frames in python.
Things we do here –
- Load the data using
- View the data.
- Get the dimensions of the data.
- Summary statistics of the data.
- Unique values and Crosstabs.
- Data types.
- Correlation among features.
Also read: How To Change Column Order Using Pandas.
Load the Data
For this tutorial, we will be working on a Housing dataset that is pretty huge and serves the purpose well. Using pandas we can load the data into python.
#load the data import pandas as pd data = pd.read_csv('Housing.csv') data.head(5)
We have successfully loaded the data into python. Now let’s understand about the data and dive in for analysis.
Peek Into the Data
To understand the high-level overview of the data, pandas offers multiple functions. We are going to use the head and tail function to see the first and last n rows of the data. Similarly, we will be using Shape() and info() functions to know dimensions and information about the data.
Head and Tail()
#head of the data data.head(5)
#tail of the data data.tail(5)
That’s good. The head and tail functions will return the top and bottom n rows of the data. You can always specify the number of rows which should be returned.
To know the dimensions of the data, we can make use of shape() function in pandas.
That’s it. It says our data has 545 rows and 13 columns. So, now we want to see those features / variables right. Then just go for it.
Index(['price', 'area', 'bedrooms', 'bathrooms', 'stories', 'mainroad', 'guestroom', 'basement', 'hotwaterheating', 'airconditioning', 'parking', 'prefarea', 'furnishingstatus'], dtype='object')
That’s cool. Now we got all the feature names in the data. Finally, we have to understand what data is telling us. So, use info() function and get the results.
<class 'pandas.core.frame.DataFrame'> RangeIndex: 545 entries, 0 to 544 Data columns (total 13 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 price 545 non-null int64 1 area 545 non-null int64 2 bedrooms 545 non-null int64 3 bathrooms 532 non-null float64 4 stories 539 non-null float64 5 mainroad 545 non-null object 6 guestroom 537 non-null object 7 basement 545 non-null object 8 hotwaterheating 518 non-null object 9 airconditioning 545 non-null object 10 parking 538 non-null float64 11 prefarea 545 non-null object 12 furnishingstatus 545 non-null object dtypes: float64(3), int64(3), object(7) memory usage: 55.5+ KB
Perfect! Here you will get an idea about the null values and the data types as well. If you want to particularly view the data types, you can make use of
Statistical Analysis Using Pandas
Yes. It is not enough to understand your data completely by just peeking into it. You have to use some statistical measures to dig deep into data and get meaningful insights. Let’s do it together.
Here are some of the functions which we are going to use –
Let’s see how we can use these functions and make sense out of our data.
Describe function will help us to find the statistical measures such as min and max values, mean, standard deviation and more.
The describe measure only consider the numerical features.
unique function will help us to find all the unique values in the data. Let’s try it out.
array(['furnished', 'semi-furnished', 'unfurnished'], dtype=object)
It says that feature – ‘furnishingstatus‘ has 3 unique values.
Sample function is used to get the random data record from the data.
You can see the randomly sampled data values.
Value counts and Correlation
Value counts and
correlation function will help us in getting the frequency of the values and correlation among the features respectively.
#Value counts data['furnishingstatus'].value_counts()
semi-furnished 227 unfurnished 178 furnished 140 Name: furnishingstatus, dtype: int64
This tells use that most of the houses are semi-furnished.
Here is the correlation among the features which ranges from +1 to -1 where the former is highly correlated and later stands for weakly correlated.
Wrapping Up – Pandas
Python pandas is an open-source and robust library that is widely used for data manipulation and analysis. In this article, I have shown many pandas’ functions which helps us in the data analysis. I hope you find this useful and don’t forget to grab some data and try it yourself.
That’s all for now. Happy Python!!!
More read: Python Pandas