As we know, Pandas is the go-to library in python for data manipulation and analysis. It is a known thing that we cannot able to get insights from the raw data. Hence, as a data analyst or scientist, you have to tweak the data to uncover hidden patterns. In other words, it is called subsetting the data or even data slicing. Here, you will be interested only in some part of the data rather than the entire visibility. Today, let’s discuss what is data slicing and how we can use pandas for that.
Data Slicing Using Python Pands
In this tutorial, we will be working with the coffee sales dataset, which is quite huge and offers real-world data flavor. Let’s load the data using the read_csv() function in pandas.
#data import pandas as pd data = pd.read_csv('coffeesales.csv') data.head(5)
Well, our data is ready to be sliced and diced!
1. Pandas Series
We will first work on the pandas series. Let’s create a simple series and then we will see how we can extract the data from the series.
#series my_series = pd.Series([11,22,33,44,55,66,77,88,99,0]) my_series
This is our simple pandas series. Now, we can slice the data based on the index.
#index slicing my_series
#index slicing my_series
#index slicing my_series
That’s it. You can extract the data value by specifying the index of that. I know it will be very easy for you to do this.
Now, let’s create a pandas series with a defined index.
#series with index dummy = pd.Series([89,78,60,71,90],index = ['Josh','Sam','Reece','Kay','Jade']) dummy
Josh 89 Sam 78 Reece 60 Kay 71 Jade 90 dtype: int6
It looks good. Let’s slice the data based on this defined index.
#indexed slicing dummy['Josh']
##indexed slicing dummy['Kay']
#indexed slicing dummy['Jade']
You got it right.
##indexed slicing dummy['Josh':'Kay']
Josh 89 Sam 78 Reece 60 Kay 71 dtype: int64
That’s all about extracting the data from the pandas series. In the next phase, we will be working with pandas data frames.
2. Pandas Dataframe
The panda’s data frames are the 2-D data structures that include the attributes of various datatypes. It is just like a spreadsheet or a SQL table.
It consists of rows and columns which are indexed. This will help us to get the data we need for our analysis. Well, we have already loaded the data (coffeesales) and it should be ready to work on.
To start things, we first look at the different features present in the data.
Index(['order_date', 'market', 'region', 'product_category', 'product', 'cost', 'inventory', 'net_profit', 'sales'], dtype='object')
Quickly we can check for the null values.
#null values data.isnull().sum()
order_date 0 market 0 region 0 product_category 0 product 0 cost 0 inventory 0 net_profit 0 sales 0 dtype: int64
Perfect!, we don’t have any null values in our dataset. Let’s move to the slicing part.
Now, we can slice the data as we want. Let’s pull up the region values from the data and see how it works.
0 Central 1 Central 2 Central 3 Central 4 Central ... 4243 West 4244 West 4245 West 4246 West 4247 West Name: region, Length: 4248, dtype: object
I know that you are getting an idea now, how to slice and dice!. In the next step, we will try to extract multiple columns in the order that we need. What I meant is, up next, I will choose the order of the features unlike in the raw data.
#multiple features data[['product','sales','net_profit','region']]
I hope you got the idea now. The order here starts with a product followed by its sales, profit, and the region. It will make sense now for sure unlike the raw data and mixed order.
If you are much interested in the region of the sales, then you can set the index to the region and then slice the data based on that for better insights.
Slicing the Dataframe
#value counts data['region'].value_counts()
Central 1344 West 1344 East 888 South 672 Name: region, dtype: int64
Well, we have 4 regions among which most of the stores are located in the central and west parts. Now, we want to see the data only related to the shops located at the central region. For this, we have to use loc function by pandas to locate the particular region and values associated with it.
#region data df = data.set_index('region') df df.loc[['Central']]
The above returned only the data associated with the central region.
#region df.loc[['Central'], 'product':'sales']
Wow! This is telling much interesting story to us. I hope by now you understood how to listen to a story from the data using data slicing methods.
Wrapping Up – Data slicing
Data slicing is one of the handy methods to slice and dice the data to gain the most precious insights for your analysis. We have discussed these methods over pandas series and dataframe as well. There are many functions that will help you in this data slicing process as shown in this tutorial.
I hope you will find this useful in your future assignments. That’s all for now. Happy Python!!!
More read: Working with data using Pandas