Data manipulation or transformation is the key aspect of any analysis. I am saying this because chances of getting insights that make sense are highly impossible. You should transform raw data into meaningful data. You may need to create new variables, bring the data into one form or even rearrange the data to make sense out of it.
This helps in identifying the anomalies and extract more insights than you think. Therefore, in this article, we will be discussing some of the python pandas and numpy functions which help us in Data mapping and replacement in python.
1. Create a Data Set
For the data mapping purpose, let’s create a simple dataset using the pandas dataframe function. This will be a simple student grade dataset.
we will be creating a simple dataset having 2 columns, one for student name and another for student grade.
#Create a dataset
import pandas as pd
student = {'Name':['Mike','Julia','Trevor','Brooks','Murphy'],'Grade':[3.5,4,2.1,4.6,3.1]}
df = pd.DataFrame(student)
df
Name Grade
0 Mike 3.5
1 Julia 4.0
2 Trevor 2.1
3 Brooks 4.6
4 Murphy 3.1
Well, we got simple students data. Let’s see how we can map and replace the values as a part of the data transformation process.
2. Replacing Values in the data
So, we have data that include 5 values and multiple attributes. Now, we got a message from the class teacher that Murphy actually secured 5 grades and he is the topper in the class. We need to replace the old grade with a new grade as per the teacher’s words.
So, here we go…
#Replacing data
df['Grade'] = df['Grade'].replace([3.1],5)
#Updated data
Name Grade
0 Mike 3.5
1 Julia 4.0
2 Trevor 2.1
3 Brooks 4.6
4 Murphy 5.0
That’s great! We have successfully replaced the old grade(Value) with a new grade(Value). It is just an example and I have provided a real-world application of this process.
In the real world use case, you have to submit a data quality report to the client and you need to ask for correct values if you found the current data is not good. After you get the revised data from client, you should replace that against old record.
More Examples / Instances
- Well, now we look for some other requirements as well. Let’s see how we can replace multiple old values with a set of new values.
#Replace multiple values with new set of values
df['New_grades']= df['Grade'].replace([3.5,4.0,2.1,4.6,5.0],['Average','Good','Needs Improvement','Good','Excellent'])
df
Name Grade
0 Mike Average
1 Julia Good
2 Trevor Needs Improvement
3 Brooks Good
4 Murphy Excellent
That’s cool!
We have amazingly replaced multiple values a set of new values. As you can see, we have replaced all 5 values at once.
- Replacing multiple values with a single new value.
#Replacing multiple values with a single new value
df['Grade']= df['Grade'].replace(['Average','Good','Needs Improvement','Good','Excelelnt'],'Good')
df
Name Grade
0 Mike Good
1 Julia Good
2 Trevor Good
3 Brooks Good
4 Murphy Good
That’s it. As simple as that. This is how you can replace multiple value with new set of values and a single new value.
3. Data Mapping Using Pandas Cut function
Well, we have discussed replacing values with multiple scenarios. Now, we will see how we can do this using the Pandas cut function in python.
In the above examples, we have manually replaced the values. But here, we will be creating bins and assign the values based on the grades.
#Pandas cut function
my_bins = [0,2,4,5]
my_comments = ['Poor','Satisfied','Good']
df['New_Grades'] = pd.cut(df['Grade'],my_bins,labels=my_comments)
Name Grade New_Grades
0 Mike 3.5 Satisfied
1 Julia 4.0 Satisfied
2 Trevor 2.1 Satisfied
3 Brooks 4.6 Good
4 Murphy 5.0 Good
Excellent! We have mapped new grades into the data.
- You need to define the bins.
- Add the comments for the bins range.
- Map the new variable into the data
4. Data Mapping using Numpy.digitize Function
This function will do the same mapping as pandas cut did. But, the difference is we have to create a dictionary and map it to the data.
Here, defining bins and bin range names will be same as above.
#Data mapping using numpy
import numpy as np
my_bins = [0,2,4.5,5]
my_comments = ['Poor','Satisfied','Good']
my_dict = dict(enumerate(my_comments,1))
df['Numpy.digitize'] = np.vectorize(my_dict.get)(np.digitize(df['Grade'], my_bins))
df
Name Grade New_Grades Numpy.digitize
0 Mike 3.5 Satisfied Satisfied
1 Julia 4.0 Satisfied Satisfied
2 Trevor 2.1 Satisfied Satisfied
3 Brooks 4.6 Good Good
4 Murphy 5.0 Good Good
You can see that, numpy.digitize
method also produces the same result as of pandas cut function.
5. Numpy.select()
If you use this method for data mapping, you have to set the list conditions. based on your conditions, it will return an array of your choice.
#Numpy.select method
import numpy as np
select = [df['Grade'].between(0,2),
df['Grade'].between(2,4),
df['Grade'].between(4.1,5)]
values = ['Poor', 'Satisfied', 'Good']
df['Numpy_select'] = np.select(Numpy_select, values, 0)
Name Grade New_Grades Numpy.digitize Numpy_select
0 Mike 3.5 Satisfied Satisfied Satisfied
1 Julia 4.0 Satisfied Satisfied Satisfied
2 Trevor 2.1 Satisfied Satisfied Satisfied
3 Brooks 4.6 Good Good Good
4 Murphy 5.0 Good Good Good
The code itself is self explanatory and you will get the idea easily.
6. User-defined Function
Finally, we are going to create a custom function which will do the same job like pandas cut, numpy.digitize and numpy.select functions.
#User defined function
def user_defined(values):
if values >=0 and values <=2:
return 'Poor'
elif values >2 and values <= 4:
return 'Satisfied'
else:
return 'Good'
#Using the custom function
df['user_defined'] = df['Grade'].apply(lambda x: user_defined(x))
Name Grade New_Grades Numpy.digitize Numpy_select user_defined
0 Mike 3.5 Satisfied Satisfied Satisfied Satisfied
1 Julia 4.0 Satisfied Satisfied Satisfied Satisfied
2 Trevor 2.1 Satisfied Satisfied Satisfied Satisfied
3 Brooks 4.6 Good Good Good Good
4 Murphy 5.0 Good Good Good Good
Impressive!
We got the same output using different methods. You are free to use any of these shown methods when you working on data transformation and data mapping or data replacement as well.
Ending Note – Data Mapping
Data mapping and transformation is the vital part of the analysis. It will turn your raw data into an insights engine where you can get as many patterns and meaningful insights as you want. I hope you find this tutorial useful and enjoyed playing with the above methods.
That’s all for now! Happy Python 🙂
More read: Numpy.digitize