Hello folks, today let’s shed some light on data sampling using python pandas. Data sampling is a statistical technique that allows us to get information from large data. In other words, we will get the sample out of the population.
But why do we need Data Sampling?
Many times, data can be huge and it’s a common case in Big data analytics. There are millions of data records that trouble you from effectively analyzing it. In these cases, you can go for sampling and examine the small chunk of data to get some insights.
Let’s consider you conduct a large-scale survey.
You have to find the average height of adults in New York City. There are over 6.5 million adults in this city. It will be impossible to reach out to every induvial and record their height. And also, you cannot enter a basketball ground and take the height of people there. Because generally, all those people have greater heights than others.
Finally, we can neither reach out to all nor reach specific people. So, what’s next?
Here comes sampling. Here, you have to take samples at a random time, places, and people and then compute the average of those values to get the average height of adults in NY.
Types of Data Sampling
Yes, we do have multiple data sampling methods. In this story, we will be discussing the below three –
- Random sampling
- Condition bases sampling
- Constant rate sampling
Random Sampling: In this sampling technique, every sample has an equal chance of getting picked up. Due to its unbiased nature, it will be much helpful for concluding.
Condition bases sampling: This sampling technique is used to specify the sample selection based on the conditions or criteria.
Constant rate sampling: Here, you will be mentioning the rate at which the sample is being selected. This will allow a constant distance between the selected samples.
Setting Up Data
We will be using the iris dataset for this purpose. But, never ever think the data in real-world will be this small 😛
#import pandas import pandas as pd #load data data = pd.read_csv('irisdata.csv')
- Import the pandas module.
- Call the read_csv function and load the data.
data.head()function to peek into the data.
1. Random Sampling
The idea of random sampling states that if we have N rows, then it will extract X rows from that (X < N). You have to use pandas
sample() function for this.
#subset the data subset_data = data.sample(n=100) subset_data
Here, we have passed the number of rows parameter to the sample function to get this subset of the data. But, you can also mention the sampling rows in percentage. Let’s see how.
#sampling with percentage subset_data_percentage = data.sample(frac=0.5) subset_data_percentage
You can confirm the size of the sampled data using the shape function as shown below.
#shape of the data subset_data_percentage.shape
As we have mentioned the 50% of the data needs to be sampled, here we have 75 rows, half of the original data with random rows.
2. Conditional Sampling
Based on the case, you can opt for condition-based sampling. Here, by specifying a condition, you can extract the rows which satisfy it. Let’s see how it works.
#conditonal sampling our_condition = data['Species'] == 'Iris-setosa' #Retirive the index index = our_condition[our_condition == True].index #sample based on condition conditional_subset = data[our_condition].sample(n = 10) #output conditional_subset
Check the shape of the sampled data.
- We have defined the condition.
- Retrieved the indexes of the samples.
- Sampled the data based on the condition.
3. Constant Rate Sampling
In this sampling method, we will get the samples based on constant intervals or the rate. In the below example we will be getting the samples at rate 2. Let’s see how it works.
#defining rate our_rate = 2 #apply the rate constant_subset = data[::our_rate] #data constant_subset
You can observe that every second data record is retrieved as a subset of the original data.
Now, we have sampled the data using multiple methods. But what if you want to retrieve the remaining data?
Pass to the next heading…
Data Sampling – Data Retrieval
To get the remaining data or the data apart from sampled data, there are two methods for it. Let’s see both of them.
The first one is, it will drop the sampled data and presents the remaining data.
#First method remaining_data = data.drop(labels=constant_subset.index) remaining_data
Here, you can observe that sampled out data or the remaining data is been produced as output.
In the second method, we will be selecting only those rows which are not involved in sampling. In simple words, we will be selecting data in the second method and dropping data in the first method.
#second method remaining_data_method2 = data[~data.index.isin(constant_subset.index)] remaining_data_method2
Observe that same output here. Method changes but not the result.
Data Sampling – Conclusion
Data sampling is one of the key aspects of statistical data analysis. It has many applications and using it you can extract meaningful insights out of big data. I hope you now got an idea of using data sampling in your data work, so that big data is no bigger…
That’s all as of now. Happy Python!!!
More read: Sampling techniques