Hey learner! In this tutorial, we will take a dataset and learn how to analyze the dataset and gain maximum information from it. We will be using the Mountain Deaths dataset which is available on Kaggle easily.
Let’s not wait and get started already!
Also read: Friends (TV Series) Dataset Analysis using Python
What Does the Dataset Contain?
The dataset we will be using in this tutorial can be found here. The dataset description according to the Kaggle page says the following :
The International Climbing and Mountaineering Federation, commonly known by its French name Union Internationale des Associations d’Alpinisme (UIAA) recognizes 14 mountains that are more than 8,000 meters (26,247 ft) in height above sea level, and are considered to be sufficiently independent of neighboring peaks. These mountains are popularly called eight-thousanders. Even though all eight-thousanders have been summited, more than 1000 people have died trying to make it to the summits of these mountains.
The dataset contains the following columns for all the 14 mountains:
- Date: Date on which the mountaineer died
- Name: Name of the deceased
- Nationality: The country which the mountaineer belonged to
- Cause of death: Reason for the death
Analyzing the Mountain Deaths Using Python
Firstly, we import all of the libraries that we will need for our analysis in the later sections.
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
The next thing that we are going to do is combine all the 14 CSV files into a single CSV file to make the analysis even more interesting for us.
The code for the same is below. We will make sure sure that all the CSV files are in the same directory as the code file and then run the code. All the data stored in a single variable, DATA
.
arr = os.listdir('.')
all_csv = [i for i in arr if i.endswith('csv')]
DATA = pd.read_csv(all_csv[0])
DATA['Peak Name'] = [all_csv[0].split('.')[0] for i in range(DATA.shape[0])]
for i in all_csv[1:]:
temp_DATA = pd.read_csv(i)
temp_DATA['Peak Name'] = [i.split('.')[0] for j in range(temp_DATA.shape[0])]
DATA = DATA.append(temp_DATA)
The data will look something like the image below.

Some Preliminary Analysis
The first thing we will look at is the describe
function that summarizes the count, mean, standard deviation, min, and max for all the numeric features in the dataset.
DATA.describe()

The count
function provides the number of data rows in a specific column.
DATA.count()

We can also the data type of each and every column in the dataset using this syntax:
DATA.dtypes

Next, we can use the unique
function to find out the unique values of a particular column. Let’s see what are the unique values of the ‘Nationality’ column in our dataset.
print(DATA['Nationality'].unique())

Some Basic Visualizations for Mountain Deaths
First, let’s have a look at the mountain that has the largest number of deaths over the time period using the code below.
sns.catplot(x='Peak Name',kind='count',data=DATA,height=10,aspect=20/10)
plt.xticks(rotation=90)
plt.show()

From the plot, we can clearly see that everest
 has had the maximum number of deaths!
Next, we can see which is the main cause of the deaths over the period using the code below.
sns.catplot(x='Cause of death',kind='count',data=DATA,height=10,aspect=30/10)
plt.xticks(rotation=90)
plt.show()

We can see that most of the climbers died due to Avalanche
making it the deadliest of all the other reasons. Also, avalanches are unfortunately out of control for climbers and it’s a risk they take up when going for the climb.
Lastly, we can analyze the climbers of which nationality have died the most over the years using the code below.
sns.catplot(x='Nationality',kind='count',data=DATA,height=10,aspect=20/10)
plt.xticks(rotation=90)
plt.show()

Of climbers from all the nationalities, those from Nepal
 have the highest death rate here. You may have to dig further to understand if the reason for death and country has any correlations and identify if the cause of death here is fixable or not.
Conclusion
Now you can easily analyze any dataset that you have no matter how challenging the dataset is. There are a lot more visualizations possible as well!
Keep reading to learn more!
Thank you for reading!