Hey there coder! Today we are going to do something different using the NetworkX library. I am sure you have heard of the famous Youtuber Pewdiepie and probably also watched his videos on YouTube.
Today in this tutorial, we will learn how to visualize his channel dataset on Kaggle using NetworkX library in Python programming.
Also read: Python NetworkX – Python Graph Library
Loading and Cleaning the Pewdiepie Dataset
We will be loading the dataset with the help of the pandas
module and the read_csv
function. The idea that we have is that we will be connecting the titles of the videos on the basis of how similar the two titles are. To make things less complex for us we will take the top 30 titles only.
import pandas as pd
data = pd.read_csv('pewdiepie.csv')
print("Number of videos : ",data.shape[0])
data.head()
all_titles_data = list(data['title'])[:30]

Hence, we only need the titles of the videos and we will take out the titles separate from the dataset. Along with this, we will be applying NLP on all the titles to get cleaner and more important words separate out for us.
import contractions
import re
import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('stopwords')
from nltk import WordNetLemmatizer
nltk.download('wordnet')
lemma = WordNetLemmatizer()
def apply_NLP(x):
x = contractions.fix(x)
x = x.lower()
x = re.sub(r'\d+','',x)
x = re.sub(r'[^\w\s]', '', x)
x = word_tokenize(x)
x = [w for w in x if not w in stopwords.words()]
x = [ lemma.lemmatize(w, pos = "v") for w in x]
x = [ lemma.lemmatize(w, pos = "n") for w in x]
for i in x:
if(len(i)==1):
x.remove(i)
x = ' '.join(x)
return x
for i in range(len(all_titles_data)):
all_titles_data[i] = apply_NLP(all_titles_data[i])
Creating a Similarity Matrix
After getting the cleaned dataset, our next goal is to find the similarity between the titles which can be done on the basis of the common words they have between them.
def get_common(x,y):
x = x.split(" ")
y = y.split(" ")
return len(list(set(x)&set(y)))
import numpy as np
size = len(all_titles_data)
Matrix = np.zeros((size,size))
for i in range(size):
for j in range(size):
if(i!=j):
Matrix[i][j] = get_common(all_titles_data[i],all_titles_data[j])
Creating Network for the Channel Data
In this step, we will generate the network on the basis of the similarity matrix. The code for the same is below.
import networkx as nx
import matplotlib.pyplot as plt
plt.style.use('seaborn')
import numpy as np
G = nx.Graph()
for i in range(size):
G.add_node(i)
for i in range(size):
for j in range(size):
if(Matrix[i][j]>0):
G.add_edge(i,j,value = Matrix[i][j])
Using NetworkX Library to Visualize the Social Network
As the final step, we will be visualizing the channel social network graph using the NetworkX library. The code for the same is below. To increase the interactivity, I added physics buttons
as well.
g = net.Network(height='400px', width='100%',heading='Pewdiepie Channel Network Graph',bgcolor='black',font_color="white")
g.from_nx(G)
g.show('karate.html')
g.show_buttons(filter_=['physics'])
display(HTML('karate.html'))
Conclusion
Congratulations! You just learned how to work with a real dataset and how to create its social network graph. You also learned to visualize the channel data in such an amazing way.
Thank you for reading! Hope you like it!