Pewdiepie Dataset along with NetworkX Library

Filed Under: Python Modules
Pewdiepie Dataset Pyvis Python

Hey there coder! Today we are going to do something different using the NetworkX library. I am sure you have heard of the famous Youtuber Pewdiepie and probably also watched his videos on YouTube.

Today in this tutorial, we will learn how to visualize his channel dataset on Kaggle using NetworkX library in Python programming.

Also read: Python NetworkX – Python Graph Library


Loading and Cleaning the Pewdiepie Dataset

We will be loading the dataset with the help of the pandas module and the read_csv function. The idea that we have is that we will be connecting the titles of the videos on the basis of how similar the two titles are. To make things less complex for us we will take the top 30 titles only.

import pandas as pd
data = pd.read_csv('pewdiepie.csv')
print("Number of videos : ",data.shape[0])
data.head()
all_titles_data = list(data['title'])[:30]
Pewdiepie Dataset
Pewdiepie Dataset

Hence, we only need the titles of the videos and we will take out the titles separate from the dataset. Along with this, we will be applying NLP on all the titles to get cleaner and more important words separate out for us.

import contractions
import re

import nltk
nltk.download('punkt')
from nltk.tokenize import word_tokenize

from nltk.corpus import stopwords
nltk.download('stopwords')

from nltk import WordNetLemmatizer
nltk.download('wordnet')
lemma = WordNetLemmatizer()

def apply_NLP(x):
  x = contractions.fix(x)
  x = x.lower()
  x = re.sub(r'\d+','',x)
  x = re.sub(r'[^\w\s]', '', x)
  x = word_tokenize(x)
  x = [w for w in x if not w in stopwords.words()]
  x = [ lemma.lemmatize(w, pos = "v") for w in x]
  x = [ lemma.lemmatize(w, pos = "n") for w in x]
  for i in x:
    if(len(i)==1):
      x.remove(i)
  x = ' '.join(x)
  return x

for i in range(len(all_titles_data)):
  all_titles_data[i] = apply_NLP(all_titles_data[i])

Creating a Similarity Matrix

After getting the cleaned dataset, our next goal is to find the similarity between the titles which can be done on the basis of the common words they have between them.

def get_common(x,y):
  x = x.split(" ")
  y = y.split(" ")
  return len(list(set(x)&set(y)))

import numpy as np
size = len(all_titles_data)
Matrix = np.zeros((size,size))
for i in range(size):
  for j in range(size):
    if(i!=j):
      Matrix[i][j] = get_common(all_titles_data[i],all_titles_data[j])

Creating Network for the Channel Data

In this step, we will generate the network on the basis of the similarity matrix. The code for the same is below.

import networkx as nx
import matplotlib.pyplot as plt
plt.style.use('seaborn')
import numpy as np

G = nx.Graph()

for i in range(size):
  G.add_node(i)

for i in range(size):
  for j in range(size):
    if(Matrix[i][j]>0):
      G.add_edge(i,j,value = Matrix[i][j])

Using NetworkX Library to Visualize the Social Network

As the final step, we will be visualizing the channel social network graph using the NetworkX library. The code for the same is below. To increase the interactivity, I added physics buttons as well.

g = net.Network(height='400px', width='100%',heading='Pewdiepie Channel Network Graph',bgcolor='black',font_color="white")
g.from_nx(G)
g.show('karate.html')
g.show_buttons(filter_=['physics'])
display(HTML('karate.html'))

Conclusion

Congratulations! You just learned how to work with a real dataset and how to create its social network graph. You also learned to visualize the channel data in such an amazing way.

Thank you for reading! Hope you like it!


close
Generic selectors
Exact matches only
Search in title
Search in content