Hierarchical Clustering
A Summary of lecture "Cluster Analysis in Python", via datacamp
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
Basics of hierarchical clustering
 Creating a distance matrix using linkage

method
: how to calculate the proximity of clusters 
metric
: distance metric 
optimal_ordering
: order data points

 Type of Methods
 single: based on two closest objects
 complete: based on two farthest objects
 average: based on the arithmetic mean of all objects
 centroids: based on the geometric mean of all objects
 median: based on the median of all objects
 ward: based on the sum of squares
Hierarchical clustering: ward method
It is time for ComicCon! ComicCon is an annual comicbased convention held in major cities in the world. You have the data of last year's footfall, the number of people at the convention ground at a given time. You would like to decide the location of your stall to maximize sales. Using the ward method, apply hierarchical clustering to find the two points of attraction in the area.
 Preprocess
comic_con = pd.read_csv('./dataset/comic_con.csv', index_col=0)
comic_con.head()
from scipy.cluster.vq import whiten
comic_con['x_scaled'] = whiten(comic_con['x_coordinate'])
comic_con['y_scaled'] = whiten(comic_con['y_coordinate'])
from scipy.cluster.hierarchy import linkage, fcluster
# Use the linkage()
distance_matrix = linkage(comic_con[['x_scaled', 'y_scaled']], method='ward', metric='euclidean')
# Assign cluster labels
comic_con['cluster_labels'] = fcluster(distance_matrix, 2, criterion='maxclust')
# Plot clusters
sns.scatterplot(x='x_scaled', y='y_scaled', hue='cluster_labels', data=comic_con);
distance_matrix = linkage(comic_con[['x_scaled', 'y_scaled']], method='single', metric='euclidean')
# Assign cluster labels
comic_con['cluster_labels'] = fcluster(distance_matrix, 2, criterion='maxclust')
# Plot clusters
sns.scatterplot(x='x_scaled', y='y_scaled', hue='cluster_labels', data=comic_con);
distance_matrix = linkage(comic_con[['x_scaled', 'y_scaled']], method='complete', metric='euclidean')
# Assign cluster labels
comic_con['cluster_labels'] = fcluster(distance_matrix, 2, criterion='maxclust')
# Plot clusters
sns.scatterplot(x='x_scaled', y='y_scaled', hue='cluster_labels', data=comic_con);
colors = {1:'red', 2:'blue'}
# Plot the scatter plot
comic_con.plot.scatter(x='x_scaled', y='y_scaled', c=comic_con['cluster_labels'].apply(lambda x: colors[x]));
sns.scatterplot(x='x_scaled', y='y_scaled', hue='cluster_labels', data=comic_con)
from scipy.cluster.hierarchy import dendrogram
# Create a dendrogram
dn = dendrogram(distance_matrix)
Timing run of hierarchical clustering
In earlier exercises of this chapter, you have used the data of ComicCon footfall to create clusters. In this exercise you will time how long it takes to run the algorithm on DataCamp's system.
Remember that you can time the execution of small code snippets with:
%timeit sum([1, 3, 2])
%timeit linkage(comic_con[['x_scaled', 'y_scaled']], method='ward', metric='euclidean')
FIFA 18: exploring defenders
In the FIFA 18 dataset, various attributes of players are present. Two such attributes are:
 sliding tackle: a number between 099 which signifies how accurate a player is able to perform sliding tackles
 aggression: a number between 099 which signifies the commitment and will of a player These are typically high in defenseminded players. In this exercise, you will perform clustering based on these attributes in the data.
This data consists of 5000 rows, and is considerably larger than earlier datasets. Running hierarchical clustering on this data can take up to 10 seconds.
 Preprocess
fifa = pd.read_csv('./dataset/fifa_18_dataset.csv')
fifa.head()
fifa['scaled_sliding_tackle'] = whiten(fifa['sliding_tackle'])
fifa['scaled_aggression'] = whiten(fifa['aggression'])
distance_matrix = linkage(fifa[['scaled_sliding_tackle', 'scaled_aggression']], method='ward')
# Assign cluster labels to each row of data
fifa['cluster_labels'] = fcluster(distance_matrix, 3, criterion='maxclust')
# Display cluster centers of each cluster
print(fifa[['scaled_sliding_tackle', 'scaled_aggression', 'cluster_labels']].groupby('cluster_labels').mean())
# Create a scatter plot through seaborn
sns.scatterplot(x='scaled_sliding_tackle', y='scaled_aggression', hue='cluster_labels', data=fifa)
plt.savefig('../images/fifa_cluster.png')