Scikit-learn: Less points plotted than initial data samples after clustering with DBSCAN


Scikit-learn: Less points plotted than initial data samples after clustering with DBSCAN



I was using the DBSCAN implementation from the library scikit-learn, when I discovered that the number of points plotted was inferior to the number of initial samples.
In particular, in the official demo of DBSCAN http://scikit-learn.org/stable/auto_examples/cluster/plot_dbscan.html, 750 samples are generated automatically. However, when I print how many points there are for each cluster and how many outliers, the result is:
CLUSTER 1: 224,
CLUSTER 2: 228,
CLUSTER 3: 227,
OUTLIERS : 18,
--> TOTAL = 697. As you can see from the following code, I have just added few lines to the original code, to print for each cluster the number of points and the number of outliers. I am confused about this and I would like to know why this happens and where are the missing points.
Thanks in advance for the answers!


print(__doc__)

import numpy as np

from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
from sklearn.preprocessing import StandardScaler


# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,
random_state=0)


X = StandardScaler().fit_transform(X)

# #############################################################################
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)



print('Estimated number of clusters: %d' % n_clusters_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
% metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
% metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(X, labels))

# #############################################################################
# Plot result
import matplotlib.pyplot as plt


unique_labels = set(labels)

i=1
for k in zip(unique_labels):

class_member_mask = (labels == k)

if k == (-1,):
xy = X[class_member_mask & ~core_samples_mask]
current_outliers = len(xy)
print "OUTLIERS :", current_outliers
else:
xy = X[class_member_mask & core_samples_mask]
print "CLUSTER", i, " :",len(xy)
i+=1

colors = [plt.cm.Spectral(each)
for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
if k == -1:`enter code here`
col = [0, 0, 0, 1]

class_member_mask = (labels == k)

xy = X[class_member_mask & core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=14)

xy = X[class_member_mask & ~core_samples_mask]
plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()




1 Answer
1



You are including only the core samples in your plot. If you want all the points to be accounted for remove the constraint on core_samples_mask:


core_samples_mask


if k == (-1,):
xy = X[class_member_mask]
current_outliers = len(xy)
print "OUTLIERS :", current_outliers
else:
xy = X[class_member_mask]
print "CLUSTER", i, " :",len(xy)
i+=1






By clicking "Post Your Answer", you acknowledge that you have read our updated terms of service, privacy policy and cookie policy, and that your continued use of the website is subject to these policies.

Popular posts from this blog

Moria Casán

How to make file upload 'Required' in Contact Form 7?

Quinn's Post Commonwealth War Graves Commission Cemetery