【scikit-learn】03：将sklearn库用于非监督性学习聚类

2018-05-18 15:32:48作者： kevinelstri来源： [链接]己有：2308人学习过

# -*-coding:utf-8-*-# ----------------------#   Author：kevinelstri#   Datetime:2017.2.16# ----------------------# -----------------------# Unsupervised learning: seeking representations of the data# http://scikit-learn.org/stable/tutorial/statistical_inference/unsupervised_learning.html# -----------------------import numpy as np'''
    非监督性学习
''''''
    Clustering: grouping observations together 聚类：分组观察
    The problem solved in clustering 聚类中的问题求解
    给定一个iris数据集，如果我们知道iris有三种类型，但是没有一个能够分开三类的标签；
    此时，可以尝试使用聚类任务，将观察到的数据划分成更好的组就叫做聚类。
''''''
    K-means clustering : k均值聚类
    聚类中存在许多不同的聚类标准和相关算法，最简单的聚类算法就是K均值聚类算法
'''from sklearn import cluster, datasets

iris = datasets.load_iris()  # 加载数据集x_iris = iris.data  # 数据集的数据y_iris = iris.target  # 数据集的标签print x_irisprint y_iris

k_means = cluster.KMeans(n_clusters=3)  # k_means分类器,参数n_clusters=3,划分成3类print k_means.fit(x_iris)  # 分类器直接对数据进行聚类print k_means.labels_[::10]  # 标签print y_iris[::10]print '-------------------------------------------------''''
    Application example: vector quantization ：应用案例：矢量量化
    聚类的一般算法，特别的，可以作为一种选择一些典范的压缩信息，这个问题被称为矢量量化。
'''import scipy as spimport matplotlib.pyplot as plttry:
    face = sp.face(gray=True)except AttributeError:    from scipy import misc

    face = misc.face(gray=True)

plt.gray()
plt.imshow(face)
plt.show()  # 显示原图# 把图片像素进行聚类X = face.reshape((-1, 1))
k_means = cluster.KMeans(n_clusters=5, n_init=1)  # 构造分类器，参数n_clusters是K值print k_means.fit(X)  # 分类器对数据进行聚类，分类不需要预测values = k_means.cluster_centers_.squeeze()
labels = k_means.labels_
face_compressed = np.choose(labels, values)
face_compressed.shape = face.shapeprint face_compressed  # 图像中各个像素的大小print face_compressed.shape  # 图像大小plt.gray()
plt.imshow(face_compressed)
plt.show()  # 显示分类器操作过的图像'''
    Hierarchical agglomerative clustering: Ward  层次凝聚聚类算法（自下向上）
    层次聚类方法是典型的聚类分析方法，目的是建立一个分层的聚类。一般，层次聚类的方法可以分为以下两种：
    自下向上-层次聚类（Agglomerative）：
    自顶向下-层次聚类（Divisive）：
''''''
    约束连接聚类：

'''import matplotlib.pyplot as pltfrom sklearn.feature_extraction.image import grid_to_graphfrom sklearn.cluster import AgglomerativeClusteringfrom sklearn.utils.testing import SkipTestfrom sklearn.utils.fixes import sp_versionfrom scipy import miscimport scipy as spif sp_version < (0, 12):    raise SkipTest("Skipping because SciPy version earlier than 0.12.0 and "
                   "thus does not include the scipy.misc.face() image.")try:
    face = sp.face(gray=True)except AttributeError:    from scipy import misc

    face = misc.face(gray=True)
face = sp.misc.imresize(face, 0.10) / 255.plt.gray()
plt.imshow(face)
plt.show()'''
    Feature agglomeration 特征群
'''digits = datasets.load_digits()
images = digits.images
x = np.reshape(images, (len(images), -1))
connectivity = grid_to_graph(*images[0].shape)
agglo = cluster.FeatureAgglomeration(connectivity=connectivity, n_clusters=32)print agglo.fit(x)
x_reduced = agglo.transform(x)
x_approx = agglo.inverse_transform(x_reduced)
images_approx = np.reshape(x_approx, images.shape)'''
    Principal component analysis: PCA 降维
'''x1 = np.random.normal(size=100)
x2 = np.random.normal(size=100)
x3 = x1 + x2
X = np.c_[x1, x2, x3]from sklearn import decomposition
pca = decomposition.PCA()  # PCA降维算法print pca.fit(X)  # 直接对数据进行降维print pca.explained_variance_
pca.n_components = 2X_reduced = pca.fit_transform(X)print X_reduced.shape123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136

KMeans案例：

# -*-coding:utf-8-*-"""
第一部分：导入包
从sklearn.cluster机器学习聚类包中导入KMeans聚类
"""from sklearn.cluster import Birchfrom sklearn.cluster import KMeans"""
第二部分：数据集
X表示二维矩阵数据，篮球运动员比赛数据
总共20行，每行两列数据
第一列表示球员每分钟助攻数：assists_per_minute
第二列表示球员每分钟得分数：points_per_minute
"""X = [[0.0888, 0.5885],
     [0.1399, 0.8291],
     [0.0747, 0.4974],
     [0.0983, 0.5772],
     [0.1276, 0.5703],
     [0.1671, 0.5835],
     [0.1906, 0.5276],
     [0.1061, 0.5523],
     [0.2446, 0.4007],
     [0.1670, 0.4770],
     [0.2485, 0.4313],
     [0.1227, 0.4909],
     [0.1240, 0.5668],
     [0.1461, 0.5113],
     [0.2315, 0.3788],
     [0.0494, 0.5590],
     [0.1107, 0.4799],
     [0.2521, 0.5735],
     [0.1007, 0.6318],
     [0.1067, 0.4326],
     [0.1956, 0.4280]
     ]# 输出数据集print X"""
第三部分：KMeans聚类
clf = KMeans(n_clusters=3) 表示类簇数为3，聚成3类数据，clf即赋值为KMeans
y_pred = clf.fit_predict(X) 载入数据集X，并且将聚类的结果赋值给y_pred
"""clf = KMeans(n_clusters=3)  # 聚类算法，参数n_clusters=3，聚成3类y_pred = clf.fit_predict(X)  # 直接对数据进行聚类，聚类不需要进行预测# 输出完整Kmeans函数，包括很多省略参数print clf# 输出聚类预测结果，20行数据，每个y_pred对应X一行或一个球员，聚成3类，类标为0、1、2print y_pred"""
第四部分：可视化绘图
Python导入Matplotlib包，专门用于绘图
import matplotlib.pyplot as plt 此处as相当于重命名，plt用于显示图像
"""import numpy as npimport matplotlib.pyplot as plt# 获取第一列和第二列数据 使用for循环获取 n[0]表示X第一列x = [n[0] for n in X]print x
y = [n[1] for n in X]print y# 绘制散点图 参数：x横轴 y纵轴 c=y_pred聚类预测结果 marker类型 o表示圆点 *表示星型 x表示点plt.scatter(x, y, c=y_pred, marker='x')# 绘制标题plt.title("Kmeans-Basketball Data")# 绘制x轴和y轴坐标plt.xlabel("assists_per_minute")
plt.ylabel("points_per_minute")# 设置右上角图例plt.legend(["A", "B", "C"])# 显示图形plt.show()12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364656667686970717273747576777879808182838485868788