溫馨提示×

溫馨提示×

您好,登錄后才能下訂單哦!

密碼登錄×
登錄注冊×
其他方式登錄
點擊 登錄注冊 即表示同意《億速云用戶服務(wù)條款》

python數(shù)據(jù)分析與挖掘之 聚類算法

發(fā)布時間:2020-07-25 09:33:49 來源:網(wǎng)絡(luò) 閱讀:929 作者:nineteens 欄目:編程語言

  第五章 數(shù)據(jù)建模

  (一)聚類分析

  1、主要方法

  2、距離分析

  度量樣本之間的相似性,采用距離算法:

  文檔相似性度量

  3、K-means分類

  #-*- coding: utf-8 -*-

  # 使用K-Means算法聚類消費行為特征數(shù)據(jù)

  import pandas as pd

  # 參數(shù)初始化

  inputfile = '../data/consumption_data.xls' # 銷量及其他屬性數(shù)據(jù)

  outputfile = '../tmp/data_type.xls' # 保存結(jié)果的文件名

  k = 3 # 聚類的類別

  iteration = 500 # 聚類最大循環(huán)次數(shù)

  data = pd.read_excel(inputfile, index_col='Id') # 讀取數(shù)據(jù)

  data_zs = 1.0 * (data - data.mean()) / data.std() # 數(shù)據(jù)標(biāo)準(zhǔn)化

  from sklearn.cluster import KMeans

  model = KMeans(n_clusters=k, n_jobs=1, max_iter=iteration) # 分為k類,并發(fā)數(shù)4

  model.fit(data_zs) # 開始聚類

  # 簡單打印結(jié)果

  r1 = pd.Series(model.labels_).value_counts() # 統(tǒng)計各個類別的數(shù)目

  r2 = pd.DataFrame(model.cluster_centers_) # 找出聚類中心

  r = pd.concat([r2, r1], axis=1) # 橫向連接(0是縱向),得到聚類中心對應(yīng)的類別下的數(shù)目

  r.columns = list(data.columns) + [u'類別數(shù)目'] # 重命名表頭

  print(r)

  # 詳細輸出原始數(shù)據(jù)及其類別

  r = pd.concat([data, pd.Series(model.labels_, index=data.index)],

  axis=1) # 詳細輸出每個樣本對應(yīng)的類別

  r.columns = list(data.columns) + [u'聚類類別'] # 重命名表頭

  r.to_excel(outputfile) # 保存結(jié)果

  def density_plot(data): # 自定義作圖函數(shù)

  import matplotlib.pyplot as plt

  plt.rcParams['font.sans-serif'] = ['SimHei'] # 用來正常顯示中文標(biāo)簽

  plt.rcParams['axes.unicode_minus'] = False # 用來正常顯示負號

  p = data.plot(kind='kde', linewidth=2, subplots=True, sharex=False)

  [p[i].set_ylabel(u'密度') for i in range(k)]

  plt.legend()

  return plt

  pic_output = '../tmp/pd_' # 概率密度圖文件名前綴

  for i in range(k):

  density_plot(data[r[u'聚類類別'] == i]).savefig(u'%s%s.png' % (pic_output, i))

  #利用TSNE繪圖

  #-*- coding: utf-8 -*-

  # 接k_means.py

  from sklearn.manifold import TSNE

  tsne = TSNE()

  tsne.fit_transform(data_zs) # 進行數(shù)據(jù)降維

  tsne = pd.DataFrame(tsne.embedding_, index=data_zs.index) # 轉(zhuǎn)換數(shù)據(jù)格式

  import matplotlib.pyplot as plt

  plt.rcParams['font.sans-serif'] = ['SimHei'] # 用來正常顯示中文標(biāo)簽

  plt.rcParams['axes.unicode_minus'] = False # 用來正常顯示負號

  # 不同類別用不同顏色和樣式繪圖

  d = tsne[r[u'聚類類別'] == 0]

  plt.plot(d[0], d[1], 'r.')

  d = tsne[r[u'聚類類別'] == 1]

  plt.plot(d[0], d[1], 'go')

  d = tsne[r[u'聚類類別'] == 2]

  plt.plot(d[0], d[1], 'b*')

  plt.show()

  聚類分析算法評價 P111

  (1)purity評價法

  (2)RI評價法

  (3)F值評價法

  4、Meanshift

  與kmeans算法不同,mean shift 算法可自動決定類別的數(shù)目。與kmeans算法一樣的是,兩者都是用集合內(nèi)數(shù)據(jù)點的均值進行中心點的移動。

  算法核心:算法的關(guān)鍵操作是通過感興趣區(qū)域內(nèi)的數(shù)據(jù)密度變化計算中心點的漂移向量,從而移動中心點進行下一次迭代,直到到達密度最大處(中心點不變)。從每個數(shù)據(jù)點出發(fā)都可以進行該操作,在這個過程,統(tǒng)計出現(xiàn)在感興趣區(qū)域內(nèi)的數(shù)據(jù)的次數(shù)。該參數(shù)將在最后作為分類的依據(jù)。

  mean shift 算法中,bandwidth(帶寬)是重要參數(shù)。

  案例來源:AI with python(mean_shift.py)

  import numpy as np

  import matplotlib.pyplot as plt

  from sklearn.cluster import MeanShift, estimate_bandwidth

  from itertools import cycle

  # Load data from input file

  X = np.loadtxt('../code/data_clustering.txt', delimiter=',')

  # Estimate the bandwidth of X

  bandwidth_X = estimate_bandwidth(X, quantile=0.1, n_samples=len(X))

  # Cluster data with MeanShift

  meanshift_model = MeanShift(bandwidth=bandwidth_X, bin_seeding=True)

  meanshift_model.fit(X)

  # Extract the centers of clusters

  cluster_centers = meanshift_model.cluster_centers_

  print('\nCenters of clusters:\n', cluster_centers)

  # Estimate the number of clusters

  labels = meanshift_model.labels_

  num_clusters = len(np.unique(labels))

  print("\nNumber of clusters in input data =", num_clusters)

  # Plot the points and cluster centers

  plt.figure()

  markers = 'o*xvs'

  for i, marker in zip(range(num_clusters), markers):

  # Plot points that belong to the current cluster

  plt.scatter(X[labels==i, 0], X[labels==i, 1], marker=marker, color='black')

  # Plot the cluster center

  cluster_center = cluster_centers[i]

  plt.plot(cluster_center[0], cluster_center[1], marker='o',

  markerfacecolor='black', markeredgecolor='black',

  markersize=15)

  plt.title('Clusters')

  plt.show()

  5、GMM算法

  GMM算法主要利用EM算法來估計高斯混合模型中的參數(shù),然后根據(jù)計算得到的 概率進行聚類。

  GMM分布

  高斯混合分布是假設(shè)總體的分布有多個不同的高斯分布混合而成,其中每一個高斯分布所占的權(quán)重不相同。

  GMM和K-means直觀對比

  最后我們比較GMM和K-means兩個算法的步驟。

  GMM:

  先計算所有數(shù)據(jù)對每個分模型的響應(yīng)度

  根據(jù)響應(yīng)度計算每個分模型的參數(shù)

  迭代

  K-means:

  先計算所有數(shù)據(jù)對于K個點的距離,取距離最近的點作為自己所屬于的類

  根據(jù)上一步的類別劃分更新點的位置(點的位置就可以看做是模型參數(shù))

  迭代

  可以看出GMM和K-means還是有很大的相同點的。GMM中數(shù)據(jù)對高斯分量的響應(yīng)度就相當(dāng)于K-means中的距離計算,GMM中的根據(jù)響應(yīng)度計算高斯分量參數(shù)就相當(dāng)于K-means中計算分類點的位置。然后它們都通過不斷迭代達到最優(yōu)。不同的是:GMM模型給出的是每一個觀測點由哪個高斯分量生成的概率,而K-means直接給出一個觀測點屬于哪一類。

  案例來源AI with python (gmm_classifier.py)

  import numpy as np

  import pandas as pd

  import matplotlib.pyplot as plt

  from matplotlib import patches #

  from sklearn import datasets

  from sklearn.mixture import GaussianMixture #GMM更換為GaussianMixture

  from sklearn.model_selection import StratifiedKFold

  # Load the iris dataset

  iris = datasets.load_iris()

  #print(iris) #數(shù)據(jù)分為data和target兩組

  # Split dataset into training and testing (80/20 split)

  skf = StratifiedKFold(n_splits=5) #將數(shù)據(jù)分為5組。

  indices=skf.split(iris.data,iris.target) #將數(shù)據(jù)分為4組train,1組test

  # Take the first fold

  train_index, test_index = next(iter(indices))

  # Extract training data and labels

  X_train = iris.data[train_index]

  y_train = iris.target[train_index]

  # Extract testing data and labels

  X_test = iris.data[test_index]

  y_test = iris.target[test_index]

  # Extract the number of classes

  num_classes = len(np.unique(y_train))

  # Build GMM

  classifier = GaussianMixture(n_components=num_classes, covariance_type='full', #n_components指的是下層分布由幾個構(gòu)成,本項目中指的是num_classes. covariance_type指一致性算法的類別

  init_params='kmeans', max_iter=20) #init_params中w代表weights,c代表covariance在迭代中進行更新;n_iter迭代次數(shù)

  # Initialize the GMM means

  classifier.means_ = np.array([X_train[y_train == i].mean(axis=0)

  for i in range(num_classes)])

  # Train the GMM classifier

  classifier.fit(X_train)

  # Draw boundaries

  plt.figure()

  colors = 'bgr'

  for i, color in enumerate(colors):

  # Extract eigenvalues and eigenvectors

  eigenvalues, eigenvectors = np.linalg.eigh(classifier.covariances_[i][:2, :2])

  #參照GaussianMixture的屬性修改為covariances_。在covariances_()時報錯,希望通過dataframe的類對象的方法得到#numpy數(shù)組。不應(yīng)帶括號,他是屬性,不是方法。

  # Normalize the first eigenvector

  norm_vec = eigenvectors[0] / np.linalg.norm(eigenvectors[0])

  # Extract the angle of tilt

  angle = np.arctan2(norm_vec[1], norm_vec[0])

  angle = 180 * angle / np.pi

  # Scaling factor to magnify the ellipses

  # (random value chosen to suit our needs)

  scaling_factor = 8

  eigenvalues *= scaling_factor

  # Draw the ellipse

  ellipse = patches.Ellipse(classifier.means_[i, :2],

  eigenvalues[0], eigenvalues[1], 180 + angle,

  color=color)

  axis_handle = plt.subplot(1, 1, 1)

  ellipse.set_clip_box(axis_handle.bbox)

  ellipse.set_alpha(0.6)

  axis_handle.add_artist(ellipse)

  # Plot the data

  colors = 'bgr'

  for i, color in enumerate(colors):

  cur_data = iris.data[iris.target == i]

  plt.scatter(cur_data[:,0], cur_data[:,1], marker='o',

  facecolors='none', edgecolors='black', s=40,

  label=iris.target_names[i])

  test_data = X_test[y_test == i]

  plt.scatter(test_data[:,0], test_data[:,1], marker='s',

  facecolors='black', edgecolors='black', s=40,

  label=iris.target_names[i])

  # Compute predictions for training and testing data

  y_train_pred = classifier.predict(X_train)

  accuracy_training = np.mean(y_train_pred.ravel() == y_train.ravel()) * 100

  print('Accuracy on training data =', accuracy_training)

  y_test_pred = classifier.predict(X_test)

  accuracy_testing = np.mean(y_test_pred.ravel() == y_test.ravel()) * 100

  print('Accuracy on testing data =', accuracy_testing)

  plt.title('GMM classifier')

  plt.xticks(())

  plt.yticks(())

  plt.show()

  存在問題 :生成圖形有差別,預(yù)測精度也有問題。特征值、特征向量的用法

  6、近鄰傳播算法

  Affinity Propagation 聚類算法的通俗解釋。下面案例是網(wǎng)上找的AP算法的一個案例。親測可用。

  https://blog.csdn.net/notHeadache/article/details/89003044

  print(__doc__)

  from sklearn.cluster import AffinityPropagation

  from sklearn import metrics

  from sklearn.datasets.samples_generator import make_blobs

  # #############################################################################

  # Generate sample data

  centers = [[1, 1], [-1, -1], [1, -1]]

  X, labels_true = make_blobs(n_samples=300, centers=centers, cluster_std=0.5,

  random_state=0)

  # #############################################################################

  # Compute Affinity Propagation

  af = AffinityPropagation(preference=-50).fit(X)

  cluster_centers_indices = af.cluster_centers_indices_

  labels = af.labels_

  n_clusters_ = len(cluster_centers_indices) #

  print('Estimated number of clusters: %d' % n_clusters_)

  print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))

  print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))

  print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))

  print("Adjusted Rand Index: %0.3f"

  % metrics.adjusted_rand_score(labels_true, labels))

  print("Adjusted Mutual Information: %0.3f"

  % metrics.adjusted_mutual_info_score(labels_true, labels))

  print("Silhouette Coefficient: %0.3f"

  % metrics.silhouette_score(X, labels, metric='sqeuclidean'))

  # #############################################################################

  # Plot result

  import matplotlib.pyplot as plt

  from itertools import cycle

  plt.close('all')

  plt.figure(1)

  plt.clf()

  colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')

  for k, col in zip(range(n_clusters_), colors):

  class_members = labels == k

  cluster_center = X[cluster_centers_indices[k]]

  plt.plot(X[class_members, 0], X[class_members, 1], col + '.')

  plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,

  markeredgecolor='k', markersize=14)

  for x in X[class_members]:

  plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)

  plt.title('Estimated number of clusters: %d' % n_clusters_)

  plt.show()無錫人流多少錢 http://www.bhnnk120.com/

  (二)問題要點

  1、數(shù)據(jù)連接

  r = pd.concat([r2, r1], axis=1) # 橫向連接(0是縱向),得到聚類中心對應(yīng)的類別下的數(shù)目

  r.columns = list(data.columns) + [u'類別數(shù)目'] # 重命名表頭

  2、分別繪圖

  pic_output = '../tmp/pd_' # 概率密度圖文件名前綴

  for i in range(k):

  density_plot(data[r[u'聚類類別'] == i]).savefig(u'%s%s.png' % (pic_output, i))

  3、數(shù)據(jù)標(biāo)準(zhǔn)化

  data_zs = 1.0 * (data - data.mean()) / data.std() # 數(shù)據(jù)標(biāo)準(zhǔn)化

  4、kmeans++算法

  kmeans = KMeans(init='k-means++', n_clusters=num_clusters, n_init=10) #k-means++采用智能方法找出初始聚類中心,n_init為迭代次數(shù)

  5、最優(yōu)分類次數(shù)計算(采用輪廓系數(shù)判定)

  import numpy as np

  import matplotlib.pyplot as plt

  from sklearn import metrics

  from sklearn.cluster import KMeans

  # Load data from input file

  X = np.loadtxt('../code/data_quality.txt', delimiter=',')

  # Plot input data

  plt.figure()

  plt.scatter(X[:,0], X[:,1], color='black', s=80, marker='o', facecolors='none')

  x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1

  y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1

  plt.title('Input data')

  plt.xlim(x_min, x_max)

  plt.ylim(y_min, y_max)

  plt.xticks(())

  plt.yticks(())

  # Initialize variables

  scores = []

  values = np.arange(2, 10)

  # Iterate through the defined range

  for num_clusters in values: #從2到10里面選擇

  # Train the KMeans clustering model

  kmeans = KMeans(init='k-means++', n_clusters=num_clusters, n_init=10)

  kmeans.fit(X)

  score = metrics.silhouette_score(X, kmeans.labels_,metric='euclidean', sample_size=len(X)) #輪廓系數(shù)

  print("\nNumber of clusters =", num_clusters)

  print("Silhouette score =", score)

  scores.append(score)

  # Plot silhouette scores

  plt.figure()

  plt.bar(values, scores, width=0.7, color='black', align='center')

  plt.title('Silhouette score vs number of clusters')

  # Extract best score and optimal number of clusters

  num_clusters = np.argmax(scores) + values[0]

  print('\nOptimal number of clusters =', num_clusters)

  plt.show()


向AI問一下細節(jié)

免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點不代表本網(wǎng)站立場,如果涉及侵權(quán)請聯(lián)系站長郵箱:is@yisu.com進行舉報,并提供相關(guān)證據(jù),一經(jīng)查實,將立刻刪除涉嫌侵權(quán)內(nèi)容。

AI