您好,登錄后才能下訂單哦!
第五章 數(shù)據(jù)建模
(一)聚類分析
1、主要方法
2、距離分析
度量樣本之間的相似性,采用距離算法:
文檔相似性度量
3、K-means分類
#-*- coding: utf-8 -*-
# 使用K-Means算法聚類消費行為特征數(shù)據(jù)
import pandas as pd
# 參數(shù)初始化
inputfile = '../data/consumption_data.xls' # 銷量及其他屬性數(shù)據(jù)
outputfile = '../tmp/data_type.xls' # 保存結(jié)果的文件名
k = 3 # 聚類的類別
iteration = 500 # 聚類最大循環(huán)次數(shù)
data = pd.read_excel(inputfile, index_col='Id') # 讀取數(shù)據(jù)
data_zs = 1.0 * (data - data.mean()) / data.std() # 數(shù)據(jù)標(biāo)準(zhǔn)化
from sklearn.cluster import KMeans
model = KMeans(n_clusters=k, n_jobs=1, max_iter=iteration) # 分為k類,并發(fā)數(shù)4
model.fit(data_zs) # 開始聚類
# 簡單打印結(jié)果
r1 = pd.Series(model.labels_).value_counts() # 統(tǒng)計各個類別的數(shù)目
r2 = pd.DataFrame(model.cluster_centers_) # 找出聚類中心
r = pd.concat([r2, r1], axis=1) # 橫向連接(0是縱向),得到聚類中心對應(yīng)的類別下的數(shù)目
r.columns = list(data.columns) + [u'類別數(shù)目'] # 重命名表頭
print(r)
# 詳細輸出原始數(shù)據(jù)及其類別
r = pd.concat([data, pd.Series(model.labels_, index=data.index)],
axis=1) # 詳細輸出每個樣本對應(yīng)的類別
r.columns = list(data.columns) + [u'聚類類別'] # 重命名表頭
r.to_excel(outputfile) # 保存結(jié)果
def density_plot(data): # 自定義作圖函數(shù)
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei'] # 用來正常顯示中文標(biāo)簽
plt.rcParams['axes.unicode_minus'] = False # 用來正常顯示負號
p = data.plot(kind='kde', linewidth=2, subplots=True, sharex=False)
[p[i].set_ylabel(u'密度') for i in range(k)]
plt.legend()
return plt
pic_output = '../tmp/pd_' # 概率密度圖文件名前綴
for i in range(k):
density_plot(data[r[u'聚類類別'] == i]).savefig(u'%s%s.png' % (pic_output, i))
#利用TSNE繪圖
#-*- coding: utf-8 -*-
# 接k_means.py
from sklearn.manifold import TSNE
tsne = TSNE()
tsne.fit_transform(data_zs) # 進行數(shù)據(jù)降維
tsne = pd.DataFrame(tsne.embedding_, index=data_zs.index) # 轉(zhuǎn)換數(shù)據(jù)格式
import matplotlib.pyplot as plt
plt.rcParams['font.sans-serif'] = ['SimHei'] # 用來正常顯示中文標(biāo)簽
plt.rcParams['axes.unicode_minus'] = False # 用來正常顯示負號
# 不同類別用不同顏色和樣式繪圖
d = tsne[r[u'聚類類別'] == 0]
plt.plot(d[0], d[1], 'r.')
d = tsne[r[u'聚類類別'] == 1]
plt.plot(d[0], d[1], 'go')
d = tsne[r[u'聚類類別'] == 2]
plt.plot(d[0], d[1], 'b*')
plt.show()
聚類分析算法評價 P111
(1)purity評價法
(2)RI評價法
(3)F值評價法
4、Meanshift
與kmeans算法不同,mean shift 算法可自動決定類別的數(shù)目。與kmeans算法一樣的是,兩者都是用集合內(nèi)數(shù)據(jù)點的均值進行中心點的移動。
算法核心:算法的關(guān)鍵操作是通過感興趣區(qū)域內(nèi)的數(shù)據(jù)密度變化計算中心點的漂移向量,從而移動中心點進行下一次迭代,直到到達密度最大處(中心點不變)。從每個數(shù)據(jù)點出發(fā)都可以進行該操作,在這個過程,統(tǒng)計出現(xiàn)在感興趣區(qū)域內(nèi)的數(shù)據(jù)的次數(shù)。該參數(shù)將在最后作為分類的依據(jù)。
mean shift 算法中,bandwidth(帶寬)是重要參數(shù)。
案例來源:AI with python(mean_shift.py)
import numpy as np
import matplotlib.pyplot as plt
from sklearn.cluster import MeanShift, estimate_bandwidth
from itertools import cycle
# Load data from input file
X = np.loadtxt('../code/data_clustering.txt', delimiter=',')
# Estimate the bandwidth of X
bandwidth_X = estimate_bandwidth(X, quantile=0.1, n_samples=len(X))
# Cluster data with MeanShift
meanshift_model = MeanShift(bandwidth=bandwidth_X, bin_seeding=True)
meanshift_model.fit(X)
# Extract the centers of clusters
cluster_centers = meanshift_model.cluster_centers_
print('\nCenters of clusters:\n', cluster_centers)
# Estimate the number of clusters
labels = meanshift_model.labels_
num_clusters = len(np.unique(labels))
print("\nNumber of clusters in input data =", num_clusters)
# Plot the points and cluster centers
plt.figure()
markers = 'o*xvs'
for i, marker in zip(range(num_clusters), markers):
# Plot points that belong to the current cluster
plt.scatter(X[labels==i, 0], X[labels==i, 1], marker=marker, color='black')
# Plot the cluster center
cluster_center = cluster_centers[i]
plt.plot(cluster_center[0], cluster_center[1], marker='o',
markerfacecolor='black', markeredgecolor='black',
markersize=15)
plt.title('Clusters')
plt.show()
5、GMM算法
GMM算法主要利用EM算法來估計高斯混合模型中的參數(shù),然后根據(jù)計算得到的 概率進行聚類。
GMM分布
高斯混合分布是假設(shè)總體的分布有多個不同的高斯分布混合而成,其中每一個高斯分布所占的權(quán)重不相同。
GMM和K-means直觀對比
最后我們比較GMM和K-means兩個算法的步驟。
GMM:
先計算所有數(shù)據(jù)對每個分模型的響應(yīng)度
根據(jù)響應(yīng)度計算每個分模型的參數(shù)
迭代
K-means:
先計算所有數(shù)據(jù)對于K個點的距離,取距離最近的點作為自己所屬于的類
根據(jù)上一步的類別劃分更新點的位置(點的位置就可以看做是模型參數(shù))
迭代
可以看出GMM和K-means還是有很大的相同點的。GMM中數(shù)據(jù)對高斯分量的響應(yīng)度就相當(dāng)于K-means中的距離計算,GMM中的根據(jù)響應(yīng)度計算高斯分量參數(shù)就相當(dāng)于K-means中計算分類點的位置。然后它們都通過不斷迭代達到最優(yōu)。不同的是:GMM模型給出的是每一個觀測點由哪個高斯分量生成的概率,而K-means直接給出一個觀測點屬于哪一類。
案例來源AI with python (gmm_classifier.py)
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from matplotlib import patches #
from sklearn import datasets
from sklearn.mixture import GaussianMixture #GMM更換為GaussianMixture
from sklearn.model_selection import StratifiedKFold
# Load the iris dataset
iris = datasets.load_iris()
#print(iris) #數(shù)據(jù)分為data和target兩組
# Split dataset into training and testing (80/20 split)
skf = StratifiedKFold(n_splits=5) #將數(shù)據(jù)分為5組。
indices=skf.split(iris.data,iris.target) #將數(shù)據(jù)分為4組train,1組test
# Take the first fold
train_index, test_index = next(iter(indices))
# Extract training data and labels
X_train = iris.data[train_index]
y_train = iris.target[train_index]
# Extract testing data and labels
X_test = iris.data[test_index]
y_test = iris.target[test_index]
# Extract the number of classes
num_classes = len(np.unique(y_train))
# Build GMM
classifier = GaussianMixture(n_components=num_classes, covariance_type='full', #n_components指的是下層分布由幾個構(gòu)成,本項目中指的是num_classes. covariance_type指一致性算法的類別
init_params='kmeans', max_iter=20) #init_params中w代表weights,c代表covariance在迭代中進行更新;n_iter迭代次數(shù)
# Initialize the GMM means
classifier.means_ = np.array([X_train[y_train == i].mean(axis=0)
for i in range(num_classes)])
# Train the GMM classifier
classifier.fit(X_train)
# Draw boundaries
plt.figure()
colors = 'bgr'
for i, color in enumerate(colors):
# Extract eigenvalues and eigenvectors
eigenvalues, eigenvectors = np.linalg.eigh(classifier.covariances_[i][:2, :2])
#參照GaussianMixture的屬性修改為covariances_。在covariances_()時報錯,希望通過dataframe的類對象的方法得到#numpy數(shù)組。不應(yīng)帶括號,他是屬性,不是方法。
# Normalize the first eigenvector
norm_vec = eigenvectors[0] / np.linalg.norm(eigenvectors[0])
# Extract the angle of tilt
angle = np.arctan2(norm_vec[1], norm_vec[0])
angle = 180 * angle / np.pi
# Scaling factor to magnify the ellipses
# (random value chosen to suit our needs)
scaling_factor = 8
eigenvalues *= scaling_factor
# Draw the ellipse
ellipse = patches.Ellipse(classifier.means_[i, :2],
eigenvalues[0], eigenvalues[1], 180 + angle,
color=color)
axis_handle = plt.subplot(1, 1, 1)
ellipse.set_clip_box(axis_handle.bbox)
ellipse.set_alpha(0.6)
axis_handle.add_artist(ellipse)
# Plot the data
colors = 'bgr'
for i, color in enumerate(colors):
cur_data = iris.data[iris.target == i]
plt.scatter(cur_data[:,0], cur_data[:,1], marker='o',
facecolors='none', edgecolors='black', s=40,
label=iris.target_names[i])
test_data = X_test[y_test == i]
plt.scatter(test_data[:,0], test_data[:,1], marker='s',
facecolors='black', edgecolors='black', s=40,
label=iris.target_names[i])
# Compute predictions for training and testing data
y_train_pred = classifier.predict(X_train)
accuracy_training = np.mean(y_train_pred.ravel() == y_train.ravel()) * 100
print('Accuracy on training data =', accuracy_training)
y_test_pred = classifier.predict(X_test)
accuracy_testing = np.mean(y_test_pred.ravel() == y_test.ravel()) * 100
print('Accuracy on testing data =', accuracy_testing)
plt.title('GMM classifier')
plt.xticks(())
plt.yticks(())
plt.show()
存在問題 :生成圖形有差別,預(yù)測精度也有問題。特征值、特征向量的用法
6、近鄰傳播算法
Affinity Propagation 聚類算法的通俗解釋。下面案例是網(wǎng)上找的AP算法的一個案例。親測可用。
https://blog.csdn.net/notHeadache/article/details/89003044
print(__doc__)
from sklearn.cluster import AffinityPropagation
from sklearn import metrics
from sklearn.datasets.samples_generator import make_blobs
# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=300, centers=centers, cluster_std=0.5,
random_state=0)
# #############################################################################
# Compute Affinity Propagation
af = AffinityPropagation(preference=-50).fit(X)
cluster_centers_indices = af.cluster_centers_indices_
labels = af.labels_
n_clusters_ = len(cluster_centers_indices) #
print('Estimated number of clusters: %d' % n_clusters_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
% metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
% metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
% metrics.silhouette_score(X, labels, metric='sqeuclidean'))
# #############################################################################
# Plot result
import matplotlib.pyplot as plt
from itertools import cycle
plt.close('all')
plt.figure(1)
plt.clf()
colors = cycle('bgrcmykbgrcmykbgrcmykbgrcmyk')
for k, col in zip(range(n_clusters_), colors):
class_members = labels == k
cluster_center = X[cluster_centers_indices[k]]
plt.plot(X[class_members, 0], X[class_members, 1], col + '.')
plt.plot(cluster_center[0], cluster_center[1], 'o', markerfacecolor=col,
markeredgecolor='k', markersize=14)
for x in X[class_members]:
plt.plot([cluster_center[0], x[0]], [cluster_center[1], x[1]], col)
plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()無錫人流多少錢 http://www.bhnnk120.com/
(二)問題要點
1、數(shù)據(jù)連接
r = pd.concat([r2, r1], axis=1) # 橫向連接(0是縱向),得到聚類中心對應(yīng)的類別下的數(shù)目
r.columns = list(data.columns) + [u'類別數(shù)目'] # 重命名表頭
2、分別繪圖
pic_output = '../tmp/pd_' # 概率密度圖文件名前綴
for i in range(k):
density_plot(data[r[u'聚類類別'] == i]).savefig(u'%s%s.png' % (pic_output, i))
3、數(shù)據(jù)標(biāo)準(zhǔn)化
data_zs = 1.0 * (data - data.mean()) / data.std() # 數(shù)據(jù)標(biāo)準(zhǔn)化
4、kmeans++算法
kmeans = KMeans(init='k-means++', n_clusters=num_clusters, n_init=10) #k-means++采用智能方法找出初始聚類中心,n_init為迭代次數(shù)
5、最優(yōu)分類次數(shù)計算(采用輪廓系數(shù)判定)
import numpy as np
import matplotlib.pyplot as plt
from sklearn import metrics
from sklearn.cluster import KMeans
# Load data from input file
X = np.loadtxt('../code/data_quality.txt', delimiter=',')
# Plot input data
plt.figure()
plt.scatter(X[:,0], X[:,1], color='black', s=80, marker='o', facecolors='none')
x_min, x_max = X[:, 0].min() - 1, X[:, 0].max() + 1
y_min, y_max = X[:, 1].min() - 1, X[:, 1].max() + 1
plt.title('Input data')
plt.xlim(x_min, x_max)
plt.ylim(y_min, y_max)
plt.xticks(())
plt.yticks(())
# Initialize variables
scores = []
values = np.arange(2, 10)
# Iterate through the defined range
for num_clusters in values: #從2到10里面選擇
# Train the KMeans clustering model
kmeans = KMeans(init='k-means++', n_clusters=num_clusters, n_init=10)
kmeans.fit(X)
score = metrics.silhouette_score(X, kmeans.labels_,metric='euclidean', sample_size=len(X)) #輪廓系數(shù)
print("\nNumber of clusters =", num_clusters)
print("Silhouette score =", score)
scores.append(score)
# Plot silhouette scores
plt.figure()
plt.bar(values, scores, width=0.7, color='black', align='center')
plt.title('Silhouette score vs number of clusters')
# Extract best score and optimal number of clusters
num_clusters = np.argmax(scores) + values[0]
print('\nOptimal number of clusters =', num_clusters)
plt.show()
免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點不代表本網(wǎng)站立場,如果涉及侵權(quán)請聯(lián)系站長郵箱:is@yisu.com進行舉報,并提供相關(guān)證據(jù),一經(jīng)查實,將立刻刪除涉嫌侵權(quán)內(nèi)容。