python實(shí)現(xiàn)canopy聚類的方法

發(fā)布時(shí)間：2020-09-23 15:46:42 來源：億速云閱讀：298 作者：小新欄目：編程語言

這篇文章將為大家詳細(xì)講解有關(guān)python實(shí)現(xiàn)canopy聚類的方法，小編覺得挺實(shí)用的，因此分享給大家做個(gè)參考，希望大家閱讀完這篇文章后可以有所收獲。

Canopy算法是2000年由Andrew McCallum, Kamal Nigam and Lyle Ungar提出來的，它是對k-means聚類算法和層次聚類算法的預(yù)處理。眾所周知，kmeans的一個(gè)不足之處在于k值需要通過人為的進(jìn)行調(diào)整，后期可以通過肘部法則（Elbow Method）和輪廓系數(shù)（Silhouette Coefficient）來對k值進(jìn)行最終的確定，但是這些方法都是屬于“事后”判斷的，而Canopy算法的作用就在于它是通過事先粗聚類的方式，為k-means算法確定初始聚類中心個(gè)數(shù)和聚類中心點(diǎn)。

使用的包：

import math
import random
import numpy as np
from datetime import datetime
from pprint import pprint as p
import matplotlib.pyplot as plt

1.首先我在算法中預(yù)設(shè)了一個(gè)二維（為了方便后期畫圖呈現(xiàn)在二維平面上）數(shù)據(jù)dataset。

當(dāng)然也可以使用高緯度的數(shù)據(jù)，并且我將canopy核心算法寫入了類中，后期可以通過直接調(diào)用的方式對任何維度的數(shù)據(jù)進(jìn)行處理，當(dāng)然只是小批量的，大批量的數(shù)據(jù)可以移步Mahout和Hadoop了。

# 隨機(jī)生成500個(gè)二維[0,1)平面點(diǎn)
dataset = np.random.rand(500, 2)

2.然后生成個(gè)兩類，類的屬性如下：

class Canopy:
    def __init__(self, dataset):        
        self.dataset = dataset        
        self.t1 = 0
      self.t2 = 0

加入設(shè)定t1和t2初始值以及判斷大小函數(shù)

   # 設(shè)置初始閾值  
def setThreshold(self, t1, t2):        
    if t1 > t2:
        self.t1 = t1            
        self.t2 = t2        
    else:
        print('t1 needs to be larger than t2!')

3.距離計(jì)算，各個(gè)中心點(diǎn)之間的距離計(jì)算方法我使用的歐式距離。

#使用歐式距離進(jìn)行距離的計(jì)算
def euclideanDistance(self, vec1, vec2):        
    return math.sqrt(((vec1 - vec2)**2).sum())

4.再寫個(gè)從dataset中根據(jù)dataset的長度隨機(jī)選擇下標(biāo)的函數(shù)

# 根據(jù)當(dāng)前dataset的長度隨機(jī)選擇一個(gè)下標(biāo) 
def getRandIndex(self):        
    return random.randint(0, len(self.dataset) - 1)

5.核心算法

def clustering(self):        
        if self.t1 == 0:
            print('Please set the threshold.')        
        else:
            canopies = []  # 用于存放最終歸類結(jié)果
            while len(self.dataset) != 0:
                rand_index = self.getRandIndex()
                current_center = self.dataset[rand_index]  # 隨機(jī)獲取一個(gè)中心點(diǎn)，定為P點(diǎn)
                current_center_list = []  # 初始化P點(diǎn)的canopy類容器
                delete_list = []  # 初始化P點(diǎn)的刪除容器
                self.dataset = np.delete(                    
                     self.dataset, rand_index, 0)  # 刪除隨機(jī)選擇的中心點(diǎn)P
                for datum_j in range(len(self.dataset)):
                    datum = self.dataset[datum_j]
                    distance = self.euclideanDistance(
                        current_center, datum)  # 計(jì)算選取的中心點(diǎn)P到每個(gè)點(diǎn)之間的距離
                    if distance < self.t1:
                        # 若距離小于t1，則將點(diǎn)歸入P點(diǎn)的canopy類
                        current_center_list.append(datum)                    
                    if distance < self.t2:
                        delete_list.append(datum_j)  # 若小于t2則歸入刪除容器
                # 根據(jù)刪除容器的下標(biāo)，將元素從數(shù)據(jù)集中刪除
                self.dataset = np.delete(self.dataset, delete_list, 0)
                canopies.append((current_center, current_center_list))        
          return canopies

為了方便后面的數(shù)據(jù)可視化，我這里的canopies定義的是一個(gè)數(shù)組，當(dāng)然也可以使用dict。
6.main()函數(shù)

def main():
    t1 = 0.6
    t2 = 0.4
    gc = Canopy(dataset)
    gc.setThreshold(t1, t2)
    canopies = gc.clustering()
    print('Get %s initial centers.' % len(canopies))    
    #showCanopy(canopies, dataset, t1, t2)

Canopy聚類可視化代碼

def showCanopy(canopies, dataset, t1, t2):
    fig = plt.figure()
    sc = fig.add_subplot(111)
    colors = ['brown', 'green', 'blue', 'y', 'r', 'tan', 'dodgerblue', 'deeppink', 'orangered', 'peru', 'blue', 'y', 'r',              'gold', 'dimgray', 'darkorange', 'peru', 'blue', 'y', 'r', 'cyan', 'tan', 'orchid', 'peru', 'blue', 'y', 'r', 'sienna']
    markers = ['*', 'h', 'H', '+', 'o', '1', '2', '3', ',', 'v', 'H', '+', '1', '2', '^',               '<', '>', '.', '4', 'H', '+', '1', '2', 's', 'p', 'x', 'D', 'd', '|', '_']    for i in range(len(canopies)):
        canopy = canopies[i]
        center = canopy[0]
        components = canopy[1]
        sc.plot(center[0], center[1], marker=markers[i],
                color=colors[i], markersize=10)
        t1_circle = plt.Circle(
            xy=(center[0], center[1]), radius=t1, color='dodgerblue', fill=False)
        t2_circle = plt.Circle(
            xy=(center[0], center[1]), radius=t2, color='skyblue', alpha=0.2)
        sc.add_artist(t1_circle)
        sc.add_artist(t2_circle)        for component in components:
            sc.plot(component[0], component[1],
                    marker=markers[i], color=colors[i], markersize=1.5)
    maxvalue = np.amax(dataset)
    minvalue = np.amin(dataset)
    plt.xlim(minvalue - t1, maxvalue + t1)
    plt.ylim(minvalue - t1, maxvalue + t1)
    plt.show()

效果圖如下：

關(guān)于python實(shí)現(xiàn)canopy聚類的方法就分享到這里了，希望以上內(nèi)容可以對大家有一定的幫助，可以學(xué)到更多知識。如果覺得文章不錯(cuò)，可以把它分享出去讓更多的人看到。

向AI問一下細(xì)節(jié)

python實(shí)現(xiàn)canopy聚類的方法

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽