怎么用Python實現(xiàn)K近鄰算法

發(fā)布時間：2022-09-09 10:06:38 來源：億速云閱讀：133 作者：iii 欄目：開發(fā)技術(shù)

這篇文章主要介紹“怎么用Python實現(xiàn)K近鄰算法”，在日常操作中，相信很多人在怎么用Python實現(xiàn)K近鄰算法問題上存在疑惑，小編查閱了各式資料，整理出簡單好用的操作方法，希望對大家解答”怎么用Python實現(xiàn)K近鄰算法”的疑惑有所幫助！接下來，請跟著小編一起來學習吧！

一、介紹

k-近鄰算法（K-Nearest Neighbour algorithm），又稱 KNN 算法，是數(shù)據(jù)挖掘技術(shù)中原理最簡單的算法。

工作原理：給定一個已知標簽類別的訓練數(shù)據(jù)集，輸入沒有標簽的新數(shù)據(jù)后，在訓練數(shù)據(jù)集中找到與新數(shù)據(jù)最鄰近的 k 個實例，如果這 k 個實例的多數(shù)屬于某個類別，那么新數(shù)據(jù)就屬于這個類別。簡單理解為：由那些離 X 最近的 k 個點來投票決定 X 歸為哪一類。

二、k-近鄰算法的步驟

（1）計算已知類別數(shù)據(jù)集中的點與當前點之間的距離；

（2）按照距離遞增次序排序；

（3）選取與當前點距離最小的 k 個點；

（4）確定前k個點所在類別的出現(xiàn)頻率；

（5）返回前 k 個點出現(xiàn)頻率最高的類別作為當前點的預(yù)測類別。

三、Python 實現(xiàn)

判斷一個電影是愛情片還是動作片。

電影名稱	搞笑鏡頭	擁抱鏡頭	打斗鏡頭	電影類型
0	功夫熊貓	39	0	31	喜劇片
1	葉問3	3	2	65	動作片
2	倫敦陷落	2	3	55	動作片
3	代理情人	9	38	2	愛情片
4	新步步驚心	8	34	17	愛情片
5	諜影重重	5	2	57	動作片
6	功夫熊貓	39	0	31	喜劇片
7	美人魚	21	17	5	喜劇片
8	寶貝當家	45	2	9	喜劇片
9	唐人街探案	23	3	17	？

歐氏距離

怎么用Python實現(xiàn)K近鄰算法

構(gòu)建數(shù)據(jù)集

rowdata = {
    "電影名稱": ['功夫熊貓', '葉問3', '倫敦陷落', '代理情人', '新步步驚心', '諜影重重', '功夫熊貓', '美人魚', '寶貝當家'],
    "搞笑鏡頭": [39,3,2,9,8,5,39,21,45],
    "擁抱鏡頭": [0,2,3,38,34,2,0,17,2],
    "打斗鏡頭": [31,65,55,2,17,57,31,5,9],
    "電影類型": ["喜劇片", "動作片", "動作片", "愛情片", "愛情片", "動作片", "喜劇片", "喜劇片", "喜劇片"]
}

計算已知類別數(shù)據(jù)集中的點與當前點之間的距離

new_data = [24,67]
dist = list((((movie_data.iloc[:6,1:3]-new_data)**2).sum(1))**0.5)

將距離升序排列，然后選取距離最小的 k 個點「容易擬合·以后專欄再論」

k = 4
dist_l = pd.DataFrame({'dist': dist, 'labels': (movie_data.iloc[:6, 3])}) 
dr = dist_l.sort_values(by='dist')[:k]

確定前 k 個點的類別的出現(xiàn)概率

re = dr.loc[:,'labels'].value_counts()
re.index[0]

選擇頻率最高的類別作為當前點的預(yù)測類別

result = []
result.append(re.index[0])
result

四、約會網(wǎng)站配對效果判定

# 導(dǎo)入數(shù)據(jù)集
datingTest = pd.read_table('datingTestSet.txt',header=None)
datingTest.head()

# 分析數(shù)據(jù)
%matplotlib inline
import matplotlib as mpl
import matplotlib.pyplot as plt

#把不同標簽用顏色區(qū)分
Colors = []
for i in range(datingTest.shape[0]):
    m = datingTest.iloc[i,-1]  # 標簽
    if m=='didntLike':
        Colors.append('black')
    if m=='smallDoses':
        Colors.append('orange')
    if m=='largeDoses':
        Colors.append('red')

#繪制兩兩特征之間的散點圖
plt.rcParams['font.sans-serif']=['Simhei'] #圖中字體設(shè)置為黑體
pl=plt.figure(figsize=(12,8))  # 建立一個畫布

fig1=pl.add_subplot(221)  # 建立兩行兩列畫布，放在第一個里面
plt.scatter(datingTest.iloc[:,1],datingTest.iloc[:,2],marker='.',c=Colors)
plt.xlabel('玩游戲視頻所占時間比')
plt.ylabel('每周消費冰淇淋公升數(shù)')

fig2=pl.add_subplot(222)
plt.scatter(datingTest.iloc[:,0],datingTest.iloc[:,1],marker='.',c=Colors)
plt.xlabel('每年飛行?？屠锍?#39;)
plt.ylabel('玩游戲視頻所占時間比')

fig3=pl.add_subplot(223)
plt.scatter(datingTest.iloc[:,0],datingTest.iloc[:,2],marker='.',c=Colors)
plt.xlabel('每年飛行常客里程')
plt.ylabel('每周消費冰淇淋公升數(shù)')
plt.show()


# 數(shù)據(jù)歸一化
def minmax(dataSet):
    minDf = dataSet.min()
    maxDf = dataSet.max()
    normSet = (dataSet - minDf )/(maxDf - minDf)
    return normSet

datingT = pd.concat([minmax(datingTest.iloc[:, :3]), datingTest.iloc[:,3]], axis=1)
datingT.head()

# 切分訓練集和測試集
def randSplit(dataSet,rate=0.9):
    n = dataSet.shape[0] 
    m = int(n*rate)
    train = dataSet.iloc[:m,:]
    test = dataSet.iloc[m:,:]
    test.index = range(test.shape[0])
    return train,test

train,test = randSplit(datingT)


# 分類器針對約會網(wǎng)站的測試代碼
def datingClass(train,test,k):
    n = train.shape[1] - 1  # 將標簽列減掉
    m = test.shape[0]  # 行數(shù)
    result = []
    for i in range(m):
        dist = list((((train.iloc[:, :n] - test.iloc[i, :n]) ** 2).sum(1))**5)
        dist_l = pd.DataFrame({'dist': dist, 'labels': (train.iloc[:, n])})
        dr = dist_l.sort_values(by = 'dist')[: k]
        re = dr.loc[:, 'labels'].value_counts()
        result.append(re.index[0])
    result = pd.Series(result)  
    test['predict'] = result  # 增加一列
    acc = (test.iloc[:,-1]==test.iloc[:,-2]).mean()
    print(f'模型預(yù)測準確率為{acc}')
    return test


datingClass(train,test,5)  # 95%

五、手寫數(shù)字識別

import os


#得到標記好的訓練集
def get_train():
    path = 'digits/trainingDigits'
    trainingFileList = os.listdir(path)
    train = pd.DataFrame()
    img = []  # 第一列原來的圖像轉(zhuǎn)換為圖片里面0和1，一行
    labels = []  # 第二列原來的標簽
    for i in range(len(trainingFileList)):
        filename = trainingFileList[i]
        txt = pd.read_csv(f'digits/trainingDigits/{filename}', header = None) #32行
        num = ''
        # 將32行轉(zhuǎn)變?yōu)?行
        for i in range(txt.shape[0]):
            num += txt.iloc[i,:]
        img.append(num[0])
        filelable = filename.split('_')[0]
        labels.append(filelable)
    train['img'] = img
    train['labels'] = labels
    return train
    
train = get_train()   



# 得到標記好的測試集
def get_test():
    path = 'digits/testDigits'
    testFileList = os.listdir(path)
    test = pd.DataFrame()
    img = []  # 第一列原來的圖像轉(zhuǎn)換為圖片里面0和1，一行
    labels = []  # 第二列原來的標簽
    for i in range(len(testFileList)):
        filename = testFileList[i]
        txt = pd.read_csv(f'digits/testDigits/{filename}', header = None) #32行
        num = ''
        # 將32行轉(zhuǎn)變?yōu)?行
        for i in range(txt.shape[0]):
            num += txt.iloc[i,:]
        img.append(num[0])
        filelable = filename.split('_')[0]
        labels.append(filelable)
    test['img'] = img
    test['labels'] = labels
    return test

test = get_test()

# 分類器針對手寫數(shù)字的測試代碼
from Levenshtein import hamming

def handwritingClass(train, test, k):
    n = train.shape[0]
    m = test.shape[0]
    result = []
    for i in range(m):
        dist = []
        for j in range(n):
            d = str(hamming(train.iloc[j,0], test.iloc[i,0]))
            dist.append(d)
        dist_l = pd.DataFrame({'dist':dist, 'labels':(train.iloc[:,1])})
        dr = dist_l.sort_values(by='dist')[:k]
        re = dr.loc[:,'labels'].value_counts()
        result.append(re.index[0])
    result = pd.Series(result)
    test['predict'] = result
    acc = (test.iloc[:,-1] == test.iloc[:,-2]).mean()
    print(f'模型預(yù)測準確率為{acc}')
    return test

handwritingClass(train, test, 3)  # 97.8%

六、算法優(yōu)缺點

優(yōu)點

（1）簡單好用，容易理解，精度高，理論成熟，既可以用來做分類也可以用來做回歸；

（2）可用于數(shù)值型數(shù)據(jù)和離散型數(shù)據(jù)；

（3）無數(shù)據(jù)輸入假定；

（4）適合對稀有事件進行分類。

缺點

（1）計算復(fù)雜性高；空間復(fù)雜性高；

（2）計算量大，所以一般數(shù)值很大的適合不用這個，但是單個樣本又不能太少，否則容易發(fā)生誤分；

（3）樣本不平衡問題（即有些類別的樣本數(shù)量很多，而其他樣本的數(shù)量很少）；

（4）可理解性比較差，無法給出數(shù)據(jù)的內(nèi)在含義

到此，關(guān)于“怎么用Python實現(xiàn)K近鄰算法”的學習就結(jié)束了，希望能夠解決大家的疑惑。理論與實踐的搭配能更好的幫助大家學習，快去試試吧！若想繼續(xù)學習更多相關(guān)知識，請繼續(xù)關(guān)注億速云網(wǎng)站，小編會繼續(xù)努力為大家?guī)砀鄬嵱玫奈恼拢?/p>

向AI問一下細節(jié)

怎么用Python實現(xiàn)K近鄰算法

一、介紹

二、k-近鄰算法的步驟

三、Python 實現(xiàn)

四、約會網(wǎng)站配對效果判定

五、手寫數(shù)字識別

六、算法優(yōu)缺點

優(yōu)點

缺點

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標簽

一、介紹

三、Python 實現(xiàn)

四、約會網(wǎng)站配對效果判定

五、手寫數(shù)字識別