您好,登錄后才能下訂單哦!
本篇內(nèi)容主要講解“怎么使用python k-近鄰算法”,感興趣的朋友不妨來看看。本文介紹的方法操作簡單快捷,實用性強。下面就讓小編來帶大家學(xué)習(xí)“怎么使用python k-近鄰算法”吧!
它采用測量不同特征值之間的距離方法進(jìn)行分類
優(yōu):精度高、對異常值不敏感、無數(shù)據(jù)輸入假定
缺:計算復(fù)雜度高、空間復(fù)雜度高
適用數(shù)據(jù)范圍:數(shù)值型和標(biāo)稱型
存在一個樣本數(shù)據(jù)集合(訓(xùn)練樣品集),并且樣本集中每個數(shù)據(jù)都存在標(biāo)簽,即我們知道樣本集中每一個數(shù)據(jù)與所屬分類的對應(yīng)關(guān)系。
輸入沒有標(biāo)簽的新數(shù)據(jù)后,將新數(shù)據(jù)的每個特征與樣品集中數(shù)據(jù)對應(yīng)的特征進(jìn)行比較,然后算法提取樣本集中特征最相近數(shù)據(jù)(最鄰近)的分類標(biāo)簽。
一般來說,只選擇樣本數(shù)據(jù)集中前k個最相似的數(shù)據(jù),這就是k-近鄰算法中K的出處,通常k是不大于20的整數(shù)。
最后,選擇k個最相似數(shù)據(jù)中出現(xiàn)次數(shù)最多的分類,作為新數(shù)據(jù)的分類。
kNN.py
偽代碼
計算已知類別數(shù)據(jù)集中的點 與 當(dāng)前點之間的距離;
按照距離遞增次序排序;
選取 與 當(dāng)前距離最小的k個點;
確定 前k個點所在類別的出現(xiàn)頻率;
返回 前k個點出現(xiàn)頻率最高的類別作為當(dāng)前點的預(yù)測分類。
步驟1用到 歐式距離公式
def classify0(inX, dataSet, labels, k): #距離計算 dataSetSize = dataSet.shape[0] #numpy庫數(shù)組的行數(shù) diffMat = tile(inX, (dataSetSize,1)) - dataSet #tile復(fù)制inX dataSetSize行數(shù)倍,以便相減 sqDiffMat = diffMat**2 sqDistances = sqDiffMat.sum(axis=1) #axis=0表示按列相加,axis=1表示按照行的方向相加 #開二次方 distances = sqDistances**0.5 sortedDistIndicies = distances.argsort() #將元素從小到大排序,提取對應(yīng)的index,然后輸出返回,如x[3]=-1,y[0]=3 #選擇距離最小的k個點 classCount={} #python對象 for i in range(k): voteIlabel = labels[sortedDistIndicies[i]] classCount[voteIlabel] = classCount.get(voteIlabel,0) + 1 #設(shè)置鍵 值 + 1 #排序 sortedClassCount = sorted(classCount.iteritems(), key=operator.itemgetter(1), reverse=True) #operator.itemgetter(1)返回函數(shù),得classCount K-V的V,對V進(jìn)行排序,因設(shè)置reverse,從左到右,從大到小 return sortedClassCount[0][0]
示例使用
def createDataSet(): group = array([[1.0,1.1],[1.0,1.0],[0,0],[0,0.1]]) labels = ['A','A','B','B'] return group, labels group, labels = createDataSet() #最后輸出值 print classify0([0, 0], group, labels, 3) #輸出 B
def file2matrix(filename): fr = open(filename) numberOfLines = len(fr.readlines()) #get the number of lines in the file returnMat = zeros((numberOfLines,3)) #prepare matrix to return classLabelVector = [] #prepare labels return fr = open(filename) index = 0 for line in fr.readlines(): line = line.strip() listFromLine = line.split('\t') returnMat[index,:] = listFromLine[0:3] classLabelVector.append(int(listFromLine[-1])) index += 1 return returnMat,classLabelVector
運行結(jié)果
>>>datingDataMat, datingLabels = kNN.file2matrix('datingTestSet2.txt') >>>print datingDataMat [[ 4.09200000e+04 8.32697600e+00 9.53952000e-01] [ 1.44880000e+04 7.15346900e+00 1.67390400e+00] [ 2.60520000e+04 1.44187100e+00 8.05124000e-01] ..., [ 2.65750000e+04 1.06501020e+01 8.66627000e-01] [ 4.81110000e+04 9.13452800e+00 7.28045000e-01] [ 4.37570000e+04 7.88260100e+00 1.33244600e+00]] >>>print datingLabels[0:20] [3, 2, 1, 1, 1, 1, 3, 3, 1, 3, 1, 1, 2, 1, 1, 1, 1, 1, 2, 3]
import matplotlib import matplotlib.pyplot as plt from numpy import * fig = plt.figure() ax = fig.add_subplot(111)#1行1列1圖 ax.scatter(datingDataMat[:, 1], datingDataMat[:, 2]) #datingDataMat[:, 1] /*返回所有行,第2列*/ plt.show()
#附帶尺寸、顏色參數(shù) ax.scatter(datingDataMat[:, 1], datingDataMat[:, 2] \ , 15.0*array(datingLabels), 15.0*array(datingLabels)) plt.show()
有時發(fā)現(xiàn)數(shù)據(jù)中的不同特征的特征值相差甚遠(yuǎn),會影響計算結(jié)果
處理方法
將數(shù)值歸一化,如將取值范圍處理為0到1或者-1到之間。公式:
newValue = (oldValue - min) / (max - min)
def autoNorm(dataSet): minVals = dataSet.min(0)#從每列中選出最小值 maxVals = dataSet.max(0)#從每列中選出最大值 ranges = maxVals - minVals#范圍 normDataSet = zeros(shape(dataSet))#行寬和dataSet相同的00矩陣 m = dataSet.shape[0]#dataSet有多少個實例 normDataSet = dataSet - tile(minVals, (m,1)) #tile將數(shù)組A重復(fù)n次,上例子minVals,重復(fù)m次,1表示 #tile(a,(2,1))就是把a先沿x軸(就這樣稱呼吧)復(fù)制1倍,即沒有復(fù)制,仍然是 [0,1,2]。 再把結(jié)果沿y方向復(fù)制2倍 normDataSet = normDataSet / tile(ranges, (m,1)) #element wise divide return normDataSet, ranges, minVals
運行結(jié)果
>>>normMat, ranges, minVals = kNN.autoNorm(datingDataMat) >>>normMat [[ 0.44832535 0.39805139 0.56233353] [ 0.15873259 0.34195467 0.98724416] [ 0.28542943 0.06892523 0.47449629] ..., [ 0.29115949 0.50910294 0.51079493] [ 0.52711097 0.43665451 0.4290048 ] [ 0.47940793 0.3768091 0.78571804]] >>>ranges [ 9.12730000e+04 2.09193490e+01 1.69436100e+00] >>>minVals [ 0. 0. 0.001156]
機器學(xué)習(xí)算法一個重要的工作就是評估算法的正確率,通常提供已有數(shù)據(jù)的90%作為訓(xùn)練樣品分類器,而使用其余的10%數(shù)據(jù)去測試分類器,檢測出分類器的正確率。
def datingClassTest(): #取50%的數(shù)據(jù)進(jìn)行測試 hoRatio = 0.50 #hold out 10% #處理數(shù)據(jù) datingDataMat,datingLabels = file2matrix('datingTestSet2.txt') #load data setfrom file #數(shù)據(jù)歸一化處理 normMat, ranges, minVals = autoNorm(datingDataMat) m = normMat.shape[0] #拿來 測試條目 數(shù)目 numTestVecs = int(m*hoRatio) errorCount = 0.0 for i in range(numTestVecs): #kNN核心算法 classifierResult = classify0(normMat[i,:],normMat[numTestVecs:m,:],datingLabels[numTestVecs:m],3) #輸出結(jié)果 print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, datingLabels[i]) #統(tǒng)計錯誤數(shù) if (classifierResult != datingLabels[i]): errorCount += 1.0 print "the total error rate is: %f" % (errorCount/float(numTestVecs)) print 'errorCount: '+str(errorCount)
運行結(jié)果
>>>datingClassTest() ... the classifier came back with: 2, the real answer is: 1 the classifier came back with: 2, the real answer is: 2 the classifier came back with: 1, the real answer is: 1 the classifier came back with: 1, the real answer is: 1 the classifier came back with: 2, the real answer is: 2 the total error rate is: 0.064000 errorCount: 32.0
def classifyPerson(file): resultList = ['not at all','in small doses', 'in large doses'] percentTats = float(raw_input(\ "percentage of time spent playing video games?")) ffMiles = float(raw_input("frequent flier miles earned per year?")) iceCream = float(raw_input("liters of ice cream consumed per year?")) datingDataMat,datingLabels = file2matrix(file) normMat, ranges, minVals = autoNorm(datingDataMat) inArr = array([ffMiles, percentTats, iceCream]) classifierResult = classify0((inArr- \ minVals)/ranges,normMat,datingLabels,3) print "You will probably like this person: ",\ resultList[classifierResult - 1] kNN.classifyPerson('..\\datingTestSet2.txt')
結(jié)果
percentage of time spent playing video games?10 frequent flier miles earned per year?10000 liters of ice cream consumed per year?0.5 You will probably like this person: in small doses
trainingDigits 2000個訓(xùn)練樣本
testDigitsa 900個測試數(shù)據(jù)
某一示例:
00000000000001111000000000000000 00000000000011111110000000000000 00000000001111111111000000000000 00000001111111111111100000000000 00000001111111011111100000000000 00000011111110000011110000000000 00000011111110000000111000000000 00000011111110000000111100000000 00000011111110000000011100000000 00000011111110000000011100000000 00000011111100000000011110000000 00000011111100000000001110000000 00000011111100000000001110000000 00000001111110000000000111000000 00000001111110000000000111000000 00000001111110000000000111000000 00000001111110000000000111000000 00000011111110000000001111000000 00000011110110000000001111000000 00000011110000000000011110000000 00000001111000000000001111000000 00000001111000000000011111000000 00000001111000000000111110000000 00000001111000000001111100000000 00000000111000000111111000000000 00000000111100011111110000000000 00000000111111111111110000000000 00000000011111111111110000000000 00000000011111111111100000000000 00000000001111111110000000000000 00000000000111110000000000000000 00000000000011000000000000000000
將把一個32 * 32的二進(jìn)制圖像矩陣裝換為1 * 1024的向量
def img2vector(filename): returnVect = zeros((1,1024))#創(chuàng)建有1024個元素的列表 fr = open(filename) for i in range(32): lineStr = fr.readline() for j in range(32): returnVect[0,32*i+j] = int(lineStr[j]) return returnVect
運行結(jié)果
>>>testVector = kNN.img2vector('testDigits/0_13.txt') >>>testVector[0, 0:31] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.] >>>testVector[0, 32:63] [ 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 1. 1. 1. 1. 1. 1. 1. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0. 0.]
import os def handwritingClassTest(): #準(zhǔn)備訓(xùn)練數(shù)據(jù)和測試數(shù)據(jù) hwLabels = [] trainingFileList = os.listdir('trainingDigits') #load the training set m = len(trainingFileList) trainingMat = zeros((m,1024)) for i in range(m): fileNameStr = trainingFileList[i] fileStr = fileNameStr.split('.')[0] #take off .txt classNumStr = int(fileStr.split('_')[0]) hwLabels.append(classNumStr) trainingMat[i,:] = img2vector('trainingDigits/%s' % fileNameStr) testFileList = os.listdir('testDigits') #iterate through the test set errorCount = 0.0 mTest = len(testFileList) #開始測試 for i in range(mTest): fileNameStr = testFileList[i] fileStr = fileNameStr.split('.')[0] #take off .txt classNumStr = int(fileStr.split('_')[0]) vectorUnderTest = img2vector('testDigits/%s' % fileNameStr) classifierResult = classify0(vectorUnderTest, trainingMat, hwLabels, 3) print "the classifier came back with: %d, the real answer is: %d" % (classifierResult, classNumStr) if (classifierResult != classNumStr): errorCount += 1.0 print "\nthe total number of errors is: %d" % errorCount print "\nthe total error rate is: %f" % (errorCount/float(mTest))
運行結(jié)果
>>>handwritingClassTest() ... the classifier came back with: 9, the real answer is: 9 the classifier came back with: 9, the real answer is: 9 the classifier came back with: 9, the real answer is: 9 the classifier came back with: 9, the real answer is: 9 the classifier came back with: 9, the real answer is: 9 the classifier came back with: 9, the real answer is: 9 the classifier came back with: 9, the real answer is: 9 the total number of errors is: 11 the total error rate is: 0.011628
實際使用這個算法時,算法的執(zhí)行效率并不高
到此,相信大家對“怎么使用python k-近鄰算法”有了更深的了解,不妨來實際操作一番吧!這里是億速云網(wǎng)站,更多相關(guān)內(nèi)容可以進(jìn)入相關(guān)頻道進(jìn)行查詢,關(guān)注我們,繼續(xù)學(xué)習(xí)!
免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點不代表本網(wǎng)站立場,如果涉及侵權(quán)請聯(lián)系站長郵箱:is@yisu.com進(jìn)行舉報,并提供相關(guān)證據(jù),一經(jīng)查實,將立刻刪除涉嫌侵權(quán)內(nèi)容。