溫馨提示×

溫馨提示×

您好，登錄后才能下訂單哦！

密碼登錄×

忘記密碼？

登錄注冊×

獲取短信驗(yàn)證碼

其他方式登錄

點(diǎn)擊登錄注冊即表示同意《億速云用戶服務(wù)條款》

用戶登錄×

賬戶密碼登錄

請使用微信掃描上方二維碼

使用幫助

請求超時！

請點(diǎn)擊重新獲取二維碼

Python如何使用KNN進(jìn)行驗(yàn)證碼識別

發(fā)布時間：2021-08-03 12:23:44 來源：億速云閱讀：139 作者：小新欄目：開發(fā)技術(shù)

這篇文章主要介紹Python如何使用KNN進(jìn)行驗(yàn)證碼識別，文中介紹的非常詳細(xì)，具有一定的參考價值，感興趣的小伙伴們一定要看完！

分析

我們學(xué)校的驗(yàn)證碼是這樣的： Python如何使用KNN進(jìn)行驗(yàn)證碼識別，其實(shí)就是簡單地把字符進(jìn)行旋轉(zhuǎn)然后加上一些微弱的噪點(diǎn)形成的。我們要識別，就得逆行之，具體思路就是，首先二值化去掉噪點(diǎn)，然后把單個字符分割出來，最后旋轉(zhuǎn)至標(biāo)準(zhǔn)方向，然后從這些處理好的圖片中選出模板，最后每次新來一張驗(yàn)證碼就按相同方式處理，然后和這些模板進(jìn)行比較，選擇判別距離最近的一個模板作為其判斷結(jié)果（亦即KNN的思想，本文取K=1）。接下來按步驟進(jìn)行說明。

獲得驗(yàn)證碼

首先得有大量的驗(yàn)證碼，我們通過爬蟲來實(shí)現(xiàn)，代碼如下

#-*- coding:UTF-8 -*-
import urllib,urllib2,cookielib,string,Image
def getchk(number):
 #創(chuàng)建cookie對象
 cookie = cookielib.LWPCookieJar()
 cookieSupport= urllib2.HTTPCookieProcessor(cookie)
 opener = urllib2.build_opener(cookieSupport, urllib2.HTTPHandler)
 urllib2.install_opener(opener)
 #首次與教務(wù)系統(tǒng)鏈接獲得cookie#
 #偽裝browser
 headers = {
 'Accept':'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8',
 'Accept-Encoding':'gzip,deflate',
 'Accept-Language':'zh-CN,zh;q=0.8',
 'User-Agent':'Mozilla/5.0 (Windows NT 6.2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/38.0.2125.111 Safari/537.36'
 }
 req0 = urllib2.Request(
  url ='http://mis.teach.ustc.edu.cn',
  headers = headers  #請求頭
 )
 # 捕捉http錯誤
 try :
 result0 = urllib2.urlopen(req0)
 except urllib2.HTTPError,e:
 print e.code
 #提取cookie
 getcookie = ['',]
 for item in cookie:
 getcookie.append(item.name)
 getcookie.append("=")
 getcookie.append(item.value)
 getcookie = "".join(getcookie)
 
 #修改headers
 headers["Origin"] = "http://mis.teach.ustc.edu.cn"
 headers["Referer"] = "http://mis.teach.ustc.edu.cn/userinit.do"
 headers["Content-Type"] = "application/x-www-form-urlencoded"
 headers["Cookie"] = getcookie
 for i in range(number):
 req = urllib2.Request(
  url ="http://mis.teach.ustc.edu.cn/randomImage.do?date='1469451446894'",
  headers = headers   #請求頭
 )
 response = urllib2.urlopen(req)
 status = response.getcode()
 picData = response.read()
 if status == 200:
  localPic = open("./source/"+str(i)+".jpg", "wb")
  localPic.write(picData)
  localPic.close()
 else:
  print "failed to get Check Code "
if __name__ == '__main__':
 getchk(500)

這里下載了500張驗(yàn)證碼到source目錄下面。如圖：

Python如何使用KNN進(jìn)行驗(yàn)證碼識別

二值化

matlab豐富的圖像處理函數(shù)能給我們省下很多時間，，我們遍歷source文件夾，對每一張驗(yàn)證碼圖片進(jìn)行二值化處理，把處理過的圖片存入bw目錄下。代碼如下

mydir='./source/';
bw = './bw/';
if mydir(end)~='\'
 mydir=[mydir,'\'];
end
DIRS=dir([mydir,'*.jpg']); %擴(kuò)展名
n=length(DIRS);
for i=1:n
 if ~DIRS(i).isdir
 img = imread(strcat(mydir,DIRS(i).name ));
 img = rgb2gray(img);%灰度化
 img = im2bw(img);%0-1二值化
 name = strcat(bw,DIRS(i).name)
 imwrite(img,name);
 end
end

處理結(jié)果如圖：

Python如何使用KNN進(jìn)行驗(yàn)證碼識別

分割

mydir='./bw/';
letter = './letter/';
if mydir(end)~='\'
 mydir=[mydir,'\'];
end
DIRS=dir([mydir,'*.jpg']); %擴(kuò)展名
n=length(DIRS);
for i=1:n
 if ~DIRS(i).isdir
 img = imread(strcat(mydir,DIRS(i).name ));
 img = im2bw(img);%二值化
 img = 1-img;%顏色反轉(zhuǎn)讓字符成為聯(lián)通域，方便去除噪點(diǎn)
 for ii = 0:3
  region = [ii*20+1,1,19,20];%把一張驗(yàn)證碼分成四個20*20大小的字符圖片
  subimg = imcrop(img,region);
  imlabel = bwlabel(subimg);
%  imshow(imlabel);
 
  if max(max(imlabel))>1 % 說明有噪點(diǎn)，要去除
%   max(max(imlabel))
 
%   imshow(subimg);
 
  stats = regionprops(imlabel,'Area');
  area = cat(1,stats.Area);
  maxindex = find(area == max(area));
  area(maxindex) = 0;  
  secondindex = find(area == max(area)); 
  imindex = ismember(imlabel,secondindex);
  subimg(imindex==1)=0;%去掉第二大連通域，噪點(diǎn)不可能比字符大，所以第二大的就是噪點(diǎn)
  end
  name = strcat(letter,DIRS(i).name(1:length(DIRS(i).name)-4),'_',num2str(ii),'.jpg')
  imwrite(subimg,name);
 end
 end
end

處理結(jié)果如圖：

Python如何使用KNN進(jìn)行驗(yàn)證碼識別

旋轉(zhuǎn)

接下來進(jìn)行旋轉(zhuǎn)，哪找一個什么標(biāo)準(zhǔn)呢？據(jù)觀察，這些字符旋轉(zhuǎn)不超過60度，那么在正負(fù)60度之間，統(tǒng)一旋轉(zhuǎn)至字符寬度最小就行了。代碼如下

if mydir(end)~='\'
 mydir=[mydir,'\'];
end
DIRS=dir([mydir,'*.jpg']); %擴(kuò)展名
n=length(DIRS);
for i=1:n
 if ~DIRS(i).isdir
 img = imread(strcat(mydir,DIRS(i).name ));
 img = im2bw(img);
 minwidth = 20;
 for angle = -60:60
  imgr=imrotate(img,angle,'bilinear','crop');%crop 避免圖像大小變化
  imlabel = bwlabel(imgr);
  stats = regionprops(imlabel,'Area');
  area = cat(1,stats.Area);
  maxindex = find(area == max(area));
  imindex = ismember(imlabel,maxindex);%最大連通域?yàn)?
  [y,x] = find(imindex==1);
  width = max(x)-min(x)+1;
  if width<minwidth
  minwidth = width;
  imgrr = imgr;
  end
 end
 name = strcat(rotate,DIRS(i).name)
 imwrite(imgrr,name);
 end
end

處理結(jié)果如圖，一共2000個字符的圖片存在rotate文件夾中

Python如何使用KNN進(jìn)行驗(yàn)證碼識別

模板選取

現(xiàn)在從rotate文件夾中選取一套模板，涵蓋每一個字符，一個字符可以選取多個圖片，因?yàn)榧词褂星懊娴闹T多處理也不能保證一個字符的最終呈現(xiàn)形式只有一種，多選幾個才能保證覆蓋率。把選出來的模板圖片存入samples文件夾下，這個過程很耗時耗力?？梢哉彝瑢W(xué)幫忙~，如圖

Python如何使用KNN進(jìn)行驗(yàn)證碼識別

測試

測試代碼如下：首先對測試驗(yàn)證碼進(jìn)行上述操作，然后和選出來的模板進(jìn)行比較，采用差分值最小的模板作為測試樣本的字符選擇，代碼如下

% 具有差分最小值的圖作為答案

mydir='./test/';
samples = './samples/';
if mydir(end)~='\'
 mydir=[mydir,'\'];
end
if samples(end)~='\'
 samples=[samples,'\'];
end
DIRS=dir([mydir,'*.jpg']); %擴(kuò)展?
DIRS1=dir([samples,'*.jpg']); %擴(kuò)展名
n=length(DIRS);%驗(yàn)證碼總圖數(shù)
singleerror = 0;%單個錯誤
uniterror = 0;%一張驗(yàn)證碼錯誤個數(shù)
for i=1:n
 if ~DIRS(i).isdir
 realcodes = DIRS(i).name(1:4);
 fprintf('驗(yàn)證碼實(shí)際字符:%s\n',realcodes);
 img = imread(strcat(mydir,DIRS(i).name ));
 img = rgb2gray(img);
 img = im2bw(img);
 img = 1-img;%顏色反轉(zhuǎn)讓字符成為聯(lián)通域
 subimgs = [];
 for ii = 0:3
  region = [ii*20+1,1,19,20];%奇怪,為什么這樣才能均分？
  subimg = imcrop(img,region);
  imlabel = bwlabel(subimg);
  if max(max(imlabel))>1 % 說明有雜點(diǎn)
  stats = regionprops(imlabel,'Area');
  area = cat(1,stats.Area);
  maxindex = find(area == max(area));
  area(maxindex) = 0;  
  secondindex = find(area == max(area)); 
  imindex = ismember(imlabel,secondindex);
  subimg(imindex==1)=0;%去掉第二大連通域
  end
  subimgs = [subimgs;subimg];
 end
 codes = [];
 for ii = 0:3
  region = [ii*20+1,1,19,20];
  subimg = imcrop(img,region);
  minwidth = 20;
  for angle = -60:60
  imgr=imrotate(subimg,angle,'bilinear','crop');%crop 避免圖像大小變化
  imlabel = bwlabel(imgr);
  stats = regionprops(imlabel,'Area');
  area = cat(1,stats.Area);
  maxindex = find(area == max(area));
  imindex = ismember(imlabel,maxindex);%最大連通域?yàn)?
  [y,x] = find(imindex==1);
  width = max(x)-min(x)+1;
  if width<minwidth
   minwidth = width;
   imgrr = imgr;
  end
  end
  mindiffv = 1000000;
  for jj = 1:length(DIRS1)
  imgsample = imread(strcat(samples,DIRS1(jj).name ));
  imgsample = im2bw(imgsample);
  diffv = abs(imgsample-imgrr);
  alldiffv = sum(sum(diffv));
  if alldiffv<mindiffv
   mindiffv = alldiffv;
   code = DIRS1(jj).name;
   code = code(1);
  end
  end
  codes = [codes,code];
 end
 fprintf('驗(yàn)證碼測試字符:%s\n',codes);
 num = codes-realcodes;
 num = length(find(num~=0));
 singleerror = singleerror + num;
 if num>0
  uniterror = uniterror +1;
 end
 fprintf('錯誤個數(shù):%d\n',num);
 end
end
fprintf('\n-----結(jié)果統(tǒng)計如下-----\n\n');
fprintf('測試驗(yàn)證碼的字符數(shù)量:%d\n',n*4);
fprintf('測試驗(yàn)證碼的字符錯誤數(shù)量:%d\n',singleerror);
fprintf('單個字符識別正確率:%.2f%%\n',(1-singleerror/(n*4))*100);
fprintf('測試驗(yàn)證碼圖的數(shù)量:%d\n',n);
fprintf('測試驗(yàn)證碼圖的錯誤數(shù)量:%d\n',uniterror);
fprintf('填對驗(yàn)證碼的概率:%.2f%%\n',(1-uniterror/n)*100);

結(jié)果：

驗(yàn)證碼實(shí)際字符:2B4E
驗(yàn)證碼測試字符:2B4F
錯誤個數(shù):1
驗(yàn)證碼實(shí)際字符:4572
驗(yàn)證碼測試字符:4572
錯誤個數(shù):0
驗(yàn)證碼實(shí)際字符:52CY
驗(yàn)證碼測試字符:52LY
錯誤個數(shù):1
驗(yàn)證碼實(shí)際字符:83QG
驗(yàn)證碼測試字符:85QG
錯誤個數(shù):1
驗(yàn)證碼實(shí)際字符:9992
驗(yàn)證碼測試字符:9992
錯誤個數(shù):0
驗(yàn)證碼實(shí)際字符:A7Y7
驗(yàn)證碼測試字符:A7Y7
錯誤個數(shù):0
驗(yàn)證碼實(shí)際字符:D993
驗(yàn)證碼測試字符:D995
錯誤個數(shù):1
驗(yàn)證碼實(shí)際字符:F549
驗(yàn)證碼測試字符:F5A9
錯誤個數(shù):1
驗(yàn)證碼實(shí)際字符:FMC6
驗(yàn)證碼測試字符:FMLF
錯誤個數(shù):2
驗(yàn)證碼實(shí)際字符:R4N4
驗(yàn)證碼測試字符:R4N4
錯誤個數(shù):0

-----結(jié)果統(tǒng)計如下-----

測試驗(yàn)證碼的字符數(shù)量:40
測試驗(yàn)證碼的字符錯誤數(shù)量:7
單個字符識別正確率:82.50%
測試驗(yàn)證碼圖的數(shù)量:10
測試驗(yàn)證碼圖的錯誤數(shù)量:6
填對驗(yàn)證碼的概率:40.00%

可見單個字符準(zhǔn)確率是比較高的的了，但是綜合準(zhǔn)確率還是不行，觀察結(jié)果至，錯誤的字符就是那些易混淆字符，比如E和F,C和L,5和3，4和A等，所以我們能做的事就是增加模板中的樣本數(shù)量，以期盡量減少混淆。

增加了幾十個樣本過后再次試驗(yàn)，結(jié)果：

驗(yàn)證碼實(shí)際字符:2B4E
驗(yàn)證碼測試字符:2B4F
錯誤個數(shù):1
驗(yàn)證碼實(shí)際字符:4572
驗(yàn)證碼測試字符:4572
錯誤個數(shù):0
驗(yàn)證碼實(shí)際字符:52CY
驗(yàn)證碼測試字符:52LY
錯誤個數(shù):1
驗(yàn)證碼實(shí)際字符:83QG
驗(yàn)證碼測試字符:83QG
錯誤個數(shù):0
驗(yàn)證碼實(shí)際字符:9992
驗(yàn)證碼測試字符:9992
錯誤個數(shù):0
驗(yàn)證碼實(shí)際字符:A7Y7
驗(yàn)證碼測試字符:A7Y7
錯誤個數(shù):0
驗(yàn)證碼實(shí)際字符:D993
驗(yàn)證碼測試字符:D993
錯誤個數(shù):0
驗(yàn)證碼實(shí)際字符:F549
驗(yàn)證碼測試字符:F5A9
錯誤個數(shù):1
驗(yàn)證碼實(shí)際字符:FMC6
驗(yàn)證碼測試字符:FMLF
錯誤個數(shù):2
驗(yàn)證碼實(shí)際字符:R4N4
驗(yàn)證碼測試字符:R4N4
錯誤個數(shù):0

-----結(jié)果統(tǒng)計如下-----

測試驗(yàn)證碼的字符數(shù)量:40
測試驗(yàn)證碼的字符錯誤數(shù)量:5
單個字符識別正確率:87.50%
測試驗(yàn)證碼圖的數(shù)量:10
測試驗(yàn)證碼圖的錯誤數(shù)量:4
填對驗(yàn)證碼的概率:60.00%

可見無論是單個字符識別正確率還是整個驗(yàn)證碼正確的概率都有了提升。能夠預(yù)見：隨著模板數(shù)量的增多，正確率會不斷地提高。

以上是“Python如何使用KNN進(jìn)行驗(yàn)證碼識別”這篇文章的所有內(nèi)容，感謝各位的閱讀！希望分享的內(nèi)容對大家有幫助，更多相關(guān)知識，歡迎關(guān)注億速云行業(yè)資訊頻道！

向AI問一下細(xì)節(jié)

推薦閱讀：

免責(zé)聲明：本站發(fā)布的內(nèi)容（圖片、視頻和文字）以原創(chuàng)、轉(zhuǎn)載和分享為主，文章觀點(diǎn)不代表本網(wǎng)站立場，如果涉及侵權(quán)請聯(lián)系站長郵箱：is@yisu.com進(jìn)行舉報，并提供相關(guān)證據(jù)，一經(jīng)查實(shí)，將立刻刪除涉嫌侵權(quán)內(nèi)容。

上一篇新聞：
Python語言中異常處理測試的示例分析
下一篇新聞：
如何解決某些HTML字符打不出來的問題

猜你喜歡

AI
助
手

產(chǎn)品服務(wù)

地區(qū)劃分

專題活動

幫助支持

關(guān)于我們

售后咨詢

7*24小時在線電話：400-100-2938

7*24小時在線 QQ：800811969

關(guān)注億速云

億速云公眾號

手機(jī)網(wǎng)站二維碼

<pre id="8xesb"></pre>

<var id="8xesb"></var>