怎么用Python編寫一個(gè)拼寫糾錯(cuò)器

發(fā)布時(shí)間：2022-01-18 15:50:48 來(lái)源：億速云閱讀：194 作者：iii 欄目：編程語(yǔ)言

這篇文章主要介紹“怎么用Python編寫一個(gè)拼寫糾錯(cuò)器”，在日常操作中，相信很多人在怎么用Python編寫一個(gè)拼寫糾錯(cuò)器問(wèn)題上存在疑惑，小編查閱了各式資料，整理出簡(jiǎn)單好用的操作方法，希望對(duì)大家解答”怎么用Python編寫一個(gè)拼寫糾錯(cuò)器”的疑惑有所幫助！接下來(lái)，請(qǐng)跟著小編一起來(lái)學(xué)習(xí)吧！

代碼如下：

Python

# coding:utf-8

import re

from collections import Counter

def words(text):

return re.findall(r'\w+', text.lower())

# 統(tǒng)計(jì)詞頻

WORDS = Counter(words(open('big.txt').read()))

def P(word, N=sum(WORDS.values())):

"""詞'word'的概率"""

return float(WORDS[word]) / N

def correction(word):

"""最有可能的糾正候選詞"""

return max(candidates(word), key=P)

def candidates(word):

"""生成拼寫糾正詞的候選集合"""

return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words):

"""'words'中出現(xiàn)在WORDS集合的元素子集"""

return set(w for w in words if w in WORDS)

def edits1(word):

"""與'word'的編輯距離為1的全部結(jié)果"""

letters = 'abcdefghijklmnopqrstuvwxyz'

splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]

deletes = [L + R[1:] for L, R in splits if R]

transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R) > 1]

replaces = [L + c + R[1:] for L, R in splits for c in letters]

inserts = [L + c + R for L, R in splits for c in letters]

return set(deletes + transposes + replaces + inserts)

def edits2(word):

"""與'word'的編輯距離為2的全部結(jié)果"""

return (e2 for e1 in edits1(word) for e2 in edits1(e1))

函數(shù)correction(word)返回一個(gè)最有可能的糾錯(cuò)還原單詞：

Python

>>>correction('speling')

'spelling'

>>>correction('korrectud')

'corrected'

它是如何工作的：Python部分

該程序的4個(gè)部分：
1.選擇機(jī)制：在Python中，帶key的max()函數(shù)即可實(shí)現(xiàn)argmax的功能。
2.候選模型：先介紹一個(gè)新概念：對(duì)一個(gè)單詞的簡(jiǎn)單編輯是指：刪除(移除一個(gè)字母)、置換(單詞內(nèi)兩字母互換)、替換(單詞內(nèi)一個(gè)字母改變)、插入(增加一個(gè)字母)。函數(shù)edits1(word)返回一個(gè)單詞的所有簡(jiǎn)單編輯（譯者：稱其編輯距離為1）的集合，不考慮編輯后是否是合法單詞:

Python

def edits1(word):

"""與'word'的編輯距離為1的全部結(jié)果"""

letters = 'abcdefghijklmnopqrstuvwxyz'

splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]

deletes = [L + R[1:] for L, R in splits if R]

transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R) > 1]

replaces = [L + c + R[1:] for L, R in splits for c in letters]

inserts = [L + c + R for L, R in splits for c in letters]

return set(deletes + transposes + replaces + inserts)

這個(gè)集合可能非常大。一個(gè)長(zhǎng)度為n的單詞，有n個(gè)刪除編輯，n?1個(gè)置換編輯，26n個(gè)替換編輯，26(n+1)的插入編輯，總共54n+25個(gè)簡(jiǎn)單編輯（其中存在重復(fù)）。例如：

Python

>>>len(edits1('something'))

442

然而，如果我們限制單詞為已知(known，譯者：即存在于WORDS字典中的單詞)，那么這個(gè)單詞集合將顯著縮?。?/p>

Python

def known(words):

"""'words'中出現(xiàn)在WORDS集合的元素子集"""

return set(w for w in words if w in WORDS)

>>>known(edits1('something'))

['something', 'soothing']

我們也需要考慮經(jīng)過(guò)二次編輯得到的單詞（譯者：“二次編輯”即編輯距離為2，此處作者巧妙運(yùn)用遞歸思想，將函數(shù)edits1返回集合里的每個(gè)元素再次經(jīng)過(guò)edits1處理即可得到），這個(gè)集合更大，但仍然只有很少一部分是已知單詞：

Python

def edits2(word):

"""與'word'的編輯距離為2的全部結(jié)果"""

return (e2 for e1 in edits1(word) for e2 in edits1(e1))

>>> len(set(edits2('something'))

90902

>>> known(edits2('something'))

{'seething', 'smoothing', 'something', 'soothing'}

>>> known(edits2('somthing'))

{'loathing', 'nothing', 'scathing', 'seething', 'smoothing', 'something', 'soothing', 'sorting'}

我們稱edits2(w)結(jié)果中的每個(gè)單詞與w的距離為2。

3.語(yǔ)言模型：我們通過(guò)統(tǒng)計(jì)一個(gè)百萬(wàn)級(jí)詞條的文本big.txt中各單詞出現(xiàn)的頻率來(lái)估計(jì)P(w)，它的數(shù)據(jù)來(lái)源于古騰堡項(xiàng)目中公共領(lǐng)域的書摘，以及維基詞典中頻率最高的詞匯，還有英國(guó)國(guó)家語(yǔ)料庫(kù)，函數(shù)words(text)將文本分割為詞組，并統(tǒng)計(jì)每個(gè)詞出現(xiàn)的頻率保存在變量WORDS中，P基于該統(tǒng)計(jì)評(píng)估每個(gè)詞的概率：

Python

def words(text):

return re.findall(r'\w+', text.lower())

# 統(tǒng)計(jì)詞頻

WORDS = Counter(words(open('big.txt').read()))

def P(word, N=sum(WORDS.values())):

"""詞'word'的概率"""

return float(WORDS[word]) / N

可以看到，去重后有32,192個(gè)單詞，它們一共出現(xiàn)1,115,504次，”the”是出現(xiàn)頻率最高的單詞，共出現(xiàn)79,808次(約占7%)，其他詞概率低一些。

Python

>>> len(WORDS)

32192

>>> sum(WORDS.values())

1115504

>>> WORDS.most_common(10)

[('the', 79808),

('of', 40024),

('and', 38311),

('to', 28765),

('in', 22020),

('a', 21124),

('that', 12512),

('he', 12401),

('was', 11410),

('it', 10681),

('his', 10034),

('is', 9773),

('with', 9739),

('as', 8064),

('i', 7679),

('had', 7383),

('for', 6938),

('at', 6789),

('by', 6735),

('on', 6639)]

>>> max(WORDS, key=P)

'the'

>>> P('the')

0.07154434228832886

>>> P('outrivaled')

8.9645577245801e-07

>>> P('unmentioned')

0.0

4.錯(cuò)誤模型：2007年坐在機(jī)艙內(nèi)寫這個(gè)程序時(shí)，我沒(méi)有拼寫錯(cuò)誤的相關(guān)數(shù)據(jù)，也沒(méi)有網(wǎng)絡(luò)連接(我知道這在今天可能難以想象)。沒(méi)有數(shù)據(jù)就不能構(gòu)建拼寫錯(cuò)誤模型，因此我采用了一個(gè)捷徑，定義了這么一個(gè)簡(jiǎn)單的、有缺陷的模型：認(rèn)定對(duì)所有已知詞距離為1的編輯必定比距離為2的編輯概率更高，且概率一定低于距離為0的單詞（即原單詞）。因此函數(shù)candidates(word)的優(yōu)先級(jí)如下：
1. 原始單詞（如果已知），否則到2。
2. 所有距離為1的單詞，如果為空到3。
3. 所有距離為2的單詞，如果為空到4。
4. 原始單詞，即使它不是已知單詞。

到此，關(guān)于“怎么用Python編寫一個(gè)拼寫糾錯(cuò)器”的學(xué)習(xí)就結(jié)束了，希望能夠解決大家的疑惑。理論與實(shí)踐的搭配能更好的幫助大家學(xué)習(xí)，快去試試吧！若想繼續(xù)學(xué)習(xí)更多相關(guān)知識(shí)，請(qǐng)繼續(xù)關(guān)注億速云網(wǎng)站，小編會(huì)繼續(xù)努力為大家?guī)?lái)更多實(shí)用的文章！

向AI問(wèn)一下細(xì)節(jié)

怎么用Python編寫一個(gè)拼寫糾錯(cuò)器

它是如何工作的：Python部分

猜你喜歡

最新資訊

相關(guān)推薦

相關(guān)標(biāo)簽