您好,登錄后才能下訂單哦!
這期內(nèi)容當中小編將會給大家?guī)碛嘘P(guān)Python中怎么實現(xiàn)數(shù)據(jù)挖掘,文章內(nèi)容豐富且以專業(yè)的角度為大家分析和敘述,閱讀完這篇文章希望大家可以有所收獲。
第一步:加載數(shù)據(jù),瀏覽一下
讓我們跳過真正的第一步(完善資料,了解我們要做的是什么,這在實踐過程中是非常重要的),直接到 https://archive.ics.uci.edu/ml/datasets/SMS+Spam+Collection 下載 demo 里需要用的 zip 文件,解壓到 data 子目錄下。你能看到一個大概 0.5MB 大小,名為 SMSSpamCollection 的文件:
Python
$ <span class="kw">ls</span> -l data <span class="kw">total</span> 1352 <span class="kw">-rw-r--r--@</span> 1 kofola staff 477907 Mar 15 2011 SMSSpamCollection <span class="kw">-rw-r--r--@</span> 1 kofola staff 5868 Apr 18 2011 readme <span class="kw">-rw-r-----@</span> 1 kofola staff 203415 Dec 1 15:30 smsspamcollection.zip |
電動chache
這份文件包含了 5000 多份 SMS 手機信息(查看 readme 文件以獲得更多信息):
In [2]:
messages = [line.rstrip() for line in open('./data/SMSSpamCollection')] print len(messages) |
5574
文本集有時候也稱為“語料庫”,我們來打印 SMS 語料庫中的前 10 條信息:
In [3]:
Python
for message_no, message in enumerate(messages[:10]): print message_no, message |
Python
0 ham Go until jurong point, crazy.. Available only in bugis n great world la e buffet... Cine there got amore wat... 1 ham Ok lar... Joking wif u oni... 2 spam Free entry in 2 a wkly comp to win FA Cup final tkts 21st May 2005. Text FA to 87121 to receive entry question(std txt rate)T&C's apply 08452810075over18's 3 ham U dun say so early hor... U c already then say... 4 ham Nah I don't think he goes to usf, he lives around here though 5 spam FreeMsg Hey there darling it's been 3 week's now and no word back! I'd like some fun you up for it still? Tb ok! XxX std chgs to send, ?1.50 to rcv 6 ham Even my brother is not like to speak with me. They treat me like aids patent. 7 ham As per your request 'Melle Melle (Oru Minnaminunginte Nurungu Vettam)' has been set as your callertune for all Callers. Press *9 to copy your friends Callertune 8 spam WINNER!! As a valued network customer you have been selected to receivea ?900 prize reward! To claim call 09061701461. Claim code KL341. Valid 12 hours only. 9 spam Had your mobile 11 months or more? U R entitled to Update to the latest colour mobiles with camera for Free! Call The Mobile Update Co FREE on 08002986030 |
我們看到一個 TSV 文件(用制表符 tab 分隔),它的第一列是標記正常信息(ham)或“垃圾文件”(spam)的標簽,第二列是信息本身。
這個語料庫將作為帶標簽的訓練集。通過使用這些標記了 ham/spam 例子,我們將訓練一個自動分辨 ham/spam 的機器學習模型。然后,我們可以用訓練好的模型將任意未標記的信息標記為 ham 或 spam。
我們可以使用 Python 的 Pandas 庫替我們處理 TSV 文件(或 CSV 文件,或 Excel 文件):
In [4]:
Python
messages = pandas.read_csv('./data/SMSSpamCollection', sep='t', quoting=csv.QUOTE_NONE, names=["label", "message"]) print messages |
Python
label message 0 ham Go until jurong point, crazy.. Available only ... 1 ham Ok lar... Joking wif u oni... 2 spam Free entry in 2 a wkly comp to win FA Cup fina... 3 ham U dun say so early hor... U c already then say... 4 ham Nah I don't think he goes to usf, he lives aro... 5 spam FreeMsg Hey there darling it's been 3 week's n... 6 ham Even my brother is not like to speak with me. ... 7 ham As per your request 'Melle Melle (Oru Minnamin... 8 spam WINNER!! As a valued network customer you have... 9 spam Had your mobile 11 months or more? U R entitle... 10 ham I'm gonna be home soon and i don't want to tal... 11 spam SIX chances to win CASH! From 100 to 20,000 po... 12 spam URGENT! You have won a 1 week FREE membership ... 13 ham I've been searching for the right words to tha... 14 ham I HAVE A DATE ON SUNDAY WITH WILL!! 15 spam XXXMobileMovieClub: To use your credit, click ... 16 ham Oh k...i'm watching here:) 17 ham Eh u remember how 2 spell his name... Yes i di... 18 ham Fine if that?s the way u feel. That?s the way ... 19 spam England v Macedonia - dont miss the goals/team... 20 ham Is that seriously how you spell his name? 21 ham I‘m going to try for 2 months ha ha only joking 22 ham So ü pay first lar... Then when is da stock co... 23 ham Aft i finish my lunch then i go str down lor. ... 24 ham Ffffffffff. Alright no way I can meet up with ... 25 ham Just forced myself to eat a slice. I'm really ... 26 ham Lol your always so convincing. 27 ham Did you catch the bus ? Are you frying an egg ... 28 ham I'm back &amp; we're packing the car now, I'll... 29 ham Ahhh. Work. I vaguely remember that! What does... ... ... ... 5544 ham Armand says get your ass over to epsilon 5545 ham U still havent got urself a jacket ah? 5546 ham I'm taking derek &amp; taylor to walmart, if I... 5547 ham Hi its in durban are you still on this number 5548 ham Ic. There are a lotta childporn cars then. 5549 spam Had your contract mobile 11 Mnths? Latest Moto... 5550 ham No, I was trying it all weekend ;V 5551 ham You know, wot people wear. T shirts, jumpers, ... 5552 ham Cool, what time you think you can get here? 5553 ham Wen did you get so spiritual and deep. That's ... 5554 ham Have a safe trip to Nigeria. Wish you happines... 5555 ham Hahaha..use your brain dear 5556 ham Well keep in mind I've only got enough gas for... 5557 ham Yeh. Indians was nice. Tho it did kane me off ... 5558 ham Yes i have. So that's why u texted. Pshew...mi... 5559 ham No. I meant the calculation is the same. That ... 5560 ham Sorry, I'll call later 5561 ham if you aren't here in the next &lt;#&gt; hou... 5562 ham Anything lor. Juz both of us lor. 5563 ham Get me out of this dump heap. My mom decided t... 5564 ham Ok lor... Sony ericsson salesman... I ask shuh... 5565 ham Ard 6 like dat lor. 5566 ham Why don't you wait 'til at least wednesday to ... 5567 ham Huh y lei... 5568 spam REMINDER FROM O2: To get 2.50 pounds free call... 5569 spam This is the 2nd time we have tried 2 contact u... 5570 ham Will ü b going to esplanade fr home? 5571 ham Pity, * was in mood for that. So...any other s... 5572 ham The guy did some bitching but I acted like i'd... 5573 ham Rofl. Its true to its name
[5574 rows x 2 columns] |
我們也可以使用 pandas 輕松查看統(tǒng)計信息:
In [5]:
messages.groupby('label').describe() |
out[5]:
message | ||
label | ||
ham | count | 4827 |
unique | 4518 | |
top | Sorry, I’ll call later | |
freq | 30 | |
spam | count | 747 |
unique | 653 | |
top | Please call our customer service representativ… | |
freq | 4 |
這些信息的長度是多少:
In [6]:
Python
messages['length'] = messages['message'].map(lambda text: len(text)) print messages.head() |
Python
label message length 0 ham Go until jurong point, crazy.. Available only ... 111 1 ham Ok lar... Joking wif u oni... 29 2 spam Free entry in 2 a wkly comp to win FA Cup fina... 155 3 ham U dun say so early hor... U c already then say... 49 4 ham Nah I don't think he goes to usf, he lives aro... 61 |
In [7]:
Python
messages.length.plot(bins=20, kind='hist') |
Out[7]:
Python
<matplotlib.axes._subplots.AxesSubplot at 0x10dd7a990> |
cdn2.b0.upaiyun.com/2015/02/7b930a617449365ee096983ea22bc78a.png">
In [8]:
Python
messages.length.describe() |
Out[8]:
Python
count 5574.000000 mean 80.604593 std 59.919970 min 2.000000 25% 36.000000 50% 62.000000 75% 122.000000 max 910.000000 Name: length, dtype: float64 |
哪些是超長信息?
In [9]:
print list(messages.message[messages.length > 900]) |
["For me the love should start with attraction.i should feel that I need her every time around me.she should be the first thing which comes in my thoughts.I would start the day and end it with her.she should be there every time I dream.love will be then when my every breath has her name.my life should happen around her.my life will be named to her.I would cry for her.will give all my happiness and take all her sorrows.I will be ready to fight with anyone for her.I will be in love when I will be doing the craziest things for her.love will be when I don't have to proove anyone that my girl is the most beautiful lady on the whole planet.I will always be singing praises for her.love will be when I start up making chicken curry and end up makiing sambar.life will be the most beautiful then.will get every morning and thank god for the day because she is with me.I would like to say a lot..will tell later.."] |
spam 信息與 ham 信息在長度上有區(qū)別嗎?
In [10]:
Python
messages.hist(column='length', by='label', bins=50) |
Out[10]:
Python
array([<matplotlib.axes._subplots.AxesSubplot object at 0x11270da50>, <matplotlib.axes._subplots.AxesSubplot object at 0x1126c7750>], dtype=object) |
太棒了,但是我們怎么能讓電腦自己識別文字信息?它可以理解這些胡言亂語嗎?
這一節(jié)我們將原始信息(字符序列)轉(zhuǎn)換為向量(數(shù)字序列);
這里的映射并非一對一的,我們要用詞袋模型(bag-of-words)把每個不重復的詞用一個數(shù)字來表示。
與第一步的方法一樣,讓我們寫一個將信息分割成單詞的函數(shù):
In [11]:
Python
def split_into_tokens(message): message = unicode(message, 'utf8') # convert bytes into proper unicode return TextBlob(message).words |
這還是原始文本的一部分:
In [12]:
Python
messages.message.head() |
Out[12]:
Python
0 Go until jurong point, crazy.. Available only ... 1 Ok lar... Joking wif u oni... 2 Free entry in 2 a wkly comp to win FA Cup fina... 3 U dun say so early hor... U c already then say... 4 Nah I don't think he goes to usf, he lives aro... Name: message, dtype: object |
這是原始文本處理后的樣子:
In [13]:
Python
messages.message.head().apply(split_into_tokens) |
Out[13]:
Python
0 [Go, until, jurong, point, crazy, Available, o... 1 [Ok, lar, Joking, wif, u, oni] 2 [Free, entry, in, 2, a, wkly, comp, to, win, F... 3 [U, dun, say, so, early, hor, U, c, already, t... 4 [Nah, I, do, n't, think, he, goes, to, usf, he... Name: message, dtype: object |
自然語言處理(NLP)的問題:
大寫字母是否攜帶信息?
單詞的不同形式(“goes”和“go”)是否攜帶信息?
嘆詞和限定詞是否攜帶信息?
換句話說,我們想對文本進行更好的標準化。
我們使用 textblob 獲取 part-of-speech (POS) 標簽:
In [14]:
Python
TextBlob("Hello world, how is it going?").tags # list of (word, POS) pairs |
Out[14]:
Python
[(u'Hello', u'UH'), (u'world', u'NN'), (u'how', u'WRB'), (u'is', u'VBZ'), (u'it', u'PRP'), (u'going', u'VBG')] |
并將單詞標準化為基本形式 (lemmas):
In [15]:
Python
def split_into_lemmas(message): message = unicode(message, 'utf8').lower() words = TextBlob(message).words # for each word, take its "base form" = lemma return [word.lemma for word in words]
messages.message.head().apply(split_into_lemmas) |
Out[15]:
0 [go, until, jurong, point, crazy, available, o... 1 [ok, lar, joking, wif, u, oni] 2 [free, entry, in, 2, a, wkly, comp, to, win, f... 3 [u, dun, say, so, early, hor, u, c, already, t... 4 [nah, i, do, n't, think, he, go, to, usf, he, ... Name: message, dtype: object |
這樣就好多了。你也許還會想到更多的方法來改進預處理:解碼 HTML 實體(我們上面看到的 & 和 <);過濾掉停用詞 (代詞等);添加更多特征,比如所有字母大寫標識等等。
現(xiàn)在,我們將每條消息(詞干列表)轉(zhuǎn)換成機器學習模型可以理解的向量。
用詞袋模型完成這項工作需要三個步驟:
1. 對每個詞在每條信息中出現(xiàn)的次數(shù)進行計數(shù)(詞頻);
2. 對計數(shù)進行加權(quán),這樣經(jīng)常出現(xiàn)的單詞將會獲得較低的權(quán)重(逆向文件頻率);
3. 將向量由原始文本長度歸一化到單位長度(L2 范式)。
每個向量的維度等于 SMS 語料庫中包含的獨立詞的數(shù)量。
In [16]:
Python
bow_transformer = CountVectorizer(analyzer=split_into_lemmas).fit(messages['message']) print len(bow_transformer.vocabulary_) |
Python
8874 |
這里我們使用強大的 python 機器學習訓練庫 scikit-learn (sklearn),它包含大量的方法和選項。
我們?nèi)∫粋€信息并使用新的 bow_tramsformer 獲取向量形式的詞袋模型計數(shù):
In [17]:
Python
message4 = messages['message'][3] print message4 |
Python
U dun say so early hor... U c already then say... |
In [18]:
Python
bow4 = bow_transformer.transform([message4]) print bow4 print bow4.shape |
Python
(0, 1158) 1 (0, 1899) 1 (0, 2897) 1 (0, 2927) 1 (0, 4021) 1 (0, 6736) 2 (0, 7111) 1 (0, 7698) 1 (0, 8013) 2 (1, 8874) |
message 4 中有 9 個獨立詞,它們中的兩個出現(xiàn)了兩次,其余的只出現(xiàn)了一次??捎眯詸z測,哪些詞出現(xiàn)了兩次?
In [19]:
Python
print bow_transformer.get_feature_names()[6736] print bow_transformer.get_feature_names()[8013] |
Python
say u |
整個 SMS 語料庫的詞袋計數(shù)是一個龐大的稀疏矩陣:
In [20]:
Python
messages_bow = bow_transformer.transform(messages['message']) print 'sparse matrix shape:', messages_bow.shape print 'number of non-zeros:', messages_bow.nnz print 'sparsity: %.2f%%' % (100.0 * messages_bow.nnz / (messages_bow.shape[0] * messages_bow.shape[1])) |
Python
sparse matrix shape: (5574, 8874) number of non-zeros: 80272 sparsity: 0.16% |
最終,計數(shù)后,使用 scikit-learn 的 TFidfTransformer 實現(xiàn)的 TF-IDF 完成詞語加權(quán)和歸一化。
In [21]:
Python
tfidf_transformer = TfidfTransformer().fit(messages_bow) tfidf4 = tfidf_transformer.transform(bow4) print tfidf4 |
Python
(0, 8013) 0.305114653686 (0, 7698) 0.225299911221 (0, 7111) 0.191390347987 (0, 6736) 0.523371210191 (0, 4021) 0.456354991921 (0, 2927) 0.32967579251 (0, 2897) 0.303693312742 (0, 1899) 0.24664322833 (0, 1158) 0.274934159477 |
單詞 “u” 的 IDF(逆向文件頻率)是什么?單詞“university”的 IDF 又是什么?
In [22]:
Python
print tfidf_transformer.idf_[bow_transformer.vocabulary_['u']] print tfidf_transformer.idf_[bow_transformer.vocabulary_['university']] |
Python
2.85068150539 8.23975323521 |
將整個 bag-of-words 語料庫轉(zhuǎn)化為 TF-IDF 語料庫。
In [23]:
Python
messages_tfidf = tfidf_transformer.transform(messages_bow) print messages_tfidf.shape |
Python
(5574, 8874) |
有許多方法可以對數(shù)據(jù)進行預處理和向量化。這兩個步驟也可以稱為“特征工程”,它們通常是預測過程中最耗時間和最無趣的部分,但是它們非常重要并且需要經(jīng)驗。訣竅在于反復評估:分析模型誤差,改進數(shù)據(jù)清洗和預處理方法,進行頭腦風暴討論新功能,評估等等。
我們使用向量形式的信息來訓練 spam/ham 分類器。這部分很簡單,有很多實現(xiàn)訓練算法的庫文件。
這里我們使用 scikit-learn,首先選擇 Naive Bayes 分類器:
In [24]:
Python
%time spam_detector = MultinomialNB().fit(messages_tfidf, messages['label']) |
Python
CPU times: user 4.51 ms, sys: 987 ?s, total: 5.49 ms Wall time: 4.77 ms |
我們來試著分類一個隨機信息:
In [25]:
Python
print 'predicted:', spam_detector.predict(tfidf4)[0] print 'expected:', messages.label[3] |
Python
predicted: ham expected: ham |
太棒了!你也可以用自己的文本試試。
有一個很自然的問題是:我們可以正確分辨多少信息?
In [26]:
Python
all_predictions = spam_detector.predict(messages_tfidf) print all_predictions |
Python
['ham' 'ham' 'spam' ..., 'ham' 'ham' 'ham'] |
In [27]:
Python
print 'accuracy', accuracy_score(messages['label'], all_predictions) print 'confusion matrixn', confusion_matrix(messages['label'], all_predictions) print '(row=expected, col=predicted)' |
Python
accuracy 0.969501255831 confusion matrix [[4827 0] [ 170 577]] (row=expected, col=predicted) |
In [28]:
Python
plt.matshow(confusion_matrix(messages['label'], all_predictions), cmap=plt.cm.binary, interpolation='nearest') plt.title('confusion matrix') plt.colorbar() plt.ylabel('expected label') plt.xlabel('predicted label') |
Out[28]:
Python
<matplotlib.text.Text at 0x11643f6d0> |
我們可以通過這個混淆矩陣計算精度(precision)和召回率(recall),或者它們的組合(調(diào)和平均值)F1:
In [29]:
Python
print classification_report(messages['label'], all_predictions) |
Python
precision recall f1-score support
ham 0.97 1.00 0.98 4827 spam 1.00 0.77 0.87 747
avg / total 0.97 0.97 0.97 5574 |
有相當多的指標都可以用來評估模型性能,至于哪個最合適是由任務(wù)決定的。比如,將“spam”錯誤預測為“ham”的成本遠低于將“ham”錯誤預測為“spam”的成本。
在上述“評價”中,我們犯了個大忌。為了簡單的演示,我們使用訓練數(shù)據(jù)進行了準確性評估。永遠不要評估你的訓練數(shù)據(jù)。這是錯誤的。
這樣的評估方法不能告訴我們模型的實際預測能力,如果我們記住訓練期間的每個例子,訓練的準確率將非常接近 100%,但是我們不能用它來分類任何新信息。
一個正確的做法是將數(shù)據(jù)分為訓練集和測試集,在模型擬合和調(diào)參時只能使用訓練數(shù)據(jù),不能以任何方式使用測試數(shù)據(jù),通過這個方法確保模型沒有“作弊”,最終使用測試數(shù)據(jù)評價模型可以代表模型真正的預測性能。
In [30]:
Python
msg_train, msg_test, label_train, label_test = train_test_split(messages['message'], messages['label'], test_size=0.2)
print len(msg_train), len(msg_test), len(msg_train) + len(msg_test) |
Python
4459 1115 5574 |
按照要求,測試數(shù)據(jù)占整個數(shù)據(jù)集的 20%(總共 5574 條記錄中的 1115 條),其余的是訓練數(shù)據(jù)(5574 條中的 4459 條)。
讓我們回顧整個流程,將所有步驟放入 scikit-learn 的 Pipeline 中:
In [31]:
Python
def split_into_lemmas(message): message = unicode(message, 'utf8').lower() words = TextBlob(message).words # for each word, take its "base form" = lemma return [word.lemma for word in words]
pipeline = Pipeline([ ('bow', CountVectorizer(analyzer=split_into_lemmas)), # strings to token integer counts ('tfidf', TfidfTransformer()), # integer counts to weighted TF-IDF scores ('classifier', MultinomialNB()), # train on TF-IDF vectors w/ Naive Bayes classifier ]) |
實際當中一個常見的做法是將訓練集再次分割成更小的集合,例如,5 個大小相等的子集。然后我們用 4 個子集訓練數(shù)據(jù),用最后 1 個子集計算精度(稱之為“驗證集”)。重復5次(每次使用不同的子集進行驗證),這樣可以得到模型的“穩(wěn)定性“。如果模型使用不同子集的得分差異非常大,那么很可能哪里出錯了(壞數(shù)據(jù)或者不良的模型方差)。返回,分析錯誤,重新檢查輸入數(shù)據(jù)有效性,重新檢查數(shù)據(jù)清洗。
在這個例子里,一切進展順利:
In [32]:
Python
scores = cross_val_score(pipeline, # steps to convert raw messages into models msg_train, # training data label_train, # training labels cv=10, # split data randomly into 10 parts: 9 for training, 1 for scoring scoring='accuracy', # which scoring metric? n_jobs=-1, # -1 = use all cores = faster ) print scores |
Python
[ 0.93736018 0.96420582 0.94854586 0.94183445 0.96412556 0.94382022 0.94606742 0.96404494 0.94831461 0.94606742] |
得分確實比訓練全部數(shù)據(jù)時差一點點( 5574 個訓練例子中,準確性 0.97),但是它們相當穩(wěn)定:
In [33]:
Python
print scores.mean(), scores.std() |
Python
0.9504386476 0.00947200821389 |
我們自然會問,如何改進這個模型?這個得分已經(jīng)很高了,但是我們通常如何改進模型呢?
Naive Bayes 是一個高偏差-低方差的分類器(簡單且穩(wěn)定,不易過度擬合)。與其相反的例子是低偏差-高方差(容易過度擬合)的 k 最臨近(kNN)分類器和決策樹。Bagging(隨機森林)是一種通過訓練許多(高方差)模型和求均值來降低方差的方法。
換句話說:
高偏差 = 分類器比較固執(zhí)。它有自己的想法,數(shù)據(jù)能夠改變的空間有限。另一方面,也沒有多少過度擬合的空間(左圖)。
低偏差 = 分類器更聽話,但也更神經(jīng)質(zhì)。大家都知道,讓它做什么就做什么可能造成麻煩(右圖)。
In [34]:
Python
def plot_learning_curve(estimator, title, X, y, ylim=None, cv=None, n_jobs=-1, train_sizes=np.linspace(.1, 1.0, 5)): """ Generate a simple plot of the test and traning learning curve.
Parameters ---------- estimator : object type that implements the "fit" and "predict" methods An object of that type which is cloned for each validation.
title : string Title for the chart.
X : array-like, shape (n_samples, n_features) Training vector, where n_samples is the number of samples and n_features is the number of features.
y : array-like, shape (n_samples) or (n_samples, n_features), optional Target relative to X for classification or regression; None for unsupervised learning.
ylim : tuple, shape (ymin, ymax), optional Defines minimum and maximum yvalues plotted.
cv : integer, cross-validation generator, optional If an integer is passed, it is the number of folds (defaults to 3). Specific cross-validation objects can be passed, see sklearn.cross_validation module for the list of possible objects
n_jobs : integer, optional Number of jobs to run in parallel (default 1). """ plt.figure() plt.title(title) if ylim is not None: plt.ylim(*ylim) plt.xlabel("Training examples") plt.ylabel("Score") train_sizes, train_scores, test_scores = learning_curve( estimator, X, y, cv=cv, n_jobs=n_jobs, train_sizes=train_sizes) train_scores_mean = np.mean(train_scores, axis=1) train_scores_std = np.std(train_scores, axis=1) test_scores_mean = np.mean(test_scores, axis=1) test_scores_std = np.std(test_scores, axis=1) plt.grid()
plt.fill_between(train_sizes, train_scores_mean - train_scores_std, train_scores_mean + train_scores_std, alpha=0.1, color="r") plt.fill_between(train_sizes, test_scores_mean - test_scores_std, test_scores_mean + test_scores_std, alpha=0.1, color="g") plt.plot(train_sizes, train_scores_mean, 'o-', color="r", label="Training score") plt.plot(train_sizes, test_scores_mean, 'o-', color="g", label="Cross-validation score")
plt.legend(loc="best") return plt |
In [35]:
Python
%time plot_learning_curve(pipeline, "accuracy vs. training set size", msg_train, label_train, cv=5) |
Python
CPU times: user 382 ms, sys: 83.1 ms, total: 465 ms Wall time: 28.5 s |
Out[35]:
Python
<module 'matplotlib.pyplot' from '/Volumes/work/workspace/vew/sklearn_intro/lib/python2.7/site-packages/matplotlib/pyplot.pyc'> |
(我們對數(shù)據(jù)的 64% 進行了有效訓練:保留 20% 的數(shù)據(jù)作為測試集,保留剩余的 20% 做 5 折交叉驗證 = > 0.8*0.8*5574 = 3567個訓練數(shù)據(jù)。)
隨著性能的提升,訓練和交叉驗證都表現(xiàn)良好,我們發(fā)現(xiàn)由于數(shù)據(jù)量較少,這個模型難以足夠復雜/靈活地捕獲所有的細微差別。在這種特殊案例中,不管怎樣做精度都很高,這個問題看起來不是很明顯。
關(guān)于這一點,我們有兩個選擇:
使用更多的訓練數(shù)據(jù),增加模型的復雜性;
使用更復雜(更低偏差)的模型,從現(xiàn)有數(shù)據(jù)中獲取更多信息。
在過去的幾年里,隨著收集大規(guī)模訓練數(shù)據(jù)越來越容易,機器越來越快。方法 1 變得越來越流行(更簡單的算法,更多的數(shù)據(jù))。簡單的算法(如 Naive Bayes)也有更容易解釋的額外優(yōu)勢(相對一些更復雜的黑箱模型,如神經(jīng)網(wǎng)絡(luò))。
了解了如何正確地評估模型,我們現(xiàn)在可以開始研究參數(shù)對性能有哪些影響。
到目前為止,我們看到的只是冰山一角,還有許多其它參數(shù)需要調(diào)整。比如使用什么算法進行訓練。
上面我們已經(jīng)使用了 Navie Bayes,但是 scikit-learn 支持許多分類器:支持向量機、最鄰近算法、決策樹、Ensamble 方法等…
我們會問:IDF 加權(quán)對準確性有什么影響?消耗額外成本進行詞形還原(與只用純文字相比)真的會有效果嗎?
讓我們來看看:
In [37]:
Python
params = { 'tfidf__use_idf': (True, False), 'bow__analyzer': (split_into_lemmas, split_into_tokens), }
grid = GridSearchCV( pipeline, # pipeline from above params, # parameters to tune via cross validation refit=True, # fit using all available data at the end, on the best found param combination n_jobs=-1, # number of cores to use for parallelization; -1 for "all cores" scoring='accuracy', # what score are we optimizing? cv=StratifiedKFold(label_train, n_folds=5), # what type of cross validation to use ) |
In [38]:
Python
%time nb_detector = grid.fit(msg_train, label_train)
print nb_detector.grid_scores_ |
Python
CPU times: user 4.09 s, sys: 291 ms, total: 4.38 s Wall time: 20.2 s [mean: 0.94752, std: 0.00357, params: {'tfidf__use_idf': True, 'bow__analyzer': <function split_into_lemmas at 0x1131e8668>}, mean: 0.92958, std: 0.00390, params: {'tfidf__use_idf': False, 'bow__analyzer': <function split_into_lemmas at 0x1131e8668>}, mean: 0.94528, std: 0.00259, params: {'tfidf__use_idf': True, 'bow__analyzer': <function split_into_tokens at 0x11270b7d0>}, mean: 0.92868, std: 0.00240, params: {'tfidf__use_idf': False, 'bow__analyzer': <function split_into_tokens at 0x11270b7d0>}] |
(首先顯示最佳參數(shù)組合:在這個案例中是使用 idf=True 和 analyzer=split_into_lemmas 的參數(shù)組合)
快速合理性檢查
In [39]:
Python
print nb_detector.predict_proba(["Hi mom, how are you?"])[0] print nb_detector.predict_proba(["WINNER! Credit for free!"])[0] |
Python
[ 0.99383955 0.00616045] [ 0.29663109 0.70336891] |
predict_proba 返回每類(ham,spam)的預測概率。在第一個例子中,消息被預測為 ham 的概率 >99%,被預測為 spam 的概率 <1%。如果進行選擇模型會認為信息是 ”ham“:
In [40]:
Python
print nb_detector.predict(["Hi mom, how are you?"])[0] print nb_detector.predict(["WINNER! Credit for free!"])[0] |
Python
ham spam |
在訓練期間沒有用到的測試集的整體得分:
In [41]:
Python
predictions = nb_detector.predict(msg_test) print confusion_matrix(label_test, predictions) print classification_report(label_test, predictions) |
Python
[[973 0] [ 46 96]] precision recall f1-score support
ham 0.95 1.00 0.98 973 spam 1.00 0.68 0.81 142
avg / total 0.96 0.96 0.96 1115 |
這是我們使用詞形還原、TF-IDF 和 Navie Bayes 分類器的 ham 檢測 pipeline 獲得的實際預測性能。
讓我們嘗試另一個分類器:支持向量機(SVM)。SVM 可以非常迅速的得到結(jié)果,它所需要的參數(shù)調(diào)整也很少(雖然比 Navie Bayes 稍多一點),在處理文本數(shù)據(jù)方面它是個好的起點。
In [42]:
Python
pipeline_svm = Pipeline([ ('bow', CountVectorizer(analyzer=split_into_lemmas)), ('tfidf', TfidfTransformer()), ('classifier', SVC()), # <== change here ])
# pipeline parameters to automatically explore and tune param_svm = [ {'classifier__C': [1, 10, 100, 1000], 'classifier__kernel': ['linear']}, {'classifier__C': [1, 10, 100, 1000], 'classifier__gamma': [0.001, 0.0001], 'classifier__kernel': ['rbf']}, ]
grid_svm = GridSearchCV( pipeline_svm, # pipeline from above param_grid=param_svm, # parameters to tune via cross validation refit=True, # fit using all data, on the best detected classifier n_jobs=-1, # number of cores to use for parallelization; -1 for "all cores" scoring='accuracy', # what score are we optimizing? cv=StratifiedKFold(label_train, n_folds=5), # what type of cross validation to use ) |
In [43]:
Python
%time svm_detector = grid_svm.fit(msg_train, label_train) # find the best combination from param_svm
print svm_detector.grid_scores_ |
Python
CPU times: user 5.24 s, sys: 170 ms, total: 5.41 s Wall time: 1min 8s [mean: 0.98677, std: 0.00259, params: {'classifier__kernel': 'linear', 'classifier__C': 1}, mean: 0.98654, std: 0.00100, params: {'classifier__kernel': 'linear', 'classifier__C': 10}, mean: 0.98654, std: 0.00100, params: {'classifier__kernel': 'linear', 'classifier__C': 100}, mean: 0.98654, std: 0.00100, params: {'classifier__kernel': 'linear', 'classifier__C': 1000}, mean: 0.86432, std: 0.00006, params: {'classifier__gamma': 0.001, 'classifier__kernel': 'rbf', 'classifier__C': 1}, mean: 0.86432, std: 0.00006, params: {'classifier__gamma': 0.0001, 'classifier__kernel': 'rbf', 'classifier__C': 1}, mean: 0.86432, std: 0.00006, params: {'classifier__gamma': 0.001, 'classifier__kernel': 'rbf', 'classifier__C': 10}, mean: 0.86432, std: 0.00006, params: {'classifier__gamma': 0.0001, 'classifier__kernel': 'rbf', 'classifier__C': 10}, mean: 0.97040, std: 0.00587, params: {'classifier__gamma': 0.001, 'classifier__kernel': 'rbf', 'classifier__C': 100}, mean: 0.86432, std: 0.00006, params: {'classifier__gamma': 0.0001, 'classifier__kernel': 'rbf', 'classifier__C': 100}, mean: 0.98722, std: 0.00280, params: {'classifier__gamma': 0.001, 'classifier__kernel': 'rbf', 'classifier__C': 1000}, mean: 0.97040, std: 0.00587, params: {'classifier__gamma': 0.0001, 'classifier__kernel': 'rbf', 'classifier__C': 1000}] |
因此,很明顯的,具有 C=1 的線性核函數(shù)是最好的參數(shù)組合。
再一次合理性檢查:
In [44]:
Python
print svm_detector.predict(["Hi mom, how are you?"])[0] print svm_detector.predict(["WINNER! Credit for free!"])[0] |
Python
ham spam |
In [45]:
Python
print confusion_matrix(label_test, svm_detector.predict(msg_test)) print classification_report(label_test, svm_detector.predict(msg_test)) |
Python
[[965 8] [ 13 129]] precision recall f1-score support
ham 0.99 0.99 0.99 973 spam 0.94 0.91 0.92 142
avg / total 0.98 0.98 0.98 1115 |
這是我們使用 SVM 時可以從 spam 郵件檢測流程中獲得的實際預測性能。
經(jīng)過基本分析和調(diào)優(yōu),真正的工作(工程)開始了。
生成預測器的最后一步是再次對整個數(shù)據(jù)集合進行訓練,以充分利用所有可用數(shù)據(jù)。當然,我們將使用上面交叉驗證找到的最好的參數(shù)。這與我們開始做的非常相似,但這次深入了解它的行為和穩(wěn)定性。在不同的訓練/測試子集進行評價。
最終的預測器可以序列化到磁盤,以便我們下次想使用它時,可以跳過所有訓練直接使用訓練好的模型:
In [46]:
Python
# store the spam detector to disk after training with open('sms_spam_detector.pkl', 'wb') as fout: cPickle.dump(svm_detector, fout)
# ...and load it back, whenever needed, possibly on a different machine svm_detector_reloaded = cPickle.load(open('sms_spam_detector.pkl')) |
加載的結(jié)果是一個與原始對象表現(xiàn)相同的對象:
In [47]:
Python
print 'before:', svm_detector.predict([message4])[0] print 'after:', svm_detector_reloaded.predict([message4])[0] |
Python
before: ham after: ham |
生產(chǎn)執(zhí)行的另一個重要部分是性能。經(jīng)過快速、迭代模型調(diào)整和參數(shù)搜索之后,性能良好的模型可以被翻譯成不同的語言并優(yōu)化。可以犧牲幾個點的準確性換取一個更小、更快的模型嗎?是否值得優(yōu)化內(nèi)存使用情況,或者使用 mmap 跨進程共享內(nèi)存?
請注意,優(yōu)化并不總是必要的,要從實際情況出發(fā)。
還有一些需要考慮的問題,比如,生產(chǎn)流水線還需要考慮魯棒性(服務(wù)故障轉(zhuǎn)移、冗余、負載平衡)、監(jiān)測(包括異常自動報警)、HR 可替代性(避免關(guān)于工作如何完成的“知識孤島”、晦澀/鎖定的技術(shù)、調(diào)整結(jié)果的黑藝術(shù))?,F(xiàn)在,開源世界都可以為所有這些領(lǐng)域提供可行的解決方法,由于 OSI 批準的開源許可證,今天展示的所有工具都可以免費用于商業(yè)用途。
上述就是小編為大家分享的Python中怎么實現(xiàn)數(shù)據(jù)挖掘了,如果剛好有類似的疑惑,不妨參照上述分析進行理解。如果想知道更多相關(guān)知識,歡迎關(guān)注億速云行業(yè)資訊頻道。
免責聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點不代表本網(wǎng)站立場,如果涉及侵權(quán)請聯(lián)系站長郵箱:is@yisu.com進行舉報,并提供相關(guān)證據(jù),一經(jīng)查實,將立刻刪除涉嫌侵權(quán)內(nèi)容。