python基礎學習之特征工程

發(fā)布時間：2020-08-16 15:36:07 來源：ITPUB博客閱讀：165 作者：ckxllf 欄目：編程語言

　　一、特征提取

　　字典加載特征：DictVectorizer

　　文本特征提?。涸~頻向量(CountVectorizer)TF-IDF向量(TfidfVectorizer，F(xiàn)fidfTransformer) 特征哈希向量(HashingVectorizer)

　　圖像特征的提?。禾崛∠袼鼐仃囘吘壓团d趣點

　　1.1、字典加載特征

　　用python中的字典存儲特征是一種常用的做法，其優(yōu)點是容易理解，但是sklearn的輸入特征必須是numpy或Scipy的數(shù)組?？梢杂肈ictVectorizer從字典加載特征轉化成numpy，并且對分類特征會采用獨熱編碼。

　　me=[

　　{'city':'Dubai','temperature':33.},

　　{'city':'London','temperature':12.},

　　{'city':'San Francisco','temperature':18.}

　　]

　　from sklearn.feature_extraction import DictVectorizer

　　vec=DictVectorizer()

　　print(vec.fit_transform(me).toarray())

　　vec.get_feature_names()

　　[[ 1. 0. 0. 33.]

　　[ 0. 1. 0. 12.]

　　[ 0. 0. 1. 18.]]

　　1.2、字頻向量

　　詞庫模型(Bag-of-words model)是文字模型化最常用的方法，它為每個單詞設值一個特征值，依據(jù)是用類似單詞的文章意思也差不多

　　CountVectorizer類會將文檔全部轉化成小寫，然后把句子分割成塊或有意義的字母序列，并統(tǒng)計他們出現(xiàn)的次數(shù)

　　可以使用stop_words選項排除一些常用的但沒有意義的助詞。

　　from sklearn.feature_extraction.text import CountVectorizer

　　co=[

　　'UNC played Duke in basketball',

　　'Duke lost the basketball game ,game over',

　　'I ate a sandwich'

　　]

　　vec=CountVectorizer(stop_words='english')

　　print(vec.fit_transform(co).todense())

　　print(vec.vocabulary_)

　　# 三行數(shù)據(jù)

　　[[0 1 1 0 0 1 0 1]

　　[0 1 1 2 1 0 0 0]

　　[1 0 0 0 0 0 1 0]]

　　{'unc': 7, 'played': 5, 'duke': 2, 'basketball': 1, 'lost': 4, 'game': 3, 'ate': 0, 'sandwich': 6}

　　import jieba

　　from sklearn.feature_extraction.text import CountVectorizer

　　corpus=[

　　'朋友，小紅是我的',

　　'小明對小紅說：“小紅，我們還是不是朋友”',

　　'小明與小紅是朋友'

　　]

　　cutcorpus=["/".join(jieba.cut(x)) for x in corpus]

　　vec==CountVectorizer(stop_words=['好的','是的'])

　　counts=vec.fit_transform(cutcorpus).todense()

　　print(counts)

　　# 查看映射字典

　　print(vec.vocabulary_)

　　可以用詞頻向量的歐式距離(L2范數(shù))來衡量兩個文檔之間的距離(距離越小越相似)

　　from sklearn.feature_extraction.text import CountVectorizer

　　# 計算歐式距離

　　from sklearn.metrics.pairwise import euclidean_distances

　　vectorizer=CountVectorizer()

　　for x,y in [[0,1],[0,2],[1,2]]:

　　dist=euclidean_distances(counts[x],counts[y])

　　print('文檔{}與文檔{}的距離{}'.format(x,y,dist))

　　1.3、Tf-idf權重向量

　　from sklearn.feature_extraction.text import TfidfTransformer

　　transformer=TfidfTransformer(smooth_idf=False)

　　counts=[[3,0,1],

　　[2,0,0],

　　[3,0,0],

　　[4,0,0],

　　[3,2,0],

　　[3,0,2]]

　　tfidf=transformer.fit_transform(counts)

　　tfidf.toarray()

　　array([[0.81940995, 0. , 0.57320793],

　　[1. , 0. , 0. ],

　　[0.47330339, 0.88089948, 0. ],

　　[0.58149261, 0. , 0.81355169]])

　　from sklearn.feature_extraction.text import TfidfVectorizer

　　vectorizer=TfidfVectorizer()

　　vectorizer.fit_transform(cutcorpus).toarray()

　　vectorizer.vocabulary_

　　{'小明': 0, '小紅': 1, '我們': 2, '是不是': 3, '朋友': 4}

　　1.4、特征哈希值

　　詞袋模型的方法很好用，也很直接，但在有些場景下很難使用，比如分詞后的詞匯字典表非常大，達到100萬+,此時如果直接使用詞頻向量或Tf-idf權重向量的方法，將對應的樣本對應特征矩陣載入內存，有可能將內存撐爆，在這種情況下我們該怎么辦呢?

　　我們可以應用哈希技巧進行降維。

　　Hash函數(shù)可以將一個任意長度的字符串映射到_個固定長度的散列數(shù)字中去。Hash函數(shù)是一種典型的多對一映射。

　　正向快速：給定明文和hash算法，在有限時間和有限資源內能計算出hash值。

　　逆向困難：給定(若干)hash值，在有限時間內很難(基本不可能)逆推出明文。

　　輸入敏感：原始輸入信息修改一點信息，產生的hash值看起來應該都有很大不同。

　　碰撞避免：很難找到兩段內容不同的明文，使得它們的hash值一致(發(fā)生碰撞)。即對于任意兩個不同的數(shù)據(jù)塊，其hash值相同的可能性極小;對于一個給定的數(shù)據(jù)塊，找到和它hash值相同的數(shù)據(jù)塊極為困難。

　　目前流行的Hash函數(shù)包括MD4，MD5，SHA等。

　　from sklearn.feature_extraction.text import HashingVectorizer

　　corpus=['smart boy','ate','bacon','a cat']

　　# HashingVectorizeras是無狀態(tài)的，不需要fit

　　vectorizer=HashingVectorizer(n_features=6,stop_words='english')

　　print(vectorizer.transform(corpus).todense())

　　[[-0.70710678 -0.70710678 0. 0. 0. 0. ]

　　[ 0. 0. 0. 1. 0. 0. ]

　　[ 0. 0. 0. 0. -1. 0. ]

　　[ 0. 1. 0. 0. 0. 0. ]]

　　from sklearn.feature_extraction.text import HashingVectorizer

　　corpus=[

　　'UNC played Duke in basketball',

　　'Duke lost the basketball game ,game over',

　　'I ate a sandwich'

　　]

　　vectorizer=HashingVectorizer(n_features=6)

　　counts=vectorizer.transform(corpus).todense()

　　print(counts)

　　counts.shape

　　[[ 0. 0. -0.89442719 0. 0. -0.4472136 ]

　　[-0.37796447 -0.75592895 -0.37796447 0. 0. -0.37796447]

　　[ 0. 0. 0.70710678 0.70710678 0. 0. ]]

　　Out[9]:(3, 6)

　　二、特征選擇

　　當數(shù)據(jù)預處理完成后，我們需要選擇有意義的特征輸入機器學習的算法和模型進行訓練。通常來說，從兩個方面考慮來選擇特征：無錫婦科醫(yī)院排行 http://www.0510bhyy.com/

　　特征是否發(fā)散：如果一個特征不發(fā)散，例如方差接近于0 ,也就是說樣本在這個特征上基本上沒有差異，這個特征對于樣本的區(qū)分并沒有什么用。

　　特征與目標的相關性：這點比較顯見，與目標相關性高的特征，應當優(yōu)選選擇。除方差法外，本文介紹的其他方法均從相關性考慮。

　　根據(jù)特征選擇的形式又可以將特征選擇方法分為3種：

　　1、Filter:過濾法，按照發(fā)散性或者相關性對各個特征進行評分，設定閾值或者待選擇閾值的個數(shù)，選擇特征。

　　2、Wrapper:包裝法，根據(jù)目標函數(shù)(通常是預測效果評分)，每次選擇若干特征，或者排除若干特征。

　　3、Embedded :嵌入法，先使用某些機器學習的算法和模型進行訓練，得到各個特征的權值系數(shù)，根據(jù)系數(shù)從大到小選擇特征。類似于Filter方法，但是是通過訓練來確定特征的優(yōu)劣。

　　2.1、Filter過濾法

　　2.1.1、方差選擇法

　　使用方差選擇法，先要計算各個特征的方差，然后根據(jù)闕值，選擇方差大于闕值的特征。(用的不是很多)

　　from sklearn.feature_selection import VarianceThreshold

　　**方差選擇法，返回值為特征選擇后的數(shù)據(jù)**

　　**參數(shù)thresshold為方差的闕值，方差大于3(threshold=3)**

　　vardata=VarianceThreshold(threshold=3).fit_transform(iris.data)

　　vardata.shape

　　(150, 1)

　　2.1.2、相關系數(shù)法

　　使用相關系數(shù)，先要計算各個特征對目標值的相關系數(shù)。用feature_selection庫的SelectKBest類結合相關系數(shù)來選擇

　　from sklearn.feature_selection import SelectKBest

　　from scipy.stats import pearsonr

　　import numpy as np

　　**選擇K個最好的特征，返回選擇特征后的數(shù)據(jù)**

　　**第一個參數(shù)為計算評估特征是否好的函數(shù)，該函數(shù)輸入特征矩陣和目標向量**

　　**輸出二元(評分，P值)的數(shù)組，數(shù)組第i項為第i個特征的評分和P值**

　　f=lambda X ,Y:np.array(list(map(lambda x:pearsonr(x,Y)[0],X.T))).T

　　SelectKBest(f,k=2).fit_transform(iris.data,iris.target)

　　2.1.3、卡方檢驗

　　from sklearn.feature_selection import SelectKBest

　　from sklearn.feature_selection import chi2

　　SelectKBest(chi2,k=2).fit_transform(iris.data,iris.target)

　　2.1.4、互信息法

　　經(jīng)典的互信息法也是評價定性自變量對定性因變量的相關性的。相關系數(shù)，卡方檢驗，互信息選擇原理相似，但相關系數(shù)通常只適用于連續(xù)特征選擇

　　import numpy as np

　　from sklearn.feature_selection import SelectKBest

　　from sklearn import metrics

　　mic=metrics.mutual_info_score

　　g=lambda X ,Y:np.array(list(map(lambda x:mic(x,Y),X.T))).T

　　SelectKBest(g,k=2).fit_transform(iris.data,iris.target)

　　2.2、Wrapper

　　遞歸特征消除法(RFE)

　　遞歸消除特征法使用一個基膜型進行多倫訓練，每輪訓練后，消除若干權值系數(shù)的特征，再基于新的特征集進行下一輪訓練

　　from sklearn.feature_selection import RFE

　　from sklearn.linear_model import LogisticRegression

　　**遞歸特征消除法，返回特征選擇后的數(shù)據(jù)**

　　**參數(shù)estimator為基膜型**

　　**n_features_to_select=2特征個數(shù)**

　　RFE(estimator=LogisticRegression(),n_features_to_select=2).fit_transform(iris.data,iris.target)

　　2.3、Embedded嵌入法

　　https://blog.csdn.net/jinping_shi/article/details/52433975

　　使用帶懲罰項的基模型，除了篩選出特征外同時也進行了降維。使用feature_selection庫的SelectFormModel類結合帶L1懲罰項的邏輯回歸模型，來選擇特征的代碼如下

　　from sklearn.feature_selection import SelectFromModel

　　from sklearn.linear_model import LogisticRegression

　　# L1正則化的回歸模型

　　SelectFromModel(LogisticRegression(penalty='l1',C=0.1)).fit_transform(iris.data,iris.target)

　　L1懲罰項降維的原理在于保留多個對目標值具有同等相關性的特征中的一個，所以沒選到的特征不代表不重要。故，可結合L2懲罰項來優(yōu)化。具體操作為：若一個特征在L1中的權值為1 ,選擇在L2 中權值差別不大且在L1中權值為0的特征構成同類集合，將這一集合中的特征平分L1中的權值，故需要構建一個新的邏輯回歸模型：

向AI問一下細節(jié)

python基礎學習之 特征工程

猜你喜歡

最新資訊

相關推薦

相關標簽

python基礎學習之特征工程