您好,登錄后才能下訂單哦!
這篇文章主要介紹“python數(shù)據(jù)挖掘中比較實(shí)用的幾個(gè)特征選擇方法”,在日常操作中,相信很多人在python數(shù)據(jù)挖掘中比較實(shí)用的幾個(gè)特征選擇方法問(wèn)題上存在疑惑,小編查閱了各式資料,整理出簡(jiǎn)單好用的操作方法,希望對(duì)大家解答”python數(shù)據(jù)挖掘中比較實(shí)用的幾個(gè)特征選擇方法”的疑惑有所幫助!接下來(lái),請(qǐng)跟著小編一起來(lái)學(xué)習(xí)吧!
對(duì)于從事數(shù)據(jù)分析、數(shù)據(jù)挖掘的小伙伴來(lái)說(shuō),特征選擇是繞不開(kāi)的話題,是數(shù)據(jù)挖掘過(guò)程中不可或缺的環(huán)節(jié)。好的特征選擇能夠提升模型的性能,更能幫助我們理解數(shù)據(jù)的特點(diǎn)、底層結(jié)構(gòu),這對(duì)進(jìn)一步改善模型、算法都有著重要作用。
特征選擇作用
減少特征數(shù)量、降維,使模型泛化能力更強(qiáng),減少過(guò)擬合
增強(qiáng)對(duì)特征和特征值之間的理解
特征選擇方法介紹
1.特征重要性
在特征的選擇過(guò)程中,學(xué)習(xí)器是樹(shù)模型的話,可以根據(jù)特征的重要性來(lái)篩選有效的特征,在sklearn中,GBDT和RF的特征重要性計(jì)算方法是相同的,都是基于單棵樹(shù)計(jì)算每個(gè)特征的重要性,探究每個(gè)特征在每棵樹(shù)上做了多少的貢獻(xiàn),再取個(gè)平均值。單棵樹(shù)上特征的重要性定義為:特征在所有非葉節(jié)在分裂時(shí)加權(quán)不純度的減少,減少的越多說(shuō)明特征越重要
import numpy as npfrom sklearn.tree import DecisionTreeClassifierfrom sklearn.externals.six import StringIOfrom sklearn import treeimport pydotplusclf = DecisionTreeClassifier()x = [[1,1,1,1,1,2,2,2,2,2,3,3,3,3,3], [1,1,2,2,1,1,1,2,1,1,1,1,2,2,1], [1,1,1,2,1,1,1,2,2,2,2,2,1,1,1], [1,2,2,1,1,1,2,2,3,3,3,2,2,3,1] ]y = [1,1,2,2,1,1,1,2,2,2,2,2,2,2,1]x = np.array(x)x = np.transpose(x)clf.fit(x,y)print(clf.feature_importances_)feature_name = ['A1','A2','A3','A4']target_name = ['1','2']dot_data = StringIO()tree.export_graphviz(clf,out_file = dot_data,feature_names=feature_name, class_names=target_name,filled=True,rounded=True, special_characters=True)graph = pydotplus.graph_from_dot_data(dot_data.getvalue())graph.write_pdf("Tree.pdf")
2.回歸模型的系數(shù)
越是重要的特征在模型中對(duì)應(yīng)的系數(shù)就會(huì)越大,而跟輸出變量越是無(wú)關(guān)的特征對(duì)應(yīng)的系數(shù)就會(huì)越接近于0。在噪音不多的數(shù)據(jù)上,或者是數(shù)據(jù)量遠(yuǎn)遠(yuǎn)大于特征數(shù)的數(shù)據(jù)上,如果特征之間相對(duì)來(lái)說(shuō)是比較獨(dú)立的,那么即便是運(yùn)用最簡(jiǎn)單的線性回歸模型也一樣能取得非常好的效果。
from sklearn.linear_model import LinearRegression
import numpy as np
np.random.seed(0)
size = 5000
#A dataset with 3 features
X = np.random.normal(0, 1, (size, 3))
#Y = X0 + 2*X1 + noise
Y = X[:,0] + 2*X[:,1] + np.random.normal(0, 2, size)
lr = LinearRegression()
lr.fit(X, Y)
#A helper method for pretty-printing linear models
def pretty_print_linear(coefs, names = None, sort = False):
if names == None:
names = ["X%s" % x for x in range(len(coefs))]
lst = zip(coefs, names)
if sort:
lst = sorted(lst, key = lambda x:-np.abs(x[0]))
return " + ".join("%s * %s" % (round(coef, 3), name)
for coef, name in lst)
print "Linear model:", pretty_print_linear(lr.coef_)
from sklearn.cross_validation import ShuffleSplit
from sklearn.metrics import r2_score
from collections import defaultdict
X = boston["data"]
Y = boston["target"]
rf = RandomForestRegressor()
scores = defaultdict(list)
#crossvalidate the scores on a number of different random splits of the data
for train_idx, test_idx in ShuffleSplit(len(X), 100, .3):
X_train, X_test = X[train_idx], X[test_idx]
Y_train, Y_test = Y[train_idx], Y[test_idx]
r = rf.fit(X_train, Y_train)
acc = r2_score(Y_test, rf.predict(X_test))
for i in range(X.shape[1]):
X_t = X_test.copy()
np.random.shuffle(X_t[:, i])
shuff_acc = r2_score(Y_test, rf.predict(X_t))
scores[names[i]].append((acc-shuff_acc)/acc)
print "Features sorted by their score:"
print sorted([(round(np.mean(score), 4), feat) for
feat, score in scores.items()], reverse=True)
到此,關(guān)于“python數(shù)據(jù)挖掘中比較實(shí)用的幾個(gè)特征選擇方法”的學(xué)習(xí)就結(jié)束了,希望能夠解決大家的疑惑。理論與實(shí)踐的搭配能更好的幫助大家學(xué)習(xí),快去試試吧!若想繼續(xù)學(xué)習(xí)更多相關(guān)知識(shí),請(qǐng)繼續(xù)關(guān)注億速云網(wǎng)站,小編會(huì)繼續(xù)努力為大家?guī)?lái)更多實(shí)用的文章!
免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng),如果涉及侵權(quán)請(qǐng)聯(lián)系站長(zhǎng)郵箱:is@yisu.com進(jìn)行舉報(bào),并提供相關(guān)證據(jù),一經(jīng)查實(shí),將立刻刪除涉嫌侵權(quán)內(nèi)容。