溫馨提示×

您好,登錄后才能下訂單哦!

密碼登錄×
登錄注冊(cè)×
其他方式登錄
點(diǎn)擊 登錄注冊(cè) 即表示同意《億速云用戶服務(wù)條款》

LightGBM調(diào)參貝葉斯全局優(yōu)化方法是什么

發(fā)布時(shí)間:2021-12-16 09:36:20 來(lái)源:億速云 閱讀:192 作者:iii 欄目:編程語(yǔ)言

這篇文章主要介紹“LightGBM調(diào)參貝葉斯全局優(yōu)化方法是什么”,在日常操作中,相信很多人在LightGBM調(diào)參貝葉斯全局優(yōu)化方法是什么問(wèn)題上存在疑惑,小編查閱了各式資料,整理出簡(jiǎn)單好用的操作方法,希望對(duì)大家解答”LightGBM調(diào)參貝葉斯全局優(yōu)化方法是什么”的疑惑有所幫助!接下來(lái),請(qǐng)跟著小編一起來(lái)學(xué)習(xí)吧!

  這里結(jié)合Kaggle比賽的一個(gè)數(shù)據(jù)集,記錄一下使用貝葉斯全局優(yōu)化和高斯過(guò)程來(lái)尋找最佳參數(shù)的方法步驟。

  1.安裝貝葉斯全局優(yōu)化庫(kù)

  從pip安裝最新版本

  pip install bayesian-optimization

  2.加載數(shù)據(jù)集

  import pandas as pd

  import numpy as np

  from sklearn.model_selection import StratifiedKFold

  from scipy.stats import rankdata

  from sklearn import metrics

  import lightgbm as lgb

  import warnings

  import gc

  pd.set_option('display.max_columns', 200)

  train_df = pd.read_csv('../input/train.csv')

  test_df = pd.read_csv('../input/test.csv')

  目標(biāo)變量的分布

  target = 'target'

  predictors = train_df.columns.values.tolist()[2:]

  train_df.target.value_counts()

LightGBM調(diào)參貝葉斯全局優(yōu)化方法是什么

  問(wèn)題是不平衡。這里使用50%分層行作為保持行,以便驗(yàn)證集獲得最佳參數(shù)。 稍后將在最終模型擬合中使用5折交叉驗(yàn)證。

  bayesian_tr_index, bayesian_val_index = list(StratifiedKFold(n_splits=2,

  shuffle=True, random_state=1).split(train_df, train_df.target.values))[0]

  這些bayesian_tr_index和bayesian_val_index索引將用于貝葉斯優(yōu)化,作為訓(xùn)練和驗(yàn)證數(shù)據(jù)集的索引。

  3.黑盒函數(shù)優(yōu)化(LightGBM)

  在加載數(shù)據(jù)時(shí),為L(zhǎng)ightGBM創(chuàng)建黑盒函數(shù)以查找參數(shù)。

  def LGB_bayesian(

  num_leaves, # int

  min_data_in_leaf, # int

  learning_rate,

  min_sum_hessian_in_leaf, # int

  feature_fraction,

  lambda_l1,

  lambda_l2,

  min_gain_to_split,

  max_depth):

  # LightGBM expects next three parameters need to be integer. So we make them integer

  num_leaves = int(num_leaves)

  min_data_in_leaf = int(min_data_in_leaf)

  max_depth = int(max_depth)

  assert type(num_leaves) == int

  assert type(min_data_in_leaf) == int

  assert type(max_depth) == int

  param = {

  'num_leaves': num_leaves,

  'max_bin': 63,

  'min_data_in_leaf': min_data_in_leaf,

  'learning_rate': learning_rate,

  'min_sum_hessian_in_leaf': min_sum_hessian_in_leaf,

  'bagging_fraction': 1.0,

  'bagging_freq': 5,

  'feature_fraction': feature_fraction,

  'lambda_l1': lambda_l1,

  'lambda_l2': lambda_l2,

  'min_gain_to_split': min_gain_to_split,

  'max_depth': max_depth,

  'save_binary': True,

  'seed': 1337,

  'feature_fraction_seed': 1337,

  'bagging_seed': 1337,

  'drop_seed': 1337,

  'data_random_seed': 1337,

  'objective': 'binary',

  'boosting_type': 'gbdt',

  'verbose': 1,

  'metric': 'auc',

  'is_unbalance': True,

  'boost_from_average': False,

  }

  xg_train = lgb.Dataset(train_df.iloc[bayesian_tr_index][predictors].values,

  label=train_df.iloc[bayesian_tr_index][target].values,

  feature_name=predictors,

  free_raw_data = False

  )

  xg_valid = lgb.Dataset(train_df.iloc[bayesian_val_index][predictors].values,

  label=train_df.iloc[bayesian_val_index][target].values,

  feature_name=predictors,

  free_raw_data = False

  )

  num_round = 5000

  clf = lgb.train(param, xg_train, num_round, valid_sets = [xg_valid], verbose_eval=250, early_stopping_rounds = 50)

  predictions = clf.predict(train_df.iloc[bayesian_val_index][predictors].values, num_iteration=clf.best_iteration)

  score = metrics.roc_auc_score(train_df.iloc[bayesian_val_index][target].values, predictions)

  return score

  上面的LGB_bayesian函數(shù)將作為貝葉斯優(yōu)化的黑盒函數(shù)。 我已經(jīng)在LGB_bayesian函數(shù)中為L(zhǎng)ightGBM定義了trainng和validation數(shù)據(jù)集。

  LGB_bayesian函數(shù)從貝葉斯優(yōu)化框架獲取num_leaves,min_data_in_leaf,learning_rate,min_sum_hessian_in_leaf,feature_fraction,lambda_l1,lambda_l2,min_gain_to_split,max_depth的值。 請(qǐng)記住,對(duì)于LightGBM,num_leaves,min_data_in_leaf和max_depth應(yīng)該是整數(shù)。 但貝葉斯優(yōu)化會(huì)發(fā)送連續(xù)的函數(shù)。 所以我強(qiáng)制它們是整數(shù)。 我只會(huì)找到它們的最佳參數(shù)值。 讀者可以增加或減少要優(yōu)化的參數(shù)數(shù)量。

  現(xiàn)在需要為這些參數(shù)提供邊界,以便貝葉斯優(yōu)化僅在邊界內(nèi)搜索。

  bounds_LGB = {

  'num_leaves': (5, 20),

  'min_data_in_leaf': (5, 20),

  'learning_rate': (0.01, 0.3),

  'min_sum_hessian_in_leaf': (0.00001, 0.01),

  'feature_fraction': (0.05, 0.5),

  'lambda_l1': (0, 5.0),

  'lambda_l2': (0, 5.0),

  'min_gain_to_split': (0, 1.0),

  'max_depth':(3,15),

  }

  讓我們將它們?nèi)糠旁贐ayesianOptimization對(duì)象中

  from bayes_opt import BayesianOptimization

  LGB_BO = BayesianOptimization(LGB_bayesian, bounds_LGB, random_state=13)

  現(xiàn)在,讓我們來(lái)優(yōu)化key space (parameters):

  print(LGB_BO.space.keys)

LightGBM調(diào)參貝葉斯全局優(yōu)化方法是什么

  我創(chuàng)建了BayesianOptimization對(duì)象(LGB_BO),在調(diào)用maxime之前它不會(huì)工作。在調(diào)用之前,解釋一下貝葉斯優(yōu)化對(duì)象(LGB_BO)的兩個(gè)參數(shù),我們可以傳遞給它們進(jìn)行最大化:

  init_points:我們想要執(zhí)行的隨機(jī)探索的初始隨機(jī)運(yùn)行次數(shù)。 在我們的例子中,LGB_bayesian將被運(yùn)行n_iter次。

  n_iter:運(yùn)行init_points數(shù)后,我們要執(zhí)行多少次貝葉斯優(yōu)化運(yùn)行。

  現(xiàn)在,是時(shí)候從貝葉斯優(yōu)化框架調(diào)用函數(shù)來(lái)最大化。 我允許LGB_BO對(duì)象運(yùn)行5個(gè)init_points和5個(gè)n_iter。

  init_points = 5

  n_iter = 5

  print('-' * 130)

  with warnings.catch_warnings():

  warnings.filterwarnings('ignore')

  LGB_BO.maximize(init_points=init_points, n_iter=n_iter, acq='ucb', xi=0.0, alpha=1e-6)

  優(yōu)化完成后,讓我們看看我們得到的最大值是多少。

  LGB_BO.max['target']

LightGBM調(diào)參貝葉斯全局優(yōu)化方法是什么

  參數(shù)的驗(yàn)證AUC是0.89, 讓我們看看參數(shù):

  LGB_BO.max['params']

  現(xiàn)在我們可以將這些參數(shù)用于我們的最終模型!

  BayesianOptimization庫(kù)中還有一個(gè)很酷的選項(xiàng)。 你可以探測(cè)LGB_bayesian函數(shù),如果你對(duì)最佳參數(shù)有所了解,或者您從其他kernel獲取參數(shù)。 我將在此復(fù)制并粘貼其他內(nèi)核中的參數(shù)。 你可以按照以下方式進(jìn)行探測(cè):

  LGB_BO.probe(

  params={'feature_fraction': 0.1403,

  'lambda_l1': 4.218,

  'lambda_l2': 1.734,

  'learning_rate': 0.07,

  'max_depth': 14,

  'min_data_in_leaf': 17,

  'min_gain_to_split': 0.1501,

  'min_sum_hessian_in_leaf': 0.000446,

  'num_leaves': 6},

  lazy=True, #

  )

  好的,默認(rèn)情況下這些將被懶惰地探索(lazy = True),這意味著只有在你下次調(diào)用maxime時(shí)才會(huì)評(píng)估這些點(diǎn)。 讓我們對(duì)LGB_BO對(duì)象進(jìn)行最大化調(diào)用。

  LGB_BO.maximize(init_points=0, n_iter=0) # remember no init_points or n_iter

  最后,通過(guò)屬性LGB_BO.res可以獲得探測(cè)的所有參數(shù)列表及其相應(yīng)的目標(biāo)值。

  for i, res in enumerate(LGB_BO.res):

  print("Iteration {}: \n\t{}".format(i, res))

  我們?cè)谡{(diào)查中獲得了更好的驗(yàn)證分?jǐn)?shù)!和以前一樣,我只運(yùn)行LGB_BO 10次。在實(shí)踐中,我將它增加到100。

  LGB_BO.max['target']

  LGB_BO.max['params']

  讓我們一起構(gòu)建一個(gè)模型使用這些參數(shù)。

  4.訓(xùn)練LightGBM模型

  param_lgb = {

  'num_leaves': int(LGB_BO.max['params']['num_leaves']), # remember to int here

  'max_bin': 63,

  'min_data_in_leaf': int(LGB_BO.max['params']['min_data_in_leaf']), # remember to int here

  'learning_rate': LGB_BO.max['params']['learning_rate'],

  'min_sum_hessian_in_leaf': LGB_BO.max['params']['min_sum_hessian_in_leaf'],

  'bagging_fraction': 1.0,

  'bagging_freq': 5,

  'feature_fraction': LGB_BO.max['params']['feature_fraction'],

  'lambda_l1': LGB_BO.max['params']['lambda_l1'],

  'lambda_l2': LGB_BO.max['params']['lambda_l2'],

  'min_gain_to_split': LGB_BO.max['params']['min_gain_to_split'],

  'max_depth': int(LGB_BO.max['params']['max_depth']), # remember to int here

  'save_binary': True,

  'seed': 1337,

  'feature_fraction_seed': 1337,

  'bagging_seed': 1337,

  'drop_seed': 1337,

  'data_random_seed': 1337,

  'objective': 'binary',

  'boosting_type': 'gbdt',

  'verbose': 1,

  'metric': 'auc',

  'is_unbalance': True,

  'boost_from_average': False,

  }

  如您所見(jiàn),我將LGB_BO的最佳參數(shù)保存到param_lgb字典中,它們將用于訓(xùn)練5折的模型。

  Kfolds數(shù)量:無(wú)錫婦科檢查醫(yī)院 http://www.87554006.com/

  nfold = 5

  gc.collect()

  skf = StratifiedKFold(n_splits=nfold, shuffle=True, random_state=2019)

  oof = np.zeros(len(train_df))

  predictions = np.zeros((len(test_df),nfold))

  i = 1

  for train_index, valid_index in skf.split(train_df, train_df.target.values):

  print("\nfold {}".format(i))

  xg_train = lgb.Dataset(train_df.iloc[train_index][predictors].values,

  label=train_df.iloc[train_index][target].values,

  feature_name=predictors,

  free_raw_data = False

  )

  xg_valid = lgb.Dataset(train_df.iloc[valid_index][predictors].values,

  label=train_df.iloc[valid_index][target].values,

  feature_name=predictors,

  free_raw_data = False

  )

  clf = lgb.train(param_lgb, xg_train, 5000, valid_sets = [xg_valid], verbose_eval=250, early_stopping_rounds = 50)

  oof[valid_index] = clf.predict(train_df.iloc[valid_index][predictors].values, num_iteration=clf.best_iteration)

  predictions[:,i-1] += clf.predict(test_df[predictors], num_iteration=clf.best_iteration)

  i = i + 1

  print("\n\nCV AUC: {:<0.2f}".format(metrics.roc_auc_score(train_df.target.values, oof)))

  所以我們?cè)?折交叉驗(yàn)證中獲得了0.90 AUC。

  讓我們對(duì)5折預(yù)測(cè)進(jìn)行排名平均。

  5.排名平均值

  print("Rank averaging on", nfold, "fold predictions")

  rank_predictions = np.zeros((predictions.shape[0],1))

  for i in range(nfold):

  rank_predictions[:, 0] = np.add(rank_predictions[:, 0], rankdata(predictions[:, i].reshape(-1,1))/rank_predictions.shape[0])

  rank_predictions /= nfold

  6.提交

  sub_df = pd.DataFrame({"ID_code": test_df.ID_code.values})

  sub_df["target"] = rank_predictions

  sub_df.to_csv("Customer_Transaction_rank_predictions.csv", index=False)

到此,關(guān)于“LightGBM調(diào)參貝葉斯全局優(yōu)化方法是什么”的學(xué)習(xí)就結(jié)束了,希望能夠解決大家的疑惑。理論與實(shí)踐的搭配能更好的幫助大家學(xué)習(xí),快去試試吧!若想繼續(xù)學(xué)習(xí)更多相關(guān)知識(shí),請(qǐng)繼續(xù)關(guān)注億速云網(wǎng)站,小編會(huì)繼續(xù)努力為大家?guī)?lái)更多實(shí)用的文章!

向AI問(wèn)一下細(xì)節(jié)

免責(zé)聲明:本站發(fā)布的內(nèi)容(圖片、視頻和文字)以原創(chuàng)、轉(zhuǎn)載和分享為主,文章觀點(diǎn)不代表本網(wǎng)站立場(chǎng),如果涉及侵權(quán)請(qǐng)聯(lián)系站長(zhǎng)郵箱:is@yisu.com進(jìn)行舉報(bào),并提供相關(guān)證據(jù),一經(jīng)查實(shí),將立刻刪除涉嫌侵權(quán)內(nèi)容。

AI