機器學習算法：AdaBoost

發(fā)布時間：2020-08-03 07:36:59 來源：網(wǎng)絡閱讀：3791 作者：BoyTNT 欄目：編程語言

AdaBoost算法（Adaptive Boost）的核心思想是：如果一個弱分類器的分類效果不好，那么就構建多個弱分類器，綜合考慮它們的分類結果和權重來決定最終的分類結果。很多人認為AdaBoost是監(jiān)督學習中最強大的兩種算法之一（另一個是支持向量機SVM）。

AdaBoost的訓練過程如下：

為每個訓練樣本初始化相同的權重；
針對訓練樣本及權重，找到一個弱分類器；
計算出這個弱分類器的錯誤率ε與權重α；
對正確分類的樣本，降低其權重，對錯誤分類的樣本，提升其權重；
返回2不斷迭代，直至弱分類器數(shù)量足夠；

其中錯誤率ε定義為分錯的樣本數(shù)除以總樣本數(shù)。權重α定義為：

機器學習算法：AdaBoost

權重提升與降低的公式如下：

機器學習算法：AdaBoost

對未知樣本分類時，用每個弱分類器進行分類，將結果乘以這個分類器的權重α，累加到一起得到一個猜測值，根據(jù)其正負決定最終輸出1還是-1。注意AdaBoost只能解決二分類問題，如果多于2個分類，需要做一定變通。

在AdaBoost中，廣泛使用單層決策樹（Decision Stump）作為弱分類器。之前研究ID3算法的決策樹時，樣本特征是離散型數(shù)據(jù)，這回研究連續(xù)型數(shù)據(jù)，因此之前的方法就不可用了，需要基于邊界比較來確定分類。另外需要注意要同時考慮權重D來決定最佳切分方式。基本思路如下：

foreach(每一個維度)
{
    在此維度最大最小值之間切出N個邊界
    foreach(每一個邊界)
    {
        針對"<="與">"兩種情況
        {
            獲取此條件下的分類結果（滿足記為1，不滿足記為-1）
            foreach(每一個結果)
            {
                if(猜測錯誤)
                    累加加權錯誤率
            }
            保留最小加權錯誤率的切分方式
        }
    }
}

由于只在一個維度上切分一次，可以想像單層決策樹的錯誤率應該相當高，是典型的弱分類器。但是綜合考慮它們的結果及權重，最終的分類錯誤率可以做到大大低于單個單層決策樹。上完整C#版AdaBoost的實現(xiàn)代碼：

首先是DataVector，之前Label一直是字符串，這回要出現(xiàn)1和-1了，因此改造一下，使Label的類型可定義：

using System;

namespace MachineLearning
{
    /// <summary>
    /// 數(shù)據(jù)向量
    /// </summary>
    /// <typeparam name="TData">特征類型</typeparam>
    /// <typeparam name="TLabel">標簽數(shù)據(jù)</typeparam>
    public class DataVector<TData, TLabel>
    {
        /// <summary>
        /// N維數(shù)據(jù)
        /// </summary>
        public TData[] Data { get; private set; }
        /// <summary>
        /// 分類標簽
        /// </summary>
        public TLabel Label { get; set; }
        /// <summary>
        /// 構造
        /// </summary>
        /// <param name="dimension">數(shù)據(jù)維度</param>
        public DataVector(int dimension)
        {
            Data = new TData[dimension];
        }
        /// <summary>
        /// 維度數(shù)量
        /// </summary>
        public int Dimension
        {
            get { return this.Data.Length; }
        }
    }
}

然后是一個存儲單層決策樹的結構：

using System;
using System.Collections.Generic;

namespace MachineLearning
{
    /// <summary>
    /// 單層決策樹
    /// </summary>
    public class DecisionStump
    {
        /// <summary>
        /// 分類器權重
        /// </summary>
        public double Alpha { get; set; }
        /// <summary>
        /// 加權錯誤率
        /// </summary>
        public double Error { get; set; }
        /// <summary>
        /// 在哪個維度上切分
        /// </summary>
        public int Dimension { get; set; }
        /// <summary>
        /// 切分邊界
        /// </summary>
        public double Boundary { get; set; }
        /// <summary>
        /// 是否按大于來切分
        /// </summary>
        public bool GreaterThan { get; set; }
        /// <summary>
        /// 此分類器在訓練數(shù)據(jù)集上的分類結果
        /// </summary>
        public List<int> Results { get; set; }
    }
}

最后是AdaBoost：

using System;
using System.Collections.Generic;
using System.Linq;

namespace MachineLearning
{
    /// <summary>
    /// AdaBoost(基于單層決策樹)
    /// </summary>
    public class AdaBoost
    {
        /// <summary>
        /// 弱分類器列表
        /// </summary>
        private List<DecisionStump> m_WeakClassifiers;
        
        /// <summary>
        /// 訓練
        /// </summary>
        /// <param name="trainingSet">訓練數(shù)據(jù)集</param>
        /// <param name="iterateCount">迭代次數(shù)，即弱分類器數(shù)量</param>
        public void Train(List<DataVector<double, int>> trainingSet, int iterateCount = 50)
        {
            m_WeakClassifiers = new List<DecisionStump>();
            
            var D = new List<double>();
            var gue***esults = new List<double>();
            for(int i = 0;i < trainingSet.Count;++i)
            {
                //權重初始化為1/n
                D.Add(1.0 / trainingSet.Count);
                //猜測結果初始化為0，后面累加要用
                gue***esults.Add(0.0);
            }
            
            //迭代指定次數(shù)
            for(int i = 0;i < iterateCount;++i)
            {
                //在當前權重下生成一棵錯誤率最低的單層決策樹
                var stump = CreateDecisionStump(trainingSet, D);
                
                //計算Alpha（注意stump.Error有可能為0，要防止除0錯誤）
                stump.Alpha = 0.5 * Math.Log((1 - stump.Error) / Math.Max(stump.Error, 1e-16));
                
                //保存這個決策樹到弱分類器
                m_WeakClassifiers.Add(stump);
                
                //根據(jù)猜測結果，重新計算下一輪的權重向量D(暫時未除以Sum(D)，下一步再處理)
                for(int j = 0;j < trainingSet.Count;++j)
                {
                    if(stump.Results[j] == trainingSet[j].Label)
                        D[j] = D[j] * Math.Exp(-1.0 * stump.Alpha);
                    else
                        D[j] = D[j] * Math.Exp(stump.Alpha);
                }
                
                //保證Sum(D)==1
                double sum = D.Sum();
                for(int j = 0;j < trainingSet.Count;++j)
                {
                    D[j] = D[j] / sum;
                    gue***esults[j] += stump.Alpha * stump.Results[j];
                }
                
                //計算總錯誤率
                int errors = 0;
                for(int j = 0;j < trainingSet.Count;++j)
                {
                    if(Math.Sign(gue***esults[j]) != trainingSet[j].Label)
                        ++errors;
                }
                
                //如果沒有錯誤，可以提前退出循環(huán)，但一般很難達到
                if(errors == 0)
                    break;
            }
        }
        
        /// <summary>
        /// 分類
        /// </summary>
        /// <param name="vector">待測數(shù)據(jù)</param>
        /// <returns>分類結果，1或-1</returns>
        public int Classify(DataVector<double, int> vector)
        {
            double result = 0.0;
            var dataSet = new List<DataVector<double, int>>();
            dataSet.Add(vector);
            
            //用每一個弱分類器的結果乘以相應的alpha，累加得到最終的猜測結果
            foreach(var c in m_WeakClassifiers)
            {
                var stumpResults = ClassifyByDecisionStump(dataSet, c.Dimension, c.Boundary, c.GreaterThan);
                result += stumpResults[0] * c.Alpha;
            }
            
            //根據(jù)正負決定輸出1還是-1
            return Math.Sign(result);
        }
        
        /// <summary>
        /// 根據(jù)單層決策樹進行一次分類
        /// </summary>
        /// <param name="dataSet">數(shù)據(jù)集</param>
        /// <param name="dimension">在哪個維度上分類</param>
        /// <param name="boundary">分類邊界</param>
        /// <param name="greaterThan">是否按大于來劃分數(shù)據(jù)</param>
        /// <returns>結果</returns>
        private List<int> ClassifyByDecisionStump(List<DataVector<double, int>> dataSet, int dimension, double boundary, bool greaterThan)
        {
            var result = new List<int>();
            foreach(var item in dataSet)
            {
                if(greaterThan)
                    result.Add(item.Data[dimension] > boundary ? 1 : -1);
                else
                    result.Add(item.Data[dimension] <= boundary ? 1 : -1);
            }
            
            return result;
        }
        
        /// <summary>
        /// 構建一個單層決策樹
        /// </summary>
        /// <param name="dataSet">數(shù)據(jù)集</param>
        /// <param name="D">權重</param>
        /// <returns>此權重下的最佳單層決策樹</returns>
        private DecisionStump CreateDecisionStump(List<DataVector<double, int>> dataSet, List<double> D)
        {
            var stump = new DecisionStump();
            double minError = double.PositiveInfinity;
            
            //遍歷每一個維度
            for(int i = 0;i < dataSet[0].Dimension;++i)
            {
                //找此維度的最大最小值
                double maxValue = double.NegativeInfinity;
                double minValue = double.PositiveInfinity;
                
                foreach(var item in dataSet)
                {
                    if(item.Data[i] > maxValue)
                        maxValue = item.Data[i];
                    if(item.Data[i] < minValue)
                        minValue = item.Data[i];
                }
                
                //做10次切分，計算步長
                double stepSize = (maxValue - minValue) / 10.0;
                for(int j = 0;j < 10;++j)
                {
                    //邊界
                    double boundary = minValue + stepSize * j;
                    
                    //分別計算邊界兩邊的情況
                    for(int k = 0;k < 2;++k)
                    {
                        var results = ClassifyByDecisionStump(dataSet, i, boundary, k == 0);
                        
                        //計算錯誤，注意是加權的錯誤
                        double weightError = 0.0;
                        for(int idx = 0;idx < results.Count;++idx)
                        {
                            if(results[idx] != dataSet[idx].Label)
                                weightError += D[idx];
                        }
                        
                        //保留最小錯誤的分類器
                        if(weightError < minError)
                        {
                            minError = weightError;
                            stump.Error = Math.Min(weightError, 1.0);        //此分類器的錯誤比例
                            stump.Boundary = boundary;                       //分類邊界
                            stump.Dimension = i;                             //在哪個維度上分類
                            stump.GreaterThan = false;                       //大于還是小于
                            stump.Results = results;                         //用此分類器得出的結果
                        }
                    }
                }
            }
            
            return stump;
        }
    }
}

最后上測試，還是使用上次的breast-cancer-wisconsin.txt做測試，順便也和kNN對比一下。上測試代碼：

public void TestAdaBoost()
{
    var trainingSet = new List<DataVector<double, int>>();
    var testSet = new List<DataVector<double, int>>();
    
    //讀取數(shù)據(jù)
    var file = new StreamReader(@"breast-cancer-wisconsin.txt", Encoding.Default);
    for(int i = 0;i < 699;++i)
    {
        string line = file.ReadLine();
        var parts = line.Split(',');
        
        var p = new DataVector<double, int>(9);
        for(int j = 0;j < p.Dimension;++j)
        {
            if(parts[j + 1] == "?")
                parts[j + 1] = "0";
            p.Data[j] = Convert.ToDouble(parts[j + 1]);
        }
        p.Label = Convert.ToInt32(parts[10]) == 2 ? 1 : -1;
        
        //和上次一樣，600個做訓練，99個做測試
        if(i < 600)
            trainingSet.Add(p);
        else
            testSet.Add(p);
    }
    file.Close();
    
    //檢驗
    var boost = new AdaBoost();
    boost.Train(trainingSet);               //默認50次迭代
    int error = 0;
    foreach(var p in testSet)
    {
        var label = boost.Classify(p);
        if(label != p.Label)
            error++;
    }
    
    Console.WriteLine("Error = {0}/{1}, {2}%", error, testSet.Count, (error * 100.0 / testSet.Count));
}

最終結果是99個測試樣本猜錯1個，錯誤率1.01%，相當不錯。

在Train方法中計算了錯誤率，可以把它同時也輸出出來，另外改變不同的迭代次數(shù)，可以對比一下訓練錯誤率與測試錯誤率：

弱分類器數(shù)量（迭代次數(shù)）	訓練錯誤率	測試錯誤率
5	4.8%	4.0%
10	4.7%	2.0%
50	3.0%	1.0%
100	2.5%	2.0%
200	2.0%	2.0%
500	0.8%	2.0%
1000	0.0%	2.0%

可以看到，隨著弱分類器越來越多，AdaBoost的訓練錯誤率不斷下降，最終收斂到0，但測試錯誤率在下降到最低點后又有所上升。

向AI問一下細節(jié)

機器學習算法：AdaBoost

猜你喜歡

最新資訊

相關推薦

相關標簽