求CART算法matlab实现_软件运维

function D = CART(train_features, train_targets, params, region)

% Classify using classification and regression trees

% Inputs:

% features - Train features

% targets- Train targets

% params - [Impurity type, Percentage of incorrectly assigned samples at a node]

% Impurity can be: Entropy, Variance (or Gini), or Missclassification

% region- Decision region vector: [-x x -y y number_of_points]

% Outputs

% D - Decision sufrace

[Ni, M] = size(train_features)

%Get parameters

[split_type, inc_node] = process_params(params)

%For the decision region

N = region(5)

mx = ones(N,1) * linspace (region(1),region(2),N)

my = linspace (region(3),region(4),N)' * ones(1,N)

flatxy = [mx(:), my(:)]'

%Preprocessing

[f, t, UW, m] = PCA(train_features, train_targets, Ni, region)

train_features = UW * (train_features - m*ones(1,M))

flatxy = UW * (flatxy - m*ones(1,N^2))

%Build the tree recursively

disp('Building tree')

tree= make_tree(train_features, train_targets, M, split_type, inc_node, region)

%Make the decision region according to the tree

disp('Building decision surface using the tree')

targets = use_tree(flatxy, 1:N^2, tree)

D = reshape(targets,N,N)

%END

function targets = use_tree(features, indices, tree)

%Classify recursively using a tree

if isnumeric(tree.Raction)

%Reached an end node

targets = zeros(1,size(features,2))

targets(indices) = tree.Raction(1)

else

%Reached a branching, so:

%Find who goes where

in_right= indices(find(eval_r(tree.Raction)))

in_left = indices(find(eval_r(tree.Laction)))

Ltargets = use_tree(features, in_left, tree.left)

Rtargets = use_tree(features, in_right, tree.right)

targets = Ltargets + Rtargets

end

%END use_tree

function tree = make_tree(features, targets, Dlength, split_type, inc_node, region)

%Build a tree recursively

if (length(unique(targets)) == 1),

%There is only one type of targets, and this generates a warning, so deal with it separately

tree.right = []

tree.left = []

tree.Raction= targets(1)

tree.Laction= targets(1)

break

end

[Ni, M] = size(features)

Nt = unique(targets)

N = hist(targets, Nt)

if ((sum(N <Dlength*inc_node) == length(Nt) - 1) | (M == 1)),

%No further splitting is neccessary

tree.right = []

tree.left = []

if (length(Nt) ~= 1),

MLlabel = find(N == max(N))

else

MLlabel = 1

end

tree.Raction= Nt(MLlabel)

tree.Laction= Nt(MLlabel)

else

%Split the node according to the splitting criterion

deltaI = zeros(1,Ni)

split_point = zeros(1,Ni)

op = optimset('Display', 'off')

for i = 1:Ni,

split_point(i) = fminbnd('CARTfunctions', region(i*2-1), region(i*2), op, features, targets, i, split_type)

I(i) = feval_r('CARTfunctions', split_point(i), features, targets, i, split_type)

end

[m, dim] = min(I)

loc = split_point(dim)

%So, the split is to be on dimention 'dim' at location 'loc'

indices = 1:M

tree.Raction= ['features(' num2str(dim) ',indices) > ' num2str(loc)]

tree.Laction= ['features(' num2str(dim) ',indices) <= ' num2str(loc)]

in_right= find(eval_r(tree.Raction))

in_left = find(eval_r(tree.Laction))

if isempty(in_right) | isempty(in_left)

%No possible split found

tree.right = []

tree.left = []

if (length(Nt) ~= 1),

MLlabel = find(N == max(N))

else

MLlabel = 1

end

tree.Raction= Nt(MLlabel)

tree.Laction= Nt(MLlabel)

else

%...It's possible to build new nodes

tree.right = make_tree(features(:,in_right), targets(in_right), Dlength, split_type, inc_node, region)

tree.left = make_tree(features(:,in_left), targets(in_left), Dlength, split_type, inc_node, region)

end

一、基本概念

1.cart使用基尼系数作为划分标准。基尼系数越小，则不纯度越低，区分的越彻底。

2.假设有k个类别，第k个类别的概率为 ,则基尼系数表达式为：

Gini(p)=(1- )=1-

3.对于样本D，如果根据特征A 的值把样本分为D1,D2两部分，则在特征A条件下，D的基尼系数

Gini(D,A)= Gini(D1)+ Gini(D2)

4.CART建立起来的是二叉树，如果特征A有A1,A2，A3三个类别，CART会考虑把A分成{A1},{A2 ,A3}两组，或者是其他两种情况。由于这次A并没有完全分开，所以下次还有机会在子节点把A2,A3分开.

5.对于连续值的切分.假如有1 2 3 4 5 那么cart会有4个切分点 [1.5 2.5 3.5 4.5]

二.实例推导树的建立过程

1.假设我有以下源数据

序号天气周末促销销量

1 坏是是高

2 坏是是高

3 坏是是高

4 坏否是高

5 坏是是高

6 坏否是高

7 坏是否高

8 好是是高

9 好是否高

10 好是是高

11 好是是高

12 好是是高

13 好是是高

14 坏是是低

15 好否是高

16 好否是高

17 好否是高

18 好否是高

19 好否否高

20 坏否否低

21 坏否是低

22 坏否是低

23 坏否是低

24 坏否否低

25 坏是否低

26 好否是低

27 好否是低

28 坏否否低

29 坏否否低

30 好否否低

31 坏是否低

32 好否是低

33 好否否低

34 好否否低

该数据集有三个特征天气周末促销

2.为了简化建立树的过程,我将忽略基尼系数与样本个数阀值

2.1 首先计算各个特征值对数据集的基尼系数,公式见---- 基本概念.3

Gini(D|天气)=17/34*(1-(11/17)^2-(6/17)^2)+17/34*(1-(7/17)^2-(10/17)^2)=0.4706

Gini(D|周末)=20/34*(1-(7/20)^2-(13/20)^2)+14/34*(1-(11/14)^2-(3/14)^2)=0.4063

Gini(D|促销)=12/34*(1-(9/12)^2-(3/12)^2)+22/34*(1-(7/22)^2-(15/22)^2)=0.4131

周末的基尼系数最小，这也符合我们的一般认识

2.2 第一个分列特征选择周末。此时数据集按照是否周末分成两个。

Gini(周末|天气)=0.2679

Gini(周末|促销)=0.2714

Gini(非周末|天气)=0.3505

Gini(非周末|促销)=0.3875

此时，周末应该以天气作为划分，非周末也是以天气作为划分，下面放个图

三、CART树对于连续特征的处理

假如特征A为连续型变量，则把特征A按照从小到大进行排序，取相邻两点的平均值为切分点，计算基尼系数。则基尼系数最小的点为切分点，大于切分点的为一类，小于切分点的为另一类。举例：特征A的值为 1，2，3，4，5，6 目标变量是高、低、高、低、高、低。则1.5处的基尼系数为 (1/6)*(1-1^2)+(5/6)*(1-(2/5)^2-(3/5)^2)=0.4 2.5处的基尼系数为 (2/6)*(1-(1/2)^2-(1/2)^2)+(4/6)*(1-(2/4)^2-(2/4)^2)=0.5 3.5处的基尼系数为 (3/6)*(1-(1/3)^2-(2/3)^2)+(3/6)*(1-(1/3)^2-(2/3)^2)=0.44 4.5处的基尼系数为 (4/6)*(1-(2/4)^2-(2/4)^2)+(2/6)*(1-(1/2)^2-(1/2)^2)=0.5 5.5处的基尼系数为 (5/6)*(1-(2/5)^2-(3/5)^2)+(1/6)*(1-1^2)=0.4 结论： 1.5和5.5处的基尼系数最小，可以把1分为一类，2-6分为另一类。或者6分为一类，1-5另一类。

四、关于回归树

1.回归树和分类树的区别在于输出值类型不同。分类树输出的是离散值，回归树输出的是连续值。

2.和分类树使用基尼系数不同，回归树使用和均方差来度量最佳分隔点。假设有1 2 3 4 5 6 六个数。假设3.5处把数据分开最合适，那么(1-2)^2+(2-2)^2+(3-2)^2+(4-5)^2+(5-5)^2+(6-5)^2在所有分割点中取得最小值。2，5为各自数据段的平均值。

3.回归树采用最后叶子的平均值或者中值作为输出结果

CART算法就是分类回归树，它只支持二叉树，既可以作分类树，又可以作回归树。

那什么是分类树，什么是回归树呢？假如有个数据集，分别给出了，不同年龄、职业、性别的不同学习时间。如果我构造了一棵决策树，想要基于数据判断这个人的职业身份，这个就属于分类树，因为是从几个分类中来做选择。如果是给定了数据，想要预测这个人的年龄，那就属于回归树。分类树可以处理离散数据，也就是数据种类有限的数据，它输出的是样本的类别，而回归树可以对连续型的数值进行预测，也就是数据在某个区间内都有取值的可能，它输出的是一个数值。

CART算法属性指标是基尼系数。基尼系数本身反应了样本的不确定性。当基尼系数越小的时候，说明样本之间的差异性小，不确定程度低。分类的过程本身是一个不确定度降低的过程，即纯度的提升过程。所以CART算法在构造分类树的时候，会选择基尼系数最小的属性作为属性的划分。

欢迎分享，转载请注明来源：内存溢出

原文地址:https://54852.com/yw/11277435.html

求CART算法matlab实现

发表评论

评论列表（0条）