Comparison of fuzzy and crisp classification trees using gini index, chi-square statistic and the gain ratio
Loading...
Date
2017-07
Authors
Muchai, Eunice Wambui
Journal Title
Journal ISSN
Volume Title
Publisher
Kenyatta University
Abstract
ABSTRACT
Discriminant (classification) analysis is a classification problem where a new individual is allocated
into one of known populations or classes based on the measured characteristics of the individual.
Different models are used in allocating the new individual into one of the populations (classes). Some
of the models depend on the underlying distribution of the populations, hese are known as parametric
models. If the model does not depend on any underlying distribution it is known as a distribution free
or non parametric model. In this work a distribution free model known as classification tree is used.
A classification tree is a presentation of edges and nodes. It is a model that is used to assign an
individual to one of many classes or populations. At each node a test is applied on a value of one of
the attributes (variables) of the individual. The individual moves to the next node (child node) along
an edge depending on the result of the test. The attribute, on which the test is applied, is known as the
splitting attribute and the value the splitting value. Tests are carried out at each node until it is not
possible to carry out more tests. The final nodes are known as terminal or leaf nodes. Classification
is done at the terminal nodes by assigning all the individuals on that node to a class. If the splitting
value is a fuzzy value, then the tree is known as a fuzzy classification tree otherwise the tree is known
as a crisp classification tree. When there are only two possible answers to the test at each node, the
resulting tree is known as a binary tree. Classification trees have been used to model many situations.
These include speech recognition, data mining and market surveys among others. In this study the
performance of crisp and fuzzy classification trees was compared. The performance was based on
probabilities of correct allocation and probabilities of misclassification. Simulated data and real
data were used. Data was simulated using R and the real data was obtained from machine learning
repository. Gini Index, Chi-Square Statistic and Gain Ratio impurity measures were applied to both
the simulated data and real data. The performance of Gini Index, Chi-Square Statistic and Gain
Ratio impurity measures was also compared. Finally the performance of the trees using varied
sample sizes was compared. It was found that for the simulated data, fuzzy classification tree
performed better than the crisp classification tree when all the three impurity measures were applied.
It was found that the Gini Index and Chi-Square Statistic impurity measures were appropriate as
impurity measures for the data used in the study and gave similar results. However the Gain Ratio
impurity measure did not perform as well as the other two impurity measures. It was also found that
there was no significant difference in the probabilities of misclassification irrespective of different
sample sizes in the populations.
Description
A thesis submitted in fulfilment of the requirements for the award of degree of doctor of philosophy (statistics) in the school of pure and applied sciences of Kenyatta University. July 2017