Comparison of fuzzy and crisp classification trees using gini index, chi-square statistic and the gain ratio
Muchai, Eunice Wambui
MetadataShow full item record
ABSTRACT Discriminant (classification) analysis is a classification problem where a new individual is allocated into one of known populations or classes based on the measured characteristics of the individual. Different models are used in allocating the new individual into one of the populations (classes). Some of the models depend on the underlying distribution of the populations, hese are known as parametric models. If the model does not depend on any underlying distribution it is known as a distribution free or non parametric model. In this work a distribution free model known as classification tree is used. A classification tree is a presentation of edges and nodes. It is a model that is used to assign an individual to one of many classes or populations. At each node a test is applied on a value of one of the attributes (variables) of the individual. The individual moves to the next node (child node) along an edge depending on the result of the test. The attribute, on which the test is applied, is known as the splitting attribute and the value the splitting value. Tests are carried out at each node until it is not possible to carry out more tests. The final nodes are known as terminal or leaf nodes. Classification is done at the terminal nodes by assigning all the individuals on that node to a class. If the splitting value is a fuzzy value, then the tree is known as a fuzzy classification tree otherwise the tree is known as a crisp classification tree. When there are only two possible answers to the test at each node, the resulting tree is known as a binary tree. Classification trees have been used to model many situations. These include speech recognition, data mining and market surveys among others. In this study the performance of crisp and fuzzy classification trees was compared. The performance was based on probabilities of correct allocation and probabilities of misclassification. Simulated data and real data were used. Data was simulated using R and the real data was obtained from machine learning repository. Gini Index, Chi-Square Statistic and Gain Ratio impurity measures were applied to both the simulated data and real data. The performance of Gini Index, Chi-Square Statistic and Gain Ratio impurity measures was also compared. Finally the performance of the trees using varied sample sizes was compared. It was found that for the simulated data, fuzzy classification tree performed better than the crisp classification tree when all the three impurity measures were applied. It was found that the Gini Index and Chi-Square Statistic impurity measures were appropriate as impurity measures for the data used in the study and gave similar results. However the Gain Ratio impurity measure did not perform as well as the other two impurity measures. It was also found that there was no significant difference in the probabilities of misclassification irrespective of different sample sizes in the populations.