Class association rule mining using multidimensional numbered information spaces

MITOV, Iliya

Please use this identifier to cite or link to this item: http://hdl.handle.net/1942/20134

Title:	Class association rule mining using multidimensional numbered information spaces
Authors:	MITOV, Iliya
Advisors:	VANHOOF, Koen MARKOV, Krassimir
Issue Date:	2011
Abstract:	Data mining is of great importance in the overall process of knowledge discovery. In this dissertation we focused our attention in the part of discoveryoriented methods and especially classification algorithms. Class-Association Rules (CAR) algorithms have a special place within the family of classification algorithms. This type of classifiers offers a number of advantages: efficiency of the training regardless of the training set; easy handling with high dimensionality; very fast classification; high accuracy; classification model easily comprehensible for humans. The main classification workflow of CAR algorithms usually involves three phases: generating the rules, pruning, and recognition. The mining of association rules is a typical data mining task that works in an unsupervised manner. A major advantage of association rules is that they are theoretically capable to reveal all interesting relationships in a database. But for practical applications the number of mined rules is usually too large to be exploited entirely. Hence, a pruning phase is applied in order to build accurate and compact classifiers. The pruning can be applied during preprocessing, simultaneously to the association rules mining, or during post-processing. Different rule quality measures and rule ordering schemes can be applied in the process of rule selection. There are also different options which can be considered for the recognition phase – e.g. to use a simple rule or to use a set of rules with different types of ordering schemas. On the other hand, the process of creating classification models inevitably touches upon the use of appropriate access methods which facilitate access to different kinds of structures used in such algorithms. Our effort had been focused on the memory organization called Multidimensional numbered information spaces which allows to operate with contextfree multidimensional data structures. The program realization of such structures is named ArM 32. Multi-Domain Information Model (MDIM) and respectively Arm 32 are based on the process of replacing names by numbers which allows to use mathematical functions and addressing vectors for accessing the information. Our approach is to use such structures and operations in the implementation of one class association rule classifier in order to provide evidence on the vitality of the idea of using context-free multidimensional data structures and direct access as a powerful tool for knowledge discovery. We have proposed two classification algorithms – Pyramidal Growing Networks (PGN) and Multi-layer Pyramidal Growing Networks (MPGN). PGN creates association rules, optimized for maximal accuracy of produced rules. One of the main characteristics of PGN is that it is a parameter-free classifier. The association rule mining is executed from the longest rules to the shorter ones until no intersections between patterns in the classes are possible. In the pruning phase the contradictions and inconsistencies of more general rules are cleared, after that the pattern set is compacted excluding all more concrete rules within the classes. PGN is introduced as a useful tool for questioning the support-first principle used by many associative classifiers when mining for association rules. PGN reverses the common approach and focuses primarily on the confidence of the association rules and only in a later stage on the support of the rules. The main purpose is twofold: to provide a proof of concept for this new approach and to gather evidence on its potential. MPGN is based on multilayer structure. It involves possibility to escape combinatorial explosion using smart disposing of the information in the multilayer structures called "pyramids". These structures can be easily implemented using ArM-structures. These algorithms are implemented in the data mining environment PaGaNe, developed by the team from the Institute of Mathematics and Informatics – Bulgarian Academy of Sciences; Iliya Mitov and Krassimira Ivanova are the principal developers. PaGaNe incorporates different types of statistical analysis methods, discretization algorithms, association rule miner, as well as classification algorithms, which all are based on the use of multi-dimensional numbered information spaces. The Lenses dataset is used as a test example to illustrate the specifics of the proposed algorithms, the process of creating classification models as well as the process of recognition. We demonstrate that PGN produces the pattern set that is both minimal and complete for covering the learning set, which is an indicator for expectation that PGN will produce tight model and good accuracy results. In the case of MPGN we have demonstrated the process of creating main construction elements. We also have illustrated the functionality which allows to visualize how the pyramids are being created and how the queries are being recognized. We carried out experiments with 25 datasets from the UCI machine learning repository [Frank and Asuncion, 2010]. The experiments had been conducted using the data mining environment PaGaNe, the knowledge analysis system Weka, and LUCS-KDD Repository. A comparison between PGN, MPGN and some other CAR algorithms, as well as decision tree and decision rule classifiers which have similar behavior of creating the task model, had been done. One series of experiments aimed to study what accuracy had been obtained while preprocessing real data with different discretizators realized in PaGaNe. We found that in general PGN-classifier trained on data preprocessed by Chimerge with 95% significance level achieves lower classification error than those trained on data preprocessed by the other discretization methods. The main reason for this is that using Chi-square statistical measure as criterion for class dependency in adjacent intervals of a feature results in good separation between class labels. A second set of experiments studied the process of growing the learning sets and how this reflects on the classification model and the accuracy of PGN and MPGN; more specifically, we studied the critical point of the amount of the learning set in which classification model is relatively compact and the received accuracy stabilizes. Of course this critical point highly depends on the choice of dataset. A third set of experiments were focused on analyzing different exit points of MPGN. The received results showed that in a lot of cases the build constructs lead to excluding only one class as best competitor. Other cases usually fall into competition between classes, where different strategies for ordering the competitors can be applied. A very few cases fall into the way where MPGNalgorithm did not work and alternative choice is given. A fourth set of experiments aimed to analyze the dependencies of classifiers' behaviors when the noise rush in the dataset attributes; for this set we used the Monks1 dataset. The experiments demonstrated that noising in the dataset worsens considerably the accuracy of PGN which had been designed to perform well in clear datasets. However, experiments with other existing classifiers showed that they also were not been able to resist noising attacks. We made the comparison of overall accuracy between PGN, MPGN (with two recognition strategies – S1 and S2), CMAR, OneR, JRip, J48 and REPTree. The Friedman test showed statistical difference between tested classifiers. The post-hoc Nemenyi test showed that our PGN has best overall performance between examined classifiers and MPGN is competitive with CMAR, J48, JRip and REPTree. The experimental results are very positive and show that PGN is competitive with classification methods that build similar classification behavior. At the same time, it has an essential advantage over the other classifiers being parameter free. Furthermore, the empirical results showed that PGN is slightly more sensitive to noise than techniques such as C4.5 and RIPPER. However, its overall accuracy was still very good compared to these classifiers. In general, the results provide evidence that the confidence-first approach yields interesting opportunities for knowledge discovery.
Document URI:	http://hdl.handle.net/1942/20134
Category:	T1
Type:	Theses and Dissertations
Appears in Collections:	PhD theses Research publications

Files in This Item:

File	Description	Size	Format
1989 D-2011-2451-50 Iliya MITOV.pdf		5.97 MB	Adobe PDF	View/Open

Show full item record

Google Scholar^TM

Check

Files in This Item:

Google ScholarTM

Google Scholar^TM