Keywords
|
Data mining, Text mining, Classification, AttributeSelectedClassifier, Filtered Classifier, LogitBoost. |
INTRODUCTION
|
Text mining or knowledge discovery from text (KDT) deals with the machine supported analysis of text. It usesmethods from information retrieval, information extraction and natural language processing (NLP) and also connectsthem with the algorithms and methods of Knowledge discovery of data, data mining, machine learning and statistics.Current research in the area of text mining tackles problems of text representation, classification, clustering, or thesearch and modelling of hidden patterns. [5] |
Text mining is used to describe the application of data mining techniques to automated discovery of useful orinteresting knowledge from unstructured or semi-structured text. Text mining is the procedure of synthesizing theinformation by analysing the relations, the patterns, and the procedures among textual data semi-structured orunstructured text. Text mining, sometimes alternately referred to as text data mining refers to the process of derivinghigh-quality information from text. High quality information is typically derived through the divining of patterns andtrends through means such as statistical pattern learning. [6] Text mining involves the process of structuring the inputtext (usually analyzing, along with the addition of some derived linguistic features and the removal of others andsubsequent insertion into a database) deriving patterns within the structured data and finally evaluation andinterpretation of the output. |
Some of the important applications of text-mining include Enterprise Business Intelligence, Data MiningCompetitive Intelligence, E-Discovery, National Security, Intelligence Scientific discovery especially LifeSciences, Records Management, Search or Information Access and Social media monitoring. [13] Some of thetechnologies that have been developed and can be used in the text mining process are information extraction, concept linkage, summarization, categorization, clustering, topic tracking, information visualization and questionanswering. |
The rest of this paper is organized as follows. Section 2 discusses the review of literature. Section 3 describes theclassification Meta techniques and the various algorithms used for classification. Experimental results are analyzed inSection 4 and Conclusion are given in Section 5. |
LITERATURE REVIEW
|
P. Kalaiselvi et al [7] discussed the performance of the different classifier methods like Bagging, Dagging,Decorate, Multi Class Classifier, and MultiboostAB are compared. Bagging is best algorithm to finding the accuracythan other algorithms. In this experiment Robot Navigation datasets are used and the classification accuracy and timeis calculated by 10-fold validation methods. In future the same experiments will conduct with different datasetsinstead of multiple dataset, MULTICLASS and combine few ensembles with the different base classifier tostudy how the ensemblers combined with the base classifiers boost the performance accuracy. |
Nikita Bhatt et al [10] discussed the different approaches of Meta learning based on dataset characteristicsprovides a system that automatically provides ranking of the classifiers by considering different characteristics ofdatasets and different characteristics of classifiers after the generation of the Meta Knowledge Base, Ranking isprovided based on Adjusted Ration of Ratio (ARR) or accuracy or time that helps non-experts in algorithmselection task. |
Pfahringer et al [14] presented a novel meta-feature generation method in the context of meta-learning, which isbased on procedures that compare the performance of individual base learners in a one-to-one manner. In addition tothese new meta-features, a new meta-learner called Approximate Ranking Tree Forests (ART Forests) that performsvery competitively when compared with several state-of-the-art meta-learners. The experimental results are based on alarge collection of datasets and show that the proposed new techniques can improve the overall performance of metalearningfor algorithm ranking significantly. A main point in this approach is that each performance figure of any baselearner for any specific dataset is generated by optimizing the parameters of the base learner separately for each dataset. |
Artur Ferreira et al [3] presented an overview of boosting algorithms to build ensembles of classifiers. The basicboosting technique and its variants are addressed and compared for supervised learning. The extension of thesetechniques for semi-supervised learning is also addressed. For face detection, boosting algorithms have been the mosteffective of all those developed so far, achieving the best results. |
METHODOLOGY
|
Text classification is one of the important research issues in the field of text mining where the documents areclassified with supervised knowledge. In this research work, computer files can be classified based on their extension.For Example – pdf, doc, ppt, xls and so on. The main objective of this research work is to find the best classificationalgorithm among Attribute Selected Classifier, Filtered Classifier and LogitBoost. The methodology of the researchwork is as follows: |
1. Dataset – Computer Files can be collected from the system hard disk |
2. Classification Meta Algorithms |
•Attribute Selected Classifier |
• Filtered Classifier |
• LogitBoost |
3. Performance factors |
• Classification accuracy |
• Error rate |
4. Best Technique among classification Meta algorithms |
• LogitBoost |
A. DATASET |
A synthetic dataset can be collected from the computer systems which are stored in the hard disk. This datasetcontains 9000 instances and four attributes namely file name, file size, extension and file path. Weka data mining toolis used for analyzing the performance of the classification algorithms. |
B. CLASSIFICATION META ALGORITHMS |
Classification is an important data mining technique with broad applications. It is used to classify each item in a setof data into one of predefined set of classes or groups. Classification algorithm plays an important role in documentclassification. There are various Meta classification algorithms such as AttributeSelectedClassifier, Bagging, Decorate,Vote, FilteredClassifier, LogitBoost, END, Dagging, Rotation Forest, and so on. In this research work, we haveanalyzed three Classification Meta Algorithms. The algorithms are namely AttributeSelectedClassifier, FilteredClassifier and LogitBoost. |
C. ATTRIBUTE SELECTED CLASSIFIER |
Dimensionality of training and test data is reduced by attribute selection before being passed on to a classifier.Some of the important options in attribute selected classifier are as follows |
ïÃâ÷ Classifier -- The base classifier to be used. |
ïÃâ÷ Debug -- If set to true, classifier may output additional info to the console. |
ïÃâ÷ Evaluator -- Set the attribute evaluator to use. It is used during the attribute selection phase before theclassifier is invoked. |
ïÃâ÷ Search -- Set the search method. This method is used during the attribute selection phase before theclassifier is invoked |
D. FILTERED CLASSIFIER |
This Class is used for running an arbitrary classifier on data that has been passed through an arbitraryfilter. Similar to classifier, the structure of the filter is based exclusively on the training data and test instances will beprocessed by the filter without changing their structure. Some of the important options in Filtered classifier are asfollows |
ïÃâ÷ Classifier -- The base classifier to be used. |
ïÃâ÷ Debug -- If set to true, classifier may output additional info to the console. |
ïÃâ÷ Filter -- The filter to be used. |
E. LOGITBOOST |
LogitBoost algorithm is an extension of Adaboost algorithm. It replaces the exponential loss of Adaboost algorithmto conditional Bernoulli likelihood loss. This Class is used for performing additive logistic regression. This classperforms classification using a regression scheme as the base learner, and can handle multiclass problems. |
|
EXPERIMENTAL RESULTS
|
A. ACCURACY AND ERROR RATE |
There are various measures used for classification accuracy such as true positive rate, precision, F Measure, ROCArea, and kappa Statistics. The TP Rate is the ratio of play cases predicted correctly cases to the total of positive cases.F Measure is a way of combining recall and precision scores into a single measure of performance. Precision is theproportion of relevant documents in the results returned. ROC Area is a traditional to plot the same information in anormalized form with 1-false negative rate plotted against the false positive rate |
From the above graph, it is analyzed that the LogitBoost algorithms performs better than the other algorithms.Therefore the LogitBoost classification algorithm performs well because it contains highest accuracy when comparedto Attribute Selected Classifier and Filtered Classifier. |
B. ERROR RATE |
They are the mean absolute error (M.A.E), root mean square error (R.M.S.E), relative absolute error(R.A.E) and root relative squared error (R.R.S.R) [10]. The mean absolute error (MAE) is defined as the quantityused to measure how close predictions or forecasts are to the eventual outcomes. The root mean square error(RMSE) is defined as frequently used measure of the differences between values predicted by a model or an estimatorand the values actually observed. Relative error is a measure of the uncertainty of measurement compared to the size ofthe measurement. The root relative squared error is defined as a relative to what it would have been if a simplepredictor had been used. |
From the above graph, it is analyzed that the LogitBoost algorithms performs better than the other algorithms.Therefore the LogitBoost classification algorithm performs well because it attains lowest error rate when compared toAttribute Selected Classifier and Filtered Classifier. |
CONCLUSION
|
Data mining can be defined as the extraction of useful knowledge from large data repositories. Text miningis a technique which extracts information from both structured and unstructured data and also finding patterns which isnovel and not known earlier. In this paper, the classification meta algorithms are used for classifying computerfiles which are stored in the computer. The Classification Meta algorithms include three techniques namelyAttribute Selected Classifier, Filtered Classifier and LogitBoost. By analyzing the experimental results it is observedthat the LogitBoost classification technique has yields better result than other techniques. |
Tables at a glance
|
|
|
Table 1 |
Table 2 |
|
|
Figures at a glance
|
|
|
Figure 1 |
Figure 2 |
|
References
|
- Abdullah Wahbeh H, Mohammed Al-Kabi., “Comparative Assessment of the Performance of Three WEKA Text Classifiers Applied to ArabicText”, Vol. 21, No. 1, pp. 15- 28, 2012.
- Abdullah Wahbeh H, Qasem Al-Radaideh A, Mohammed Al-Kabi N, and Emad Al-ShawakfaM., “A Comparison Study between Data MiningTools over some Classification Methods”.
- Artur Ferreira., “Survey on Boosting Algorithms for Supervised and Semi-supervised Learning”.
- Christophe Giraud-Carrier., “Meta learning - A Tutorial”.
- ChristophGoller, Joachim Löning., Thilo Will, Werner Wolff., “Automatic Document Classification: A thorough Evaluation of variousMethods”
- Falguni Patel N, Neha Soni R., “Text mining: A Brief survey”, Volume-2 Number-4 Issue December-2012.
- Ian Witten H, Eibe Frank, Mark Hall A., “Data Mining Practical Machine Learning Tools and Techniques”.
- Kalaiselvi P, Nalini C., “A Comparative Study of Meta Classifier Algorithms on Multiple Dataset”, International Journal of Advanced Researchin Computer Science and Software Engineering, Volume 3, Issue 3, March 2013.
- KaushikRaviya H, BirenGajjar., “Performance Evaluation of Different Data Mining Classification Algorithm Using WEKA”.
- Mahendra Tiwari, Manu BhaiJha, OmPrakashYadav., “Performance analysis of Data Mining algorithms in Weka”, IOSR Journal of ComputerEngineering (IOSRJCE), ISSN: 2278-0661, ISBN: 2278-8727 Volume 6, Issue 3, PP 32-41, (Sep-Oct. 2012).
- Nikita Bhatt, Amit Thakkar, Amit Ganatra., “A Survey & Current Research Challenges in Meta Learning Approaches based on DatasetCharacteristics”, Volume-2, Issue-1, March 2012
- Mrs. Sayantani Ghosh, Mr. Sudipta Roy, Prof. Samir Bandyopadhyay K., “A tutorial review on Text Mining Algorithms, Vol. 1, Issue 4, June2012.
- ShaidahJusoh, HejabAlfawareh M., “Techniques, Applications and Challenging Issues in Text Mining”, Vol. 9, Issue 6, No 2, November2012.
- Shilpa DhanjibhaiSerasiya, Neeraj Chaudhary., “Simulation of Various Classifications results using WEKA”, International Journal of RecentTechnology and Engineering (IJRTE) ISSN: 2277-3878, Volume-1, Issue-3, August 2012.
- Quan Sun, Pfahringer, “Pairwise meta-rules for better meta-learning-based algorithm ranking Machine learning”, Springer US, MachineLearning, 93(1):141-161, 2013.
|
Dr. S. Vijayarani She has completed MCA, M.Phil and PhD in Computer Science. She is working as Assistant Professor in the School of Computer Science and Engineering, Bharathiar University, Coimbatore. Her fields of research interest are data mining, privacy, security, bioinformatics and data streams. She has published papers in the international journals and presented research papers in international and national conferences. |
Mrs. M. Muthulakshmi She has completed M.Sc in Computer Science and Information Technology. She is currently pursuing her M.Phil in Computer Science in the School of Computer Science and Engineering, Bharathiar University, Coimbatore. Her fields of interest are data mining, text mining and semantic web mining. |