Rule Based Classifier Analysis with Nucleotide
Sequence in Normal Liver Cells and Cancer
Affected Liver Cells

Mayilvaganan M; Rajamani R

Rule Based Classifier Analysis with Nucleotide Sequence in Normal Liver Cells and Cancer Affected Liver Cells

Mayilvaganan M¹, Rajamani R²

Associate Professor, Dept of Computer Science, PSG College of arts and science, Coimbatore, TamilNadu, India
Assistant Professor, Dept of Computer Science, PSG College of arts and science, Coimbatore, TamilNadu, India

Related article at Pubmed, Scholar Google

Visit for more related articles at International Journal of Innovative Research in Computer and Communication Engineering

Abstract

The Data mining is the process of finding correlations or patterns among dozens of fields in large relational databases. Clustering algorithm used to find groups of objects such that the objects in a group will be similar (or related) to one another and different from (or unrelated to) the objects in other groups . This paper comprises of two database such as normal liver cells and cancer affected cells. Each character variables are assigned numeric number and its corresponding pair combination of sequence are represented in a graph. In this paper ,the attempt has been made to analyze the DNA gene liver cancer dataset and normal liver cell data with reference to association and classification rule based on the FSA red algorithm and apriori algorithm. .Here this algorithm is applied to find no of occurrences for the gene dataset. After that T is replaced by U. Comparisons are made based on the Execution time and memory efficiency in finding frequent patterns. The extracted rules and analyzed results are graphically demonstrated. The performance is analyzed based on the different no of instances and confidence in DNA sequence data set.

Keywords

Association Rule and Classification,,Zero rule, fsa red and Apriori algorithm.

INTRODUCTION

In this paper two techniques are analyzed to search and mine the very large gene database. Classification is a machine learning discipline, and is inspired by pattern recognitions, which is a branch of science. The data classification process involves learning and classification. Association rule mining is the discovery of association relationships or correlations among a set of items.

Apriori algorithm

Association rule mining is one of the classical data mining processes, which finds associated item sets from a large number of transactions. Apriori discovers patterns with frequency above the minimum support threshold. Therefore, in order to find associations involving rare events, the algorithm must run with very low minimum support values. The Apriori algorithm calculates rules that express probabilistic relationships between items in frequent item sets [2].

FSA red algorithm

Algorithm is used for data reduction or preprocessing to minimize the attribute to be analyzed. The goal is to make strong association rules using data mining techniques related to the data which is reduced . The data preprocessing in FSA-Red performed a few of reduction techniques such as attribute selection, row selection and feature selection. Row selection has done by deleting all signed record which related to the attribute which need to be analyzed. Feature selection will remove all the unwanted attribute, ended with attribute selection to eliminate the non value attributes which is no need to be included..

Data for Research

This data set includes descriptions of DEFINITION Homo sapiens occludin (OCLN), transcript variant 1, mRNA. ACCESSION NM_002538 XM_003118543 XM_936894

VERSION NM_002538.3 GI:327478412

KEYWORDS.SOURCE Homo sapiens (human)

ORGANISM Homo sapiens Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi; Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;

Catarrhini; Hominidae; Homo.REFERENCE 1 (bases 1 to 6451) AUTHORS Al-Sadi,R., Khatib,K., Guo,S., Ye,D., Youssef,M. and Ma,T. TITLE Occludin regulates macromolecule flux across the intestinal epithelial tight junction barrier JOURNAL Am. J. Physiol. Gastrointest. Liver Physiol. 300 (6), G1054-G1064 (2011)

PUBMED 21415414 REMARK GeneRIF: Suggest occludin plays a crucial role in the maintenance of tight junction barrier through the large-channel TJ pathway, the pathway responsible for the macromolecule.

Normal liver cells Original Data

1 gcctctctcc atcagacacc ccaaggttcc atccgaagca ggcggagcac cgaacgcaccccggggtggt cagggacccc catccgtgct gccccctagg agcccgcgcc tctcctctgcgccccgcctc tcgggccgca acgtcgcgcg gttcctttaacagcgcgctg gcagggtgtgggaagcagga ccgcgtcctc ccgccccctc ccatccgagt ttcaggtgaa ttggtcaccg gggaggagg ccgacacacc acacctacac tcccgcgtcc acctctccct ccctgcttcc ctggcggag gcggcaggaa ccgagagcca ggtccagagc gccgaggagc cggtctagga gcagcagat tggtttatct tggaagctaa agggcattgc tcatcctgaa gatcagctga

attaacttttg ccccctttca agtcaccctt cactgagttt cttcactatc tttccaaaaa g tgtaaatctt agcacaacag gctgcagctt aaagtccttt agtgactccc cgtagctcag taggatgaggt tctcatttcg gagtatttac agttcttgtc tatctctgtg gcctcgactc cgtccccactct cctccaagcc ccatttcctt gactgggcag cactccttgt tcttcctatt ccttatgctg tttcctgcct ctagccccgt gcgtttgtac ttcccactgc tggaacattc agttctctcctt tccctttccc cgctcctgat ccttcagagt ctaataccca cctctctggg aggccacatg agctcactgg acaggtgctc ctctgtgtgc aaacatcact gtgcatggct gctgttagagt acttcatgcc atgtaatttt tgccccttta ttcatctctc ccctcatttg tctggaaatcc tgtgagggca gcatctgtgt cttgtctaac ttggtatccc tgacacctaa

METHODOLOGY

The proposed methodology is using gene dataset for mining. By mining frequent patterns, in each node easily identify the defects occurred; and can rectify it. In this paper the Apriori and FSA red algorithms are applied in the database using weka to compare the memory efficiency and execution time. Searching also be done with the help of this tool. The proposed system can be solved to achieve the effect of existing algorithms for mining. Frequent Item sets on very large DNA datasets and to validate the new scheme on dataset. The actual knowledge extracted is presented in the form of easy-to-understand rules, while the details of the process such as time taken, file size and memory levels are considered, and conveniently summarized. This tool also allows the results to be displayed through various graphical representations, such as bar charts and line graphs. Such graphics can often help to summarize the knowledge being analyzed by providing a concise conceptualization of the data under scrutiny.

IMPLEMENTATION

Implementation is a stage, which is crucial in the life cycle of the new system designed. It is the process of changing from the old system to new one. In the proposed research work association rule mining is performed in Gene databases. The most efficient algorithms of Apriori and fsa red algorithms are implemented using Matlab tool. Preprocessing is nothing but data cleaning. The unnecessary information is removed or reconfigures the data to ensure a consistent format. Data can be modified or changed into different formats. The gene data are indexed which will be easier for generating candidate item sets. The Apriori algorithm uses indexed data for generating sequence sets and frequent item sets are identified from gene database. The flexibility according to the FSA-Red Algorithm is the way attribute is chosen, there is no limitation to exclude the attribute, by mean any kind of attribute can be chose as a basis of reduction process even though there would be the attribute which is not the best compare to the others. This is the benefit from the reduction procedure which might result rich association patterns of the data..The Count and position of gene sequences are retrieved using Apriori algorithm. The following table shows the RBC cancer data set with count of each occurence and T replaced by U and its occurence.

RESULTS AND DISCUSSION

The count and position of gene sequences are retrieved using Apriori algorithm. Single,double and triple character search done with the help of apriori algorithm using Matlab. The following figure1 shows the double character search in gene database.

The following Figure2 shows the liver cancer cells single charcater search compared by FSA red algorithm and aprioir algorithm In this graph, x axis represents the range of data and y axis represents the values. The performance of two algorithms revealed that FSA red algorithm acheives less memory, speed and accuracy with compared to apriori algorithm..

The following Figure3 shows cancer affected liver cells compared by FSA red algorithm and apriori algorithm In this graph, x axis represents the range of data and y axis represents the values.

The following Figur 4 shows the rule based classifier for liver cancer cells with its original nucleotide position of each amino acids. Using the rule based classifier, distance between each nucleotode position are estimated.

The performance , spped accuracy, and storage positions are retrieved using Apriori algorithm is shown in the figure 6. Single,double and triple character search done with the help of apriori algorithm using Matlab.

The nucleotide distance between each node and ratio of occurence of each pair of node are estimated using the FSA red algoithm and shown in the figure 7.

CONCLUSION

The proposed tool that extracts the from gene data files using a variety of selectable algorithms and criteria. The program integrates several mining methods which allow the efficient extraction of rules, while allowing the thoroughness of the mine to be specified at the users discretion. The program also allows the results to be displayed through various graphical representations. Such representations can often help to summarize the knowledge being analyzed by providing a concise conceptualization of the data under scrutiny.This paper uses Apriori algorithm and fsa red algorithms and use other algorithms to improve this approach. This was applied in biological application ie, in DNA data sets , future work can be carried out in other industry.

Figures at a glance


Figure 1	Figure 2	Figure 3

Figure 4	Figure 5	Figure 6

References

Role of Association Rule Mining in Numerical Data Analysis Sudhir Jagtap, Kodge B. G., Shinde G. N., Devshette P. M

M.Anandavalli, M.K.Ghose ,K.Gauthaman,ÃÂ¢Ãâ¬ÃÂAssociation Rule Mining in GeonomicsÃÂ¢Ãâ¬ÃÂ,International journal of Computer Theory and Engineering Vol.2 ,No.2 April 2010.

Piatetsky-Shapiro, G. (1991), Discovery, analysis, and presentation of strong rules, in G. Piatetsky-Shapiro & W. J. Frawley, eds, ÃÂ¢Ãâ¬ÃËKnowledge Discovery in DatabasesÃÂ¢Ãâ¬Ãâ¢, AAAI/MIT Press, Cambridge, MA.

Role of association rule mining in numerical data analysis, sudhir Sudhir Jagtap, Kodge B. G., Shinde G. N., Devshette P. M

Bayardo, Roberto J., Jr.; Agrawal, Rakesh; Gunopulos, Dimitrios (2000). "Constraint-based rule mining in large, dense databases". Data Mining and Knowledge Discovery (2): 217ÃÂ¢Ãâ¬Ãâ240. doi:10.1023/A:1009895914772.

Webb, Geoffrey I. (2000); Efficient Search for Association Rules, in Ramakrishnan, Raghu; and Stolfo, Sal; eds.; Proceedings of the Sixth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD-2000), Boston, MA, New York.

http://www.b3intelligence.com/NumericalDataMinig.html

http://en.wikipedia.org/wiki/Numerical_analysis

http://www.saedsayad.com/zeror.html

http://www.cogsys.wiai.unibamberg.de/teaching/ss05/ml/slides/cogsysII-6.pdf

http://www.slideshare.net/totoyou/covering-rulesbased-algorithm

M.Anandavalli , M.K.Ghose , K.Gouthaman ,ÃÂ¢Ãâ¬ÃÂAssociation Rule Mining in GenomicsÃÂ¢Ãâ¬ÃÂ,International journal of computer Theory and engineering ,Vol.2,No.2 April,2010.

Arun.K.PujariÃÂ¢Ãâ¬ÃÂdata mining techniques ÃÂ¢Ãâ¬ÃÅ,Universities Press (india) private limited.2001.

F.Braz,ÃÂ¢Ãâ¬ÃÂA review of the association rules data mining techniques for the analysis of gene expressionsÃÂ¢Ãâ¬ÃÂ

Douglas Trewartha, ÃÂ¢Ãâ¬ÃÂInvestigating data mining in MATLAB ÃÂ¢Ãâ¬ÃÅ,Rhodes University 2006.