IMPROVING THE CLASSIFICATION ACCURACY USING SUPPORT VECTOR MACHINES (SVMS) WITH NEW KERNEL

Ashraf Afifi; E.A.Zanaty; Said Ghoniemy

IMPROVING THE CLASSIFICATION ACCURACY USING SUPPORT VECTOR MACHINES (SVMS) WITH NEW KERNEL

Ashraf Afifi and E.A.Zanaty, Said Ghoniemy
IT Department, College of Computers and IT, Taif City, Taif University, Saudi Arabia,

Corresponding Author: Ashraf Afifi, E-mail: a.afifi; n.allam@tu.edu.sa

Related article at Pubmed, Scholar Google

Visit for more related articles at Journal of Global Research in Computer Sciences

Abstract

In this paper, we introduce a new kernel function called polynomial radial basis function (PRBF) that could improve the classification accuracy of support vector machines (SVMs). The proposed kernel function combines both Gauss (RBF) and Polynomial (POLY) kernels and is stated in general form. It is shown that the proposed kernel converges faster than the Gauss and Polynomial kernels. The accuracy of the proposed algorithm is compared to algorithms based on both Gaussian and polynomial kernels by application to a variety of non-separable data sets with several attributes. We noted that the proposed kernel gives good classification accuracy in nearly all the data sets, especially those of high dimensions.

Keywords

Classification problem, SVMs, kernel functions.

INTRODUCTION

The basic form of support vector machines (SVMs) is to maximize the distance separating the elements of two different classes [1]. When the classes to which the elements belong to are known a priori, the problem is called classification. The set of data used to calculate the boundary limit between the classes is called the training set, while the data set used to test the efficacy of the method, is called validation set. Initially, classification problems were designed to separate two classes, the binary class problem [2, 3].

Despite the maturity of classification, problems remain, especially in choosing the most appropriate kernel of SVMs for a particular application. In recent years, support vector machines (SVMs) have received considerable attention because of their superior performance in pattern recognition and regression [1,4-8]. Choosing different kernel functions will produce different SVMs [7, 10, 11] and may result in different performances. Some work has been done on limiting kernels using prior knowledge, but the best choice of a kernel for a given problem is still an open research issue [12,13]. Comparing SVMs with Gaussian function (GF) to radial basis function (RBF) classifiers can be found in literature [14-17], but these focus on a sub-set of techniques and often only on performance accuracy. Tsang et al. [18] described a way to take advantage of the approximations inherent in kernel classifiers, by using a minimum enclosing ball algorithm as an alternative means of speeding up training. An alternative approach was proposed in [13] to selecting an appropriate kernel is to use invariance transformations.

The drawback here is that they are mostly appropriate only for linear SVM classifiers. The methods of [19-20] can both be applied to pre-image applications with a discrete input space, since they do not require the gradient of the objective function. Generally, in implementations of this method, the time and space complexities are very high because the core of the SVMs is based on approximate minimum enclosing ball algorithms which are computationally expensive. The exact evaluation of intersection kernel SVMs which is logarithmic in time was presented in Maji et al. [21]. They have shown that the method is relatively simple and the classification accuracy is acceptable, but the runtimes are significantly increased compared with the established radial bases function (RBF) and polynomial kernel (POLY) due to large number of SV for each classifier [14, 21]. Zanaty et al. [15-17] combined GF and RBF functions in one kernel called “universal kernel” to take advantage of their respective strengths. The universal kernels constructed the most established kernels such as radial bases, gauss, and polynomial functions by optimizing the parameters using the training data. These kernels satisfied Mercer’s condition and converged faster than the existing kernels.

Completely achieving a Support Vector Machine with high accuracy classification therefore requires specifying the high quality kernel function. This paper addresses the problem of data classification using SVMs. We improve the accuracy of SVMs using a new kernel function. We concentrate on non-linearly separable data sets to improve the classification accuracy. The proposed kernel function called polynomial radial basis function (PRBF) that combines both Gauss and Polynomial kernels, is analyzed to prove its advantages over Gaussian and Polynomial kernels. The SVMs modified by the proposed PRBF kernel function is experimented using different data sets.

The rest of this paper is organized as follows: In section 2, the problem formulation is stated. In section 3,m the traditional kernel functions are discussed. A new kernel function is presented and discussed its analysis in section 4. Experimental results are shown in section 5. Finally, section 6 gives our conclusions.

THE PROBLEM FORMULATION

SVMs algorithm [7] has been shown to be one of the most effective machine learning algorithms. It gives very good results in terms of accuracy when the data are linearly or non-linearly separable. When the data are linearly separable, the SVMs result is a separating hyperplane, which maximizes the margin of separation between classes, measured along a line perpendicular to the hyperplane. If data are not linearly separable, the algorithm works by mapping the data to a higher dimensional featurespace (where the data becomes separable) using an appropriate kernel function and a maximum margin separating hyperplane is found in this space. Thus the weight vector that defines the maximal margin hyperplane is a sufficient statistic for the SVMs algorithm (it contains all the information needed for constructing the separating hyperplane). Since this weight vector can be expressed as a weighted sum of a subset of training instances, called support vectors, it follows that the support vectors and the associated weights also constitute sufficient statistics for learning SVMs from centralized data.

The accuracy problem is usually represented by the proportion of correct classifications. For many data sets, the SVMs may not be able to find any separating hyperplane at all (accuracy equal 0), either because the kernel function is inappropriate for the training data or because the data contains mislabeled examples. The latter problem can be addressed by using a soft margin that accepts some misclassifications of the training examples. A soft margin can be obtained in two different ways. The first is to add a constant factor to the kernel function output whenever the given input vectors are identical. The second is to define a priori an upper bound on the size of the training set weights. In either case, the magnitude of the constant factor to be added to the kernel or to bound the size of the weights controls the number of training points that the system misclassifies. The setting of this parameter depends on the specific data at hand. Completely specifying a support vector machine therefore requires specifying two parameters, the kernel function and the magnitude of the penalty for violating the soft margin. Hence, in order to improve the accuracy of SVMs, we select a suitable kernel function; this is criterion for achieving better results.

Binary Classification Problem:

The difference between the separable and non-separable is the added constraint in equation (20), now αi’s have an upper bound of c.

The support vectors in this solution are not only the training examples that lie on the hyperplane boundary but also the training examples that either falls between the two hyperplanesH+ and H- or falls on the wrong side of the decision surface.

Multi-Class Support Vector Machines:

The multi-class problem is defined as the classification problem that has many classes. While SVMs are binary classifiers, i.e. they can classify two classes, we need some techniques to extend these classifiers to handle multiple classes. The goal of such a technique is to map the generalization abilities of the binary classifiers to the multiclass domain. Multi-class SVMs are usually implemented by combining several two class SVMs. In literature, numerous schemes have been proposed to solve this problem, popular methods for doing this are: one-versus-all method using Winner-Takes-All strategy (WTA SVMs), one-versus-one method implemented by Max-Wins Voting (MWV SVMs), DAG SVMs and error-correcting codes[25]. Hsu and Lin [26] compared these methods on a number of data sets and found that MWV SVMs and WTA SVMs give similar generalization performance. Hastie and Tibshirani[27] proposed a good general strategy called pairwise couplingfor combining posterior probabilities provided by individual binary classifiers in order to do multi-class classification and extended it in [28] for speeding up multiclass. Since SVMs do not naturally give out posterior probabilities, they suggested a particular way of generating these probabilities from the binary SVMs outputs and then used these probabilities together with pairwise coupling to do multi-class classification.

Here, we use a multi-class SVM classifier based on one versus one algorithm (the voting strategy) [26, 27], and use the Support Vector Machine toolbox for MATLAB. The classifier is designed to read two input data files, the training data and the test data.

KERNEL FUNCTIONSL

Kernel functions are used to non-linearly map the input data to a high-dimensional space (feature space). The new mapping is then linearly separable [29,30]. The idea of the kernel function is to enable operations to be performed in the input space rather than the potentially high dimension feature space. Hence the inner product does not need to be evaluated in the feature space. The mapping is achieved by replacement of the inner product (x. y)→Φ(x).Φ(y) this mapping is defined by the kernel,

K(x, y) = Φ(x).Φ(y).

In order for thedata tobe linearly separable a suitable kernel is chosen. There are many different types of kernels, some of these are listed below. Consider the linear kernel that just computes the dot product of two vectors, K(x, z) x z x z cos(u) , where u is the angle between x and z.

Therefore, the proposed function converges faster than the Gauss function. We normalized the data sets to be within the interval [-1,1], because the Polynomial function will diverge in large intervals, and the proposed function has faster convergence within this interval. The following Fig. (2), Fig. (3) and Fig. (4) show the shape of Polynomial (POLY), Gauss (RBF) and the proposed (PRBF) kernels, respectively, within the interval [-1,1] and with twodimension vectors x1,x2.

EXPERIMENTAL RESULTS

Data Sets:

In order to evaluate the performance of the proposed kernel with SVMs, we carried out some experiments with different data sets. Table (1) shows the description of these data sets (see Appendix) and for more details can be seen in [33,34]. We can divide these data sets according to the training set size into two types, large data sets (1→4) and small ones (5→8). Fig.( ) shows some examples for the input data and the output.

Comparative results:

We design a multi-class SVM classifier based on one versus one algorithm (the voting strategy), and we use the Support Vector Machine toolbox for MATLAB. We design the classifier to read two input data files (as shown in Fig.(5), the training data and the test data. Each file is organized as records, each of which consists of a vector of attributes X=[x1, x2, ..., xM] followed by the target Y ÃÂÃâ {y1, y2, ..., yC} where M is the number of attributes and C is the number of classes. It constructs C(C-1)/2 binary classifiers, and uses the training data to find the optimum separating hyperplane. Finally, we use the test data to compute the accuracy of our classifier from the equation, Acc= ( n / N ) *100, where n is the number of correct classified examples and N is the total number of the test examples. We compare the results of Polynomial, Gauss and the proposed kernels with the classifier as in Table (2).

Result analysis:

From Table (2) it is obvious that Gauss Radial Basis function with σ =3 accomplishes better accuracy with the small data sets (5→8) than the Polynomial function, and the Polynomial kernel with d=4 gives better results in the large sets (1→4). Whereas our proposed kernel, PRBF, accomplishes the best accuracy in nearly all the data sets, and especially in the largest number of attributes data set number (5), because the proposed function is more complex and combines the performance of both its parents, Gauss and Polynomial functions.

Table (3) presents the mean accuracy we obtained from all kernels; it is obvious that the new proposed kernel obtains the best mean accuracy compared to the classical Gauss and Polynomial function. Fig. (6) Summarizes the comparison of the performance of the SVMs with different kernels (POLY, RBF & PRBF), it is clear that the proposed kernel (PRBF) achieves the highest accuracy.

CONCLUSION

In this paper, SVMs have been improved to solve the classification problems by mapping the training data into a feature space by the aid of new kernel functions and then separating the data using a large margin hyperplane.

We have tested the proposed kernel function with different sizes of data sets and different attributes. It is obvious from experimental results that RBF gives better accuracy with the small data sets than the Polynomial function. However the Polynomial kernel gives better results in the large data sets. Whereas our proposed kernel PRBF obtains the best accuracy in nearly all the data sets and especially in the largest number of attributes data set, because the proposed function combines the performance of both its parents, Gauss and Polynomial functions.

Experimental results have been reported to illustrate the validity and effectiveness of the proposed kernel. The experimental results show that the proposed kernel function obtains the best accuracy in nearly all the data sets especially in the largest number of attributes data set. Thus the proposed kernel functions can be considered as a good alternative to the Gaussian and polynomial kernel functions for some specific datasets.

References

Vapnik V. N., “The nature of statistical learning theory”, Springer-Verlag, New York, NY, USA, 1995.
Kim H., Pang S., Je H., Kim D., Bang S.Y., “Constructing support vector machine ensemble”, Pattern Recognition”, vol.36, no.12, pp.2757–2767, 2003.
Du P., Peng J., TerlakyT.. “Self-adaptive support vector machines”, modeling and experiments Computational Management Science, vol. 6, no.1, pp. 41–51, 2009.
Boser B. E., Guyon I. M., Vapnik V. N., “A training algorithm for optimal margin classifiers”, Proc. Fifth Ann. Workshop Computing Learning Theory, pp. 144-152, 1995.
Vapnik V.N., “An overview of statistical learning theory”, IEEE Trans. Neural networks, vol. 10, no. 5, pp. 988-999,1999.
Cortes C., Vapnik V.N., “Support-vector networks”, Machine Learning, vol. 20, pp. 273- 297,1995.
Burges, C., “A tutorial on support vector machines for pattern recognition”, Data Mining and Knowledge Discovery 2 (2), pp. 67-121, 1998.
Vapnik V.N., Golowich S., and SmolaA.,“Support vector method for function approximation, regression estimation and signal processing”, Advances in Neural Information processing Systems, vol. 9, Cambridge, Mass.: MIT Press, 1997.
Mangasarian O.L., Musicant D.R., “Successive over relaxation for support vector machines”, IEEE Trans. Neural Networks, vol. 10, no. 5, pp. 1032-1037,1999.
Aronszajn N. , “Theory of Reproducing Kernels”, Trans. Am. Math. Soc.,vol. 68, pp. 337-404, 1950.
Shawe-Taylor J., Bartlett P.L., Willianmson R.C., Anthony M., “Structural risk minimization over data-dependent hierarchies”, IEEE Trans. Information Theory, vol. 44, no. 5, pp. 1926-1940, 1998.
Williamson R. C., Smola A., Schölkopf B., “Entropy numbers, operators and support vector kernels”, MA: MIT Press, Cambridge, pp. 44-127, 1999.
Chapelle O. and Schölkopf B., “Incorporating invariances in non-linear support vector machines”, In T. G. Dietterich, S. Becker, and Z. Ghahramani (Eds.), Advances in Neural Information Processing Systems, Cambridge, MA: MIT Press, vol. 14, pp. 594-609, 2002.
Hastie T., Hsu C-W., Lin C-J, “A comparison of methods for multi-class support vector Machines”, IEEE Transactions on Neural Networks, no. 13, pp.415–425, 2005.
Zanaty , E.A. ,and Sultan Aljahdali, “Improving the accuracy of support vector machines”, Proceedings of 23rd International conference on computers and their application April, pp. 75-83, Cancun, Mexico, 2008.
Zanaty E.A, Sultan Aljahdali, R.J. Cripps, “Accurate support vector machines for data classification”, Int. J. Rapid Manufacturing, vol. 1, no. 2, pp. 114-127, 2009.
Zanaty E.A, Ashraf Afifi,”Support vector machines (SVMs) with universal kernels “, in International journal of Artificial Intelligence, vol. 25, pp.575-589, 2011.
Tsang I. W., Kwok J. T., Cheung P.-M., “Core vector machines: Fast SVMs training on very large data sets”, Journal of Machine Learning Research, vol. 6, pp. 271-363, 2005.
Kwok J. T., Tsang I. W, “The pre-image problem in kernel methods”, IEEE Trans.On Neural Networks, no. 15, pp. 1517–1525, 2004.
Bakir G. H. , Weston J., Schlkopf B., “Learning to find pre-images”, Advances in Neural Information Processing Systems, vol. 16, pp. 449–456, 2004.
Maji S., Berg A.C., Malik J., “Classification using intersection kernel support vector machines is efficient”, IEEE Conference on Computer Vision and Pattern Recognition, June pp.1–8, 2008.
Boser, B., I. Guyon, and V. Vapnik, “A training algorithm for optimal margin classifiers”, In D. Haussler (Ed.), Annual ACM Workshop on COLT, Pittsburgh, PA,ACM Press, vol. 144, no.52, 1992.
Osuna, E., R. Freund, and F. Girosi, “An improved training algorithm for support vector machines”, In J. Principe, L. Gile, N. Morgan, and E. Wilson (Eds.), Neural Networks for Signal Processing VII: Proceedings of IEEE Workshop, New York, NY, pp. 191-276, 1997.
Burges, C. and B. Schölkopf, “Improving the accuracy and speed of support vector machines”, Advances in Neural Information Processing Systems, vol. 9, no.375, 1997.
Rifin R.,Klautau A., “In defense of one vs. all classification”, Journal of Machine Learning Research, vol.5, pp.101-141, 2004.
Hsu C.-W., Lin C.-J., “A comparison of methods for multi-class support vector Machines”, IEEE Transactions on Neural Networks, vol.13, pp. 415-425, 2002.
Hastie T. &Tibshirani R., “Classification by pairwise coupling”, Annals of Statistical, vol.26, no.2, pp. 451-471, 1998.
Hansheng Lei,VenuGovindaraju, “Speeding up multi-class SVM evaluation by PCA and feature selection”, Center for Unified Biometrics and Sensors (CUBS), 2005.
Gunnar Ratsch, “A brief introduction into machine learning”, Friedrich Miescher Laboratory of the Max Planck Society, Germany, 2005.
Scholkopf B., Smola A.J., “Learning with kernels, support vector machine, Regularization, Optimization and Beyond”, Cambridge, MA.MIT Press, 2002.
Muller K.-R., S. Mika G., Ratsch, Tsuda K., Scholkopf B., “An introduction to kernel-based learning algorithms”, IEEE Transactions on Neural Networks, vol.12, no.2, pp. 181–201, 2001.
Boudat G., Anour F., “Kernel-based methods and function approximation”, International Joint Conference on Neural Networks (IJCNN), Washington, IEEE Press, vol.1, 2001.
http://www.ics.uci.edu/~mlearn/MLRepository.html.
http://www.liacc.up.pt/ML/old/statlog/datasets.html1.