ISSN ONLINE(2320-9801) PRINT (2320-9798)
Niti Syal1 and Naresh Kumar Garg2
|
Related article at Pubmed, Scholar Google |
Visit for more related articles at International Journal of Innovative Research in Computer and Communication Engineering
With the advancement of the digital technology, the multimedia databases are increasing every day. The databases usually contain images and videos in addition to the textual information. Several papers have been reported in the literature and numerous algorithms are proposed for text localization in natural images and videos. All these algorithms are proposed considering the different properties of text that helps to distinguish the text regions from the other regions in the natural scenes. In this paper, review of text localization is given. It will help the beginner’s to start their work in the field of text localization.
Keywords |
||||||
Region Based, Texture Based, Support Vector Machine | ||||||
INTRODUCTION |
||||||
Text localization and recognition in real-world (scene) images is an open problem which has been receiving significant attention since it is a critical component in a number of computer vision applications like searching images by their textual content, reading labels on businesses in map applications (e.g. Google Street View) or assisting visually impaired as shown in Figure 1 [18]. | ||||||
The textual information is very useful semantic information because it describes the image or video and can be used to fully understand images and videos. Text localization can be done in three kinds of images namely: | ||||||
1. Document image | ||||||
2. Scene text image | ||||||
3. Caption text image | ||||||
Document images may be in the form of scanned book covers, CD covers or video images. Text in images or videos is classified as scene text and caption text. Scene text is also called as graphics text. Natural images that contain text are called scene text. The name of caption text is artificial text and it is one in which text is inserted or superimposed in the image [2]. Localizing text in an image is potentially a computationally very expensive task as generally any of the 2N subsets can correspond to text (where N is the number of pixels). Text localization methods deal with this problem in two different ways. | ||||||
Most published methods for text localization and recognition [3]-[4]-[5] are based on sequential pipeline processing consisting of three steps - text localization, text segmentation and processing by an OCR for printed documents. In such approaches, the overall success rate of the method is a product of success rates of each stage as there is no possibility to made decisions by previous stages. Some authors have focused on subtasks of the scene text recognition problem, such as text localization [3]-[4] [6]-[7] individual character recognition [8]-[13] or reading text from segmented areas of images [10]. Whilst they achieved promising results for individual subtasks, separating text localization from text recognition inevitably leads to loss of information, which results in degradation of overall text localization and recognition performance. Text localization can be useful for many applications including tourism, navigation for the blind, robot guidance, and intelligent transportation systems. In order to be easily seen from the road, most signs contain text printed on a homogenous background. | ||||||
The objective of text localizer is to output one text rectangle per text word present in the image as ICDAR database [10] is annotated at word level. The output of text detector is rectangles of varying sizes covering text regions and some non-text regions. | ||||||
APPROACHES FOR TEXT LOCALIZATION |
||||||
There are two different approaches have been used for text localization from complex images namely region based approach and texture based approach. | ||||||
A. Region Based Approach |
||||||
This approach uses the properties of the color or gray scale in a text region or their differences regarding the background. This method is basically divided in two sub categories: edge based and connected component (CC) based methods. The edge based method is mainly focus on the high contrast between text and background. In this method, firstly text edges are identified in an image and are merged. Finally, some heuristic rules are applied to discard non-text regions. Connected component based method considers text as a set of separate connected components, each having distinct intensity and color distributions. The edge based methods are robust to low contrast and different text size where as CC based methods are somewhat simpler to implement, but they fail to localize text in images with complex backgrounds [11]. | ||||||
B. Texture Based Methods |
||||||
As we know that, text in images has distinct textural properties which can be used to differentiate them from the background or other non text regions [12]. This method is based on the concept of textural properties. In this method, Fourier transforms. Discrete cosine transform and wavelet decomposition are generally used. The main drawback of this method is that it is highly complex in nature but, in other hand, it is more robust than the CC based methods in dealing with complex background. | ||||||
TEXT LOCALIZATION |
||||||
A. MSER Detection |
||||||
Since the original search space induced by all regions (R) of image (I) is huge, certain approximations were applied in our approach. Assuming that individual characters are detected as External Regions (ER) and taking computation complexity into consideration, the search space was limited to the set M of Maximally Stable External Regions (MSER) [14], which can be computed in linear time in number of pixels [15]. The set of MSERs detected in certain scalar image projections (intensity, red channel, blue channel, green channel) defines the set of vertices of the graph G, i.e. V (G) =M. The edges of the graph G are not stored explicitly, but they are induced on the fly. | ||||||
B. Character and Non-Character Classification |
||||||
In this module, each vertex of graph G is labeled as a character or a non character using a trained classifier which creates an initial hypothesis of text position, because character vertices are likely to be included in some path p representing a text as shown in Fig. 2 [10]. | ||||||
The features used by the classifier as shown in Table.1 are scale invariant to detect all characters sizes, but they are not rotation invariant, which implies that characters at different rotations had to be included in the training set. | ||||||
A standard Support Vector Machine (SVM) [16] classifier with Radial Basis Function (RBF) kernel [17] was used. The classifier was trained on a set of 1227 characters and 1396 non-characters, results obtained by manually annotating MSERs extracted from real-world images downloaded from Flicker. The classification error obtained by crossvalidation was 5.6%. The training set was relatively small and certainly does not contain all possible fonts, scripts or even characters, but extending the training set with more examples did not bring any significant improvement in the classification success rate. This indicates that features used by the character classifier are insensitive to fonts and alphabets. | ||||||
A. Textline Hypothesis Formation |
||||||
In real-world images a font rarely changes inside a word, which implies that certain character measurements (character height, aspect ratio, spacing between characters, stroke width, etc.) are either constant or constrained to a limited interval. Based on this observation, an approximation h ^ (p, v) of function h (p;v) was implemented using a SVM classifier with polynomial kernel, whose feature vector is created by comparing average character measurements of the existing path p to the character measurements of given vertex v. | ||||||
TEXT LOCALIZATION IN COMPLEX IMAGES |
||||||
For text localization in complex images following steps are used: First, a wavelet-based edge extraction scheme is applied on gray-level image. Second the gray level edge image is converted into binary image by using suitable global thresholding. After that, a new filter is applied on binary image to remove noise and non text areas. The text locations are determined using new projection profile. Several heuristic methods are applied to improve the system performance. Bounding boxes are generated two last steps. Figure.3 shows the structure of the system. | ||||||
CONCLUSION |
||||||
Text data in image present useful information for automatic annotation, indexing and structuring of images. Many text localization approaches use features related to text character properties. Edge detection is a common first step in many such algorithms. Texture is another widely used feature for text detection and localization. Some approaches assume that the text is written in the horizontal or vertical direction. A number of methods discussed above, focus on identifying text so that the text can be extracted and interpreted. This paper will be very useful for the beginners to start their work in the field of text localization. | ||||||
Tables at a glance |
||||||
|
||||||
Figures at a glance |
||||||
|
||||||
References |
||||||
|