An End-to-End Trainable Neural Network for Image-based Sequence Recognition and Its Application to Scene Text Recognition
基于端到端的可训练神经网络基于图像的序列识别及其在场景文本识别中的应用
Abstract
Image-based sequence recognition has been a longstanding research topic in computer vision. In this paper, we investigate the problem of scene text recognition, which is among the most important and challenging tasks in image-based sequence recognition. A novel neural network architecture, which integrates feature extraction, sequence modeling and transcription into a unified framework, is proposed. Compared with previous systems for scene text recognition, the proposed architecture possesses four distinctive properties: (1) It is end-to-end trainable, in contrast to most of the existing algorithms whose components are separately trained and tuned. (2) It naturally handles sequences in arbitrary lengths, involving no character segmentation or horizontal scale normalization. (3) It is not confined to any predefined lexicon and achieves remarkable performances in both lexicon-free and lexicon-based scene text recognition tasks. (4) It generates an effective yet much smaller model, which is more practical for real-world application scenarios. The experiments on standard benchmarks, including the IIIT-5K, Street View Text and ICDAR datasets, demonstrate the superiority of the proposed algorithm over the prior arts. Moreover, the proposed algorithm performs well in the task of image-based music score recognition, which evidently verifies the generality of it.
基于图像的序列识别已成为计算机视觉领域的长期研究课题。在本文中,我们研究了场景文本识别问题,这是基于图像的序列识别中最重要和最具挑战性的任务之一。提出了一种新颖的神经网络架构,它将特征提取,序列建模和转录集成到一个统一的框架中。与以前的用于场景文本识别的系统相比,所提出的体系结构具有四个独特的特性:(1)与大多数现有的算法(其组件分别经过训练和调整)相比,它是端对端可训练的。 (2)它自然地处理任意长度的序列,不涉及字符分割或水平尺度归一化。 (3)它不限于任何预定义的词典,并且在无词典和基于词典的场景文本识别任务中均表现出色。 (4)生成有效但小得多的模型,这对于实际应用场景更实用。在包括IIIT-5K,街景文字和ICDAR数据集在内的标准基准上进行的实验证明了该算法优于现有技术的优势。此外,该算法在基于图像的乐谱识别任务中表现良好,显然证明了其通用性。
1. Introduction
Recently, the community has seen a strong revival of neural networks, which is mainly stimulated by the great success of deep neural network models, specifically Deep Convolutional Neural Networks (DCNN), in various vision tasks. However, majority of the recent works related to deep neural networks have devoted to detection or classification of object categories [12, 25]. In this paper, we are concerned with a classic problem in computer vision: imagebased sequence recognition. In real world, a stable of visual objects, such as scene text, handwriting and musical score, tend to occur in the form of sequence, not in isolation. Unlike general object recognition, recognizing such sequence-like objects often requires the system to predict a series of object labels, instead of a single label. Therefore, recognition of such objects can be naturally cast as a sequence recognition problem. Another unique property of sequence-like objects is that their lengths may vary drastically. For instance, English words can either consist of 2 characters such as "OK" or 15 characters such as "congratulations". Consequently, the most popular deep models like DCNN [25, 26] cannot be directly applied to sequence prediction, since DCNN models often operate on inputs and outputs with fixed dimensions, and thus are incapable of producing a variable-length label sequence.
最近,社区看到了神经网络的强大复兴,这主要是由于深度神经网络模型(尤其是深度卷积神经网络(DCNN))在各种视觉任务中的巨大成功所激发。但是,与深度神经网络有关的最新著作大多数都致力于对象类别的检测或分类[12,25]。在本文中,我们关注计算机视觉中的一个经典问题:基于图像的序列识别。在现实世界中,稳定的视觉对象(例如场景文本,手写和乐谱)倾向于以顺序而不是孤立的形式出现。与一般对象识别不同,识别此类类似序列的对象通常需要系统预测一系列对象标签,而不是单个标签。因此,这种对象的识别自然可以被看作是序列识别问题。类序列对象的另一个独特属性是它们的长度可能会急剧变化。例如,英语单词可以由2个字符组成,例如"确定",也可以由15个字符组成,例如"祝贺"。因此,像DCNN [25,26]这样最流行的深度模型不能直接应用于序列预测,因为DCNN模型通常对具有固定尺寸的输入和输出进行操作,因此无法生成可变长度的标签序列。
Some attempts have been made to address this problem for a specific sequence-like object (e.g. scene text). For example, the algorithms in [35, 8] firstly detect individual characters and then recognize these detected characters with DCNN models, which are trained using labeled character images. Such methods often require training a strong character detector for accurately detecting and cropping each character out from the original word image. Some other approaches (such as [22]) treat scene text recognition as an image classification problem, and assign a class label to each English word (90K words in total). It turns out a large trained model with a huge number of classes, which is difficult to be generalized to other types of sequencelike objects, such as Chinese texts, musical scores, etc., because the numbers of basic combinations of such kind of sequences can be greater than 1 million. In summary, current systems based on DCNN can not be directly used for image-based sequence recognition.
对于特定的类似序列的对象(例如场景文本),已经尝试解决该问题。例如,[35,8]中的算法首先检测单个字符,然后使用DCNN模型识别这些检测到的字符,该模型使用标记的字符图像进行训练。此类方法通常需要训练强大的字符检测器,以准确地从原始文字图像中检测并裁剪出每个字符。其他一些方法(例如[22])将场景文本识别视为图像分类问题,并为每个英语单词(总共90K个单词)分配一个类别标签。事实证明,这种训练有素的模型具有大量的类,很难将其推广到其他类型的类似序列的对象,例如中文文本,乐谱等,因为此类序列的基本组合数量可以大于一百万。总之,当前基于DCNN的系统不能直接用于基于图像的序列识别。
Recurrent neural networks (RNN) models, another important branch of the deep neural networks family, were mainly designed for handling sequences. One of the advantages of RNN is that it does not need the position of each element in a sequence object image in both training and testing. However, a preprocessing step that converts an input object image into a sequence of image features, is usually essential. For example, Graves et al. [16] extract a set of geometrical or image features from handwritten texts, while Su and Lu [33] convert word images into sequential HOG features. The preprocessing step is independent of the subsequent components in the pipeline, thus the existing systems based on RNN can not be trained and optimized in an end-to-end fashion.
递归神经网络(RNN)模型是深度神经网络家族的另一个重要分支,主要设计用于处理序列。 RNN的优点之一是,在训练和测试中,RNN都不需要序列对象图像中每个元素的位置。 但是,通常必须执行将输入对象图像转换为图像特征序列的预处理步骤。 例如,Graves等。 [16]从手写文本中提取出一组几何或图像特征,而Su和Lu [33]将单词图像转换为连续的HOG特征。 预处理步骤独立于流水线中的后续组件,因此无法以端到端的方式训练和优化基于RNN的现有系统。
Several conventional scene text recognition methods that are not based on neural networks also brought insightful ideas and novel representations into this field. For example, Almazan` et al. [5] and Rodriguez-Serrano et al. [30] proposed to embed word images and text strings in a common vectorial subspace, and word recognition is converted into a retrieval problem. Yao et al. [36] and Gordo et al. [14] used mid-level features for scene text recognition. Though achieved promising performance on standard benchmarks, these methods are generally outperformed by previous algorithms based on neural networks [8, 22], as well as the approach proposed in this paper.
几种不基于神经网络的常规场景文本识别方法也为该领域带来了有见地的想法和新颖的表示形式。 例如,Almazan`等。 [5]和Rodriguez-Serrano等。 [30]提出将单词图像和文本字符串嵌入到一个公共的向量子空间中,并将单词识别转换为检索问题。 姚等。 [36]和戈多等。 [14]使用中级特征进行场景文本识别。 尽管在标准基准上取得了令人满意的性能,但是这些方法通常比以前基于神经网络的算法[8,22]以及本文提出的方法要好。
The main contribution of this paper is a novel neural network model, whose network architecture is specifically designed for recognizing sequence-like objects in images. The proposed neural network model is named as Convolutional Recurrent Neural Network (CRNN), since it is a combination of DCNN and RNN. For sequence-like objects, CRNN possesses several distinctive advantages over conventional neural network models: 1) It can be directly learned from sequence labels (for instance, words), requiring no detailed annotations (for instance, characters); 2) It has the same property of DCNN on learning informative representations directly from image data, requiring neither hand-craft features nor preprocessing steps, including binarization/segmentation, component localization, etc.; 3) It has the same property of RNN, being able to produce a sequence of labels; 4) It is unconstrained to the lengths of sequence-like objects, requiring only height normalization in both training and testing phases; 5) It achieves better or highly competitive performance on scene texts (word recognition) than the prior arts [23, 8]; 6) It contains much less parameters than a standard DCNN model, consuming less storage space.
本文的主要贡献是一种新颖的神经网络模型,该网络模型是专门为识别图像中类似序列的对象而设计的。所提出的神经网络模型是DCNN和RNN的组合,因此被称为卷积递归神经网络(CRNN)。对于类似序列的对象,CRNN与传统的神经网络模型相比具有几个明显的优势:1)可以直接从序列标签(例如单词)中学习,不需要详细的注释(例如字符); 2)它具有直接从图像数据中学习信息表示的DCNN的特性,既不需要手工功能也不需要预处理步骤,包括二值化/分割,组件定位等; 3)具有RNN的相同属性,能够产生一系列标签; 4)它不受序列状物体长度的限制,在训练和测试阶段都只需要高度标准化即可; 5)与现有技术相比,它在场景文本(单词识别)上表现出更好或极具竞争力的表现[23,8]; 6)它包含的参数比标准DCNN模型少得多,占用的存储空间也更少。
2. The Proposed Network Architecture
The network architecture of CRNN, as shown in Fig. 1, consists of three components, including the convolutional layers, the recurrent layers, and a transcription layer, from bottom to top。
如图1所示,CRNN的网络架构从下到上由三个部分组成,包括卷积层,循环层和转录层。
At the bottom of CRNN, the convolutional layers automatically extract a feature sequence from each input image. On top of the convolutional network, a recurrent network is built for making prediction for each frame of the feature sequence, outputted by the convolutional layers. The transcription layer at the top of CRNN is adopted to translate the per-frame predictions by the recurrent layers into a label sequence. Though CRNN is composed of different kinds of network architectures (eg. CNN and RNN), it can be jointly trained with one loss function.
在CRNN的底部,卷积层会自动从每个输入图像中提取特征序列。 在卷积网络之上,构建了一个递归网络,用于对由卷积层输出的特征序列的每一帧进行预测。 采用CRNN顶部的转录层,将循环层的每帧预测转换为标记序列。 尽管CRNN由不同类型的网络体系结构(例如CNN和RNN)组成,但可以使用一个损失函数进行联合训练。
Figure 1. The network architecture. The architecture consists of three parts: 1) convolutional layers, which extract a feature sequence from the input image; 2) recurrent layers, which predict a label distribution for each frame; 3) transcription layer, which translates the per-frame predictions into the final label sequence.
图1.网络架构。 该体系结构包括三个部分:1)卷积层,从输入图像中提取特征序列; 2)循环层,预测每个帧的标签分布; 3)转录层,它将每帧的预测翻译成最终的标记序列。
2.1. Feature Sequence Extraction
In CRNN model, the component of convolutional layers is constructed by taking the convolutional and max-pooling layers from a standard CNN model (fully-connected layers are removed). Such component is used to extract a sequential feature representation from an input image. Before being fed into the network, all the images need to be scaled to the same height. Then a sequence of feature vectors is extracted from the feature maps produced by the component of convolutional layers, which is the input for the recurrent layers. Specifically, each feature vector of a feature sequence is generated from left to right on the feature maps by column. This means the i-th feature vector is the concatenation of the i-th columns of all the maps. The width of each column in our settings is fixed to single pixel.
在CRNN模型中,卷积层的组件是通过从标准CNN模型中获取卷积层和最大池化层(除去完全连接的层)而构造的。 这样的组件用于从输入图像中提取顺序特征表示。 在送入网络之前,所有图像都需要缩放到相同的高度。 然后,从卷积层分量产生的特征图中提取特征向量序列,该卷积层是循环层的输入。 具体地,特征序列的每个特征向量在特征图上按列从左到右生成。 这意味着第i个特征向量是所有地图的第i列的串联。 我们设置中每列的宽度固定为单个像素。
As the layers of convolution, max-pooling, and elementwise activation function operate on local regions, they are translation invariant. Therefore, each column of the feature maps corresponds to a rectangle region of the original im- age (termed the receptive field), and such rectangle regions are in the same order to their corresponding columns on the feature maps from left to right. As illustrated in Fig. 2, each vector in the feature sequence is associated with a receptive field, and can be considered as the image descriptor for that region.
当卷积层,最大池化层和元素激活函数在局部区域上运行时,它们是平移不变的。 因此,特征图的每一列对应于原始图像的一个矩形区域(称为接收场),并且这些矩形区域从左到右与它们在特征图上相应列的顺序相同。 如图2所示,特征序列中的每个向量都与一个接收场相关联,并且可以被视为该区域的图像描述符。
Figure 2. The receptive field. Each vector in the extracted feature sequence is associated with a receptive field on the input image, and can be considered as the feature vector of that field.
图2.接收场。 提取的特征序列中的每个向量都与输入图像上的一个接收场相关联,并且可以视为该场的特征向量。
Being robust, rich and trainable, deep convolutional features have been widely adopted for different kinds of visual recognition tasks [25, 12]. Some previous approaches have employed CNN to learn a robust representation for sequence-like objects such as scene text [22]. However, these approaches usually extract holistic representation of the whole image by CNN, then the local deep features are collected for recognizing each component of a sequencelike object. Since CNN requires the input images to be scaled to a fixed size in order to satisfy with its fixed input dimension, it is not appropriate for sequence-like objects due to their large length variation. In CRNN, we convey deep features into sequential representations in order to be invariant to the length variation of sequence-like objects.
作为强大,丰富和可训练的深度卷积特征已被广泛用于各种视觉识别任务[25,12]。 某些先前的方法已经使用CNN来学习对诸如场景文本之类的序列对象的鲁棒表示[22]。 然而,这些方法通常通过CNN提取整个图像的整体表示,然后收集局部深层特征以识别序列状对象的每个组成部分。 由于CNN要求将输入图像缩放到固定大小,以满足其固定的输入尺寸,因此,由于序列长度较大,因此不适合用于类似序列的对象。 在CRNN中,我们将深层特征传达到顺序表示中,以便不变于序列状对象的长度变化。
2.2. Sequence Labeling
A deep bidirectional Recurrent Neural Network is built on the top of the convolutional layers, as the recurrent layers. The recurrent layers predict a label distribution y_t for each frame x_t in the feature sequence x=x_1,…,x_T . The advantages of the recurrent layers are three-fold. Firstly, RNN has a strong capability of capturing contextual information within a sequence. Using contextual cues for image-based sequence recognition is more stable and helpful than treating each symbol independently. Taking scene text recognition as an example, wide characters may require several successive frames to fully describe (refer to Fig. 2). Besides, some ambiguous characters are easier to distinguish when observing their contexts, e.g. it is easier to recognize “il” by contrasting the character heights than by recognizing each of them separately. Secondly, RNN can back-propagates error differentials to its input, i.e. the convolutional layer, allowing us to jointly train the recurrent layers and the convolutional layers in a unified network. Thirdly, RNN is able to operate on sequences of arbitrary lengths, traversing from starts to ends.
一个深层的双向递归神经网络被构建在卷积层的顶部,作为递归层。循环层针对特征序列x=x_1,…,x_T中的每个帧x_t预测标签分布y_t。循环层的优点是三方面的。首先,RNN具有在序列中捕获上下文信息的强大功能。与单独处理每个符号相比,使用上下文提示进行基于图像的序列识别更加稳定和有用。以场景文本识别为例,宽字符可能需要几个连续的帧才能完整描述(请参阅图2)。此外,某些模棱两可的字符在观察其上下文时更容易区分,例如通过对比字符高度来识别“ il”要比分别识别每个字符要容易。其次,RNN可以将误差差分反向传播到其输入即卷积层,从而使我们能够在统一网络中共同训练递归层和卷积层. 第三,RNN可以对任意长度的序列进行操作,从开始到结束。
Figure 3. (a) The structure of a basic LSTM unit. An LSTM consists of a cell module and three gates, namely the input gate, the output gate and the forget gate. (b) The structure of deep bidirectional LSTM we use in our paper. Combining a forward (left to right) and a backward (right to left) LSTMs results in a bidirectional LSTM. Stacking multiple bidirectional LSTM results in a deep bidirectional LSTM.
图3.(a)LSTM基本单元的结构。 LSTM由单元模块和三个门组成,即输入门,输出门和忘记门。 (b)我们在本文中使用的深度双向LSTM的结构。 将向前(从左到右)和向后(从右到左)LSTM组合在一起将产生双向LSTM。 堆叠多个双向LSTM会导致深度双向LSTM。
A traditional RNN unit has a self-connected hidden layer between its input and output layers. Each time it receives a frame x_t in the sequence, it updates its internal state ht with a non-linear function that takes both current input xt and past state h_t-1 as its inputs: h_t = g(x_t,h_t-1). Then the prediction y_t is made based on ht. In this way, past contexts 〖{x_(t^' )}〗_(t^'<t) are captured and utilized for prediction. Traditional RNN unit, however, suffers from the vanishing gradient problem [7], which limits the range of context it can store, and adds burden to the training process. Long-Short Term Memory [18, 11] (LSTM) is a type of RNN unit that is specially designed to address this problem. An LSTM (illustrated in Fig. 3) consists of a memory cell and three multiplicative gates, namely the input, output and forget gates. Conceptually, the memory cell stores the past contexts, and the input and output gates allow the cell to store contexts for a long period of time. Meanwhile, the memory in the cell can be cleared by the forget gate. The special design of LSTM allows it to capture long-range dependencies, which often occur in image-based sequences.
传统的RNN单元在其输入和输出层之间具有自连接的隐藏层。每次收到序列中的帧x_t时,它都会使用非线性函数更新其内部状态h_t,该函数将当前输入x_t和过去状态ht-1都作为其输入:h_t = g(x_t,h_t-1)。然后,基于h_t做出预测y_t。通过这种方式,捕获过去的上下文〖{x_(t^' )}〗_(t^'<t)并将其用于预测。然而,传统的RNN单元遭受梯度消失的困扰[7],这限制了它可以存储的上下文范围,并增加了训练过程的负担。长期内存[18,11](LSTM)是一种RNN单元,专门设计用于解决此问题。 LSTM(图3所示)由一个存储单元和三个乘法门组成,即输入,输出和忘记门。从概念上讲,存储单元存储了过去的上下文,而输入和输出门则允许该单元长时间存储上下文。同时,可以通过忘记门清除单元中的存储器。 LSTM的特殊设计使其可以捕获长期依赖关系,这种依赖关系经常发生在基于图像的序列中。
LSTM is directional, it only uses past contexts. However, in image-based sequences, contexts from both directions are useful and complementary to each other. Therefore, we follow [17] and combine two LSTMs, one forward and one backward, into a bidirectional LSTM. Furthermore, multiple bidirectional LSTMs can be stacked, resulting in a deep bidirectional LSTM as illustrated in Fig. 3.b. The deep structure allows higher level of abstractions than a shallow one, and has achieved significant performance improvements in the task of speech recognition [17].
LSTM是定向的,它仅使用过去的上下文。 但是,在基于图像的序列中,来自两个方向的上下文都是有用的并且彼此互补。 因此,我们遵循[17],将两个LSTM(一个向前和一个向后)组合成双向LSTM。 此外,可以堆叠多个双向LSTM,从而产生如图3.b所示的深层双向LSTM。 较之较浅的结构,较深的结构可以实现更高级别的抽象,并且在语音识别任务中已经实现了显着的性能提升[17]。
In recurrent layers, error differentials are propagated in the opposite directions of the arrows shown in Fig. 3.b, i.e. Back-Propagation Through Time (BPTT). At the bottom of the recurrent layers, the sequence of propagated differentials are concatenated into maps, inverting the operation of converting feature maps into feature sequences, and fed back to the convolutional layers. In practice, we create a custom network layer, called "Map-to-Sequence", as the bridge between convolutional layers and recurrent layers.
在循环层中,误差差沿图3.b所示箭头的相反方向传播,即反向传播时间(BPTT)。 在循环层的底部,将传播的差异序列连接成图,将将特征图转换为特征序列的操作反转,然后反馈到卷积层。 实际上,我们创建了一个自定义网络层,称为"映射到序列",作为卷积层和循环层之间的桥梁。
2.3. Transcription
Transcription is the process of converting the per-frame predictions made by RNN into a label sequence. Mathematically, transcription is to find the label sequence with the highest probability conditioned on the per-frame predictions. In practice, there exists two modes of transcription, namely the lexicon-free and lexicon-based transcriptions. A lexicon is a set of label sequences that prediction is constraint to, e.g. a spell checking dictionary. In lexiconfree mode, predictions are made without any lexicon. In lexicon-based mode, predictions are made by choosing the label sequence that has the highest probability.
转录是将RNN进行的每帧预测转换为标签序列的过程。 在数学上,转录是要根据每帧预测找到具有最高概率的标记序列。 实际上,存在两种转录方式,即无词典和基于词典的转录。 词典是预测受其约束的一组标签序列,例如 拼写检查字典。 在无词典模式下,无需任何词典即可进行预测。 在基于词典的模式下,通过选择概率最高的标签序列来进行预测。
2.3.1 Probability of label sequence 标签序列的概率
We adopt the conditional probability defined in the Connectionist Temporal Classification (CTC) layer proposed by Graves et al. [15]. The probability is defined for label sequence l conditioned on the per-frame predictions y =y_1,...,y_T , and it ignores the position where each label in l is located. Consequently, when we use the negative log-likelihood of this probability as the objective to train the network, we only need images and their corresponding label sequences, avoiding the labor of labeling positions of individual characters.
我们采用Graves等人提出的在连接主义时间分类(CTC)层中定义的条件概率。 [15]。 该概率是针对以每帧预测 y =y_1,...,y_T为条件的标签序列l定义的,它忽略了l中每个标签所处的位置。 因此,当我们以这种可能性的负对数似然度为目标来训练网络时,我们只需要图像及其相应的标签序列,从而避免了为各个字符标注位置的麻烦。
The formulation of the conditional probability is briefly described as follows: The input is a sequence y =〖 y〗_1,...,〖 y〗_T where T is the sequence length. Here, each 〖 y〗_t ϵR^(|L^' |) is a probability distribution over the set L^'=L∪ , where L^' contains all labels in the task (e.g. all English characters), as well as a ’blank’ label denoted by . A sequence-to-sequence mapping function B is defined on sequence DD, where T is the length. B maps π onto l by firstly removing the repeated labels, then removing the ’blank’s. For example, B maps “--hh-e-l-ll-oo--” (’-’ represents ’blank’) onto “hello”. Then, the conditional probability is defined as the sum of probabilities of all π that are mapped by B onto l:
条件概率的公式简要描述如下:输入是序列y =〖 y〗_1,...,〖 y〗_T其中,T是序列长度。 在这里,每个〖 y〗_t ϵR^(|L^' |)都是集合L^'=L∪上的概率分布,其中L^'包含任务中的所有标签(例如,所有英文字符)以及以表示的“空白”标签。 在序列DD上定义了序列到序列的映射函数B,其中T是长度。 B首先删除重复的标签,然后删除“空白”,从而将π映射到l上。 例如,B将“ --hh-e-l-ll-oo-”(“-”代表“空白”)映射到“ hello”。 然后,将条件概率定义为B映射到l上的所有π的概率之和:
p(l│y)=∑_(π:B(π)=1)▒〖p(π│y) 〗 (1)
where the probability of π is defined as p(π│y)=∏_(t=1)^T▒y_(π_t)^t , y_(π_t)^t is the probability of having label π_t at time stamp t. Directly computing Eq. 1 would be computationally infeasible due to the exponentially large number of summation items. However, Eq. 1 can be efficiently computed using the forward-backward algorithm described in [15].
其中π的概率定义为p(π│y)=∏_(t=1)^T▒y_(π_t)^t ,y_(π_t)^t是在时间戳t处具有标签π_t的概率。 直接计算式 由于求和项的数量成指数增加,因此1在计算上是不可行的。 但是,等式。 使用[15]中描述的前向-后向算法可以有效地计算图1。
2.3.2 Lexicon-free transcription 无词典的转录
In this mode, the sequence l^* that has the highest probability as defined in Eq. 1 is taken as the prediction. Since there exists no tractable algorithm to precisely find the solution, we use the strategy adopted in [15]. The sequencel^* is approximately found by l^* ≈ B(〖arg max〗_π p(π|y)), i.e. taking the most probable label π_t at each time stamp t, and map the resulted sequence onto l^* .
在这种模式下,将具有等式1中定义的最高概率的序列l^*作为预测。 由于没有可精确计算的精确算法,因此我们使用[15]中采用的策略。 序列l^*由l^* ≈ B(〖arg max〗_π p(π|y))近似找到,即在每个时间戳t处取最可能的标记πt,并将得到的序列映射到l^*上。
2.3.3 Lexicon-based transcription 2.3.3基于词典的转录
In lexicon-based mode, each test sample is associated with a lexicon D. Basically, the label sequence is recognized by choosing the sequence in the lexicon that has highest conditional probability defined in Eq. 1, i.e. l^*=〖arg max〗_(I∈D) p(l|y). However, for large lexicons, e.g. the 50k-words Hunspell spell-checking dictionary [1], it would be very time-consuming to perform an exhaustive search over the lexicon, i.e. to compute Equation 1 for all sequences in the lexicon and choose the one with the highest probability. To solve this problem, we observe that the label sequences predicted via lexicon-free transcription, described in 2.3.2, are often close to the ground-truth under the edit distance metric. This indicates that we can limit our search to the nearest-neighbor candidates N_δ (l^'), where δ is the maximal edit distance and l^' is the sequence transcribed from y in lexicon-free mode:
在基于词典的模式下,每个测试样本都与一个词典D相关联。基本上,通过选择词典中方程式1中定义的条件概率最高的序列来识别标签序列,即l^*=〖arg max〗_(I∈D) p(l|y)。 但是,对于大型词典,例如 在使用5万个单词的Hunspell拼写检查字典[1]时,要在词典上进行详尽搜索,即为词典中的所有序列计算等式1并选择概率最高的序列,将非常耗时。 为了解决这个问题,我们观察到在2.3.2中描述的通过无词典转录预测的标签序列在编辑距离度量标准下通常接近于真实情况。 这表明我们可以将搜索范围限制为最邻近的候选对象N_δ (l^'),其中δ是最大编辑距离,而l^'是在无词典模式下从y转录的序列:
l^* ≈ B(〖arg max〗_( l∈N_δ (l^' ) ) p(l│y)). (2)
The candidates N_δ (l^')can be found efficiently with the BK-tree data structure [9], which is a metric tree specifically adapted to discrete metric spaces. The search time complexity of BK-tree is O(log |D|), where |D| is the lexicon size. Therefore this scheme readily extends to very large lexicons. In our approach, a BK-tree is constructed offline for a lexicon. Then we perform fast online search with the tree, by finding sequences that have less or equal to δ edit distance to the query sequence.
可以使用BK树数据结构[9]有效地找到候选N_δ (l^'),BK树数据结构是专门适合于离散度量空间的度量树。 BK树的搜索时间复杂度为O(log |D|),其中|D|是词典大小。 因此,该方案很容易扩展到非常大的词典。 在我们的方法中,为词典离线构建BK树。 然后,通过查找与查询序列具有小于或等于δ编辑距离的序列,我们对树进行快速在线搜索。
2.4. Network Training
Denote the training dataset by X = 〖{I_i ,I_i}〗_i , whereI_i is the training image and I_i is the ground truth label sequence. The objective is to minimize the negative log-likelihood of conditional probability of ground truth:
O=-∑_(I_i ,I_i∈X)▒〖log p(I_i│y_i ),(3)〗
where y_i is the sequence produced by the recurrent and convolutional layers from I_i . This objective function calculates a cost value directly from an image and its ground truth label sequence. Therefore, the network can be end-to-end trained on pairs of images and sequences, eliminating the procedure of manually labeling all individual components in training images.
其中y_i是由I_i的循环层和卷积层产生的序列。 该目标函数直接从图像及其地面真相标签序列计算成本值。 因此,可以在成对的图像和序列上对网络进行端到端训练,从而省去了手动标记训练图像中所有单个组件的过程。
The network is trained with stochastic gradient descent (SGD). Gradients are calculated by the back-propagation algorithm. In particular, in the transcription layer, error differentials are back-propagated with the forward-backward algorithm, as described in [15]. In the recurrent layers, the Back-Propagation Through Time (BPTT) is applied to calculate the error differentials.
该网络使用随机梯度下降(SGD)进行训练。 梯度是通过反向传播算法计算的。 特别是,在转录层中,误差差异通过前向后算法向后传播,如[15]所述。 在循环层中,应用反向传播时间(BPTT)来计算误差差异。
For optimization, we use the ADADELTA [37] to automatically calculate per-dimension learning rates. Compared with the conventional momentum [31] method, ADADELTA requires no manual setting of a learning rate. More importantly, we find that optimization using ADADELTA converges faster than the momentum method.
为了优化,我们使用ADADELTA [37]自动计算每维度的学习率。 与传统的动量[31]方法相比,ADADELTA不需要手动设置学习速率。 更重要的是,我们发现使用ADADELTA进行优化的收敛速度快于动量法。
3. Experiments
To evaluate the effectiveness of the proposed CRNN model, we conducted experiments on standard benchmarks for scene text recognition and musical score recognition, which are both challenging vision tasks. The datasets and setting for training and testing are given in Sec.3.1, the detailed settings of CRNN for scene text images is provided in Sec.3.2, and the results with the comprehensive comparisons are reported in Sec.3.3. To further demonstrate the generality of CRNN, we verify the proposed algorithm on a music score recognition task in Sec.3.4.
为了评估所提出的CRNN模型的有效性,我们针对场景文本识别和乐谱识别的标准基准进行了实验,这两者都是具有挑战性的视觉任务。 训练和测试的数据集和设置在第3.1节中给出,场景文本图像的CRNN的详细设置在第3.2节中提供,经过全面比较的结果在第3.3节中进行了报告。 为了进一步证明CRNN的通用性,我们在第3.4节中对音乐分数识别任务验证了所提出的算法。
3.1. Datasets
For all the experiments for scene text recognition, we use the synthetic dataset (Synth) released by Jaderberg et al. [20] as the training data. The dataset contains 8 millions training images and their corresponding ground truth words. Such images are generated by a synthetic text engine and are highly realistic. Our network is trained on the synthetic data once, and tested on all other real-world test datasets without any fine-tuning on their training data. Even though the CRNN model is purely trained with synthetic text data, it works well on real images from standard text recognition benchmarks.
对于所有用于场景文本识别的实验,我们使用Jaderberg等人发布的合成数据集(Synth)。 [20]作为训练数据。 数据集包含800万个训练图像及其相应的地面真实单词。 这样的图像是由合成文本引擎生成的,具有很高的逼真度。 我们的网络接受过一次综合数据训练,并在所有其他真实世界的测试数据集上进行了测试,而无需对其训练数据进行任何微调。 即使CRNN模型是完全由合成文本数据训练而成的,它也可以在标准文本识别基准的真实图像上很好地工作。
Four popular benchmarks for scene text recognition are used for performance evaluation, namely ICDAR 2003 (IC03), ICDAR 2013 (IC13), IIIT 5k-word (IIIT5k), and Street View Text (SVT).
四个流行的场景文本识别基准用于性能评估,即ICDAR 2003(IC03),ICDAR 2013(IC13),IIIT 5k字(IIIT5k)和街景文本(SVT)。
IC03 [27] test dataset contains 251 scene images with labeled text bounding boxes. Following Wang et al. [34], we ignore images that either contain non-alphanumeric characters or have less than three characters, and get a test set with 860 cropped text images. Each test image is associated with a 50-words lexicon which is defined by Wang et al. [34]. A full lexicon is built by combining all the per-image lexicons. In addition, we use a 50k words lexicon consisting of the words in the Hunspell spell-checking dictionary [1].
IC03 [27]测试数据集包含251个带有标记文本边界框的场景图像。 继王等。 [34],我们将忽略包含非字母数字字符或少于三个字符的图像,并使用860个裁剪的文本图像获取测试集。 每个测试图像都与Wang等人定义的50个单词的词典相关。 [34]。 通过合并所有按图像的词典来构建完整的词典。 另外,我们使用由Hunspell拼写检查字典[1]中的单词组成的5万个单词词典。