Figure 5. (a) Clean musical scores images collected from [2] (b) Synthesized musical score images. (c) Real-world score images taken with a mobile phone camera.
图5.(a)从[2]收集的干净的乐谱图像。(b)合成的乐谱图像。 (c)用手机相机拍摄的真实分数图像。
Since we have limited training data, we use a simplified CRNN configuration in order to reduce model capacity. Different from the configuration specified in Tab. 1, the 4th and 6th convolution layers are removed, and the 2-layer bidirectional LSTM is replaced by a 2-layer single directional LSTM. The network is trained on the pairs of images and corresponding label sequences. Two measures are used for evaluating the recognition performance: 1) fragment accuracy, i.e. the percentage of score fragments correctly recognized; 2) average edit distance, i.e. the average edit distance between predicted pitch sequences and the ground truths. For comparison, we evaluate two commercial OMR engines, namely the Capella Scan [3] and the PhotoScore [4].
由于训练数据有限,因此我们使用简化的CRNN配置以减少模型容量。 与选项卡中指定的配置不同。 如图1所示,删除了第4和第6卷积层,并将2层双向LSTM替换为2层单向LSTM。 在图像对和相应的标签序列对上训练网络。 两种方法可用于评估识别性能:1)片段准确性,即正确识别的得分片段的百分比; 2)平均编辑距离,即预测音高序列与基本事实之间的平均编辑距离。 为了进行比较,我们评估了两种商用OMR引擎,即Capella Scan [3]和PhotoScore [4]。
Table 4. Comparison of pitch recognition accuracies, among CRNN and two commercial OMR systems, on the three datasets we have collected. Performances are evaluated by fragment accuracies and average edit distance ("fragment accuracy/average edit distance").
表4.在我们收集的三个数据集上,CRNN和两个商业OMR系统之间的音高识别精度比较。 通过片段精度和平均编辑距离("片段准确性/平均编辑距离")评估演奏。
Tab.4 summarizes the results. The CRNN outperforms the two commercial systems by a large margin. The Capella Scan and PhotoScore systems perform reasonably well on the Clean dataset, but their performances drop significantly on synthesized and real-world data. The main reason is that they rely on robust binarization to detect staff lines and notes, but the binarization step often fails on synthesized and real-world data due to bad lighting condition, noise corruption and cluttered background. The CRNN, on the other hand, uses convolutional features that are highly robust to noises and distortions. Besides, recurrent layers in CRNN can utilize contextual information in the score. Each note is recognized not only itself, but also by the nearby notes. Consequently, some notes can be recognized by comparing them with the nearby notes, e.g. contrasting their vertical positions.
表4总结了结果。 CRNN大大优于两个商业系统。 Capella Scan和PhotoScore系统在Clean数据集上的表现相当不错,但在合成和真实数据上的性能却大大下降。 主要原因是他们依靠可靠的二值化来检测人员线和便条,但是由于不良的光照条件,噪声破坏和背景混乱,二值化步骤通常无法在合成的和真实的数据上进行。 另一方面,CRNN使用对噪声和失真具有高度鲁棒性的卷积特征。 此外,CRNN中的循环层可以利用分数中的上下文信息。 每个音符不仅可以自己识别,还可以被附近的音符识别。 因此,可以通过将它们与附近的音符进行比较来识别某些音符,例如 对比他们的垂直位置。
The results have shown the generality of CRNN, in that it can be readily applied to other image-based sequence recognition problems, requiring minimal domain knowledge. Compared with Capella Scan and PhotoScore, our CRNN-based system is still preliminary and misses many functionalities. But it provides a new scheme for OMR, and has shown promising capabilities in pitch recognition.
结果显示了CRNN的普遍性,因为它可以轻松应用于其他基于图像的序列识别问题,而所需的领域知识最少。 与Capella Scan和PhotoScore相比,我们基于CRNN的系统仍是初步的,缺少许多功能。 但是,它为OMR提供了一种新方案,并且在音高识别方面显示出了令人鼓舞的功能。
4. Conclusion
In this paper, we have presented a novel neural network architecture, called Convolutional Recurrent Neural Network (CRNN), which integrates the advantages of both Convolutional Neural Networks (CNN) and Recurrent Neural Networks (RNN). CRNN is able to take input images of varying dimensions and produces predictions with different lengths. It directly runs on coarse level labels (e.g. words), requiring no detailed annotations for each individual element (e.g. characters) in the training phase. Moreover, as CRNN abandons fully connected layers used in conventional neural networks, it results in a much more compact and efficient model. All these properties make CRNN an excellent approach for image-based sequence recognition.
在本文中,我们提出了一种新颖的神经网络架构,称为卷积递归神经网络(CRNN),它融合了卷积神经网络(CNN)和递归神经网络(RNN)的优点。 CRNN能够拍摄不同尺寸的输入图像,并产生不同长度的预测。 它直接在粗糙级别的标签(例如单词)上运行,在训练阶段无需为每个单独的元素(例如字符)提供详细的注释。 此外,由于CRNN放弃了常规神经网络中使用的完全连接的层,因此它导致了更加紧凑和有效的模型。 所有这些特性使CRNN成为基于图像的序列识别的绝佳方法。
The experiments on the scene text recognition benchmarks demonstrate that CRNN achieves superior or highly competitive performance, compared with conventional methods as well as other CNN and RNN based algorithms. This confirms the advantages of the proposed algorithm. In addition, CRNN significantly outperforms other competitors on a benchmark for Optical Music Recognition (OMR), which verifies the generality of CRNN.
与传统方法以及其他基于CNN和RNN的算法相比,现场文本识别基准上的实验表明CRNN具有优异或极具竞争力的性能。 这证实了所提出算法的优点。 此外,CRNN在光学音乐识别(OMR)的基准上明显优于其他竞争对手,这证明了CRNN的普遍性。
Actually, CRNN is a general framework, thus it can be applied to other domains and problems (such as Chinese character recognition), which involve sequence prediction in images. To further speed up CRNN and make it more practical in real-world applications is another direction that is worthy of exploration in the future.
实际上,CRNN是一个通用框架,因此可以应用于涉及图像序列预测的其他领域和问题(例如汉字识别)。 进一步加快CRNN的速度,使其在实际应用中更加实用是另一个值得未来探索的方向。
原文: An End-to-End Trainable Neural Network for Image-based Sequence Its Application to Scene Text Recognition (arXiV 1507.05717)