Swahili-text: CUHK Launches Text Detection and Recognition Dataset for African Language Scenes

The paper presents a dataset dedicated to text detection and recognition of natural scenes in Kiswahili, an under-explored linguistic area in current research. The dataset consists of976sheet of scene images with annotations that can be used for text detection, and8284A cropped image is used for recognition.

Source: Xiaofei's Algorithmic Engineering Notes Public

discuss a paper or thesis (old): The First Swahili Language Scene Text Detection and Recognition Dataset

Paper Address:/abs/2405.11437
Thesis Code:/FadilaW/Swahili-STR-Dataset

Introduction

Today, communication relies heavily on textual content. Text is an excellent form of communication and its influence lasts for a very long time. Scenario texts are widely available and contain a considerable amount of semantics and information to help make sense of the real world. Services such as newspapers, hospitals, financial services, insurance and legal institutions are increasingly digitizing most documents for practical applications. Applications such as automotive assistance, industrial automation, robot navigation, real-time scene translation, fraud detection, image retrieval, product search, etc., all rely on scene text recognition and these applications are evolving and developing every day. Nowadays, it becomes crucial to understand and interpret the textual content contained in images. Moreover, text is ubiquitous, appearing in many key natural scenes: road signs, advertisements, posters, streets, restaurants, stores, etc.

In recent years, researchers have made significant progress in modeling the detection and recognition of text in challenging scenarios that include blurred images, non-traditional backgrounds, changing lighting conditions, curved text, or images captured in harsh environments. However, most research has focused on widely spoken languages such as English and Chinese, with less attention and resources given to other languages in resource-limited areas such as rural India and Africa. As a result, many world languages lack appropriate datasets and tailored models, which makes it difficult to effectively address the challenges of text detection and recognition in scene images in these languages.

Swahili, also known as Kiswahili, is one of the most widely spoken languages on the African continent. More than1Kiswahili is spoken by hundreds of millions of people in several African countries, including Tanzania, Uganda, the Democratic Republic of the Congo, Burundi and Kenya. It is the official language of Tanzania and Kenya and is widely used in public administration, education and the media. Kiswahili is derived from Arabic (which accounts for about40%), Persian, Portuguese, English and German are among the foreign languages from which many words have been borrowed. Nevertheless, Swahili is still categorized as one of the resource-poor languages, and natural language processing tasks are limited by the scarcity of annotated data.

Although Kiswahili uses the Latin alphabet, most large datasets involving the Latin alphabet focus on languages with different linguistic features, such as English. This lack of attention has resulted in the fact that Kiswahili, a language spoken by millions of people, does not have dedicated resources to optimize and fine-tune text detection and recognition models to fit its unique features. Table1List some of the features of the language compared to English.

The main goal of this paper is to develop a comprehensive scene text dataset for Swahili:Swahili-text. This image collection is designed to meet the needs of specialized datasets, provide benchmarks for evaluating existing models, and help the research community develop new state-of-the-art methods for text detection and recognition in Swahili scenes.Swahili-textembody976images, mostly from Tanzanian cities and others from social media. These images include store labels, advertising banners, posters and street names. Each image was manually annotated at the word level. To the best of the authors' knowledge, theSwahili-Textis the first comprehensive dataset developed specifically for text detection and recognition of Kiswahili scenes.

Related work

Swahili Language Datasets for Natural Language Processing

Kiswahili is still categorized as a resource-poor language. Natural language processing tasks are limited by the scarcity of annotated data. However, with the development of deep learning and language models, many datasets are starting to provide increasing support for language modeling tasks. Among them.HelsinkiThe dataset is one of the most commonly used datasets dedicated to the linguistic study of Kiswahili. The dataset provides a collection of unannotated and annotated versions of Kiswahili texts. The dataset is intended to support language analysis, corpus linguistics and various research efforts related to natural language processing tasks in Kiswahili.

Gelaset al. developed an annotated dataset for language modeling tasks. The dataset contains sentences from different Kiswahili online media platforms covering a wide range of domains such as sports, general news, family, politics and religion. In total, there are512,000A unique word.Shikaliet al. combined this dataset with the Swahili syllabic alphabet and adapted theMikolovThe English word class comparison dataset proposed by et al.Barack Wet al. have developedKencorpusKiswahili quiz data set (KenSwQuAD), aiming to cope with the scarcity of Q&A datasets in low-resource languages, especially Kiswahili, and to enhance machine understanding of natural language, with applications to tasks such as Internet search and dialog systems for Kiswahili speakers.Alexander Ret al. instead focused on the lack of speech datasets in low-resource languages such as Kiswahili, particularly in the area of spoken digit recognition. This study developed a spoken digit dataset for Kiswahili and investigated the impact of cross-linguistic and multilingual pre-training methods on spoken digit recognition.

These datasets are intended to facilitate research in Swahili language modeling and natural language processing tasks, however a comprehensive dataset for annotating scene text images in Swahili does not currently exist for scene text detection and recognition tasks.

Latin Script Scene Text Datasets

The field of scene text recognition is influenced by standard datasets that allow researchers to save a lot of time and effort in collecting and annotating data. The popular datasets related to Latin alphabet scene text recognition are the following:

ICDARDatasets are very popular in the field of document analysis and recognition.ICDAR 2013The dataset contains462sheets of high-resolution images of natural scenes such as outdoor scenes, signs, and posters. The dataset introduces the challenges of multi-oriented text, different lighting conditions, and mixed fonts and text sizes to facilitate the development of robust text recognition algorithms.ICDAR 2015The episodic scenario text dataset contains the data obtained through theGoogle Glasscaptured1,670Images. The dataset includes episodic scene text with non-traditional text shapes, curved text, and text in different languages.
Total-textThe dataset is proposed for multi-orientation and curved text problems. It contains images with different oriented texts, mainly curved texts.
MSRA-TD500The dataset combines English and Chinese vocabulary and is also very popular. It contains real-world scenarios from500Sheets of arbitrarily oriented images are annotated at the sentence level. In addition to the dataset of Latin scripts, several multilingual datasets for text recognition in multilingual scenarios are proposed.

However, most of these datasets do not include Kiswahili. To the best of our knowledge, no public dataset has been created for text detection and recognition of Kiswahili scenes. While some datasets for English can be used because they use the same alphabet, they are not as effective as a dataset that is specific to Kiswahili.

Scene Text Detection and Recognition Methods

The explosion of deep learning techniques has significantly impacted the field of scene text detection and recognition, opening up entirely new possibilities for scene text detection and recognition to extract more powerful and discriminative features from text images.

Text detection and text recognition can be seen as two separate tasks. In the text detection phase, the goal is to identify and label regions in the input image where text is present. Three main approaches exist: regression-based, part-based and segmentation-based approaches. Regression-based methods directly regress the bounding box. By transforming text detection into a regression problem, the model learns to estimate the spatial distribution of text instances, which makes it well suited for scenarios where text regions need to be precisely located. Segmentation-based methods identify and associate text segments with word bounding boxes. Segmentation-based methods combine pixel-level prediction and post-processing techniques that utilize semantic segmentation and aMSERtechniques such as algorithms for detecting text instances.

Text recognition involves converting detected text regions into character instances, and there are two main approaches:connectionist temporal classification（CTC) model and the attention mechanism model.CTCThe model uses a recurrent neural network to compute the conditional probabilities of the label sequences based on the single-frame predictions, which consists of three important steps: extraction of features from the text region using a convolutional network, prediction of the label distributions at each frame using a recurrent neural network, and a post-processing step to convert the predictions at each frame into the final label sequences.

Attentional mechanisms have achieved remarkable results in the field of computer vision, including scene text recognition. Attentional mechanisms focus on the relevant part of the input, leading to more accurate character recognition in complex or changing environments. This approach utilizes a coding structure to extract feature vectors from text regions and a decoding structure to generate character instances. Xiao et al. addressed the problem of irrelevant information generated by the attention mechanism and proposed a method for evaluating the relevance between the attention result and the query. This is accomplished by combining theAttention on Attention（AoA) mechanism integrated into a text recognition framework can eliminate extraneous attention and thus improve the accuracy of text recognition.

Despite significant progress in scene text detection and recognition, the lack of labeled training data remains an obstacle. Deep learning algorithms are limited by the scarcity of large-scale datasets when generalizing to real-world scenes, especially for low-resource languages or languages that have not yet been studied, including datasets with annotated scene text images.

Swahili Text Dataset

Dataset Description

The Kiswahili scene text detection and recognition dataset contains976images of natural scenes with data from a variety of sources. The data were collected by researchers specializing in computer vision and natural language processing, as well as native Kiswahili speakers. Images were sourced from a wide range of sources, including internet sources and images taken directly in Tanzanian cities using cell phone cameras. This ensured that a representative collection of scenes was obtained from areas where Kiswahili is spoken.

Strict quality control measures were implemented to ensure the accuracy and relevance of the collected images, with special attention paid to the elimination of poorly lit and blurred images. The dataset underwent a preprocessing step to remove images with poor quality attributes and instances with incomplete data were corrected or excluded to maintain data integrity. Each image in the dataset is organized inJPEGFormat storage.

The Kiswahili text dataset contains images depicting natural scenes with Kiswahili text elements such as street signs, street names, advertisements, store names, banners, and other identifiers commonly found in areas where Kiswahili is spoken. To facilitate the task of scene text detection and recognition, the dataset has been annotated, and the annotation process has been carried out by domain experts to ensure accurate annotation of text areas.

seek1Shows some of the images contained in the dataset. For the recognition task, the images in the dataset have been cropped to8284An image. Figure2Shows statistics for cropped images in the Swahili text, including the number of images grouped by text length and the distribution of character occurrences.

Annotation

To ensure the accuracy of text detection and recognition, and to evaluate the performance of the system, accurate annotation of text instances is essential. Therefore, a meticulous manual annotation method was used for the Swahili text dataset. Each text region in each image was annotated with a single bounding box to ensure accurate annotation when dealing with the various shapes and locations of the Swahili text.

The text instance annotations for each image were collected into a separate file. This file contains the bounding box coordinates of the words and the corresponding text transcriptions. A bounding box is an image withnA polygon with a number of points, each point having a horizontal positionx1and vertical positiony1coordinates of the text. Unreadable text instances are only marked with a bounding box to facilitate detection but not to participate in text recognition. Figure3A sample image with annotations is shown.

Experiments

Text Detection Experiment

Text Recognition Experiment

If this article is helpful to you, please click a like or in the look at it ～～
For more content, please pay attention to WeChat public number [Xiaofei's Algorithm Engineering Notes].

work-life balance.