Deep Learning-Driven OCR System for Brahui Printed Text: Bridging the Digital Gap in Low-Resource Language Processing

Authors

  • Saba Gull
  • Nooruddin
  • Naseer Ahmed
  • Shakil Ahmed Sheikh

Abstract

Optical Character Recognition (OCR) is crucial for digitizing printed documents, yet low-resource languages such as Brahui remain underserved. Brahui, a Dravidian language spoken in Balochistan, Pakistan, uses the cursive Noori Nastaleeq script, which presents unique challenges including ligature dependency, positional character shaping, and diacritic complexity. This research addresses the digital gap by developing a machine learning–driven OCR framework tailored for Brahui printed text. A custom dataset of 1,000 line images was created, preprocessed, and annotated to facilitate supervised learning. A hybrid CNN–BiLSTM–CTC architecture was designed to capture spatial and sequential dependencies without requiring explicit character segmentation. The model was trained using a CTC loss function and evaluated on character and word accuracy, achieving 91.3%  character accuracy, 86.5% word accuracy, an 8.7% Character Error Rate (CER), and a 13.5% Word Error Rate (WER). Error analysis identified ligature confusion and diacritic misrecognition as primary sources of errors. This study establishes the first Brahui OCR corpus and baseline system, providing a foundation for language digitization, preservation, and further research in low-resource script recognition. The proposed framework demonstrates the feasibility of automated Brahui OCR and sets the stage for future expansions, including larger datasets, transformer-based architectures, and multilingual integration.

Keywords: Brahui OCR, Low-Resource Languages, Noori Nastaleeq Script,  CNN–BiLSTM–CTC,  Character Recognition,  Word Accuracy,  Digital Preservation.

Downloads

Published

2025-09-04