COMPUTATIONAL FRAMEWORK FOR THE DESCRIPTIVE ANALYSIS AND DIGITAL PRESERVATION OF CLASSICAL URDU POETRY
Keywords:
Cultural preservation, Urdu poetry, Descriptive text analytics, Computational philology, Low resource languagesAbstract
Digital Humanities (DH) researchers are largely interested in digitally privileged texts or European languages. There is very little research on classical non-Latin script literature using computational methods. Urdu is one of those languages. Urdu poetry has not received a substantial amount of attention for computational literary study despite the extensive poetic tradition and cultural value associated with its use within South Asia. Many of the challenges most associated with conducting computational literary analysis of Urdu poetry include the need to properly represent a right-to-left writing system, inconsistent encoding methods, and a lack of digital resources. The authors of this article are presenting a framework for creating a verifiable and reproducible descriptive analysis and digital preservation of classical Urdu poetry, using selected stanzas from Mirza Ghalib's Diwan-e-Ghalib (1797-1869) as a case study. This framework utilizes methods that are accessible via low-resource, reproducible, and transparent approaches. The core components of this framework are: (1) verification of all Urdu characters using the UTF-8 encoding standard, (2) processing of text as "right-to-left", (3) creation of a tokenization process based on rules, (4) analysis of lexical frequency, and (5) creation of a corpus pipeline associated with the use of Python programing language.
This study analyses traditional Urdu poetry (i.e., the works of Ghalib) using computational methods without employing advanced machine learning models, therefore creating easier access to researchers working in low-resource environments and providing researchers with the ability to effectively conduct their analyses of literary works within the long-term preservation of these poetic traditions of Urdu literature. Additionally, the findings of this study support the idea that the earliest forms of computational analysis were applicable to other works that contain similar themes and styles to those of Urdu poetry. This framework provides a model for the development of Digital Humanities research beyond collections of Western-language corpora of literary works.
