Closing the Gap in Corpus Linguistics: Introducing a Groundbreaking Multimodal Corpus of Informal Digital English, Encompassing Texts, Emojis, And Voice Notes
Abstract
Mainstream English corpora, such as the British National Corpus (BNC) and the Corpus of Contemporary American English (COCA), predominantly focus on edited prose and formal speech. This limited approach creates a critical gap in our understanding of everyday informal digital English, particularly in the dynamic interaction between text, emojis, and short voice notes. This paper decisively addresses this gap by proposing the design of an ethically sourced, balanced, and richly annotated multimodal corpus that thoroughly captures private-by-default digital conversations, particularly those occurring in messaging apps and group chats where English is utilized. This research provides an exhaustive review of existing corpus-linguistic methodologies and establishes robust design principles for sampling, anonymization, and multimodal annotation. Furthermore, it introduces comprehensive analytical pipelines aimed at examining lexis, pragmatics, sentiment, discourse moves, and phonetic-pragmatic features in voice notes. The study precisely formulates research questions and hypotheses, presents a reproducible methodology that includes comprehensive data governance and Institutional Review Board (IRB)-ready protocols, and delineates a meticulous evaluation plan. The proposed corpus, EDDE (English Digital Discourse & Emoji), is specifically engineered to: (i) model emojis and text as co-expressive units; (ii) operationalize pragmatic functions such as stance, politeness, and mitigation; (iii) capture prosodic correlates of stance in voice notes; and (iv) provide essential insights for English as a Foreign Language (EFL) and English as a second Language (ESL) learners. This endeavor will significantly advance both fundamental and applied corpus linguistics, closing the crucial multimodality gap in the field.
Keywords: Corpus Linguistics, English digital discourse, Emoji, voice note, Text, digital corpus design, informal English.