Language Modeling for Speech Analytics in Under-Resourced Languages

Wills, Simone; Uys, Pieter; Heerden, Charl van; Barnard, Etienne

doi:10.21437/Interspeech.2020-1586

Different language modeling approaches are evaluated on two under-resourced, agglutinative, South African languages; Sesotho and isiZulu. The two languages present different challenges to language modeling based on their respective orthographies; isiZulu is conjunctively written whereas Sotho is disjunctively written. Two subword modeling approaches are evaluated and shown to be useful to reduce the OOV rate for isiZulu, and for Sesotho, a multi-word approach is evaluated for improving ASR accuracy, with limited success. RNNs are also evaluated and shown to slightly improve ASR accuracy, despite relatively small text corpora.

Language Modeling for Speech Analytics in Under-Resourced Languages

Simone Wills, Pieter Uys, Charl van Heerden, Etienne Barnard