ISCA Archive Eurospeech 2003
ISCA Archive Eurospeech 2003

Large vocabulary ASR for spontaneous czech in the MALACH project

Josef Psutka, Pavel Ircing, J.V. Psutka, Vlasta Radova, William J. Byrne, Jan Hajic, Jiri Mirovsky, Samuel Gustman

This paper describes LVCSR research into the automatic transcription of spontaneous Czech speech in the MALACH (Multilingual Access to Large Spoken Archives) project. This project attempts to provide improved access to the large multilingual spoken archives collected by the Survivors of the Shoah Visual History Foundation (VHF) (www.vhf.org) by advancing the state of the art in automated speech recognition. We describe a baseline ASR system and discuss the problems in language modeling that arise from the nature of Czech as a highly inflectional language that also exhibits diglossia between its written and spontaneous forms. The difficulties of this task are compounded by heavily accented, emotional and disfluent speech along with frequent switching between languages. To overcome the limited amount of relevant language model data we use statistical techniques for selecting an appropriate training corpus from a large unstructured text collection resulting in significant reductions in word error rate.

  翻译: