Reach Us +44 7456 035580
All submissions of the EM system will be redirected to Online Manuscript Submission System. Authors are requested to submit articles directly to Online Manuscript Submission System of respective journal.

Unsupervised approach for Audio Categorization for Low- Resource Languagess

Joint Webinar on World Summit on Automotive and Autonomous Systems and International Conference and Exhibition on Mechanical and Aerospace Engineering

September 17, 2021 | Webinar

Basanta Joshi

Tribhivan University, Nepal

ScientificTracks Abstracts: RRJET


Low-Resource Languages can be understood as less studied, resource scarce, less computerized in fact are those that have relatively less data available for training conversational Artificial Intelligent systems. Audio samples without transcriptions are available all over the Internet as well as various other sources. The dominant form of audio recognition and categorization systems today fall under the category of supervised learning which requires tens of thousands of hours of audio and manual transcribing of data that needs to be repeated for each language. Creating an unsupervised model to extract speech representations would create the potential for learning from unlabeled data which bodes well for resource-limited languages like Hindi, Telugu, Mandarin, Tagalog, Malay, and Vietnamese Nepali. In this work, a model trained using the wav2vec framework developed and open sourced by Facebook AI Research’s for training raw audio across 23 Indic languages. The focus of this work is limited to Nepali language and several hours of unlabelled Nepali audio along with few hours of audio with transcription are collected for developing this model. Also, news corpus with proper categorized have been also used for training and categorization of Recordings. Conclusion: This unsupervised approach for audio recording categorization will be highly application Low-Resource Languages. Here, one of the current state-of-theart models for Automatic Speech Recognition has been used. At first pre-training the model with unlabeled data which is more accessible is performed and then, the model can be fine-tuned on a nepali dataset with collected transcribed audio and news corpus. Even though the accuracy of the captions generated with this approach in the present form is not that high but the categorization is done with better accuracies. So, with this approach, even though the user are located at some remote location, but as long as he/she is connected to the Internet, the autonomous vehicle control will voice instructions for Low-Resource Languages.


Basanta Joshi received a Doctor of Engineering from Osaka Sangyo University, Japan in 2013. He did both Bachelor of Electronic and Communication Engineering and Masters of Science in Information and Communication Engineering from the Institute of Engineering (IOE), Tribhuvan University (TU), Nepal. Currently, he is working as Assistant Professor at the Department of Electronics and Computer Engineering, Pulchowk Campus, IOE, TU.
He is also associated with IOE as a Member of Laboratory for ICT Research and Development. Formerly, he used to work coordinator of Master's in Information and Communication Engineering, IOE, Senior Software Engineer in D2Hawkeye and as a Research Consultant at LogPoint. He is also actively involved in valuable researches in the field of Machine Learning and its application in Big Data, especially Images and Speech. He has been actively publishing national & international research papers. He is member of NEC, NEA, IEEE, ISCA speech & AEHIN.

Global Tech Summit