AsoSoft Speech Corpus

Speech Recognition for Kurdish

AsoSoft is the first company to work in the field of Speech Recognition for Kurdish language. We develop Speech Recognition, Speaker Recognition and Speech Command software and tools for Kurdish language through Artificial Intelligence and Signal Processing. Kurdish language speech data and its related resources like tags are of most important language resources which are required for NLP research and applications such as automatic speech recognition, speaker recognition, etc. In this project, speech data for Kurdish language (Central Kurdish) was designed and collected so that it could be used in automatic speech recognition, speaker recognition, phonology researches, dialect analysis, etc. So far, approximately 43.68 hours of speech has been recorded and transcribed in order to produce this corpus.

Metadata

The information of the speakers is given in a table which contains:

Microphone type: USB/Philips/Jack/Laptop
Noise level
Gender (this label can be used for gender identification tasks): Femal/Male
Dialect/city (this label may be used for the phonetic analysis of the various dialects of Kurdish and also dialect identification task)
Age
Education
Length (total and average)

Files

In the dataset, for each recording three files are given:

.wav: wave file recorded in 22.05 kHz, 16bit, mono
.wrd: transcription in Kurdish alphabet
.phn: phonetic transcription in ASCII format

The file name format is as bellow:
SpeakerID(3digits) + Gender + RecordingDevice(Laptop/PC/Mobile) + Mic + SentenceID(3digits)
For example 001MLU001:
SpeakerID=001, Gender=Male, RecordingDevice=Laptop, Mic=USB, and SentenceID=001

Download

A subset of AsoSoft speech corpus for research and non-commercial use could be downloaded via the AsoSoft's repository on GitHub This dataset is a subset of the AsoSoft Speech Corpus that can be used for spoken language processing tasks in Central Kurdish such as speech recognition, speaker recognition, gender identification, and phonetic analysis. This subset includes 45 speakers, each of them has uttered 72 (same) sentences; the first 70 sentences of the AsoSoft Speech Corpus (i.e., sentence 1 to sentence 70) and the last two sentences (i.e., sentence 699 and sentence 700). Each of the last two sentences covers all Central Kurdish phonemes. The original version of the dataset contains 700 sentences for each speaker. The sentences are manually designed to represent the phonetic characteristics of the Central Kurdish. The recording date of this dataset is during the year 2016.

Cite

If you are using this corpus, please cite the following reference:

@article{veisi2021jira,
    title={Jira: a Kurdish Speech Recognition System Designing and Building Speech Corpus and Pronunciation Lexicon},
    author={Veisi, Hadi and Hosseini, Hawre and Mohammadamini, Mohammad and Fathy, Wirya and Mahmudi, Aso},
    journal={arXiv preprint arXiv:2102.07412},
    year={2021}
}