AsoSoft Text Corpus

Text corpus is a large structured collection of text documents in which a portion of or the entire text documents are annotated. Text corpus a significant language resource used in a variety of NLP research themes and applications. For instance, it is used in information retrieval systems or to extract language model and lexicon from to be used in the automatic speech recognition system. AsoSoft text corpus is the largest Kurdish text corpus so far. The first version of AsoSoft text corpus contains 190 Million tokens and is comprised of 458 thousand documents. The sources of collected documents include, but are not limited to, websites, books, and magazines. The documents of the corpus have been converted into the standard TEI format.

Applications

Asosoft text corpus is applicable to the following research areas:

Linguistics
Lexicography
NLP and Speech Processing:

Extracting language models
Word vector representation
Topic Identification
Extracting computational lexicons

How to Use

A great portion of our text corpus is publicized for research (Non-Commercial use). The AsoSoft text corpus repository on GitHub includes:

AsoSoft Text Corpus Large Version: This file contains 75 million tokens.
AsoSoft Text Corpus Small Version: This file contains 5 million tokens.
AsoSoft topic annotated dataset.

Common editors for work with large text files are EmEditor, TlCorpus, TextPad and so forth.

Cite:

If you are using our text corpus cite us.

@article{10.1093/llc/fqy074,
    author = {Veisi, Hadi and MohammadAmini, Mohammad and Hosseini, Hawre},
    title = "{Toward Kurdish language processing: Experiments in collecting and processing the AsoSoft text corpus}",
    journal = {Digital Scholarship in the Humanities},
    volume = {35},
    number = {1},
    pages = {176-193},
    year = {2019},
    month = {02},
    issn = {2055-7671},
    doi = {10.1093/llc/fqy074},
    url = {https://doi.org/10.1093/llc/fqy074}
}