Building a User-Generated Content North-African Arabizi Treebank: Tackling Hell

Resumen

We introduce the first treebank for a romanized user-generated content variety of Algerian, a North-African Arabic dialect. Made of 1500 sentences, fully annotated in morpho-syntax and Universal Dependency syntax, with full translation at both the word and the sentence levels, this treebank is made freely available. It is supplemented with 50k. unlabeled sentences collected from Common Crawl and web-crawled data using intensive data-mining techniques. Preliminary experiments demonstrate its usefulness for POS tagging and dependency parsing.

Publicación
In The 58th Annual Meeting of the Association for Computational Linguistics
Pedro Javier Ortiz Suárez
Pedro Javier Ortiz Suárez
Doctorante

Soy estudiante de doctorado en Ciencias de la Computación en Sorbonne Université y en el equipo de investigación ALMAnaCH en el Inria

Relacionado