Pedro Ortiz Suarez
Pedro Ortiz Suarez
Home
Publicaciones
Presentaciones
Proyectos
Contacto
CV
Claro
Oscuro
Automático
Español
Español
Deutsch
English
Français
1
Ungoliant: An Optimized Pipeline for the Generation of a Very Large-Scale Multilingual Web Corpus
We propose a new pipeline that is faster, modular, parameterizable, and well documented. We use it to create a corpus similar to OSCAR but larger and based on recent data.
Julien Abadji
,
Pedro Ortiz Suarez
,
Laurent Romary
,
Benoît Sagot
PDF
Citar
Código fuente
Datos
DOI
CMLC-9
Website
HAL
SinNer@CLEF-HIPE2020: Sinful Adaptation of SotA models for Named Entity Recognition in Historical French and German Newspapers
In this article we present the approaches developed by the Sorbonne-INRIA for NER (SinNer) team for the CLEF-HIPE 2020 challenge on Named Entity Processing on old newspapers.
Pedro Ortiz Suarez
,
Yoann Dupont
,
Gaël Lejeune
,
Tian Tian
PDF
Citar
Vídeo
CEUR-WS
CLEF-HIPE-2020
CLEF-2020
HAL
A Monolingual Approach to Contextualized Word Embeddings for Mid-Resource Languages
We explore the impact of the training corpus on contextualized word embeddings in five mid-resource languages.
Pedro Ortiz Suarez
,
Laurent Romary
,
Benoît Sagot
PDF
Citar
Datos
Proyecto
Vídeo
DOI
ACL Anthology
ACL 2020
HAL
arXiv
CamemBERT: a Tasty French Language Model
We explore the impact of the training data size on a French version of RoBERTa.
Louis Martin
,
Benjamin Muller
,
Pedro Ortiz Suarez
,
Yoann Dupont
,
Laurent Romary
,
Éric de la Clergerie
,
Djamé Seddah
,
Benoît Sagot
PDF
Citar
Datos
Proyecto
Vídeo
DOI
ACL Anthology
arXiv
Website
ACL 2020
HAL
Building a User-Generated Content North-African Arabizi Treebank: Tackling Hell
We introduce the first treebank for a romanized user-generated content variety of Algerian, a North-African Arabic dialect.
Djamé Seddah
,
Farah Essaidi
,
Amal Fethi
,
Matthieu Futeral
,
Benjamin Muller
,
Pedro Ortiz Suarez
,
Benoît Sagot
,
Abhishek Srivastava
PDF
Citar
Vídeo
DOI
ACL Anthology
ACL 2020
Les modèles de langue contextuels Camembert pour le Français : impact de la taille et de l'hétérogénéité des données d'entrainement
We explore the impact of the training data size and heterogeneity on French language modeling.
Louis Martin
,
Benjamin Muller
,
Pedro Ortiz Suarez
,
Yoann Dupont
,
Laurent Romary
,
Éric de la Clergerie
,
Benoît Sagot
,
Djamé Seddah
PDF
Citar
Datos
Proyecto
TALN 2020
HAL
Website
Establishing a New State-of-the-Art for French Named Entity Recognition
We explore convert the NER annotations of the French TreeBank to a more user-friendly format and establish a new state of the art for French NER.
Pedro Ortiz Suarez
,
Yoann Dupont
,
Benjamin Muller
,
Laurent Romary
,
Benoît Sagot
PDF
Citar
LREC 2020
HAL
arXiv
ACL Anthology
French Contextualized Word-Embeddings with a sip of CaBeRnet: a New French Balanced Reference Corpus
We investigate the impact of different types and size of training corpora on language models.
Murielle Popa-Fabre
,
Pedro Ortiz Suarez
,
Benoît Sagot
,
Éric de la Clergerie
PDF
Citar
CMLC-8
ACL Anthology
HAL
How OCR Performance can Impact on the Automatic Extraction of Dictionary Content Structures
We explore the impact of the OCR quality on grobid-dictionaries models.
Mohamed Khemakhem
,
Ioana Galleron
,
Geoffrey Williams
,
Laurent Romary
,
Pedro Ortiz Suarez
PDF
Citar
Proyecto
TEI 2019
HAL
Asynchronous Pipeline for Processing Huge Corpora on Medium to Low Resource Infrastructures
We propose a new pipeline to filter, clean and classify Common Crawl by language, we publish the final corpus under the name OSCAR.
Pedro Ortiz Suarez
,
Benoît Sagot
,
Laurent Romary
PDF
Citar
Código fuente
Datos
Proyecto
Diapositivas
DOI
CMLC-7
Website
HAL
Citar
×