Crowdsourcing Dialect Characterization through Twitter

Bruno Goncalves; David Sanchez

doi:10.1371/journal.pone.0112074

Article Dans Une Revue PLoS ONE Année : 2014

Crowdsourcing Dialect Characterization through Twitter

(1) , (2)

1
2

Bruno Goncalves

Fonction : Auteur
PersonId : 1663
IdHAL : bgoncalves
ORCID : 0000-0001-5644-3749

Centre de Physique Théorique - UMR 7332

David Sanchez

Fonction : Auteur

IFISC

Résumé

We perform a large-scale analysis of language diatopic variation using geotagged microblogging datasets. By collecting all Twitter messages written in Spanish over more than two years, we build a corpus from which a carefully selected list of concepts allows us to characterize Spanish varieties on a global scale. A cluster analysis proves the existence of well defined macroregions sharing common lexical properties. Remarkably enough, we find that Spanish language is split into two superdialects, namely, an urban speech used across major American and Spanish citites and a diverse form that encompasses rural areas and small towns. The latter can be further clustered into smaller varieties with a stronger regional character.

Mots clés

Big Data Machine Learning Linguistics Dialectology language variation microblogging datasets

Domaines

Physique [physics]

Fichier principal

fetchObject.pdf (1.18 Mo)

Origine : Fichiers éditeurs autorisés sur une archive ouverte

Administrateur HAL AMU : Connectez-vous pour contacter le contributeur

https://amu.hal.science/hal-01242109

Soumis le : vendredi 11 décembre 2015-14:41:40

Dernière modification le : mardi 5 décembre 2023-18:08:07

Archivage à long terme le : samedi 12 mars 2016-14:00:16

Dates et versions

hal-01242109 , version 1 (11-12-2015)

Identifiants

HAL Id : hal-01242109 , version 1
DOI : 10.1371/journal.pone.0112074

Citer

Bruno Goncalves, David Sanchez. Crowdsourcing Dialect Characterization through Twitter. PLoS ONE, 2014, 9 (e112074 ), ⟨10.1371/journal.pone.0112074⟩. ⟨hal-01242109⟩

Exporter

BibTeX XML-TEI Dublin Core DC Terms EndNote DataCite

Collections

UNIV-TLN CNRS UNIV-AMU

95 Consultations

115 Téléchargements

Crowdsourcing Dialect Characterization through Twitter

Résumé

Mots clés

Domaines

Dates et versions

Identifiants

Citer

Exporter

Collections

Altmetric

Partager