Skip to Main content Skip to Navigation
Journal articles

Crowdsourcing Dialect Characterization through Twitter

Abstract : We perform a large-scale analysis of language diatopic variation using geotagged microblogging datasets. By collecting all Twitter messages written in Spanish over more than two years, we build a corpus from which a carefully selected list of concepts allows us to characterize Spanish varieties on a global scale. A cluster analysis proves the existence of well defined macroregions sharing common lexical properties. Remarkably enough, we find that Spanish language is split into two superdialects, namely, an urban speech used across major American and Spanish citites and a diverse form that encompasses rural areas and small towns. The latter can be further clustered into smaller varieties with a stronger regional character.
Document type :
Journal articles
Complete list of metadatas

Cited literature [24 references]  Display  Hide  Download

https://hal-amu.archives-ouvertes.fr/hal-01242109
Contributor : Administrateur Hal Amu <>
Submitted on : Friday, December 11, 2015 - 2:41:40 PM
Last modification on : Monday, September 23, 2019 - 3:06:02 PM
Long-term archiving on: : Saturday, March 12, 2016 - 2:00:16 PM

File

fetchObject.pdf
Publisher files allowed on an open archive

Identifiers

Collections

Citation

Bruno Goncalves, David Sanchez. Crowdsourcing Dialect Characterization through Twitter. PLoS ONE, Public Library of Science, 2014, 9 (e112074 ), ⟨10.1371/journal.pone.0112074⟩. ⟨hal-01242109⟩

Share

Metrics

Record views

216

Files downloads

233