The UMass Global English on Twitter Dataset

Description

It can be difficult to identify the language that a tweet is written in. In addition to being very short, they often include code-switching, where the user uses two or more languages together, or names borrowed from a different language. This dataset contains tweets from a variety of languages, tagged for whether they are in English or not, whether they contain code-switching, whether they includes names from a different language and whether they were generated automatically. This dataset contains 10,502 tweets, randomly sampled from all publicly available geotagged Twitter messages, annotated for being in English, non-English, or having code switching, language ambiguity or having been automatically generated. It includes messages sent from 130 different countries. The file all_annotated.tsv contains the dataset of 10,502 tweets used in the paper. Text is encoded as UTF-8. The column headings (also given in the .tsv file) are: tweet ID, ISO country code, tweet date, tweet text, definitely English, ambiguous, definitely not English, code-switched, ambiguous due to named entities, and automatically generated tweets.

Resource Fields

Resource Type:

dataset

Submitted By:

Eva Bacas and Matt Lavin

Date Submitted:

2020-04-24 14:54:12


Project Open Data Required Fields (version 1.1)

Modified

[No data]

Publisher

[No data]

Contact Name

[No data]

Unique Identifier

[No data]

Public Access Level

[No data]

Project Open Data Additional Fields (version 1.0)

Contact email

[No Data]

Endpoint

https://www.kaggle.com/rtatman/the-umass-global-english-on-twitter-dataset

Format

tsv

Project Open Data Required-if-Applicable Fields (version 1.1)

Access Level Comment

[No Data]

Bureau Code

[No Data]

Program Code

[No Data]

License

[No Data]

Rights

Blodgett, Su Lin, Johnny Wei, and Brendan O'Connor. "A Dataset and Classifier for Recognizing Social Media English." Proceedings

Spatial

[No Data]

Temporal

[No Data]