Humanities Data

Description

Spanish is the second most widely-spoken language on Earth; over one in 20 humans alive today is a native speaker of Spanish. This medium-sized corpus contains 120 million words of modern Spanish taken from the Spanish-Language Wikipedia in 2010. This dataset is made up of 57 text files. Each contains multiple Wikipedia articles in an XML format. The text of each article is surrounded by tags. The initial tag also contains metadata about the article, including the article’s id and the title of the article. The text “ENDOFARTICLE.” appears at the end of each article, before the closing tag.

Resource Fields

Resource Type:

dataset

Submitted By:

Eva Bacas and Matt Lavin

Date Submitted:

2020-04-24 14:54:12

Access URL:

https://www.kaggle.com/rtatman/120-million-word-spanish-corpus

Project Open Data Required Fields (version 1.1)

Modified

[No data]

Publisher

[No data]

Contact Name

[No data]

Unique Identifier

[No data]

Public Access Level

[No data]

Project Open Data Additional Fields (version 1.0)

Contact email

[No Data]

Endpoint

[No Data]

Format

xml

Project Open Data Required-if-Applicable Fields (version 1.1)

Access Level Comment

[No Data]

Bureau Code

[No Data]

Program Code

[No Data]

License

[No Data]

Rights

Samuel Reese, Gemma Boleda, Montse Cuadros, Lluís Padró, German Rigau. Wikicorpus: A Word-Sense Disambiguated Multilingual Wikip

Spatial

[No Data]

Temporal

[No Data]

120 Million Word Spanish Corpus

Description

Resource Fields

Tags

Project Open Data Required Fields (version 1.1)

Project Open Data Additional Fields (version 1.0)

Project Open Data Required-if-Applicable Fields (version 1.1)