Urdu-Nepali Parallel Corpus

Description

Pakistan has a rich multilingual and multicultural heritage, with about 70 spoken languages, deriving from a diverse set of Indo-Aryan, Indo-Iranian, Sino-Tibetan and Dravidian language families. More than half of these languages also have a written form, employing (predominantly) Perso-Arabic Nastalique and Arabic Naskh writing styles. Gujarati, Gurmuki and Tibetan scripts are also used by some communities, while some others are in the process of defining their writing systems. These languages exhibit a diverse set of sounds and underlying linguistic structures which are both linguistically and computationally exciting and challenging. Most of these languages are not well-studied or well-modeled, and present a vast training ground for researchers in linguistics and computer science. This dataset provides resources for two languages spoken in Pakistan: Nepali and Urdu. Urdu is the national language of Pakistan, while Nepali is mainly spoken in a small immigrant community. This corpus is made of two documents, one in Nepali and one in Urdu. Each document is available with and without part of speech tags. They are parallel to the 100,000 words of common English source from PENN Treebank corpus, available through Linguistic Data Consortium (LDC). The part of speech tags are those in the Penn Treebank, and additional information can be found in the included .csv file.

Resource Fields

Resource Type:

dataset

Submitted By:

Eva Bacas and Matt Lavin

Date Submitted:

2020-04-24 14:54:12


Project Open Data Required Fields (version 1.1)

Modified

[No data]

Publisher

[No data]

Contact Name

[No data]

Unique Identifier

[No data]

Public Access Level

[No data]

Project Open Data Additional Fields (version 1.0)

Contact email

[No Data]

Endpoint

[No Data]

Format

csv,txt

Project Open Data Required-if-Applicable Fields (version 1.1)

Access Level Comment

[No Data]

Bureau Code

[No Data]

Program Code

[No Data]

License

[No Data]

Rights

This dataset was collected and made available by the Center for Language Engineering at the University of Engineering and Techno

Spatial

[No Data]

Temporal

[No Data]