HTRC Word Frequencies in English-Language Literature, 1700-1922

Description

Many of the questions scholars want to ask about large collections of text can be posed using simplified representations – for instance, a list of the words in each volume, together with their frequencies. This dataset represents a first attempt to provide that information for English-language fiction, drama, and poetry published between 1700 and 1922, and contained in the HathiTrust Digital Library. The project combines two sources of information. The word counts themselves come from the HathiTrust Research Center (HTRC), which has tabulated them at the page level in 4.8 million public-domain volumes. Information about genre comes from a parallel project led by Ted Underwood, and supported by the National Endowment for the Humanities and the American Council of Learned Societies. This project applied machine learning to recognize genre at the page level in 854,476 English-language volumes. Mapping genre at the page level is important because genres are almost always mixed within volumes. Volumes of poetry can have long nonfiction introductions; volumes of fiction can be followed by many pages of publishers' advertisements. Fortunately, text categories of this broad kind (fiction/nonfiction/poetry/drama/paratext) can be identified fairly accurately by statistical models.

Resource Fields

Resource Type:

dataset

Submitted By:

Matt Lavin

Date Submitted:

2016-12-15 16:06:57


Project Open Data Required Fields (version 1.1)

Modified

November 2016

Publisher

HathiTrust Research Center

Contact Name

Ted Underwood

Unique Identifier

doi:10.13012/J8JW8BSJ

Public Access Level

[No data]

Project Open Data Additional Fields (version 1.0)

Contact email

tunder@illinois.edu

Endpoint

[No Data]

Format

tar.gz, csv

Project Open Data Required-if-Applicable Fields (version 1.1)

Access Level Comment

[No Data]

Bureau Code

[No Data]

Program Code

[No Data]

License

[No Data]

Rights

Boris Capitanu, Ted Underwood, Peter Organisciak, Timothy Cole, Maria Janina Sarol, J. Stephen Downie (2016). The HathiTrust Res

Spatial

[No Data]

Temporal

1700-1922