"A BLAST-based, Language-agnostic Text Reuse Algorithm" data

Description

Code and sample corpus used for this article, which introduces a BLAST-based text reuse algorithm optimized for Chinese corpora. The code in this repository isÊunder active development. The code assumes you are using the Anaconda distribution of Python 3.6 or later, and have installed the python-Levenshtein library. The sample corpus comes fromÊChristian Wittern's Kanseki repository, which is used under the CC-BY-SA 4.0 license (Included in the corpus.zip file). It contains material from the "histories (__)" section. The algorithm itself has been incorporated into theÊMARKUS online research platform.

Resource Fields

Resource Type:

dataset

Submitted By:

Eva Bacas

Date Submitted:

2020-04-17 10:22:56


Project Open Data Required Fields (version 1.1)

Modified

[No data]

Publisher

[No data]

Contact Name

[No data]

Unique Identifier

[No data]

Public Access Level

[No data]

Project Open Data Additional Fields (version 1.0)

Contact email

[No Data]

Endpoint

[No Data]

Format

zip,py,js,md,html

Project Open Data Required-if-Applicable Fields (version 1.1)

Access Level Comment

[No Data]

Bureau Code

[No Data]

Program Code

[No Data]

License

[No Data]

Rights

Please use the data citation generated by the Dataverse.

Spatial

[No Data]

Temporal

[No Data]