CLDF dataset derived from Ugarte et al.'s "NorthPeruLex - A Lexical Dataset of Small Language Families and Isolates from Northern Peru (forthcoming).

How to cite

If you use these data please cite

the original source

Ugarte, Carlos and Blum, Frederic and Ingunza, Adriano and Gonzales, Rosa and Peña, Jaime. Forthcoming. NorthPeruLex - A Lexical Dataset of Small Language Families and Isolates from Northern Peru.
the derived dataset using the DOI of the particular released version you were using

Description

This dataset brings together lexical data from isolates and small language families from northern Peru to investigate their historic relations.

This dataset is licensed under a CC-BY-4.0 license

Conceptlists in Concepticon:

Swadesh-1952-200

Notes

Accessing the data

Installing dependencies

The first step to access all the contents of the dataset is to clone the repository and install all the necessary requirements.

git clone https://github.com/lexibank/northperulex.git
cd northperulex
pip install -e .

This includes all packages used for the conversion to CLDF (Cross-Linguistic Data Formats: https://cldf.clld.org). The NorthPeruLex dataset can also be downloaded directly as a ZIP file directly from this Github repository or from Zenodo (10.5281/zenodo.13269802). If the user wishes to perform the CLDF conversion, they can run the following command:

cldfbench lexibank.makecldf lexibank_northperulex.py --concepticon-version=v3.4.0 --glottolog-version=v5.2.1 --clts-version=v2.3.0

This command uses the cldfbench package (https://pypi.org/project/cldfbench/) with the pylexibank plug-in (https://pypi.org/project/pylexibank/) to automatically convert the data to CLDF using the raw data at the raw folder and the latest version (at the time of the publication of this dataset) of the references catalogs: Concepticon (https://concepticon.clld.org/), for concept glosses; Glottolog (https://glottolog.org/), for language names; and CLTS (https://clts.clld.org/), for the phonetic transcriptions.

The converted data is located in the cldf folder. All data in the dataset is stored in tabular (CSV) files. Therefore, it can be read on various platforms and environments and manually inspected.

Create the wordlist

We provided the user with a analysis\Makefile file that creates a wordlist on a TSV file that can be used to manually inspect the data with the help of EDICTOR web tool (https://edictor.org/). To produce the file, please run the following commands:

cd analysis
pip install -r requirements.txt
make wordlist

Reproduce analysis

In addition to yielding the word list file (npl_data.tsv), the Makefile also runs a script that performs the multiple sequence alignment and an automatic recognition of sound correspondence patterns. To do so, please type the following:

make analysis

The result of both processes are stored in the files npl_msaligned and npl_patterns.tsv.

Statistics

Varieties: 35 (linked to 35 different Glottocodes)
Concepts: 200 (linked to 200 different Concepticon concept sets)
Lexemes: 4,986
Sources: 21
Synonymy: 1.12
Cognacy: 4,986 cognates in 3,660 cognate sets (2,905 singletons)
Cognate Diversity: 0.72
Invalid lexemes: 0
Tokens: 29,488
Segments: 185 (0 BIPA errors, 0 CLTS sound class errors, 184 CLTS modified)
Inventory size (avg): 30.00

Contributors

Name	GitHub user	Description	Role
Carlos Ugarte	@CMUgarte	Data collector, CLDF conversion and annotation	Author, Editor
Frederic Blum	@FredericBlum	CLDF conversion and annotation	Author, Editor
Adriano Ingunza	@BadBatched	Data collector and annotation	Author
Rosa Gonzales	@rosalgm	Data collector and annotation	Author
Jaime Peña	@JaimePenat	Data collector and annotation	Author

CLDF Datasets

The following CLDF datasets are available in cldf:

CLDF Wordlist at cldf/cldf-metadata.json

Name		Name	Last commit message	Last commit date
Latest commit History 342 Commits
.github/workflows		.github/workflows
analysis		analysis
cldf		cldf
edictor-remote		edictor-remote
etc		etc
raw		raw
.gitignore		.gitignore
.zenodo.json		.zenodo.json
CONTRIBUTORS.md		CONTRIBUTORS.md
FORMS.md		FORMS.md
LICENSE		LICENSE
NOTES.md		NOTES.md
README.md		README.md
TRANSCRIPTION.md		TRANSCRIPTION.md
lexibank_northperulex.py		lexibank_northperulex.py
metadata.json		metadata.json
setup.cfg		setup.cfg
setup.py		setup.py
test.py		test.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

CLDF dataset derived from Ugarte et al.'s "NorthPeruLex - A Lexical Dataset of Small Language Families and Isolates from Northern Peru (forthcoming).

How to cite

Description

Notes

Accessing the data

Installing dependencies

Create the wordlist

Reproduce analysis

Statistics

Contributors

CLDF Datasets

About

Uh oh!

Releases 3

Packages

Contributors 4

Uh oh!

Languages

License

lexibank/northperulex

Folders and files

Latest commit

History

Repository files navigation

CLDF dataset derived from Ugarte et al.'s "NorthPeruLex - A Lexical Dataset of Small Language Families and Isolates from Northern Peru (forthcoming).

How to cite

Description

Notes

Accessing the data

Installing dependencies

Create the wordlist

Reproduce analysis

Statistics

Contributors

CLDF Datasets

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Contributors 4

Uh oh!

Languages

Packages