Skip to content

How to handle unglossed words? #158

@fmatter

Description

@fmatter

Quite often, people will not gloss words like person or place names or unparsable words, so some words may only be present in Primary_Text, but not in Analyzed_Word or Gloss.

The most transparent way to store an example like that in CLDF is to have an empty list item in these two columns:

Primary_Text: "x y Person z"
Analyzed_Word: "x\ty\t\tz" (["x","y",None,"z"] once read by pycldf)
Gloss: "xg\tyg\t\tzg" (["xg","yg",None,"zg"])

This passes validation, but for example cldf createdb does not work (TypeError: sequence item 1: expected str instance, NoneType found) and I've been doing things like ex["Analyzed_Word"] = ["" if x is None else x for x in ex["Analyzed_Word"]] in initializedb.py scripts.

Should empty items in a gloss column raise an error upon validation? If yes, is the way to handle unglossed words to simply leave them out? (i.e. "x\ty\tz" ["x","y","z"])? Or, if empty items are allowed, would it be OK for pycldf to yield "" instead of None (i.e. "x\ty\t\tz" ["x","y","","z"])?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions