Separate repository for managing data for gendered nouns?

As it is now, the implementation reads the data which describes the gendered versions of every noun from a separate repository maintained by another person (https://github.com/ecmonsen/gendered_words). To work with this data (its data format is somewhat burdensome and bloated for our purposes), we read it from a fork and then convert it into a different data format for usage. The resulting data then runs through an algorithm that "fixes" it, because there are lots of things in it like
* links to words that are not part of the database
* words that end with a gender-indicator (-men, -woman, -aunt etc.), yet don't have versions for different genders/ gender neutral individuals
* links that are one-sided
* words that have no neutral version listed, but an algorithm can easily determine one, and words that are wrongly listed as male rather than neutral

For further information, see [this issue I raised](https://github.com/ecmonsen/gendered_words/issues/1) as well as the `gender_nouns`-submodule of gender*render, which implements the algorithms above.

The way this is handled right now -reading from a different repository and converting to a different data type- comes with its pros and cons:
* **pro**: The repository is maintained, so we don't need to worry about maintaining its data ourself (or, so I thought, at least; I am not entirely sure about it anymore)
* **pro**: The data from the repository is exactly what we need, so no need to make a second repository with the same data (this did not hold true, since the data seems to be made for a [very specific purpose](https://github.com/ecmonsen/regender), and does not really attempt to be suited for different purposes, though it obviously partially is
* **contra**: We need to convert the data from the repository (and I have to maintain the code for it) to our own data format, run it through a complicated pipeline, make sure the result is always shipped with the implementation and always up-to-date to the data from the repository (we have all of this, and it works, but it is still less-than-ideal and somewhat bloated.)
* **contra**: The data has holes that need to be fixed per algorithm (as in, 500 words for each step of the pipeline, not just 3 or 4 inconsistencies), and these fixes are unreviewed; for example, the female version of manager should definitely not be "womanager", to name one example). The solution to this would be to manually go over all changes applied by the pipeline, adding those to the original data via pull request that seem correct, and adding the correct alternatives to the data in cases where the pipeline was wrong. This does, however, come with problems:
  * The maintainer of the original project might not want a "-person"-version for all "-man"-words, since the original projects vision is less focused on non-binary issues than this project is; so we would need to move [the fork used by this project](https://github.com/phseiff/gendered_words) away from the state of the original data, which makes the first pro-argument somewhat obsolete.
  * The data format from which we read is much more cumbersome than the one we actually use, so adding changes to it rather than applying them to data managed in the data format to which we convert is cumbersome.

My preferred approach would be transitioning to a different way of managing the noun data we read from, but whether this transition is necessary depends highly on whether the repository I read the data from turns out to be maintained, and what its vision and perspective is; I should eventually raise an issue asking about this.

If a change of concept turns out to be necessary here, I would prefer the following approach:
* Create a `phseiff/gender-nouns`-repository (named analogously to the submodule of gender-render.
* This repository should contain merely the README file and the gendered noun data in our preferred dataformat, with all secure steps (steps that can not misjudge) of our pipeline applied to it.
* The code of `gender_render.gender_nouns` would be changed to read its data from the new repository, so the data format conversion step would fall away.
* The new repositories README should reference `phseiff/gender-render`, explain how to use the `gender_nouns` submodule outside the context of rendering templates (since there are many other valid use cases for it), explain the data format of the gendered-noun-data, why one needs to run it through the pipeline before using it, and so on. It should also prominently ask for people to go through the automatic changes applied by the pipeline and create pull requests for the corrected version of the changes they deem incorrect, and contain detailed information on how to do this efficiently (e.g. by turning on advanced logging for the pipeline).

This would make maintaining the data easier, as well as make fixing its shortcomings easier and help people participate in it, as well as, as a side effect, making the usage of gender*render for noun gendering (without the rest) easier. I feel like this change would also go will with creating an extension specification for the format of the gender noun data.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Separate repository for managing data for gendered nouns? #3

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Separate repository for managing data for gendered nouns? #3

Description

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions