One nagging problem with countrycode (e.g., #182 #180 ) is that the current approach to codelist strictly requires bidirectional one-to-one mappings.
This is problematic in cases where we want:
Russia -> RUS (iso)
USSR -> RUS (iso)
RUS -> Russia
I have been trying to find a solution forever without much result. Today, I pushed a (nearly working) branch with a potential path forward: https://github.com/vincentarelbundock/countrycode/tree/manytoone
The concept:
- A unique regex identifies every single geographic unit covered by any of the schemes in
countrycode. This means, for example, that we need a different regexes for Russia and USSR because Correlates of War treat them separately.
- Each destination code must be associated with one and only one regex: many-to-one
- origin codes can be associated with more than one regex: many-to-one
- This requires that we keep separate lists of origin and destination codes. The differences between origin and destination codes are handled explicitly in a centralized location:
dictionary/merge.R
- instead of using
codelist internally, we use codelist_map, which is a list of lists of data.frames. For example, if we want to convert from cowc to iso3c, we use codelist_map$cowc$iso3c, which is a data.frame with only two columns.
One key, for me is number 4 above, and right now too much still happens in the get_* functions. The get functions should just be scrapers, and users should have access to a well-document script to see how we reconciled origin vs. destination.
Curious what @cjyetman thinks of this.
One nagging problem with
countrycode(e.g., #182 #180 ) is that the current approach tocodeliststrictly requires bidirectional one-to-one mappings.This is problematic in cases where we want:
Russia -> RUS (iso)
USSR -> RUS (iso)
RUS -> Russia
I have been trying to find a solution forever without much result. Today, I pushed a (nearly working) branch with a potential path forward: https://github.com/vincentarelbundock/countrycode/tree/manytoone
The concept:
countrycode. This means, for example, that we need a different regexes for Russia and USSR because Correlates of War treat them separately.dictionary/merge.Rcodelistinternally, we usecodelist_map, which is a list of lists of data.frames. For example, if we want to convert from cowc to iso3c, we usecodelist_map$cowc$iso3c, which is a data.frame with only two columns.One key, for me is number 4 above, and right now too much still happens in the
get_*functions. Thegetfunctions should just be scrapers, and users should have access to a well-document script to see how we reconciled origin vs. destination.Curious what @cjyetman thinks of this.