This implements all of the major pieces of a 5.5.1-to-7.0 converter. Some tests were perfomed during development, but not enough to provide confidence of bug-free status.
Some parts are ported directly from the C converter (such as the ANSEL Charset and date and age parsing) while others are built from the ground up. The hope is that having two somewhat-separate implementations will allow me to use the two to test one another, a hope that has already resulted in a few bug fixes in the C version.
Missing but potentially desirable functionality:
- fix common 5.5.1 error of
INDI.ALIAmeaningINDI.NAME.TYPE ALIA - handle 5.5's base64-encoded OBJE, generating GEDZip files
- put common extensions into a
SCHMA
The file ged5to7/config/ contains copies of the TSV files
from https://github.com/FamilySearch/GEDCOM/,
https://github.com/fhiso/legacy-format/,
and https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry.
These can be updated by running
javac DownloadDefinitions.java
java DownloadDefinitionsThe above will overwrite the files in ged5to7/config/ with updated versions.
DownloadDefinitions.java is otherwise unneeded, and should not be included in distributions of the ged5to7 package.
- Detect character encodings, as documented in ELF Serialisation.
- Convert to UTF-8
- Normalize line whitespace, including stripping leading spaces
- Remove
CONC - Fix
@usage - Limit character set of cross-reference identifiers
- Normalize case of tags
- Covert
DATE- replace date_phrase with
PHRASEstructure - replace calendar escapes with calendar tags
- change
BCandB.C.toBCEand remove if found in unsupported calendars - replace dual years with single years and
PHRASEs - replace just-year dual years in unqualified date with
BET/AND
- replace date_phrase with
- Convert
AGE- change age words to canonical forms (stillborn as
0y, child as< 8y, infant as< 1y) withPHRASEs - Normalize spacing in
AGEpayloads - add missing
y
- change age words to canonical forms (stillborn as
- change
SOURwith text payload into pointer toSOURwithNOTE - change
OBJEwith no payload to pointer to newOBJErecord - change
NOTErecord or with pointer payload intoSNOTE- use heuristic to change some pointer-
NOTEto nested-NOTEinstead ofSNOTE
- use heuristic to change some pointer-
- Convert
LANGpayloads to BCP 47 tags, using FHISO's mapping - tag renaming, including
EMAI,_EMAIL→EMAILFORM.TYPE→FORM.MEDI- (deferred)
_SDATE→SDATE--_SDATEis also used as "accessed at" date for web resources by some applications so this change is not universally correct _UID→UID_ASSO→ASSO_CRE,_CREAT→CREA_DATE→DATEASSO.RELA→ASSO.ROLE- other?
- Enumerated values
- Normalize case
- Convert user-text to
PHRASEs
- change
RFN,RIN, andAFNtoEXID - change
_FSFTID,_APIDtoEXID - Convert
MEDI.FORMpayloads to media types - Convert
FONEandROMNtoTRANand theirTYPEs to BCP-47LANGs - change
FILEpayloads into URLs- Windows-style
\becomes/ - Windows diver letter
C:\WINDOWSbecomesfile:///c:/WINDOWS - POSIX-stye
/User/foobecomesfile:///User/foo
- Windows-style
- remove
SUBN,HEAD.FILE,HEAD.CHAR - update the
GEDC.VERSto7.0 - Change any illegal tag
XYZinto_EXT_XYZ- or to
_XYZand add a SCHMA entry for it - leave unchanged under extensions
- or to
- (extra) change string-valued
INDI.ALIAintoNAMEwithTYPEAKA - (5.5) change base64-encoded OBJE into GEDZIP
- add
SCHMAfor all used known extensions- add URIs (or standard tags) for all extensions from https://wiki-de.genealogy.net/GEDCOM/_Nutzerdef-Tag and http://www.gencom.org.nz/GEDCOM_tags.html