Skip to content

gedcom7code/java-converter

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

15 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Feature-complete but limited testing

This implements all of the major pieces of a 5.5.1-to-7.0 converter. Some tests were perfomed during development, but not enough to provide confidence of bug-free status.

Some parts are ported directly from the C converter (such as the ANSEL Charset and date and age parsing) while others are built from the ground up. The hope is that having two somewhat-separate implementations will allow me to use the two to test one another, a hope that has already resulted in a few bug fixes in the C version.

Missing but potentially desirable functionality:

  • fix common 5.5.1 error of INDI.ALIA meaning INDI.NAME.TYPE ALIA
  • handle 5.5's base64-encoded OBJE, generating GEDZip files
  • put common extensions into a SCHMA

Updating to new versions of GEDCOM

The file ged5to7/config/ contains copies of the TSV files from https://github.com/FamilySearch/GEDCOM/, https://github.com/fhiso/legacy-format/, and https://www.iana.org/assignments/language-subtag-registry/language-subtag-registry. These can be updated by running

javac DownloadDefinitions.java
java DownloadDefinitions

The above will overwrite the files in ged5to7/config/ with updated versions.

DownloadDefinitions.java is otherwise unneeded, and should not be included in distributions of the ged5to7 package.

Current status

  • Detect character encodings, as documented in ELF Serialisation.
  • Convert to UTF-8
  • Normalize line whitespace, including stripping leading spaces
  • Remove CONC
  • Fix @ usage
  • Limit character set of cross-reference identifiers
  • Normalize case of tags
  • Covert DATE
    • replace date_phrase with PHRASE structure
    • replace calendar escapes with calendar tags
    • change BC and B.C. to BCE and remove if found in unsupported calendars
    • replace dual years with single years and PHRASEs
    • replace just-year dual years in unqualified date with BET/AND
  • Convert AGE
    • change age words to canonical forms (stillborn as 0y, child as < 8y, infant as < 1y) with PHRASEs
    • Normalize spacing in AGE payloads
    • add missing y
  • change SOUR with text payload into pointer to SOUR with NOTE
  • change OBJE with no payload to pointer to new OBJE record
  • change NOTE record or with pointer payload into SNOTE
    • use heuristic to change some pointer-NOTE to nested-NOTE instead of SNOTE
  • Convert LANG payloads to BCP 47 tags, using FHISO's mapping
  • tag renaming, including
    • EMAI, _EMAILEMAIL
    • FORM.TYPEFORM.MEDI
    • (deferred) _SDATESDATE -- _SDATE is also used as "accessed at" date for web resources by some applications so this change is not universally correct
    • _UIDUID
    • _ASSOASSO
    • _CRE, _CREATCREA
    • _DATEDATE
    • ASSO.RELAASSO.ROLE
    • other?
  • Enumerated values
    • Normalize case
    • Convert user-text to PHRASEs
  • change RFN, RIN, and AFN to EXID
  • change _FSFTID, _APID to EXID
  • Convert MEDI.FORM payloads to media types
  • Convert FONE and ROMN to TRAN and their TYPEs to BCP-47 LANGs
  • change FILE payloads into URLs
    • Windows-style \ becomes /
    • Windows diver letter C:\WINDOWS becomes file:///c:/WINDOWS
    • POSIX-stye /User/foo becomes file:///User/foo
  • remove SUBN, HEAD.FILE, HEAD.CHAR
  • update the GEDC.VERS to 7.0
  • Change any illegal tag XYZ into _EXT_XYZ
    • or to _XYZ and add a SCHMA entry for it
    • leave unchanged under extensions
  • (extra) change string-valued INDI.ALIA into NAME with TYPE AKA
  • (5.5) change base64-encoded OBJE into GEDZIP
  • add SCHMA for all used known extensions

About

5.5.1 to 7.0 converter

Resources

License

Stars

Watchers

Forks

Languages