Skip to content

[AI] Introduce compact DAWG v3 binary format#1

Merged
sochalewski merged 1 commit into
mainfrom
dawg-v3
Jun 1, 2026
Merged

[AI] Introduce compact DAWG v3 binary format#1
sochalewski merged 1 commit into
mainfrom
dawg-v3

Conversation

@sochalewski
Copy link
Copy Markdown
Owner

Summary

This MR upgrades the generated DAWG dictionary format from v2 to v3 and updates both the DAWG generator and the app-side reader to use a more compact binary representation.

The new format adds an explicit alphabet table and stores edge labels as alphabet indexes instead of full Unicode scalar values. It also packs word-node metadata into the node edge-count field and stores edge targets as 24-bit integers. This reduces per-node and per-edge storage while preserving exact lookup, tile search, and pattern search behavior.

Technical Details

DAWG format v3

A shared DAWGFormat definition was introduced for generator and reader format constants:

  • magic: unchanged DAWG magic value
  • version: bumped to 3
  • header size: 24 bytes
  • node size: reduced from 8 to 6 bytes
  • edge size: reduced from 6 to 4 bytes
  • word flag: stored in the high bit of the packed edge-count field
  • edge-count mask: used to recover the actual outgoing edge count

The file header now stores alphabetCount in the previously reserved header field.

Alphabet table

Generated .dawg files now include a compact alphabet table immediately after the header.

The generator:

  • collects all edge scalar keys from the compact DAWG,
  • deduplicates and sorts them,
  • validates that the alphabet fits in UInt8 indexes,
  • writes each alphabet scalar as UInt16,
  • stores edge labels as alphabet indexes.

The runtime reader:

  • reads the alphabet table before node and edge tables,
  • builds a scalar → alphabet index lookup,
  • converts stored edge indexes back to scalars when reconstructing result strings.

This keeps traversal compact while still supporting Unicode scalar output.

Node encoding

Node records now store:

  • firstEdge as UInt32,
  • packedEdgeCount as UInt16.

The high bit of packedEdgeCount represents whether the node terminates a word. The remaining bits store the outgoing edge count.

This replaces the previous separate edgeCount and isWord fields, reducing node size from 8 bytes to 6 bytes.

Edge encoding

Edge records now store:

  • alphabet index as UInt8,
  • target node index as a 24-bit little-endian integer.

This replaces the previous UInt16 scalar key and UInt32 target pair, reducing edge size from 6 bytes to 4 bytes.

Format validation

The generator validates the v3 storage constraints explicitly:

  • outgoing edge counts must fit in UInt16 during compaction,
  • packed node edge counts must fit under DAWGFormat.packedEdgeCountMask because the high bit is reserved for wordFlag,
  • target node indexes must fit in 24 bits,
  • the alphabet must fit in UInt8 indexes, allowing up to 256 distinct scalar values,
  • Unicode scalars must still fit in UInt16.

Lookup behavior

The app-side DAWG reader was updated to work with alphabet indexes internally:

  • exact word lookup maps Unicode scalars to alphabet indexes before traversing edges,
  • tile search builds LetterCounter from alphabet indexes,
  • pattern search maps literal scalars to alphabet indexes,
  • ? continues to work as a single-character wildcard,
  • generated result strings are reconstructed through the alphabet table.

Words containing scalars outside the DAWG alphabet now fail early instead of attempting traversal with raw scalar values.

Dictionary data

The .dictionaries submodule was updated from 89f22c5000032ae2cbc44b9864b1cbb81862d850 to e3feadd6404cfe7f97ebca884a4eae0b334b4019.

The referenced submodule commit is matching the new reader and generator format.

Project configuration

The Xcode project was updated to include DAWGFormat.swift in the DAWGWizard target and to keep target membership exclusions consistent for the synchronized DAWGWizard folder.

Documentation

The README was updated to mention the alphabet table as part of the generated .dawg binary layout.

@sochalewski sochalewski merged commit 634f439 into main Jun 1, 2026
2 checks passed
@sochalewski sochalewski deleted the dawg-v3 branch June 1, 2026 10:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant