[AI] Introduce compact DAWG v3 binary format#1
Merged
Merged
Conversation
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
This MR upgrades the generated DAWG dictionary format from v2 to v3 and updates both the DAWG generator and the app-side reader to use a more compact binary representation.
The new format adds an explicit alphabet table and stores edge labels as alphabet indexes instead of full Unicode scalar values. It also packs word-node metadata into the node edge-count field and stores edge targets as 24-bit integers. This reduces per-node and per-edge storage while preserving exact lookup, tile search, and pattern search behavior.
Technical Details
DAWG format v3
A shared
DAWGFormatdefinition was introduced for generator and reader format constants:324bytes8to6bytes6to4bytesThe file header now stores
alphabetCountin the previously reserved header field.Alphabet table
Generated
.dawgfiles now include a compact alphabet table immediately after the header.The generator:
UInt8indexes,UInt16,The runtime reader:
scalar → alphabet indexlookup,This keeps traversal compact while still supporting Unicode scalar output.
Node encoding
Node records now store:
firstEdgeasUInt32,packedEdgeCountasUInt16.The high bit of
packedEdgeCountrepresents whether the node terminates a word. The remaining bits store the outgoing edge count.This replaces the previous separate
edgeCountandisWordfields, reducing node size from8bytes to6bytes.Edge encoding
Edge records now store:
UInt8,This replaces the previous
UInt16scalar key andUInt32target pair, reducing edge size from6bytes to4bytes.Format validation
The generator validates the v3 storage constraints explicitly:
UInt16during compaction,DAWGFormat.packedEdgeCountMaskbecause the high bit is reserved forwordFlag,UInt8indexes, allowing up to 256 distinct scalar values,UInt16.Lookup behavior
The app-side
DAWGreader was updated to work with alphabet indexes internally:LetterCounterfrom alphabet indexes,?continues to work as a single-character wildcard,Words containing scalars outside the DAWG alphabet now fail early instead of attempting traversal with raw scalar values.
Dictionary data
The
.dictionariessubmodule was updated from89f22c5000032ae2cbc44b9864b1cbb81862d850toe3feadd6404cfe7f97ebca884a4eae0b334b4019.The referenced submodule commit is matching the new reader and generator format.
Project configuration
The Xcode project was updated to include
DAWGFormat.swiftin theDAWGWizardtarget and to keep target membership exclusions consistent for the synchronizedDAWGWizardfolder.Documentation
The README was updated to mention the alphabet table as part of the generated
.dawgbinary layout.