Skip to content

Use templating to reduce the size of files metadata#1264

Open
jswelling wants to merge 17 commits intodevelfrom
welling/files_templates
Open

Use templating to reduce the size of files metadata#1264
jswelling wants to merge 17 commits intodevelfrom
welling/files_templates

Conversation

@jswelling
Copy link
Copy Markdown
Collaborator

@jswelling jswelling commented Apr 23, 2026

sample_templated_list_pretty.json
This PR adds airflow/dags/template_utils.py, which contains algorithms to automatically find templating options in the long list of dicts that make up the files: metadata. A class TemplateBuilder is introduced. This class can transform a list of file metadata dicts to smaller structure in which dict terms have been replaced by templates. For example, the key "description" would be replaced by "${k0}" . The value of the description would be replaced by an equivalent string with substituted templates. The dict of templates is prepended to the updated list of dicts, resulting in an overall smaller structure.

TemplateBuilder also includes a static method expand() which un-does the transformation.

utils.make_send_status_msg_function() is modified to apply TemplateBuilder as the files metadata is constructed.

See the attached file for a sample of the template dict and a few templated files entries.

@jswelling jswelling marked this pull request as draft April 23, 2026 17:40
@gesinaphillips
Copy link
Copy Markdown
Collaborator

gesinaphillips commented Apr 23, 2026

I'm still parsing the actual logic but here are some quick things I noticed:

template_utils.py

  • Unused imports: json, pprint, defaultdict
  • base_ct param in keygen is unused
  • Typing for generate_files_template return is incorrect (should be tuple[<existing_typing>])
  • Afaik type annotating a Callable that has keyword-only arguments is not supported. To accommodate the typical_tok_len kwarg, I believe you need to use typing.Protocol like so instead of defining the custom type MappingFunc:
class MappingFunc(Protocol):
    def __call__(
        self,
        starting_list: DictList,
        key_gen: Generator[str, None, None],
        selector: str,
        typical_tok_len: int = 6,
    ) -> OrderedDict[str, str]: ...

(this was an interesting little puzzle to solve...I'm 95% sure this solution is canonical)

  • List and Dict are deprecated in favor of lowercase list and dict respectively.

@jswelling jswelling marked this pull request as ready for review April 23, 2026 18:54
@jswelling
Copy link
Copy Markdown
Collaborator Author

  • Unused imports: json, pprint, defaultdict

fixed

  • base_ct param in keygen is unused

I will argue that the base_ct parameter is potentially useful as the template-finding algorithm continues to evolve.

  • Typing for generate_files_template return is incorrect (should be tuple[<existing_typing>])

fixed.

  • List and Dict are deprecated in favor of lowercase list and dict respectively.

fixed.

  • Afaik type annotating a Callable that has keyword-only arguments is not supported. To accommodate the typical_tok_len kwarg, I believe you need to use typing.Protocol like so instead of defining the custom type MappingFunc:
class MappingFunc(Protocol):
    def __call__(
        self,
        starting_list: DictList,
        key_gen: Generator[str, None, None],
        selector: str,
        typical_tok_len: int = 6,
    ) -> OrderedDict[str, str]: ...

(this was an interesting little puzzle to solve...I'm 95% sure this solution is canonical)

I'm going to have to think about this one, but it sounds easier to just make typical_tok_len required or remove it entirely.

Copy link
Copy Markdown
Collaborator

@gesinaphillips gesinaphillips left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The suffix tree approach is new to me so I can't comment much on that but I walked through the process end-to-end with the example data and the apply process makes sense to me.

I think there might be an issue with expand if there are $s present in the original data. I think catching any $$s in fully_template before trying to perform substitution would work, as would using safe_substitute (although that seems riskier--more surface area for bad data to creep in). Suggestions are just to try to illustrate where I see the issue--you might have a better solution in mind!

@gesinaphillips
Copy link
Copy Markdown
Collaborator

  • base_ct param in keygen is unused

I will argue that the base_ct parameter is potentially useful as the template-finding algorithm continues to evolve.

This is a philosophical difference rather than a functional one, so I defer ;)

  • Afaik type annotating a Callable that has keyword-only arguments is not supported. To accommodate the typical_tok_len kwarg, I believe you need to use typing.Protocol like so instead of defining the custom type MappingFunc[...]

I'm going to have to think about this one, but it sounds easier to just make typical_tok_len required or remove it entirely.

Since everything currently uses that param I think making it positional/required is perfectly fine (as is removing it, since the value is uniform).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants