Skip to content

Support JWST and Roman#25

Open
icweaver wants to merge 6 commits intomainfrom
strict
Open

Support JWST and Roman#25
icweaver wants to merge 6 commits intomainfrom
strict

Conversation

@icweaver
Copy link
Member

@icweaver icweaver commented Mar 21, 2026

Hey folks, these are a few toggles I added to try and support some of the quirks I ran into "out in the wild". I've split them into the following three commits:

  1. Don't choke on unknown tags: dc40329
    1. This just provides a fallback for STScI's many asdf tags. Could be replaced by specific converters in the future.
    2. Also adds a metadata tag and a newer ndarray-1.1.0 tag that Roman seems to use now.
    3. This feature is behind an ASDF.load(<filename.asdf>; extensions = false) kwarg by default to preserve current behavior. Similar to the extensions kwarg in the Python impl.
  2. Checksum or not to checksum: 56a8c33
    1. Apparently some Roman data products store their file's checksum based on the uncompressed file instead of compressed (ASDF.jl's default). This breaks things.
    2. Added a flag to get around this, similarly to the Python impl.
  3. Blocks vs Frames: 0d425a0
    1. Roman seems to use the block layout for their Lz4 compression, instead of frame. This adds support for both, and automatically handles it using magic numbers.

Usage examples:

Does this seem reasonable to folks? I'd especially appreciate feedback for the last point, which I have little experience in and relied heavily on Claude for to come up with the needed incantations.

To-do

  • Add tests for item 3 if we end up sticking with an impl like this
  • Add Lz4 frame/block write option

@icweaver icweaver mentioned this pull request Mar 21, 2026
@codecov
Copy link

codecov bot commented Mar 21, 2026

Codecov Report

❌ Patch coverage is 98.11321% with 1 line in your changes missing coverage. Please review.
✅ Project coverage is 95.54%. Comparing base (c6f56b0) to head (e5b6bd4).

Files with missing lines Patch % Lines
src/ASDF.jl 98.11% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main      #25      +/-   ##
==========================================
+ Coverage   95.28%   95.54%   +0.26%     
==========================================
  Files           1        1              
  Lines         318      359      +41     
==========================================
+ Hits          303      343      +40     
- Misses         15       16       +1     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

data[3] == 0x4D && data[4] == 0x18)
return decode(LZ4FrameCodec(), data)
else
# If the data was originally created from Python's ASDF, then it will be in block instead of frame layout,
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we also produce the block layout? Should we? Can Python handle the frame layout?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reminding me, will add that in. Looks like they can, but as a plug-in https://github.com/asdf-format/asdf-compression

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in e012b21

There is a layout option for NDArrayWrapper now to flip between frame and block. Does that seem like a reasonable place to control things? I guess it doesn't really make sense for other compression schemes, so I just set the default layout as default

I don't love from a maintainability point of view that there is now a hand-rolled Lz4-specific encode and decode path to accommodate Python's asdf scheme. There's also the matter of compatibility with Lz4 frame support on the Python side, but since that's still an experimental plugin, maybe that can be a problem for future us to deal with.

I am now squarely out of my comfort zone and would gladly accept any suggestions for simplifying things, haha. Thanks again for taking a look!

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is only for lz4 then I would call it lz4_layout, with values :frame and :block. I am sure that other compression schemes will also want to have options in the future, e.g. specifying the compression level.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good, just renamed here e5b6bd4

@eschnett
Copy link
Collaborator

The checksum bit sounds weird. There is a standard for checksums, and THE major player in the field gets the implementation wrong? Does this make checksums useless in practice? Well, the world is what it is...

@icweaver
Copy link
Member Author

icweaver commented Mar 21, 2026

Yea, the checksum bit was bothering me too. Here's what I am seeing for the sample Roman data I tried:

julia> using ASDF, MD5

julia> af = ASDF.load_file("docs/data/roman.asdf"; extensions = true);

julia> ndarray = af.metadata["roman"]["data"];

julia> header = ndarray.lazy_block_headers.block_headers[ndarray.source + 1]
ASDF.BlockHeader(IOStream(<file docs/data/roman.asdf>), 176347, UInt8[0xd3, 0x42, 0x4c, 0x4b], 0x0030, 0x00000000, UInt8[0x6c, 0x7a, 0x34, 0x00], 0x0000000003ffe53b, 0x0000000003ffe53b, 0x0000000003fc0100, UInt8[0xef, 0x4e, 0x63, 0x45, 0xc4, 0xd6, 0xcd, 0xa0, 0xed, 0x4d, 0x14, 0x27, 0x43, 0xa7, 0xb2, 0xbc], true)

julia> block_data_start = header.position + 6 + header.header_size
176401

julia> seek(header.io, block_data_start)
IOStream(<file docs/data/roman.asdf>)

julia> data = Array{UInt8}(undef, header.used_size);

julia> nb = readbytes!(header.io, data)
67102011

julia> data_decompressed = ASDF.decode_Lz4(data);

julia> md5(data_decompressed) == header.checksum
true

I wonder if this behavior could just be an uncompressed data streaming thing for this specific instance with Roman's data products. I don't work with AWS S3, but it seems that this a selling point for them I guess. fwiw, the JWST data I tried looks to behave as expected (it's just downloaded as a regular file). Here are worked examples for both

Doc previews:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants