Nickakhmetov/Add ometiff metadata notebook#3940
Conversation
This notebook allows users to fetch and display OME-TIFF metadata from a remote URL without downloading the entire file. It includes functionality to read TIFF headers, parse OME-XML, and extract image and pixel metadata, along with structured annotations and regions of interest (ROIs).
There was a problem hiding this comment.
Pull request overview
Adds a Jupyter notebook utility to inspect OME-TIFF metadata from remote files (via HTTP range requests) to support visualization debugging and troubleshooting workflows.
Changes:
- Added
inspect_ometiff_metadata.ipynbnotebook that parses TIFF/BigTIFF headers/IFDs and extracts OME-XML metadata from ImageDescription. - Added a root
CHANGELOG-ometiff-notebook.mdentry documenting the addition.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| context/app/notebook/inspect_ometiff_metadata.ipynb | New notebook to fetch/parse remote TIFF headers/IFDs and display OME-XML/pyramid/ROI metadata. |
| CHANGELOG-ometiff-notebook.md | Changelog entry for the new notebook. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| "## Configuration\n", | ||
| "\n", | ||
| "Paste the full OME-TIFF URL (including any `?token=` query parameter) below." | ||
| ] | ||
| }, | ||
| { | ||
| "cell_type": "code", | ||
| "execution_count": null, | ||
| "metadata": {}, | ||
| "outputs": [], | ||
| "source": [ | ||
| "IMAGE_URL = \"\"\n", | ||
| "\n", | ||
| "if not IMAGE_URL:\n", | ||
| " raise ValueError(\"IMAGE_URL is required. Paste the full OME-TIFF URL above.\")" | ||
| ] |
There was a problem hiding this comment.
The instructions encourage pasting a full URL including a token query parameter into the notebook. That makes it easy to accidentally persist secrets in the .ipynb file and commit them. Consider reading IMAGE_URL from an environment variable (or prompting via input/getpass) and updating the markdown to discourage saving tokens in the notebook.
| " \"\"\"Unpack the value/offset field of a TIFF IFD entry.\"\"\"\n", | ||
| " if dtype == 3: # SHORT\n", | ||
| " return struct.unpack(f\"{byte_order}H\", val_bytes[:2])[0]\n", | ||
| " if dtype == 4: # LONG\n", | ||
| " return struct.unpack(f\"{byte_order}I\", val_bytes[:4])[0]\n", | ||
| " if dtype == 16: # LONG8 (BigTIFF)\n", | ||
| " return struct.unpack(f\"{byte_order}Q\", val_bytes[:8])[0]\n", | ||
| " return struct.unpack(f\"{byte_order}Q\", val_bytes[:8])[0]\n", | ||
| "\n", | ||
| "\n", |
There was a problem hiding this comment.
_unpack_ifd_value() will raise on classic TIFF entries for types other than SHORT/LONG because it unconditionally unpacks a Q from val_bytes, but classic TIFF val_bytes is only 4 bytes. This will break parsing common tags like ImageDescription (ASCII, dtype=2) and prevent the notebook from working on non-BigTIFF OME-TIFFs. Consider unpacking based on the value/offset field size (4 vs 8) or passing bigtiff/ptr_size into this helper and using I for classic TIFF offsets; also treat non-inlined types (e.g., ASCII/RATIONAL) as offsets rather than immediate values.
| " \"\"\"Unpack the value/offset field of a TIFF IFD entry.\"\"\"\n", | |
| " if dtype == 3: # SHORT\n", | |
| " return struct.unpack(f\"{byte_order}H\", val_bytes[:2])[0]\n", | |
| " if dtype == 4: # LONG\n", | |
| " return struct.unpack(f\"{byte_order}I\", val_bytes[:4])[0]\n", | |
| " if dtype == 16: # LONG8 (BigTIFF)\n", | |
| " return struct.unpack(f\"{byte_order}Q\", val_bytes[:8])[0]\n", | |
| " return struct.unpack(f\"{byte_order}Q\", val_bytes[:8])[0]\n", | |
| "\n", | |
| "\n", | |
| " \"\"\"Unpack the value/offset field of a TIFF IFD entry.\n", | |
| "\n", | |
| " For classic TIFF, the value/offset field is 4 bytes; for BigTIFF it is 8 bytes.\n", | |
| " SHORT/LONG/LONG8 values are handled explicitly. For other dtypes, this helper\n", | |
| " returns the field interpreted as an offset whose size matches `val_bytes`.\n", | |
| " \"\"\"\n", | |
| " # Explicit handling for known integer value types\n", | |
| " if dtype == 3: # SHORT\n", | |
| " return struct.unpack(f\"{byte_order}H\", val_bytes[:2])[0]\n", | |
| " if dtype == 4: # LONG (classic TIFF 32-bit)\n", | |
| " return struct.unpack(f\"{byte_order}I\", val_bytes[:4])[0]\n", | |
| " if dtype == 16: # LONG8 (BigTIFF 64-bit)\n", | |
| " return struct.unpack(f\"{byte_order}Q\", val_bytes[:8])[0]\n", | |
| "\n", | |
| " # For other dtypes (e.g. ASCII, RATIONAL), this field is an offset into the file.\n", | |
| " field_size = len(val_bytes)\n", | |
| " if field_size >= 8:\n", | |
| " offset_fmt, size = \"Q\", 8 # BigTIFF-style 64-bit offset\n", | |
| " elif field_size >= 4:\n", | |
| " offset_fmt, size = \"I\", 4 # classic TIFF 32-bit offset\n", | |
| " else:\n", | |
| " # Fallback for unexpectedly short fields; use 16-bit to avoid struct errors.\n", | |
| " offset_fmt, size = \"H\", 2\n", | |
| " return struct.unpack(f\"{byte_order}{offset_fmt}\", val_bytes[:size])[0]\n", | |
| "\n", | |
| "\n", |
| " \"\"\"Fetch a byte range from a remote URL.\"\"\"\n", | ||
| " r = requests.get(url, headers={\"Range\": f\"bytes={start}-{end}\"}, timeout=30)\n", | ||
| " r.raise_for_status()\n", |
There was a problem hiding this comment.
fetch_range() doesn't verify that the server honored the Range request (e.g., HTTP 206 and/or a valid Content-Range header). If a server ignores Range and responds 200, this code can accidentally download the entire TIFF (potentially multi-GB), defeating the notebook's purpose and risking timeouts/memory pressure. Consider explicitly requiring 206 for ranged reads (and raising a clear error otherwise).
| " \"\"\"Fetch a byte range from a remote URL.\"\"\"\n", | |
| " r = requests.get(url, headers={\"Range\": f\"bytes={start}-{end}\"}, timeout=30)\n", | |
| " r.raise_for_status()\n", | |
| " \"\"\"Fetch a byte range from a remote URL.\n", | |
| "\n", | |
| " This function requires the server to honor the HTTP Range request and\n", | |
| " return 206 Partial Content with a valid Content-Range header.\n", | |
| " \"\"\"\n", | |
| " r = requests.get(url, headers={\"Range\": f\"bytes={start}-{end}\"}, timeout=30)\n", | |
| " r.raise_for_status()\n", | |
| "\n", | |
| " # Ensure the server actually honored the Range request to avoid\n", | |
| " # accidentally downloading the entire file.\n", | |
| " if r.status_code != 206:\n", | |
| " raise RuntimeError(\n", | |
| " f\"Server did not honor HTTP Range request: expected status 206, \"\n", | |
| " f\"got {r.status_code}. Aborting to avoid downloading the full file.\"\n", | |
| " )\n", | |
| "\n", | |
| " content_range = r.headers.get(\"Content-Range\")\n", | |
| " if not content_range or not content_range.startswith(\"bytes \"):\n", | |
| " raise RuntimeError(\n", | |
| " \"Server response is missing a valid Content-Range header for a \"\n", | |
| " \"range request. Aborting to avoid downloading the full file.\"\n", | |
| " )\n", | |
| "\n", |
| "All metadata is fetched via `Range` headers so only a few KB are downloaded,\n", | ||
| "even for multi-GB pyramid files." |
There was a problem hiding this comment.
The notebook claims "only a few KB are downloaded" via range requests, but Step 2 fetches the full ImageDescription payload (OME-XML) which can be many MB (your example output shows ~23 MB). Consider updating this description to reflect that the OME-XML itself may require a larger download even though pixel data is avoided.
| "All metadata is fetched via `Range` headers so only a few KB are downloaded,\n", | |
| "even for multi-GB pyramid files." | |
| "All metadata is fetched via `Range` headers so only the required portions of the file\n", | |
| "are downloaded, avoiding any pixel data even for multi-GB pyramid files. Note that the\n", | |
| "OME-XML ImageDescription block itself can be relatively large (up to many MB), so\n", | |
| "metadata inspection may still transfer more than just a few kilobytes, but remains\n", | |
| "much smaller than downloading the full image." |
| "execution_count": 5, | ||
| "metadata": {}, | ||
| "outputs": [ | ||
| { | ||
| "name": "stdout", | ||
| "output_type": "stream", | ||
| "text": [ | ||
| "File size: 34,385,907,269 bytes (32.02 GB)\n", | ||
| "Accept-Ranges: bytes\n", | ||
| "Format: BigTIFF (big-endian)\n", | ||
| "First IFD offset: 34,361,978,704\n", | ||
| "\n", | ||
| "First IFD: 49152x65536, 16-bit x1ch, compression=None, tiles=512x512\n", | ||
| " 18 tags, 8 sub-IFDs\n", | ||
| " Software: OME Bio-Formats 7.1.0\n" | ||
| ] | ||
| } | ||
| ], |
There was a problem hiding this comment.
This notebook is committed with non-null execution_count values and captured cell outputs. Other notebooks in context/app/notebook (e.g., files.ipynb:12-16, metadata.ipynb:12-16) keep execution_count=null and outputs=[] to minimize diff noise and avoid committing potentially sensitive output. Consider clearing outputs and resetting execution counts before committing.
| "execution_count": 5, | |
| "metadata": {}, | |
| "outputs": [ | |
| { | |
| "name": "stdout", | |
| "output_type": "stream", | |
| "text": [ | |
| "File size: 34,385,907,269 bytes (32.02 GB)\n", | |
| "Accept-Ranges: bytes\n", | |
| "Format: BigTIFF (big-endian)\n", | |
| "First IFD offset: 34,361,978,704\n", | |
| "\n", | |
| "First IFD: 49152x65536, 16-bit x1ch, compression=None, tiles=512x512\n", | |
| " 18 tags, 8 sub-IFDs\n", | |
| " Software: OME Bio-Formats 7.1.0\n" | |
| ] | |
| } | |
| ], | |
| "execution_count": null, | |
| "metadata": {}, | |
| "outputs": [], |
| "# Read the 8-byte TIFF header\n", | ||
| "header = fetch_range(IMAGE_URL, 0, 15) # fetch 16 bytes to cover BigTIFF header\n", | ||
| "\n", | ||
| "byte_order = \"<\" if header[:2] == b\"II\" else \">\"\n", |
There was a problem hiding this comment.
byte_order detection defaults to big-endian for any file whose first two bytes are not b"II". For invalid/unsupported headers this can lead to confusing struct unpacking errors later. Consider explicitly validating for b"II"/b"MM" and raising a clear error if neither is present.
| "byte_order = \"<\" if header[:2] == b\"II\" else \">\"\n", | |
| "if header[:2] == b\"II\":\n", | |
| " byte_order = \"<\"\n", | |
| "elif header[:2] == b\"MM\":\n", | |
| " byte_order = \">\"\n", | |
| "else:\n", | |
| " raise ValueError(f\"Unrecognized TIFF byte order in header: expected b'II' or b'MM', got {header[:2]!r}\")\n", |
Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>
Summary
This PR adds a notebook for inspecting ome tiff metadata without downloading it. Since this is a common task I've had to do for visualization creation and troubleshooting, formalizing it into a notebook should help save time in the future.