v0.0.2: document support, kind filter, dir sorting by thepushkarp · Pull Request #2 · thepushkarp/mls

thepushkarp · 2026-03-04T08:15:05Z

Summary

Document support: Pure-Rust metadata extraction for PDF, DOCX, XLSX, PPTX, ODF, CSV, TXT, and more via new document.rs module. No external tools required. New MediaKind::Document variant, DocumentInfo type, filter paths (media.doc.*, pages/author aliases), kind filter key 5, sort by pages.
Multi-select kind filter: Keys 2-5 toggle individual kinds (video/audio/image/doc) instead of exclusive select. 1 resets to show all.
Navigate-back selection restore: Pressing h/left to go to parent dir now lands the cursor on the directory you came from.
Sortable folders: Folders respect the active sort key (name, size, modified) while staying above media entries. Media-only keys (duration, resolution, codec, bitrate, pages) fall back to name sort for directories.

Test plan

cargo clippy --all-targets --all-features -- -D warnings — zero warnings
cargo test — 300 unit + 19 integration tests pass
cargo fmt --check — clean
Manual: cargo run -- tui ~/some-dir — verify document files show with "D" icon, kind filter toggles work, sort applies to folders, navigate-back restores selection

Add MediaKind::Document as a new file category that bypasses ffprobe entirely, using pure Rust crates for metadata extraction. Supported formats: PDF, DOCX/DOC, XLSX/XLS, PPTX/PPT, ODT/ODS/ODP, CSV, TSV, TXT, MD. - New document.rs module with format-specific extractors (lopdf, zip, quick-xml, cfb) - DocumentInfo struct with page_count, word_count, line_count, sheet_count, author, title, and more - Filter support: media.doc.* field paths + pages/author aliases - Sort support: pages/page_count sort key - TUI: KindFilter::Document (key 5), "D" icon, document metadata panel - Bump package version to 0.0.2, schema version to 0.2.0

Replace exclusive KindFilter enum with a struct of per-kind booleans. Keys 2-5 now toggle individual kinds on/off; key 1 resets to show all. Footer shows checkbox indicators for each kind's active state.

Save current_dir before navigating to parent, then restore cursor to the child directory in the new dir_items list. Also fix apply_filter clamping to use visible_count() (dirs + media) instead of filtered_indices.len() (media only).

Folders now respect the user's sort key (name, size, modified) instead of always being alphabetical. Media-only keys (duration, resolution, codec, bitrate, pages) fall back to name sort for directories. Introduce DirItem struct to carry cached metadata (name, size, modified_at) alongside directory paths, avoiding re-stat on every sort. Decouple list_sibling_dirs from list_subdirs so the parent pane stays alphabetical while the current pane sorts dynamically.

thepushkarp · 2026-03-04T08:15:37Z

@codex review

gemini-code-assist · 2026-03-04T08:15:40Z

Summary of Changes

Hello, I'm Gemini Code Assist¹! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the mls utility's capabilities by integrating comprehensive document support, enabling it to recognize, categorize, and extract metadata from a wide array of document file formats using native Rust implementations. Alongside this, the user interface has been enhanced with a more flexible multi-select kind filtering system and improved directory navigation, ensuring a more intuitive and powerful browsing experience for all supported file types.

Highlights

Document Support: Introduced pure-Rust metadata extraction for various document types including PDF, DOCX, XLSX, PPTX, ODF, CSV, TXT, and Markdown. This adds a new MediaKind::Document variant and DocumentInfo type, allowing filtering by document-specific paths like media.doc.* and sorting by properties such as page count.
Multi-Select Kind Filter: The TUI now supports multi-selection for media kind filters. Keys 2 through 5 toggle individual kinds (video, audio, image, document) on or off, rather than exclusively selecting one. Key 1 resets the filter to show all media kinds.
Navigate-Back Selection Restore: When navigating back to a parent directory using h or the left arrow key, the cursor now automatically lands on the directory from which the user originated, improving navigation flow.
Sortable Folders: Folders within the file list now respect the active sort key (name, size, modified date) while consistently appearing above media entries. Media-specific sort keys (e.g., duration, resolution, pages) will cause directories to fall back to sorting by name.

Changelog

CLAUDE.md
- Updated project description to include 'document file browser'.
- Added document.rs to the src/ directory structure description.
- Revised scan.rs description to mention routing documents to native probe.
- Updated MediaEntry description to reflect the new Document variant and doc: Option<DocumentInfo> field.
- Incremented JSON schema version to 0.2.0.
- Added media.doc.* filter paths and pages/author aliases to filter documentation.
- Noted that documents bypass ffprobe and are handled by document.rs.
- Documented TUI kind filter key 5 for documents and the 'D' icon.
Cargo.lock
- Added numerous new dependencies required for document parsing, including aes, block-padding, bytecount, byteorder, cbc, cfb, cipher, ecb, encoding_rs, inout, lopdf, md-5, nom, nom_locate, quick-xml, rangemap, stringprep, tinyvec, tinyvec_macros, ttf-parser, typed-path, unicode-bidi, unicode-normalization, unicode-properties, zip, zlib-rs, and zopfli.
- Updated media-ls version to 0.0.2.
- Updated rand_chacha dependency version.
Cargo.toml
- Updated package version to 0.0.2.
- Added lopdf, zip, quick-xml, and cfb as dependencies.
src/document.rs
- Added new module for pure-Rust document metadata extraction.
- Implemented probe_document to dispatch probing based on file extension.
- Included functions for probing PDF (probe_pdf), OOXML (probe_ooxml_doc, probe_ooxml_spreadsheet, probe_ooxml_presentation), ODF (probe_odf), OLE2 (probe_ole2), and text-based formats (probe_text_table, probe_text).
- Provided XML helper functions local_name and count_xml_elements.
src/filter.rs
- Extended resolve_field_typed to support new media.doc.* filter paths for document metadata fields (format, author, title, subject, creator_app, creation_date, modification_date, page_count, word_count, line_count, sheet_count).
- Added pages and author as convenience aliases for document-related filters.
- Updated test media entry creation to include doc: None.
src/main.rs
- Imported the new document module.
- Modified run_info to conditionally use probe::probe_document_file for document extensions, otherwise falling back to probe::probe_file.
src/output.rs
- Updated SCHEMA_VERSION constant to 0.2.0.
- Updated test media entry creation to include doc: None.
src/probe.rs
- Added probe_document_file function for native document metadata extraction.
- Modified build_media_info to initialize doc field as None.
src/scan.rs
- Imported is_document_extension for file type identification.
- Modified probe_files to use probe::probe_document_file for document types.
- Updated test to correctly count .txt files as recognized media.
src/sort.rs
- Updated module comment to include 'directory items'.
- Imported DirItem struct.
- Added pages and page_count as valid sort keys in parse_sort_spec.
- Implemented sort_dir_items to sort directory items, falling back to name sort for media-only keys.
- Modified compare_by_key to handle SortKey::Pages for document page counts.
- Updated test media entry creation to include doc: None.
- Added new unit tests for sort_dir_items functionality.
src/tui/layout.rs
- Adjusted directory item rendering to use the name field from DirItem.
- Added MediaKind::Document to the kind label logic, displaying 'D' for documents.
- Implemented rendering of document-specific metadata (format, pages, words, lines, sheets, author, title) in the metadata panel.
- Updated the footer to display the multi-select kind filter status using checkmarks.
- Revised the help overlay to reflect the new kind filter key assignments (1 for all, 2-5 for toggling).
src/tui/mod.rs
- Imported sort_dir_items and DirItem.
- Refactored KindFilter from an enum to a struct to support multi-selection.
- Updated App struct to store dir_items as Vec<DirItem>.
- Initialized kind_filter to KindFilter::ALL.
- Modified selected_dir to return the path from the selected DirItem.
- Updated matches_kind to check the boolean flags in the KindFilter struct.
- Ensured apply_sort sorts dir_items before media entries.
- Modified navigate_to_dir to sort dir_items and restore the cursor position to the previously selected directory.
- Updated list_subdirs to return Vec<DirItem> with cached metadata.
- Modified list_sibling_dirs to return Vec<PathBuf> for parent pane rendering.
- Updated handle_key logic for kind filtering: 1 selects all, 2-5 toggle specific kinds.
- Updated tests to reflect changes in KindFilter and DirItem usage.
src/tui/triage.rs
- Updated test media entry creation to include doc: None.
src/types.rs
- Added MediaKind::Document enum variant.
- Defined DocumentInfo struct to hold document-specific metadata.
- Included doc: Option<DocumentInfo> in the MediaInfo struct.
- Introduced DirItem struct to cache directory metadata for sorting.
- Added SortKey::Pages for sorting by document page count.
- Implemented applies_to_dirs method for SortKey to determine if a key is applicable to directories.
- Updated the SortKey::next cycle to include Pages.
- Defined DOCUMENT_EXTENSIONS constant for recognized document file types.
- Added is_document_extension function.
- Updated is_media_extension to include document types.
- Extended unit tests for MediaKind and SortKey.
tests/cli.rs
- Updated json_output_valid_schema to expect 0.2.0 schema version and 6 media files (including .txt).
- Updated ndjson_has_header_and_footer to expect 0.2.0 schema version.
- Added new tests to verify document support: json_documents_have_kind_document, json_filter_kind_document_returns_only_documents, json_document_has_line_count, and json_sort_by_pages.

Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature	Command	Description
Code Review	`/gemini review`	Performs a code review for the current pull request in its current state.
Pull Request Summary	`/gemini summary`	Provides a summary of the current pull request in its current state.
Comment	@gemini-code-assist	Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help	`/gemini help`	Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution. ↩

gemini-code-assist

Code Review

This pull request introduces significant feature updates, including support for document metadata extraction via the new document.rs module, a more flexible multi-select kind filter in the TUI, and improved directory sorting. A critical security concern is the potential for Denial of Service (OOM) due to unbounded resource consumption when processing malicious or extremely large ZIP entries and text files without size limits. Additionally, the code review identified areas with code duplication (now explicitly linked to a rule) and magic numbers that could be refactored for better maintainability.

src/document.rs

gemini-code-assist · 2026-03-04T08:19:31Z

src/document.rs

+/// The stream uses MS-OLEPS binary format with property sets.
+/// We extract string properties by well-known IDs
+/// (2=Title, 4=Author, 5=Subject, 18=`AppName`).
+fn parse_summary_info(data: &[u8], info: &mut DocumentInfo) {


The parse_summary_info function uses several magic numbers for offsets, signatures, and property IDs when parsing the OLE2 SummaryInformation stream. This makes the code hard to read and maintain. These values should be defined as constants with descriptive names to improve clarity and make it easier to understand the structure of the binary format being parsed.

For example:

const OLEPS_SIGNATURE: u16 = 0xFFFE; const SECTION_OFFSET_POS: usize = 44; // Property IDs const PID_TITLE: u32 = 2; const PID_AUTHOR: u32 = 4; // ... etc // In the function if data.len() < 48 || read_u16_le(data, 0) != OLEPS_SIGNATURE { return; } // ... match prop_id { PID_TITLE => info.title = Some(s), PID_AUTHOR => info.author = Some(s), // ... etc }

Fixed in 6fad9c5. Added named constants: OLEPS_BYTE_ORDER_LE, OLEPS_HEADER_MIN_LEN, OLEPS_SECTION_OFFSET_POS, OLEPS_MAX_PROPS, OLEPS_SECTION_HEADER_SIZE, OLEPS_PROP_ENTRY_SIZE, VT_I4, VT_LPSTR, PIDSI_TITLE, PIDSI_AUTHOR, PIDSI_SUBJECT, PIDSI_PAGECOUNT, PIDSI_WORDCOUNT, PIDSI_APPNAME. All inline literals in parse_summary_info replaced.

src/tui/mod.rs

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 87793a7a7f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-04T08:22:06Z

src/scan.rs

+            let result = if is_doc {
+                probe::probe_document_file(&file).await
+            } else {
+                probe::probe_file(&file, timeout_ms).await


Enforce timeout for native document probes

Document files bypass the timeout_ms path and call probe_document_file directly, so native probes can run indefinitely while the rest of scanning waits. This is observable with document extensions on blocking special files (for example, a .txt FIFO): unlike probe_file, there is no tokio::time::timeout guard, so one file can stall the entire scan. Please apply timeout handling (or regular-file checks) to document probing as well.

Useful? React with 👍 / 👎.

Fixed in 408746e. Wrapped probe_document_file in tokio::time::timeout(Duration::from_millis(timeout_ms), ...) at the call site in scan.rs, matching the timeout behavior of probe_file. On timeout, logs at debug level and returns an error.

src/tui/mod.rs

- Cap zip XML reads at 4 MiB via `take(MAX_XML_BYTES)` in `read_xml_from_zip` - Cap text/CSV scanning at 256 MiB via `take(MAX_TEXT_SCAN_BYTES)` - Replace `BufReader::lines()` with byte-level counting in `probe_text_table` - Replace `BufReader::lines()` with capped `read_line` in `probe_text` Addresses security review comments about unbounded reads.

Document probes had no timeout, unlike ffprobe-based media probes. A stuck document read (e.g., network-mounted PDF) would block the scan task indefinitely.

Replace `DirEntry::file_type().is_dir()` with `Path::is_dir()` in both `list_subdirs` and `list_sibling_dirs`. The former doesn't follow symlinks, causing symlinked directories to disappear from the TUI.

Deduplicates the identical XML event-loop boilerplate shared by parse_core_xml and parse_app_xml. Each is now a thin wrapper that passes a closure for tag-to-field dispatch.

Adds OLEPS_*, VT_*, and PIDSI_* constants for the MS-OLEPS binary format values used in parse_summary_info. Makes the binary format parsing self-documenting.

Reduces the four kind-filter match arms (keys 2-5) to one-liners by extracting the shared toggle-apply-check-status pattern into a method on App.

thepushkarp added 4 commits March 3, 2026 13:10

feat(tui): make kind filter multi-select with toggle keys

63b90f5

Replace exclusive KindFilter enum with a struct of per-kind booleans. Keys 2-5 now toggle individual kinds on/off; key 1 resets to show all. Footer shows checkbox indicators for each kind's active state.

thepushkarp self-assigned this Mar 4, 2026

gemini-code-assist bot reviewed Mar 4, 2026

View reviewed changes

chatgpt-codex-connector bot reviewed Mar 4, 2026

View reviewed changes

thepushkarp added 6 commits March 4, 2026 14:02

fix(scan): wrap document probes in timeout to match media probe behavior

408746e

Document probes had no timeout, unlike ffprobe-based media probes. A stuck document read (e.g., network-mounted PDF) would block the scan task indefinitely.

fix(tui): follow symlinks in directory listings

72744d8

Replace `DirEntry::file_type().is_dir()` with `Path::is_dir()` in both `list_subdirs` and `list_sibling_dirs`. The former doesn't follow symlinks, causing symlinked directories to disappear from the TUI.

refactor(document): extract parse_xml_text_fields helper

5459c06

Deduplicates the identical XML event-loop boilerplate shared by parse_core_xml and parse_app_xml. Each is now a thin wrapper that passes a closure for tag-to-field dispatch.

refactor(document): replace OLE2 magic numbers with named constants

6fad9c5

Adds OLEPS_*, VT_*, and PIDSI_* constants for the MS-OLEPS binary format values used in parse_summary_info. Makes the binary format parsing self-documenting.

refactor(tui): extract toggle_kind_filter method

35cfb2e

Reduces the four kind-filter match arms (keys 2-5) to one-liners by extracting the shared toggle-apply-check-status pattern into a method on App.

thepushkarp merged commit ae6b8d4 into main Mar 4, 2026
3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

v0.0.2: document support, kind filter, dir sorting#2

v0.0.2: document support, kind filter, dir sorting#2
thepushkarp merged 10 commits intomainfrom
v0.0.2

thepushkarp commented Mar 4, 2026

Uh oh!

thepushkarp commented Mar 4, 2026

Uh oh!

gemini-code-assist bot commented Mar 4, 2026

Uh oh!

gemini-code-assist bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Mar 4, 2026

Uh oh!

thepushkarp Mar 4, 2026

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Uh oh!

chatgpt-codex-connector bot Mar 4, 2026

Uh oh!

thepushkarp Mar 4, 2026

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

thepushkarp commented Mar 4, 2026

Summary

Test plan

Uh oh!

thepushkarp commented Mar 4, 2026

Uh oh!

gemini-code-assist bot commented Mar 4, 2026

Summary of Changes

Highlights

Footnotes

Uh oh!

gemini-code-assist bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

gemini-code-assist bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

thepushkarp Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector bot Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

thepushkarp Mar 4, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant