Skip to content

v0.0.2: document support, kind filter, dir sorting#2

Merged
thepushkarp merged 10 commits intomainfrom
v0.0.2
Mar 4, 2026
Merged

v0.0.2: document support, kind filter, dir sorting#2
thepushkarp merged 10 commits intomainfrom
v0.0.2

Conversation

@thepushkarp
Copy link
Owner

Summary

  • Document support: Pure-Rust metadata extraction for PDF, DOCX, XLSX, PPTX, ODF, CSV, TXT, and more via new document.rs module. No external tools required. New MediaKind::Document variant, DocumentInfo type, filter paths (media.doc.*, pages/author aliases), kind filter key 5, sort by pages.
  • Multi-select kind filter: Keys 2-5 toggle individual kinds (video/audio/image/doc) instead of exclusive select. 1 resets to show all.
  • Navigate-back selection restore: Pressing h/left to go to parent dir now lands the cursor on the directory you came from.
  • Sortable folders: Folders respect the active sort key (name, size, modified) while staying above media entries. Media-only keys (duration, resolution, codec, bitrate, pages) fall back to name sort for directories.

Test plan

  • cargo clippy --all-targets --all-features -- -D warnings — zero warnings
  • cargo test — 300 unit + 19 integration tests pass
  • cargo fmt --check — clean
  • Manual: cargo run -- tui ~/some-dir — verify document files show with "D" icon, kind filter toggles work, sort applies to folders, navigate-back restores selection

Add MediaKind::Document as a new file category that bypasses ffprobe
entirely, using pure Rust crates for metadata extraction.

Supported formats: PDF, DOCX/DOC, XLSX/XLS, PPTX/PPT, ODT/ODS/ODP,
CSV, TSV, TXT, MD.

- New document.rs module with format-specific extractors (lopdf, zip,
  quick-xml, cfb)
- DocumentInfo struct with page_count, word_count, line_count,
  sheet_count, author, title, and more
- Filter support: media.doc.* field paths + pages/author aliases
- Sort support: pages/page_count sort key
- TUI: KindFilter::Document (key 5), "D" icon, document metadata panel
- Bump package version to 0.0.2, schema version to 0.2.0
Replace exclusive KindFilter enum with a struct of per-kind booleans.
Keys 2-5 now toggle individual kinds on/off; key 1 resets to show all.
Footer shows checkbox indicators for each kind's active state.
Save current_dir before navigating to parent, then restore cursor
to the child directory in the new dir_items list. Also fix
apply_filter clamping to use visible_count() (dirs + media) instead
of filtered_indices.len() (media only).
Folders now respect the user's sort key (name, size, modified) instead
of always being alphabetical. Media-only keys (duration, resolution,
codec, bitrate, pages) fall back to name sort for directories.

Introduce DirItem struct to carry cached metadata (name, size,
modified_at) alongside directory paths, avoiding re-stat on every sort.
Decouple list_sibling_dirs from list_subdirs so the parent pane stays
alphabetical while the current pane sorts dynamically.
@thepushkarp
Copy link
Owner Author

@codex review

@gemini-code-assist
Copy link

Summary of Changes

Hello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed!

This pull request significantly expands the mls utility's capabilities by integrating comprehensive document support, enabling it to recognize, categorize, and extract metadata from a wide array of document file formats using native Rust implementations. Alongside this, the user interface has been enhanced with a more flexible multi-select kind filtering system and improved directory navigation, ensuring a more intuitive and powerful browsing experience for all supported file types.

Highlights

  • Document Support: Introduced pure-Rust metadata extraction for various document types including PDF, DOCX, XLSX, PPTX, ODF, CSV, TXT, and Markdown. This adds a new MediaKind::Document variant and DocumentInfo type, allowing filtering by document-specific paths like media.doc.* and sorting by properties such as page count.
  • Multi-Select Kind Filter: The TUI now supports multi-selection for media kind filters. Keys 2 through 5 toggle individual kinds (video, audio, image, document) on or off, rather than exclusively selecting one. Key 1 resets the filter to show all media kinds.
  • Navigate-Back Selection Restore: When navigating back to a parent directory using h or the left arrow key, the cursor now automatically lands on the directory from which the user originated, improving navigation flow.
  • Sortable Folders: Folders within the file list now respect the active sort key (name, size, modified date) while consistently appearing above media entries. Media-specific sort keys (e.g., duration, resolution, pages) will cause directories to fall back to sorting by name.
Changelog
  • CLAUDE.md
    • Updated project description to include 'document file browser'.
    • Added document.rs to the src/ directory structure description.
    • Revised scan.rs description to mention routing documents to native probe.
    • Updated MediaEntry description to reflect the new Document variant and doc: Option<DocumentInfo> field.
    • Incremented JSON schema version to 0.2.0.
    • Added media.doc.* filter paths and pages/author aliases to filter documentation.
    • Noted that documents bypass ffprobe and are handled by document.rs.
    • Documented TUI kind filter key 5 for documents and the 'D' icon.
  • Cargo.lock
    • Added numerous new dependencies required for document parsing, including aes, block-padding, bytecount, byteorder, cbc, cfb, cipher, ecb, encoding_rs, inout, lopdf, md-5, nom, nom_locate, quick-xml, rangemap, stringprep, tinyvec, tinyvec_macros, ttf-parser, typed-path, unicode-bidi, unicode-normalization, unicode-properties, zip, zlib-rs, and zopfli.
    • Updated media-ls version to 0.0.2.
    • Updated rand_chacha dependency version.
  • Cargo.toml
    • Updated package version to 0.0.2.
    • Added lopdf, zip, quick-xml, and cfb as dependencies.
  • src/document.rs
    • Added new module for pure-Rust document metadata extraction.
    • Implemented probe_document to dispatch probing based on file extension.
    • Included functions for probing PDF (probe_pdf), OOXML (probe_ooxml_doc, probe_ooxml_spreadsheet, probe_ooxml_presentation), ODF (probe_odf), OLE2 (probe_ole2), and text-based formats (probe_text_table, probe_text).
    • Provided XML helper functions local_name and count_xml_elements.
  • src/filter.rs
    • Extended resolve_field_typed to support new media.doc.* filter paths for document metadata fields (format, author, title, subject, creator_app, creation_date, modification_date, page_count, word_count, line_count, sheet_count).
    • Added pages and author as convenience aliases for document-related filters.
    • Updated test media entry creation to include doc: None.
  • src/main.rs
    • Imported the new document module.
    • Modified run_info to conditionally use probe::probe_document_file for document extensions, otherwise falling back to probe::probe_file.
  • src/output.rs
    • Updated SCHEMA_VERSION constant to 0.2.0.
    • Updated test media entry creation to include doc: None.
  • src/probe.rs
    • Added probe_document_file function for native document metadata extraction.
    • Modified build_media_info to initialize doc field as None.
  • src/scan.rs
    • Imported is_document_extension for file type identification.
    • Modified probe_files to use probe::probe_document_file for document types.
    • Updated test to correctly count .txt files as recognized media.
  • src/sort.rs
    • Updated module comment to include 'directory items'.
    • Imported DirItem struct.
    • Added pages and page_count as valid sort keys in parse_sort_spec.
    • Implemented sort_dir_items to sort directory items, falling back to name sort for media-only keys.
    • Modified compare_by_key to handle SortKey::Pages for document page counts.
    • Updated test media entry creation to include doc: None.
    • Added new unit tests for sort_dir_items functionality.
  • src/tui/layout.rs
    • Adjusted directory item rendering to use the name field from DirItem.
    • Added MediaKind::Document to the kind label logic, displaying 'D' for documents.
    • Implemented rendering of document-specific metadata (format, pages, words, lines, sheets, author, title) in the metadata panel.
    • Updated the footer to display the multi-select kind filter status using checkmarks.
    • Revised the help overlay to reflect the new kind filter key assignments (1 for all, 2-5 for toggling).
  • src/tui/mod.rs
    • Imported sort_dir_items and DirItem.
    • Refactored KindFilter from an enum to a struct to support multi-selection.
    • Updated App struct to store dir_items as Vec<DirItem>.
    • Initialized kind_filter to KindFilter::ALL.
    • Modified selected_dir to return the path from the selected DirItem.
    • Updated matches_kind to check the boolean flags in the KindFilter struct.
    • Ensured apply_sort sorts dir_items before media entries.
    • Modified navigate_to_dir to sort dir_items and restore the cursor position to the previously selected directory.
    • Updated list_subdirs to return Vec<DirItem> with cached metadata.
    • Modified list_sibling_dirs to return Vec<PathBuf> for parent pane rendering.
    • Updated handle_key logic for kind filtering: 1 selects all, 2-5 toggle specific kinds.
    • Updated tests to reflect changes in KindFilter and DirItem usage.
  • src/tui/triage.rs
    • Updated test media entry creation to include doc: None.
  • src/types.rs
    • Added MediaKind::Document enum variant.
    • Defined DocumentInfo struct to hold document-specific metadata.
    • Included doc: Option<DocumentInfo> in the MediaInfo struct.
    • Introduced DirItem struct to cache directory metadata for sorting.
    • Added SortKey::Pages for sorting by document page count.
    • Implemented applies_to_dirs method for SortKey to determine if a key is applicable to directories.
    • Updated the SortKey::next cycle to include Pages.
    • Defined DOCUMENT_EXTENSIONS constant for recognized document file types.
    • Added is_document_extension function.
    • Updated is_media_extension to include document types.
    • Extended unit tests for MediaKind and SortKey.
  • tests/cli.rs
    • Updated json_output_valid_schema to expect 0.2.0 schema version and 6 media files (including .txt).
    • Updated ndjson_has_header_and_footer to expect 0.2.0 schema version.
    • Added new tests to verify document support: json_documents_have_kind_document, json_filter_kind_document_returns_only_documents, json_document_has_line_count, and json_sort_by_pages.
Using Gemini Code Assist

The full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips.

Invoking Gemini

You can request assistance from Gemini at any point by creating a comment using either /gemini <command> or @gemini-code-assist <command>. Below is a summary of the supported commands on the current page.

Feature Command Description
Code Review /gemini review Performs a code review for the current pull request in its current state.
Pull Request Summary /gemini summary Provides a summary of the current pull request in its current state.
Comment @gemini-code-assist Responds in comments when explicitly tagged, both in pull request comments and review comments.
Help /gemini help Displays a list of available commands.

Customization

To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a .gemini/ folder in the base of the repository. Detailed instructions can be found here.

Limitations & Feedback

Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here.

You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension.

Footnotes

  1. Review the Privacy Notices, Generative AI Prohibited Use Policy, Terms of Service, and learn how to configure Gemini Code Assist in GitHub here. Gemini can make mistakes, so double check it and use code with caution.

@thepushkarp thepushkarp self-assigned this Mar 4, 2026
Copy link

@gemini-code-assist gemini-code-assist bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request introduces significant feature updates, including support for document metadata extraction via the new document.rs module, a more flexible multi-select kind filter in the TUI, and improved directory sorting. A critical security concern is the potential for Denial of Service (OOM) due to unbounded resource consumption when processing malicious or extremely large ZIP entries and text files without size limits. Additionally, the code review identified areas with code duplication (now explicitly linked to a rule) and magic numbers that could be refactored for better maintainability.

/// The stream uses MS-OLEPS binary format with property sets.
/// We extract string properties by well-known IDs
/// (2=Title, 4=Author, 5=Subject, 18=`AppName`).
fn parse_summary_info(data: &[u8], info: &mut DocumentInfo) {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

medium

The parse_summary_info function uses several magic numbers for offsets, signatures, and property IDs when parsing the OLE2 SummaryInformation stream. This makes the code hard to read and maintain. These values should be defined as constants with descriptive names to improve clarity and make it easier to understand the structure of the binary format being parsed.

For example:

const OLEPS_SIGNATURE: u16 = 0xFFFE;
const SECTION_OFFSET_POS: usize = 44;

// Property IDs
const PID_TITLE: u32 = 2;
const PID_AUTHOR: u32 = 4;
// ... etc

// In the function
if data.len() < 48 || read_u16_le(data, 0) != OLEPS_SIGNATURE {
    return;
}
// ...
match prop_id {
    PID_TITLE => info.title = Some(s),
    PID_AUTHOR => info.author = Some(s),
    // ... etc
}

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 6fad9c5. Added named constants: OLEPS_BYTE_ORDER_LE, OLEPS_HEADER_MIN_LEN, OLEPS_SECTION_OFFSET_POS, OLEPS_MAX_PROPS, OLEPS_SECTION_HEADER_SIZE, OLEPS_PROP_ENTRY_SIZE, VT_I4, VT_LPSTR, PIDSI_TITLE, PIDSI_AUTHOR, PIDSI_SUBJECT, PIDSI_PAGECOUNT, PIDSI_WORDCOUNT, PIDSI_APPNAME. All inline literals in parse_summary_info replaced.

Copy link

@chatgpt-codex-connector chatgpt-codex-connector bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 87793a7a7f

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +127 to +130
let result = if is_doc {
probe::probe_document_file(&file).await
} else {
probe::probe_file(&file, timeout_ms).await

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Enforce timeout for native document probes

Document files bypass the timeout_ms path and call probe_document_file directly, so native probes can run indefinitely while the rest of scanning waits. This is observable with document extensions on blocking special files (for example, a .txt FIFO): unlike probe_file, there is no tokio::time::timeout guard, so one file can stall the entire scan. Please apply timeout handling (or regular-file checks) to document probing as well.

Useful? React with 👍 / 👎.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 408746e. Wrapped probe_document_file in tokio::time::timeout(Duration::from_millis(timeout_ms), ...) at the call site in scan.rs, matching the timeout behavior of probe_file. On timeout, logs at debug level and returns an error.

- Cap zip XML reads at 4 MiB via `take(MAX_XML_BYTES)` in `read_xml_from_zip`
- Cap text/CSV scanning at 256 MiB via `take(MAX_TEXT_SCAN_BYTES)`
- Replace `BufReader::lines()` with byte-level counting in `probe_text_table`
- Replace `BufReader::lines()` with capped `read_line` in `probe_text`

Addresses security review comments about unbounded reads.
Document probes had no timeout, unlike ffprobe-based media probes.
A stuck document read (e.g., network-mounted PDF) would block
the scan task indefinitely.
Replace `DirEntry::file_type().is_dir()` with `Path::is_dir()` in both
`list_subdirs` and `list_sibling_dirs`. The former doesn't follow
symlinks, causing symlinked directories to disappear from the TUI.
Deduplicates the identical XML event-loop boilerplate shared by
parse_core_xml and parse_app_xml. Each is now a thin wrapper that
passes a closure for tag-to-field dispatch.
Adds OLEPS_*, VT_*, and PIDSI_* constants for the MS-OLEPS binary
format values used in parse_summary_info. Makes the binary format
parsing self-documenting.
Reduces the four kind-filter match arms (keys 2-5) to one-liners by
extracting the shared toggle-apply-check-status pattern into a method
on App.
@thepushkarp thepushkarp merged commit ae6b8d4 into main Mar 4, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant