v0.0.2: document support, kind filter, dir sorting#2
Conversation
Add MediaKind::Document as a new file category that bypasses ffprobe entirely, using pure Rust crates for metadata extraction. Supported formats: PDF, DOCX/DOC, XLSX/XLS, PPTX/PPT, ODT/ODS/ODP, CSV, TSV, TXT, MD. - New document.rs module with format-specific extractors (lopdf, zip, quick-xml, cfb) - DocumentInfo struct with page_count, word_count, line_count, sheet_count, author, title, and more - Filter support: media.doc.* field paths + pages/author aliases - Sort support: pages/page_count sort key - TUI: KindFilter::Document (key 5), "D" icon, document metadata panel - Bump package version to 0.0.2, schema version to 0.2.0
Replace exclusive KindFilter enum with a struct of per-kind booleans. Keys 2-5 now toggle individual kinds on/off; key 1 resets to show all. Footer shows checkbox indicators for each kind's active state.
Save current_dir before navigating to parent, then restore cursor to the child directory in the new dir_items list. Also fix apply_filter clamping to use visible_count() (dirs + media) instead of filtered_indices.len() (media only).
Folders now respect the user's sort key (name, size, modified) instead of always being alphabetical. Media-only keys (duration, resolution, codec, bitrate, pages) fall back to name sort for directories. Introduce DirItem struct to carry cached metadata (name, size, modified_at) alongside directory paths, avoiding re-stat on every sort. Decouple list_sibling_dirs from list_subdirs so the parent pane stays alphabetical while the current pane sorts dynamically.
|
@codex review |
Summary of ChangesHello, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request significantly expands the Highlights
Changelog
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
There was a problem hiding this comment.
Code Review
This pull request introduces significant feature updates, including support for document metadata extraction via the new document.rs module, a more flexible multi-select kind filter in the TUI, and improved directory sorting. A critical security concern is the potential for Denial of Service (OOM) due to unbounded resource consumption when processing malicious or extremely large ZIP entries and text files without size limits. Additionally, the code review identified areas with code duplication (now explicitly linked to a rule) and magic numbers that could be refactored for better maintainability.
| /// The stream uses MS-OLEPS binary format with property sets. | ||
| /// We extract string properties by well-known IDs | ||
| /// (2=Title, 4=Author, 5=Subject, 18=`AppName`). | ||
| fn parse_summary_info(data: &[u8], info: &mut DocumentInfo) { |
There was a problem hiding this comment.
The parse_summary_info function uses several magic numbers for offsets, signatures, and property IDs when parsing the OLE2 SummaryInformation stream. This makes the code hard to read and maintain. These values should be defined as constants with descriptive names to improve clarity and make it easier to understand the structure of the binary format being parsed.
For example:
const OLEPS_SIGNATURE: u16 = 0xFFFE;
const SECTION_OFFSET_POS: usize = 44;
// Property IDs
const PID_TITLE: u32 = 2;
const PID_AUTHOR: u32 = 4;
// ... etc
// In the function
if data.len() < 48 || read_u16_le(data, 0) != OLEPS_SIGNATURE {
return;
}
// ...
match prop_id {
PID_TITLE => info.title = Some(s),
PID_AUTHOR => info.author = Some(s),
// ... etc
}There was a problem hiding this comment.
Fixed in 6fad9c5. Added named constants: OLEPS_BYTE_ORDER_LE, OLEPS_HEADER_MIN_LEN, OLEPS_SECTION_OFFSET_POS, OLEPS_MAX_PROPS, OLEPS_SECTION_HEADER_SIZE, OLEPS_PROP_ENTRY_SIZE, VT_I4, VT_LPSTR, PIDSI_TITLE, PIDSI_AUTHOR, PIDSI_SUBJECT, PIDSI_PAGECOUNT, PIDSI_WORDCOUNT, PIDSI_APPNAME. All inline literals in parse_summary_info replaced.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 87793a7a7f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| let result = if is_doc { | ||
| probe::probe_document_file(&file).await | ||
| } else { | ||
| probe::probe_file(&file, timeout_ms).await |
There was a problem hiding this comment.
Enforce timeout for native document probes
Document files bypass the timeout_ms path and call probe_document_file directly, so native probes can run indefinitely while the rest of scanning waits. This is observable with document extensions on blocking special files (for example, a .txt FIFO): unlike probe_file, there is no tokio::time::timeout guard, so one file can stall the entire scan. Please apply timeout handling (or regular-file checks) to document probing as well.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
Fixed in 408746e. Wrapped probe_document_file in tokio::time::timeout(Duration::from_millis(timeout_ms), ...) at the call site in scan.rs, matching the timeout behavior of probe_file. On timeout, logs at debug level and returns an error.
- Cap zip XML reads at 4 MiB via `take(MAX_XML_BYTES)` in `read_xml_from_zip` - Cap text/CSV scanning at 256 MiB via `take(MAX_TEXT_SCAN_BYTES)` - Replace `BufReader::lines()` with byte-level counting in `probe_text_table` - Replace `BufReader::lines()` with capped `read_line` in `probe_text` Addresses security review comments about unbounded reads.
Document probes had no timeout, unlike ffprobe-based media probes. A stuck document read (e.g., network-mounted PDF) would block the scan task indefinitely.
Replace `DirEntry::file_type().is_dir()` with `Path::is_dir()` in both `list_subdirs` and `list_sibling_dirs`. The former doesn't follow symlinks, causing symlinked directories to disappear from the TUI.
Deduplicates the identical XML event-loop boilerplate shared by parse_core_xml and parse_app_xml. Each is now a thin wrapper that passes a closure for tag-to-field dispatch.
Adds OLEPS_*, VT_*, and PIDSI_* constants for the MS-OLEPS binary format values used in parse_summary_info. Makes the binary format parsing self-documenting.
Reduces the four kind-filter match arms (keys 2-5) to one-liners by extracting the shared toggle-apply-check-status pattern into a method on App.
Summary
document.rsmodule. No external tools required. NewMediaKind::Documentvariant,DocumentInfotype, filter paths (media.doc.*,pages/authoraliases), kind filter key5, sort by pages.2-5toggle individual kinds (video/audio/image/doc) instead of exclusive select.1resets to show all.h/left to go to parent dir now lands the cursor on the directory you came from.Test plan
cargo clippy --all-targets --all-features -- -D warnings— zero warningscargo test— 300 unit + 19 integration tests passcargo fmt --check— cleancargo run -- tui ~/some-dir— verify document files show with "D" icon, kind filter toggles work, sort applies to folders, navigate-back restores selection