feat(hfh):SP-4181 implement raw output format for folder hashing#200
feat(hfh):SP-4181 implement raw output format for folder hashing#200agustingroh wants to merge 1 commit intomainfrom
Conversation
SCANOSS SCAN Completed 🚀
View more details on SCANOSS Action Summary |
ed33fe3 to
e1c914c
Compare
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdded a Changes
Sequence DiagramsequenceDiagram
actor User
participant CLI as CLI Handler
participant HFH as HFH Scanner
participant Presenter as HFH Presenter
participant FS as File System
User->>CLI: folder-scan --format raw
CLI->>HFH: run scan (raw)
HFH->>HFH: perform HFH scan -> results
HFH->>Presenter: _format_raw_output(results, scan_root)
Presenter->>Presenter: _extract_best_components(results)
loop per path_id (deepest-first)
Presenter->>FS: list files under scan_root matching path_id
FS-->>Presenter: file list
loop per matched file
Presenter->>FS: read file bytes
FS-->>Presenter: file bytes
Presenter->>Presenter: _compute_file_md5(bytes)
Presenter->>Presenter: _build_snippet_entry(file, component, version, md5)
end
end
Presenter-->>HFH: aggregated JSON (relpath -> entries)
HFH-->>User: output raw JSON
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes Possibly related PRs
Suggested labels
Suggested reviewers
Poem
🚥 Pre-merge checks | ✅ 3✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
SCANOSS SCAN Completed 🚀
View more details on SCANOSS Action Summary |
There was a problem hiding this comment.
🧹 Nitpick comments (1)
src/scanoss/scanners/scanner_hfh.py (1)
347-364: Consider addingusedforsecurity=Falsefor FIPS-compliant environments.On FIPS-enabled systems (Python 3.9+),
hashlib.md5()withoutusedforsecurity=Falsecan raise aValueError. Since this hash is used for file identification (not cryptographic security), explicitly marking it as such improves compatibility.♻️ Suggested change
try: - return hashlib.md5(file_path.read_bytes()).hexdigest() + return hashlib.md5(file_path.read_bytes(), usedforsecurity=False).hexdigest() except (OSError, IOError): return ''🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/scanoss/scanners/scanner_hfh.py` around lines 347 - 364, Change _compute_file_md5 to call hashlib.md5 with usedforsecurity=False when constructing the hasher and ensure ValueError is handled: replace the current hashlib.md5(file_path.read_bytes()).hexdigest() call with hashlib.md5(file_path.read_bytes(), usedforsecurity=False).hexdigest() and include ValueError in the except clause alongside OSError/IOError so that FIPS-enabled Python 3.9+ environments won't raise uncaught exceptions and the function still returns '' on failure.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/scanoss/scanners/scanner_hfh.py`:
- Around line 347-364: Change _compute_file_md5 to call hashlib.md5 with
usedforsecurity=False when constructing the hasher and ensure ValueError is
handled: replace the current hashlib.md5(file_path.read_bytes()).hexdigest()
call with hashlib.md5(file_path.read_bytes(), usedforsecurity=False).hexdigest()
and include ValueError in the except clause alongside OSError/IOError so that
FIPS-enabled Python 3.9+ environments won't raise uncaught exceptions and the
function still returns '' on failure.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: bf2e7adb-2d6d-456c-9d53-a5baf55ddc80
📒 Files selected for processing (2)
src/scanoss/cli.pysrc/scanoss/scanners/scanner_hfh.py
e1c914c to
23636aa
Compare
SCANOSS SCAN Completed 🚀
View more details on SCANOSS Action Summary |
23636aa to
159dda6
Compare
SCANOSS SCAN Completed 🚀
View more details on SCANOSS Action Summary |
There was a problem hiding this comment.
Actionable comments posted: 2
🧹 Nitpick comments (1)
src/scanoss/scanners/scanner_hfh.py (1)
359-376: Optional: Consider chunked reading for very large filesThe current implementation matches
winnowing.py's approach of computing MD5 on full file content at once. If large files are a concern, chunked reading is worth considering for memory efficiency:♻️ Optional: chunked MD5 computation
`@staticmethod` def _compute_file_md5(file_path: Path) -> str: try: - return hashlib.md5(file_path.read_bytes()).hexdigest() + md5_hash = hashlib.md5() + with open(file_path, 'rb') as f: + for chunk in iter(lambda: f.read(8192), b''): + md5_hash.update(chunk) + return md5_hash.hexdigest() except (OSError, IOError): return ''🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/scanoss/scanners/scanner_hfh.py` around lines 359 - 376, The _compute_file_md5 function currently reads the entire file into memory with file_path.read_bytes(), which can OOM on very large files; change it to perform chunked reads using file_path.open('rb') and iteratively update a hashlib.md5() object (e.g., read in 64KB or 1MB chunks and call md5.update(chunk)) so behavior matches winnowing.py but is memory-efficient, and keep the existing exception handling that returns '' on OSError/IOError; reference _compute_file_md5 and the winnowing.py MD5 approach when applying the change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/scanoss/scanners/scanner_hfh.py`:
- Around line 339-357: The matching logic in _file_matches_path_id uses os.sep
and can fail when path_id uses forward slashes from the API; update the function
to normalize and compare using pathlib.PurePath: create PurePath objects for
file_path and path_id (e.g., file_p = PurePath(file_path), path_p =
PurePath(path_id)) and return True if path_id is '.' or file_p == path_p or
path_p in file_p.parents, ensuring cross-platform correctness without relying on
os.sep.
- Around line 285-289: The sorting and matching use os.sep so paths from HFH
(which use forward slashes) are mis-counted and don't match; update the sorting
of sorted_path_ids (the sorted(..., key=lambda p: ...)) to compute depth by
counting both '/' and os.sep or normalize path_id to use '/' consistently, and
also change the file-matching logic that checks path_id + os.sep (and references
all_files) to use the same normalized separator (e.g., '/' ) or normalize
all_files entries before comparison so path_components keys and file paths use
the same separator format; ensure you update both the lambda key used for
sorted_path_ids and the matching condition that builds path prefix checks so
they are consistent cross-platform.
---
Nitpick comments:
In `@src/scanoss/scanners/scanner_hfh.py`:
- Around line 359-376: The _compute_file_md5 function currently reads the entire
file into memory with file_path.read_bytes(), which can OOM on very large files;
change it to perform chunked reads using file_path.open('rb') and iteratively
update a hashlib.md5() object (e.g., read in 64KB or 1MB chunks and call
md5.update(chunk)) so behavior matches winnowing.py but is memory-efficient, and
keep the existing exception handling that returns '' on OSError/IOError;
reference _compute_file_md5 and the winnowing.py MD5 approach when applying the
change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: f000dc68-5256-40b8-aeb8-9ea7063086f4
📒 Files selected for processing (3)
CHANGELOG.mdsrc/scanoss/cli.pysrc/scanoss/scanners/scanner_hfh.py
✅ Files skipped from review due to trivial changes (2)
- CHANGELOG.md
- src/scanoss/cli.py
159dda6 to
3ef207e
Compare
SCANOSS SCAN Completed 🚀
View more details on SCANOSS Action Summary |
There was a problem hiding this comment.
Actionable comments posted: 2
♻️ Duplicate comments (2)
src/scanoss/scanners/scanner_hfh.py (2)
369-372:⚠️ Potential issue | 🟠 MajorSame cross-platform path separator issue in matching logic.
Line 372 uses
os.sepfor the prefix check, which fails whenpath_iduses forward slashes (from API) butfile_pathuses backslashes (Windows filesystem).🐛 Proposed fix using pathlib
`@staticmethod` def _file_matches_path_id(file_path: str, path_id: str) -> bool: - """ - Check if a file path belongs under a given path_id directory. - - Both file_path and path_id are relative to the scan root directory. - A path_id of '.' matches all files (root directory). - - Args: - file_path (str): Relative file path from the scan root. - path_id (str): Relative directory path from the HFH result. - - Returns: - bool: True if the file is under the given path_id directory. - """ + """Check if a file path belongs under a given path_id directory.""" if path_id == '.': return True - # file_path and path_id are both relative to scan_dir - return file_path == path_id or file_path.startswith(path_id + os.sep) + # Normalize to forward slashes for consistent comparison + norm_file = file_path.replace(os.sep, '/') + norm_path = path_id.replace(os.sep, '/') + return norm_file == norm_path or norm_file.startswith(norm_path + '/')🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/scanoss/scanners/scanner_hfh.py` around lines 369 - 372, The matching uses os.sep which breaks on Windows when path_id comes from the API with forward slashes; to fix, normalize both file_path and path_id to path objects and compare by path parts instead of string prefixes: convert file_path and path_id to pathlib.Path (or PurePath) relative to scan_dir and then return True if path_parts == id_parts or path_parts startswith id_parts (i.e. file_parts[:len(id_parts)] == id_parts); keep the existing special-case for path_id == '.' and update the code around the file_path/path_id comparison to use these Path-based comparisons.
300-304:⚠️ Potential issue | 🟠 MajorFix cross-platform path separator mismatch in sorting.
The HFH API returns
path_idwith forward slashes (e.g.,'project-1.0/src/lib'), but the sorting usesos.sepwhich is\\on Windows. This causes incorrect depth calculation on Windows, breaking the "deepest first" sort order.🐛 Proposed fix
sorted_path_ids = sorted( path_components.keys(), - key=lambda p: (-1, 0) if p == '.' else (p.count(os.sep), len(p)), + key=lambda p: (-1, 0) if p == '.' else (p.count('/'), len(p)), reverse=True, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/scanoss/scanners/scanner_hfh.py` around lines 300 - 304, The sorting of path_components into sorted_path_ids uses os.sep to count path depth, which fails on Windows because HFH path_id uses forward slashes; update the key in the sorted call (the lambda used to build the sort key for sorted_path_ids) to compute depth using the HFH path separator '/' (or use posixpath functions) instead of os.sep, keeping the special-case for '.' and preserving the "deepest first" ordering and tie-breaker on length.
🧹 Nitpick comments (1)
src/scanoss/scanners/scanner_hfh.py (1)
170-181: Remove unusedconfigparameter or store it if needed.The
configparameter is accepted but never assigned toself. The docstring claims it's used for "API base URL for constructing file_url", but line 321 actually usesself.scanner.client.orig_urlinstead. Either remove the unused parameter or store it if it's intended for future use.🧹 Proposed fix to remove unused parameter
- def __init__(self, scanner: ScannerHFH, config: ScannerConfig = None, **kwargs): + def __init__(self, scanner: ScannerHFH, **kwargs): """ Initialize the presenter. Args: scanner (ScannerHFH): The HFH scanner instance containing scan results and file filters. - config (ScannerConfig, optional): Scanner configuration, used to access the API base URL - for constructing file_url in raw output format. **kwargs: Additional arguments passed to AbstractPresenter (debug, trace, quiet, etc.). """ super().__init__(**kwargs) self.scanner = scannerAnd update the instantiation at line 90-96:
self.presenter = ScannerHFHPresenter( self, debug=config.debug, trace=config.trace, quiet=config.quiet, - config=config, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/scanoss/scanners/scanner_hfh.py` around lines 170 - 181, The __init__ of the presenter accepts a config param but never stores or uses it; either remove the unused parameter from the signature (and update all instantiation sites of the presenter) or persist it (self.config = config) and replace uses of self.scanner.client.orig_url when constructing file_url with the intended config value (e.g., self.config.api_base_url) so the docstring is accurate; update the constructor in the class __init__ and any call sites that pass config accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@src/scanoss/scanners/scanner_hfh.py`:
- Around line 393-436: The _build_snippet_entry function currently has an overly
long line; an unused variable score; and hardcodes 'matched' to '100%'.
Shorten/reflow any lines >120 chars (split long docstring lines or wrap long
dict value expressions) to satisfy the line-length rule, and use the existing
score variable to populate the matched field (e.g., set matched to a string
derived from score like f"{score}%" or otherwise format score appropriately) so
score is no longer unused; ensure you reference _build_snippet_entry and the
variables score and matched when making the change.
- Around line 318-322: Guard against a None scanner client and wrap long line:
when constructing entry in the scanner_hfh logic (around claimed_files,
self._compute_file_md5 and self._build_snippet_entry), replace the direct access
to self.scanner.client.orig_url with a safe expression that uses a fallback
(e.g., (self.scanner.client.orig_url if self.scanner.client else '')) and break
the long statement into multiple shorter lines so it stays under 120 characters;
ensure you still pass the derived file_hash from _compute_file_md5 and the
component/best_version to _build_snippet_entry.
---
Duplicate comments:
In `@src/scanoss/scanners/scanner_hfh.py`:
- Around line 369-372: The matching uses os.sep which breaks on Windows when
path_id comes from the API with forward slashes; to fix, normalize both
file_path and path_id to path objects and compare by path parts instead of
string prefixes: convert file_path and path_id to pathlib.Path (or PurePath)
relative to scan_dir and then return True if path_parts == id_parts or
path_parts startswith id_parts (i.e. file_parts[:len(id_parts)] == id_parts);
keep the existing special-case for path_id == '.' and update the code around the
file_path/path_id comparison to use these Path-based comparisons.
- Around line 300-304: The sorting of path_components into sorted_path_ids uses
os.sep to count path depth, which fails on Windows because HFH path_id uses
forward slashes; update the key in the sorted call (the lambda used to build the
sort key for sorted_path_ids) to compute depth using the HFH path separator '/'
(or use posixpath functions) instead of os.sep, keeping the special-case for '.'
and preserving the "deepest first" ordering and tie-breaker on length.
---
Nitpick comments:
In `@src/scanoss/scanners/scanner_hfh.py`:
- Around line 170-181: The __init__ of the presenter accepts a config param but
never stores or uses it; either remove the unused parameter from the signature
(and update all instantiation sites of the presenter) or persist it (self.config
= config) and replace uses of self.scanner.client.orig_url when constructing
file_url with the intended config value (e.g., self.config.api_base_url) so the
docstring is accurate; update the constructor in the class __init__ and any call
sites that pass config accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: 325cc9e5-6356-4ce0-92ec-baf2f33a0212
📒 Files selected for processing (3)
CHANGELOG.mdsrc/scanoss/cli.pysrc/scanoss/scanners/scanner_hfh.py
✅ Files skipped from review due to trivial changes (1)
- CHANGELOG.md
3ef207e to
46158c4
Compare
SCANOSS SCAN Completed 🚀
View more details on SCANOSS Action Summary |
There was a problem hiding this comment.
🧹 Nitpick comments (2)
src/scanoss/scanners/scanner_hfh.py (2)
394-438: Consider using the actual HFH score for thematchedfield.The
matchedfield is hardcoded to'100%'(line 421), butbest_versioncontains the actual HFH score that would provide more accurate match information to downstream consumers.♻️ Proposed enhancement
def _build_snippet_entry( component: Dict, best_version: Dict, file_path: str, file_hash: str, base_url: str, ) -> Dict: ... purl = component.get('purl', '') + score = best_version.get('score', 1.0) version = best_version.get('version', '') url = purl2url.get_repo_url(purl) if purl else '' return { 'id': 'file', - 'matched': '100%', + 'matched': f'{int(score * 100)}%', 'purl': [purl], ... }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/scanoss/scanners/scanner_hfh.py` around lines 394 - 438, In _build_snippet_entry, replace the hardcoded 'matched': '100%' value with the actual HFH score from best_version (e.g., best_version.get('score')) formatted as a percent string (e.g., f"{score}%" or similar), falling back to '100%' when score is missing or not numeric; ensure you reference the best_version dict and coerce the score to a string with a trailing '%' before returning the dict so downstream consumers receive the real HFH match percentage.
170-181: Remove or use the unusedconfigparameter.The
configparameter is accepted but never stored or used. Line 181 only storesself.scanner, and_format_raw_output()at line 321 accessesself.scanner.client.orig_urlrather than anything fromconfig.Either store
configas an instance attribute if needed for future use, or remove it to avoid confusion:♻️ Option 1: Remove unused parameter
- def __init__(self, scanner: ScannerHFH, config: ScannerConfig = None, **kwargs): + def __init__(self, scanner: ScannerHFH, **kwargs): """ Initialize the presenter. Args: scanner (ScannerHFH): The HFH scanner instance containing scan results and file filters. - config (ScannerConfig, optional): Scanner configuration, used to access the API base URL - for constructing file_url in raw output format. **kwargs: Additional arguments passed to AbstractPresenter (debug, trace, quiet, etc.). """ super().__init__(**kwargs) self.scanner = scannerAnd update the caller at line 95:
self.presenter = ScannerHFHPresenter( self, debug=config.debug, trace=config.trace, quiet=config.quiet, - config=config, )🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@src/scanoss/scanners/scanner_hfh.py` around lines 170 - 181, The __init__ signature for the presenter currently accepts a config: ScannerConfig parameter that is never used; either remove the unused parameter from __init__ (and update callers that pass config) or persist it as an instance attribute (self.config = config) so future methods can rely on it; if choosing to keep it, also update any references that should use config (e.g. in _format_raw_output) instead of accessing self.scanner.client.orig_url, and ensure the ScannerHFH constructor/call sites are updated accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Nitpick comments:
In `@src/scanoss/scanners/scanner_hfh.py`:
- Around line 394-438: In _build_snippet_entry, replace the hardcoded 'matched':
'100%' value with the actual HFH score from best_version (e.g.,
best_version.get('score')) formatted as a percent string (e.g., f"{score}%" or
similar), falling back to '100%' when score is missing or not numeric; ensure
you reference the best_version dict and coerce the score to a string with a
trailing '%' before returning the dict so downstream consumers receive the real
HFH match percentage.
- Around line 170-181: The __init__ signature for the presenter currently
accepts a config: ScannerConfig parameter that is never used; either remove the
unused parameter from __init__ (and update callers that pass config) or persist
it as an instance attribute (self.config = config) so future methods can rely on
it; if choosing to keep it, also update any references that should use config
(e.g. in _format_raw_output) instead of accessing self.scanner.client.orig_url,
and ensure the ScannerHFH constructor/call sites are updated accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: defaults
Review profile: CHILL
Plan: Pro
Run ID: d146b545-8398-42e3-beb6-0f390ff5249b
📒 Files selected for processing (3)
CHANGELOG.mdsrc/scanoss/cli.pysrc/scanoss/scanners/scanner_hfh.py
✅ Files skipped from review due to trivial changes (1)
- CHANGELOG.md
Summary by CodeRabbit
New Features
rawoutput format for the folder-scan command (in addition tojsonandcyclonedx).rawemits per-file snippet-compatible JSON, expands directory-level matches into file-level entries, picks the most specific match per file, and includes a computed MD5 file hash. Default output remainsjson.Documentation
rawformat and its per-file output.