Skip to content

Releases: SemClone/src2purl

1.3.4

28 Oct 06:13
31d3075

Choose a tag to compare

Fixed

1.3.3

28 Oct 05:34
e34dadf

Choose a tag to compare

[1.3.3] - 2025-10-27

Fixed

  • Updated oslili package dependency: Changed from semantic-copycat-oslili to osslili to reflect upstream
    package rebranding
  • Updated oslili import statements: Changed import from semantic_copycat_oslili to osslili in integration
    module
  • Maintained backward compatibility: License enhancement functionality preserved with graceful fallback
    handling

Technical

  • All existing functionality remains unchanged
  • Integration properly handles ImportError when osslili package not available
  • No breaking changes to API or core package identification workflow

1.3.2

28 Oct 05:22
06fbfff

Choose a tag to compare

Changelog for v1.3.2 - 2025-10-27

Changed (Project Rebranding)

  • Renamed project from semantic-copycat-src2id to src2purl: Better reflects the tool's focus on Package URL (PURL) generation
  • Updated CLI commands: Changed from src2id to src2purl and src2id -validate to src2purl -validate
  • Updated project metadata:
    • Package name: semantic-copycat-src2id → src2purl
    • Repository URLs updated to point to new src2purl repository
    • Project description updated to emphasize PURL generation capabilities
  • Updated documentation: README, CHANGELOG, and all references updated to reflect new branding
  • Updated CLI display: Version string now shows src2purl v{version} instead of src2id v{version}

Technical

  • All package imports and module structure remain unchanged (src2id.*)
  • Backward compatibility maintained for existing integrations
  • No breaking changes to API or functionality

Additional Enhancements

  • Added UPMEX as a hard dependency: Universal Package Metadata Extractor now guaranteed to be available
  • Enhanced manifest parsing integration: UPMEX was integrated adequately into Phase 2 identification workflow
  • Updated README documentation: Reflects the actual 2-phase discovery process (Hash-based + Manifest-based)
  • Improved package identification: UPMEX provides authoritative manifest data that validates and enhances hash-based findings

1.3.1

20 Oct 06:55
e1cd019

Choose a tag to compare

Release v1.3.1

Added

Hybrid Discovery Strategy

  • Primary: Hash-based discovery (SWHIDs + Software Heritage) for comprehensive source inventory
  • Secondary: Manifest parsing (UPMEX integration) to validate findings and add missing packages
  • Tertiary: GitHub API and SCANOSS fingerprinting for additional coverage
  • Individual testing capabilities for each discovery method

Comprehensive Manifest Parsing

Direct manifest file analysis supporting 15+ ecosystems:

  • Python: setup.py, pyproject.toml, setup.cfg, requirements.txt, Pipfile
  • JavaScript/Node.js: package.json, yarn.lock, package-lock.json
  • Java/Maven: pom.xml, gradle.build
  • Go: go.mod, go.sum
  • Rust: Cargo.toml, Cargo.lock
  • Ruby: Gemfile, gemspec files
  • PHP: composer.json, composer.lock
  • .NET: *.csproj, packages.config, *.nuspec

Enhanced Package Metadata Extraction

  • License detection with confidence scoring
  • Version extraction from multiple sources (tags, releases, manifest files)
  • PURL (Package URL) generation across ecosystems
  • Official organization detection and prioritization

Security

Fixed URL Substring Sanitization Vulnerabilities

  • Replaced vulnerable substring-based URL validation with proper URL parsing
  • Enhanced hostname validation using urlparse() for accurate domain matching
  • Prevents URL validation bypass attacks (e.g., evil.com/github.com/fake)
  • Addresses CodeQL security alerts #1-#25 for incomplete URL substring sanitization
  • Applied across all modules: extractor.py, orchestrator.py, purl.py, providers.py

Improved

Performance Optimization

  • Software Heritage made optional by default
  • Use --use-swh flag to enable Software Heritage integration
  • Prevents timeout issues on large codebases
  • Faster execution for most common use cases

Enhanced Documentation

  • 4-tier discovery strategy explanation
  • API key setup instructions for GitHub, SCANOSS, SerpAPI
  • Emphasis on "no API keys required" approach for basic functionality
  • Updated usage examples and configuration options

Fixed

Python Version Compatibility

  • Added fallback support for Python < 3.11
  • Graceful handling of tomllib import (Python 3.11+ only)
  • Fallback to tomli library for older Python versions
  • Maintains backward compatibility with Python 3.9+

UPMEX Integration Issues

  • Resolved import and dependency conflicts
  • Created a direct manifest parser instead of UPMEX archive-based tool
  • Handles raw manifest files in source directories
  • Supports recursive manifest discovery with depth control

Changed

Discovery Method Prioritization

  • Implemented a proper hybrid approach
  • Hash-based discovery prioritized over manifest parsing
  • Manifest parsing is used for validation and enhancement of hash-based results
  • Software Heritage integration is now optional (disabled by default)
  • Intelligent result merging and deduplication

1.2.2

30 Aug 06:46
83bee26

Choose a tag to compare

Release Notes - v1.2.2

Code Quality and Maintenance Release

This maintenance release focuses on code cleanup and dependency updates for the oslili license detection integration.

Dependencies

Updated

  • semantic-copycat-oslili from >=1.3.2 to >=1.3.3
    • Improved license detection confidence scores
    • Enhanced compatibility and bug fixes
    • No implementation changes required

Code Quality Improvements

Cleanup

  • Removed unused Set import from typing module
  • Removed dead code: find_license_files() method (33 lines) that was never called
  • Extracted license deduplication logic to private _deduplicate_licenses() helper method
  • Removed extra blank lines for cleaner code formatting

Benefits

  • Reduced codebase by 31 lines while maintaining full functionality
  • Eliminated code duplication with centralized deduplication helper
  • Cleaner, more maintainable code following DRY principles
  • Better code organization and readability

Performance

Enhanced License Detection

  • License detection confidence improved to 94.9% on test data (up from ~88%)
  • More accurate license categorization (declared/detected/referenced)
  • Better copyright holder extraction and mapping

Testing

Verified Compatibility

  • All existing functionality preserved
  • License detection is working correctly across all test directories
  • Integration tests passing with improved confidence scores
  • No breaking changes or functionality regression

Technical Details

File Changes

  • pyproject.toml: Version bump and dependency update
  • src2id/integrations/oslili.py: Code cleanup and optimization

Backward Compatibility

  • Fully backward compatible with existing integrations
  • No API changes or breaking modifications
  • All existing method signatures are maintained

1.1.2

20 Aug 21:25

Choose a tag to compare

src2id v1.1.2 - First Public Release

What is src2id?

src2id (Source Code to ID) identifies package coordinates (name, version, license, PURL) from source code directories. It helps you understand "what is this code?" when dealing with unknown dependencies or
analyzing open source components.

Key Features

  • Multiple identification strategies: Hash search, web search (GitHub/Google), SCANOSS fingerprinting, and optional Software Heritage archive
  • Subcomponent detection: Identifies multiple packages within monorepos and complex projects
  • License detection: Integrated with oslili for accurate license identification
  • Smart ordering: Optimized strategy order minimizes API calls (30x faster than single-strategy approaches)
  • Package URLs (PURLs): Generates standard package identifiers
  • Confidence scoring: Multi-factor scoring for match reliability
  • Persistent caching: 24-hour cache to avoid API rate limits

Installation

git clone https://github.com/oscarvalenzuelab/semantic-copycat-src2id.git
cd semantic-copycat-src2id
pip install -e .

Quick Start

Basic package identification

src2id /path/to/unknown/code

With subcomponent detection for monorepos

src2id /path/to/monorepo --detect-subcomponents

Include Software Heritage archive search

src2id /path/to/code --use-swh

JSON output for automation

src2id /path/to/code --output-format json

Example Output

src2id v1.1.2
Analyzing: test_data/darktable

Local Source Analysis
✓ Licenses detected: GPL-3.0, BSD-3-Clause, MIT and 4 more
Confidence: 94.1%

Name Confidence Method PURL
darktable 0.80 fuzzy pkg:generic/darktable

License

GNU Affero General Public License v3.0 (AGPL-3.0)

Acknowledgments

Built on top of excellent open source projects:

  • Software Heritage for archive access and SWHID generation
  • SCANOSS for code fingerprinting