Releases: SemClone/src2purl
1.3.4
Fixed
- Updated GitHub repository URLs: Changed project URLs in pyproject.toml from oscarvalenzuelab/src2purl to
official SemClone/src2purl repository- Homepage URL updated to https://github.com/SemClone/src2purl
- Bug Tracker URL updated to https://github.com/SemClone/src2purl/issues
- Source Code URL updated to https://github.com/SemClone/src2purl
1.3.3
[1.3.3] - 2025-10-27
Fixed
- Updated oslili package dependency: Changed from semantic-copycat-oslili to osslili to reflect upstream
package rebranding - Updated oslili import statements: Changed import from semantic_copycat_oslili to osslili in integration
module - Maintained backward compatibility: License enhancement functionality preserved with graceful fallback
handling
Technical
- All existing functionality remains unchanged
- Integration properly handles ImportError when osslili package not available
- No breaking changes to API or core package identification workflow
1.3.2
Changelog for v1.3.2 - 2025-10-27
Changed (Project Rebranding)
- Renamed project from semantic-copycat-src2id to src2purl: Better reflects the tool's focus on Package URL (PURL) generation
- Updated CLI commands: Changed from src2id to src2purl and src2id -validate to src2purl -validate
- Updated project metadata:
- Package name: semantic-copycat-src2id → src2purl
- Repository URLs updated to point to new src2purl repository
- Project description updated to emphasize PURL generation capabilities
- Updated documentation: README, CHANGELOG, and all references updated to reflect new branding
- Updated CLI display: Version string now shows src2purl v{version} instead of src2id v{version}
Technical
- All package imports and module structure remain unchanged (src2id.*)
- Backward compatibility maintained for existing integrations
- No breaking changes to API or functionality
Additional Enhancements
- Added UPMEX as a hard dependency: Universal Package Metadata Extractor now guaranteed to be available
- Enhanced manifest parsing integration: UPMEX was integrated adequately into Phase 2 identification workflow
- Updated README documentation: Reflects the actual 2-phase discovery process (Hash-based + Manifest-based)
- Improved package identification: UPMEX provides authoritative manifest data that validates and enhances hash-based findings
1.3.1
Release v1.3.1
Added
Hybrid Discovery Strategy
- Primary: Hash-based discovery (SWHIDs + Software Heritage) for comprehensive source inventory
- Secondary: Manifest parsing (UPMEX integration) to validate findings and add missing packages
- Tertiary: GitHub API and SCANOSS fingerprinting for additional coverage
- Individual testing capabilities for each discovery method
Comprehensive Manifest Parsing
Direct manifest file analysis supporting 15+ ecosystems:
- Python: setup.py, pyproject.toml, setup.cfg, requirements.txt, Pipfile
- JavaScript/Node.js: package.json, yarn.lock, package-lock.json
- Java/Maven: pom.xml, gradle.build
- Go: go.mod, go.sum
- Rust: Cargo.toml, Cargo.lock
- Ruby: Gemfile, gemspec files
- PHP: composer.json, composer.lock
- .NET: *.csproj, packages.config, *.nuspec
Enhanced Package Metadata Extraction
- License detection with confidence scoring
- Version extraction from multiple sources (tags, releases, manifest files)
- PURL (Package URL) generation across ecosystems
- Official organization detection and prioritization
Security
Fixed URL Substring Sanitization Vulnerabilities
- Replaced vulnerable substring-based URL validation with proper URL parsing
- Enhanced hostname validation using
urlparse()for accurate domain matching - Prevents URL validation bypass attacks (e.g.,
evil.com/github.com/fake) - Addresses CodeQL security alerts #1-#25 for incomplete URL substring sanitization
- Applied across all modules: extractor.py, orchestrator.py, purl.py, providers.py
Improved
Performance Optimization
- Software Heritage made optional by default
- Use
--use-swhflag to enable Software Heritage integration - Prevents timeout issues on large codebases
- Faster execution for most common use cases
Enhanced Documentation
- 4-tier discovery strategy explanation
- API key setup instructions for GitHub, SCANOSS, SerpAPI
- Emphasis on "no API keys required" approach for basic functionality
- Updated usage examples and configuration options
Fixed
Python Version Compatibility
- Added fallback support for Python < 3.11
- Graceful handling of
tomllibimport (Python 3.11+ only) - Fallback to
tomlilibrary for older Python versions - Maintains backward compatibility with Python 3.9+
UPMEX Integration Issues
- Resolved import and dependency conflicts
- Created a direct manifest parser instead of UPMEX archive-based tool
- Handles raw manifest files in source directories
- Supports recursive manifest discovery with depth control
Changed
Discovery Method Prioritization
- Implemented a proper hybrid approach
- Hash-based discovery prioritized over manifest parsing
- Manifest parsing is used for validation and enhancement of hash-based results
- Software Heritage integration is now optional (disabled by default)
- Intelligent result merging and deduplication
1.2.2
Release Notes - v1.2.2
Code Quality and Maintenance Release
This maintenance release focuses on code cleanup and dependency updates for the oslili license detection integration.
Dependencies
Updated
- semantic-copycat-oslili from >=1.3.2 to >=1.3.3
- Improved license detection confidence scores
- Enhanced compatibility and bug fixes
- No implementation changes required
Code Quality Improvements
Cleanup
- Removed unused Set import from typing module
- Removed dead code: find_license_files() method (33 lines) that was never called
- Extracted license deduplication logic to private _deduplicate_licenses() helper method
- Removed extra blank lines for cleaner code formatting
Benefits
- Reduced codebase by 31 lines while maintaining full functionality
- Eliminated code duplication with centralized deduplication helper
- Cleaner, more maintainable code following DRY principles
- Better code organization and readability
Performance
Enhanced License Detection
- License detection confidence improved to 94.9% on test data (up from ~88%)
- More accurate license categorization (declared/detected/referenced)
- Better copyright holder extraction and mapping
Testing
Verified Compatibility
- All existing functionality preserved
- License detection is working correctly across all test directories
- Integration tests passing with improved confidence scores
- No breaking changes or functionality regression
Technical Details
File Changes
- pyproject.toml: Version bump and dependency update
- src2id/integrations/oslili.py: Code cleanup and optimization
Backward Compatibility
- Fully backward compatible with existing integrations
- No API changes or breaking modifications
- All existing method signatures are maintained
1.1.2
src2id v1.1.2 - First Public Release
What is src2id?
src2id (Source Code to ID) identifies package coordinates (name, version, license, PURL) from source code directories. It helps you understand "what is this code?" when dealing with unknown dependencies or
analyzing open source components.
Key Features
- Multiple identification strategies: Hash search, web search (GitHub/Google), SCANOSS fingerprinting, and optional Software Heritage archive
- Subcomponent detection: Identifies multiple packages within monorepos and complex projects
- License detection: Integrated with oslili for accurate license identification
- Smart ordering: Optimized strategy order minimizes API calls (30x faster than single-strategy approaches)
- Package URLs (PURLs): Generates standard package identifiers
- Confidence scoring: Multi-factor scoring for match reliability
- Persistent caching: 24-hour cache to avoid API rate limits
Installation
git clone https://github.com/oscarvalenzuelab/semantic-copycat-src2id.git
cd semantic-copycat-src2id
pip install -e .Quick Start
Basic package identification
src2id /path/to/unknown/code
With subcomponent detection for monorepos
src2id /path/to/monorepo --detect-subcomponents
Include Software Heritage archive search
src2id /path/to/code --use-swh
JSON output for automation
src2id /path/to/code --output-format json
Example Output
src2id v1.1.2
Analyzing: test_data/darktable
Local Source Analysis
✓ Licenses detected: GPL-3.0, BSD-3-Clause, MIT and 4 more
Confidence: 94.1%
| Name | Confidence | Method | PURL |
|---|---|---|---|
| darktable | 0.80 | fuzzy | pkg:generic/darktable |
License
GNU Affero General Public License v3.0 (AGPL-3.0)
Acknowledgments
Built on top of excellent open source projects:
- Software Heritage for archive access and SWHID generation
- SCANOSS for code fingerprinting