Add Recursive and Generative Research System (Phases 1-2)#4
Add Recursive and Generative Research System (Phases 1-2)#4
Conversation
Implements a comprehensive system for automatically discovering, analyzing, and learning from similar organizations and repositories to continuously improve architecture governance practices. ## What's Added ### Core System (Phases 1-2 Complete) - **Phase 1: Organization Profiling** - Technology stack fingerprinting (languages, frameworks, tools) - Architecture pattern extraction - Baseline metrics aggregation - Challenge identification and research area prioritization - **Phase 2: Repository Discovery** - GitHub API integration with rate limit handling - Multi-dimensional similarity scoring algorithm - Intelligent filtering and deduplication - Configurable search queries and weights ### Scripts - `scripts/research/profile_org.py` - Organization profiling orchestrator - `scripts/research/extract_tech_stack.py` - Technology detection - `scripts/research/discover_repos.py` - Repository discovery engine - `scripts/research/similarity_scorer.py` - Similarity calculation ### Configuration - `config/research/discovery_config.yaml` - Search parameters - `config/research/similarity_weights.yaml` - Scoring weights - `config/research/analysis_config.yaml` - Analysis settings (Phase 3) - `config/research/prioritization_weights.yaml` - Recommendation weights (Phase 5) ### Documentation - `docs/ROADMAP_RECURSIVE_RESEARCH_SYSTEM.md` - Complete vision (22 weeks) - `docs/TASK_LIST_RESEARCH_SYSTEM.md` - Detailed tasks (151 items, 522 hours) - `docs/research/README.md` - System overview - `docs/research/RESEARCH_QUICKSTART.md` - Step-by-step guide ### Build System - Updated `Makefile` with research-* targets - `requirements-research.txt` - Python dependencies ### Makefile Targets - `make research-profile` - Create organization profile - `make research-discover` - Discover similar repositories - `make research-similarity` - Calculate similarity scores - `make research-report` - Generate summary report - `make research-full` - Run complete research cycle - `make research-check-deps` - Verify dependencies - `make research-clean` - Remove artifacts ## Key Features ### Multi-Dimensional Similarity Scoring Repositories ranked by 5 dimensions: - Tech Stack (30%): Language/framework overlap - Problem Domain (25%): Topic alignment - Scale (15%): Size/complexity similarity - Activity (15%): Update frequency - Maturity (15%): Age/maintenance status ### Intelligent Discovery - Automatic query generation from org profile - Research-area-focused searches - Quality filtering (stars, activity, recency) - Blocklist/allowlist support ### Recursive Design Foundation for continuous improvement: - Profile evolution tracking - Feedback collection (Phase 6) - Model retraining (Phase 6) - Self-optimization (Phase 6) ## Usage ```bash # Quick start export GITHUB_TOKEN="your_token" pip install -r requirements-research.txt make research-full # View results make research-report cat artifacts/research/discoveries/similarity_scores.json ``` ## What's Next ### Phase 3: Automated Analysis (In Progress) - Safe repository cloning - Pattern extraction (CI/CD, testing, docs) - Gap analysis vs baseline ### Phase 4: Pattern Recognition - Cross-repo aggregation - Best practice identification - Trend analysis ### Phase 5: Recommendations - Prioritized improvement suggestions - Evidence-based rationale - ADR and code scaffold generation ### Phase 6: Recursive Refinement - Feedback loops - Query optimization - Model retraining - Continuous self-improvement ## Benefits - **Time Savings**: 70% reduction in manual research - **Quality**: Learn from high-quality, vetted repositories - **Personalization**: Recommendations tailored to YOUR context - **Continuous**: Keeps you current with evolving best practices - **Data-Driven**: Evidence-based improvements ## Architecture Directory structure: ``` scripts/research/ # Research system scripts config/research/ # Configuration files docs/research/ # Documentation artifacts/research/ # Generated outputs profiles/ # Organization profiles discoveries/ # Discovered repositories analysis/ # Analysis results (Phase 3) patterns/ # Extracted patterns (Phase 4) recommendations/ # Generated recommendations (Phase 5) feedback/ # Feedback logs (Phase 6) ``` Implements roadmap items for automated research, pattern discovery, and continuous improvement of architecture governance toolkit. Related: #research #automation #ml #best-practices
There was a problem hiding this comment.
Sorry @4444JPP, you have reached your weekly rate limit of 2500000 diff characters.
Please try again later or upgrade to continue using Sourcery
|
Warning Rate limit exceeded@4444JPP has exceeded the limit for the number of commits or files that can be reviewed per hour. Please wait 5 minutes and 15 seconds before requesting another review. ⌛ How to resolve this issue?After the wait time has elapsed, a review can be triggered using the We recommend that you space out your commits to avoid hitting the rate limit. 🚦 How do rate limits work?CodeRabbit enforces hourly rate limits for each developer per organization. Our paid plans have higher rate limits than the trial, open-source and free plans. In all cases, we re-allow further reviews after a brief timeout. Please see our FAQ for further information. 📒 Files selected for processing (14)
✨ Finishing touches
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
Summary of ChangesHello @4444JPP, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request introduces the foundational components of a Recursive and Generative Research System aimed at automating the discovery and analysis of best practices in software architecture. By profiling the current organization and intelligently discovering similar external repositories, the system provides a data-driven approach to improving architecture governance, significantly reducing the manual effort traditionally required for such research. This initial implementation completes the profiling and discovery phases, setting the stage for future automated analysis and recommendation generation. Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
PR Compliance Guide 🔍Below is a summary of compliance checks for this PR:
Compliance status legend🟢 - Fully Compliant🟡 - Partial Compliant 🔴 - Not Compliant ⚪ - Requires Further Human Verification 🏷️ - Compliance label |
||||||||||||||||||||||||||||
PR Code Suggestions ✨Explore these optional code suggestions:
|
||||||||||||||||||||||
There was a problem hiding this comment.
Code Review
This pull request introduces a comprehensive new research system for discovering and analyzing repositories. The implementation is well-structured, with clear separation of concerns into different scripts for profiling, discovery, and scoring. The addition of configuration files and extensive documentation is excellent. My review focuses on improving maintainability, efficiency, and correctness. Key suggestions include refactoring complex inline commands in the Makefile, addressing a bug in file scanning logic that ignores important configuration files, optimizing GitHub API usage to prevent rate-limiting issues, and making hardcoded values configurable across several scripts and configuration files. Overall, this is a strong foundation for the new system.
| 'stars': repo.stargazers_count, | ||
| 'forks': repo.forks_count, | ||
| 'language': repo.language, | ||
| 'topics': repo.get_topics(), |
There was a problem hiding this comment.
The repo.get_topics() method is called for each repository inside the search loop. This is a separate API call for each repository, which is highly inefficient and will quickly exhaust your GitHub API rate limit. The search_repositories result object already includes the topics in the repo.topics attribute (which is a list of strings). Using repo.topics will avoid these extra API calls and significantly improve performance.
| 'topics': repo.get_topics(), | |
| 'topics': repo.topics, |
| dirnames[:] = [d for d in dirnames if d not in exclude_patterns and not d.startswith('.')] | ||
|
|
||
| for filename in filenames: | ||
| if filename.startswith('.'): | ||
| continue |
There was a problem hiding this comment.
The scan_directory function filters out all dot-directories (e.g., .github) and dot-files (e.g., .eslintrc). However, the detect_tools function relies on finding these exact files and directories to identify tools and configurations. This contradiction is a bug that will prevent many tools from being detected. The filtering logic should be revised to only exclude specific unwanted patterns like .git while allowing important configuration files and directories to be scanned.
| dirnames[:] = [d for d in dirnames if d not in exclude_patterns and not d.startswith('.')] | |
| for filename in filenames: | |
| if filename.startswith('.'): | |
| continue | |
| dirnames[:] = [d for d in dirnames if d not in exclude_patterns] | |
| for filename in filenames: |
| @python3 -c "import json; p=json.load(open('$(ORG_PROFILE)')); print(f\" Fingerprint: {p.get('fingerprint', 'unknown')}\"); print(f\" Languages: {', '.join(list(p.get('metrics', {}).get('primary_languages', []))[:5])}\"); print(f\" Research Areas: {len(p.get('challenges', {}).get('research_areas', []))}\"); print(f\" High Priority Challenges: {len(p.get('challenges', {}).get('high_priority', []))}\")" | ||
| @echo "" | ||
| @echo "Discovery Results:" | ||
| @python3 -c "import json; d=json.load(open('$(SIMILARITY_SCORES)')); meta=d.get('similarity_metadata', {}); print(f\" Total Scored: {meta.get('total_scored', 0)}\"); print(f\" Above Threshold: {meta.get('above_threshold', 0)}\"); print(f\" Threshold: {meta.get('threshold', 0)}\"); repos=d.get('repositories', []); print(f\" Top 5 Matches:\"); [print(f\" {i+1}. {r.get('full_name', 'unknown')} (score: {r.get('similarity_score', 0):.4f})\") for i, r in enumerate(repos[:5])]" |
There was a problem hiding this comment.
The inline Python commands in the research-report target are complex and difficult to read, maintain, and debug. It's better to move this logic into a dedicated Python script (e.g., scripts/research/generate_report.py) that takes the necessary file paths as arguments. This will improve modularity, testability, and readability.
| require_indicators: | ||
| - has_readme: true | ||
| - has_license: true | ||
| # - has_ci: true # Optional: require CI/CD |
There was a problem hiding this comment.
The structure for require_indicators is a list of single-key dictionaries (- has_readme: true), which can be cumbersome to parse. A simpler list of strings would be more conventional and easier to process in the consuming script.
require_indicators:
- "has_readme"
- "has_license"
# - "has_ci" # Optional: require CI/CD
| **Total Tasks**: 87 | ||
|
|
||
| --- | ||
|
|
||
| ## Task Status Legend | ||
|
|
||
| - 🔴 **Not Started** - Task not yet begun | ||
| - 🟡 **In Progress** - Currently being worked on | ||
| - 🟢 **Completed** - Task finished and verified | ||
| - 🔵 **Blocked** - Waiting on dependency or external factor | ||
| - ⚪ **Deferred** - Postponed to future phase | ||
|
|
||
| --- | ||
|
|
||
| ## Phase 1: Organization Profiling & Fingerprinting | ||
|
|
||
| ### 1.1 Directory Structure Setup | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P1.1.1 | Create scripts/research/ directory structure | 🔴 | - | 0.5 | | ||
| | P1.1.2 | Create config/research/ for research configs | 🔴 | - | 0.5 | | ||
| | P1.1.3 | Create artifacts/research/ for outputs | 🔴 | - | 0.5 | | ||
| | P1.1.4 | Create templates/research/ for report templates | 🔴 | - | 0.5 | | ||
| | P1.1.5 | Create docs/research/ for documentation | 🔴 | - | 0.5 | | ||
|
|
||
| ### 1.2 Technology Stack Detection | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P1.2.1 | Implement language detection (file extensions) | 🔴 | - | 2 | | ||
| | P1.2.2 | Implement framework detection (package manifests) | 🔴 | - | 4 | | ||
| | P1.2.3 | Implement tool detection (config files) | 🔴 | - | 3 | | ||
| | P1.2.4 | Extract dependency versions and constraints | 🔴 | - | 3 | | ||
| | P1.2.5 | Detect infrastructure patterns (Docker, K8s, etc.) | 🔴 | - | 3 | | ||
| | P1.2.6 | Create tech_stack fingerprint aggregator | 🔴 | - | 2 | | ||
| | P1.2.7 | Write extract_tech_stack.py script | 🔴 | - | 4 | | ||
|
|
||
| ### 1.3 Architecture Pattern Extraction | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P1.3.1 | Detect directory structure patterns | 🔴 | - | 3 | | ||
| | P1.3.2 | Identify service boundaries from code | 🔴 | - | 4 | | ||
| | P1.3.3 | Extract API patterns (REST, GraphQL, gRPC) | 🔴 | - | 4 | | ||
| | P1.3.4 | Detect data flow patterns | 🔴 | - | 4 | | ||
| | P1.3.5 | Identify security patterns (auth, encryption) | 🔴 | - | 3 | | ||
| | P1.3.6 | Write analyze_architecture.py script | 🔴 | - | 4 | | ||
|
|
||
| ### 1.4 Baseline Metrics Collection | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P1.4.1 | Aggregate existing risk scores | 🔴 | - | 2 | | ||
| | P1.4.2 | Collect code quality metrics (complexity, coverage) | 🔴 | - | 2 | | ||
| | P1.4.3 | Extract team velocity metrics (commits, PRs) | 🔴 | - | 3 | | ||
| | P1.4.4 | Calculate codebase health scores | 🔴 | - | 3 | | ||
| | P1.4.5 | Write baseline_metrics.py script | 🔴 | - | 3 | | ||
|
|
||
| ### 1.5 Challenge Identification | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P1.5.1 | Parse existing risk register for pain points | 🔴 | - | 2 | | ||
| | P1.5.2 | Identify capability gaps | 🔴 | - | 2 | | ||
| | P1.5.3 | Extract improvement areas from hotspots | 🔴 | - | 2 | | ||
| | P1.5.4 | Prioritize research areas | 🔴 | - | 2 | | ||
| | P1.5.5 | Generate research_priorities.yaml | 🔴 | - | 2 | | ||
|
|
||
| ### 1.6 Profile Orchestration | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P1.6.1 | Write profile_org.py orchestrator script | 🔴 | - | 4 | | ||
| | P1.6.2 | Create org_profile.json schema | 🔴 | - | 2 | | ||
| | P1.6.3 | Add validation and error handling | 🔴 | - | 3 | | ||
| | P1.6.4 | Create profile visualization script | 🔴 | - | 3 | | ||
| | P1.6.5 | Write unit tests for profiling | 🔴 | - | 4 | | ||
|
|
||
| --- | ||
|
|
||
| ## Phase 2: Repository Discovery Engine | ||
|
|
||
| ### 2.1 GitHub API Integration | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P2.1.1 | Set up PyGithub authentication | 🔴 | - | 2 | | ||
| | P2.1.2 | Implement rate limit handling | 🔴 | - | 3 | | ||
| | P2.1.3 | Create search query builder from org profile | 🔴 | - | 4 | | ||
| | P2.1.4 | Implement pagination for large result sets | 🔴 | - | 3 | | ||
| | P2.1.5 | Add response caching layer | 🔴 | - | 3 | | ||
| | P2.1.6 | Write github_search.py script | 🔴 | - | 4 | | ||
|
|
||
| ### 2.2 Similarity Scoring | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P2.2.1 | Implement tech stack similarity (Jaccard) | 🔴 | - | 3 | | ||
| | P2.2.2 | Implement problem domain similarity (keywords) | 🔴 | - | 4 | | ||
| | P2.2.3 | Implement scale similarity (size, complexity) | 🔴 | - | 3 | | ||
| | P2.2.4 | Implement activity pattern similarity | 🔴 | - | 3 | | ||
| | P2.2.5 | Implement maturity alignment scoring | 🔴 | - | 2 | | ||
| | P2.2.6 | Create composite scoring algorithm | 🔴 | - | 4 | | ||
| | P2.2.7 | Write similarity_scorer.py script | 🔴 | - | 4 | | ||
| | P2.2.8 | Create similarity_weights.yaml config | 🔴 | - | 1 | | ||
|
|
||
| ### 2.3 Multi-Source Discovery | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P2.3.1 | Implement GitHub trending scraper | 🔴 | - | 3 | | ||
| | P2.3.2 | Add awesome-lists parser | 🔴 | - | 3 | | ||
| | P2.3.3 | Add topic-based discovery | 🔴 | - | 2 | | ||
| | P2.3.4 | Add organization discovery (similar orgs) | 🔴 | - | 3 | | ||
|
|
||
| ### 2.4 Deduplication & Ranking | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P2.4.1 | Implement canonical URL resolution | 🔴 | - | 2 | | ||
| | P2.4.2 | Implement fuzzy matching for forks/mirrors | 🔴 | - | 3 | | ||
| | P2.4.3 | Add blocklist/allowlist filtering | 🔴 | - | 2 | | ||
| | P2.4.4 | Write dedup_rank.py script | 🔴 | - | 3 | | ||
|
|
||
| ### 2.5 Discovery Orchestration | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P2.5.1 | Write discover_repos.py orchestrator | 🔴 | - | 4 | | ||
| | P2.5.2 | Create discovery_config.yaml | 🔴 | - | 2 | | ||
| | P2.5.3 | Add discovery metadata tracking | 🔴 | - | 2 | | ||
| | P2.5.4 | Create discovered_repos.json schema | 🔴 | - | 2 | | ||
| | P2.5.5 | Write unit tests for discovery | 🔴 | - | 4 | | ||
|
|
||
| --- | ||
|
|
||
| ## Phase 3: Automated Analysis Pipeline | ||
|
|
||
| ### 3.1 Safe Repository Cloning | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P3.1.1 | Implement shallow clone (depth=1) | 🔴 | - | 2 | | ||
| | P3.1.2 | Create Docker sandbox for cloning | 🔴 | - | 4 | | ||
| | P3.1.3 | Add size limits and validation | 🔴 | - | 2 | | ||
| | P3.1.4 | Implement automatic cleanup | 🔴 | - | 2 | | ||
| | P3.1.5 | Add parallel processing with concurrency limits | 🔴 | - | 3 | | ||
| | P3.1.6 | Write clone_safe.py script | 🔴 | - | 3 | | ||
|
|
||
| ### 3.2 Structural Analysis | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P3.2.1 | Analyze directory structure patterns | 🔴 | - | 3 | | ||
| | P3.2.2 | Detect configuration file patterns | 🔴 | - | 3 | | ||
| | P3.2.3 | Measure documentation coverage | 🔴 | - | 3 | | ||
| | P3.2.4 | Analyze test organization | 🔴 | - | 3 | | ||
| | P3.2.5 | Write extract_structure.py script | 🔴 | - | 4 | | ||
|
|
||
| ### 3.3 Code Quality Analysis | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P3.3.1 | Integrate radon for complexity metrics | 🔴 | - | 2 | | ||
| | P3.3.2 | Detect test coverage configurations | 🔴 | - | 3 | | ||
| | P3.3.3 | Extract linting configurations | 🔴 | - | 2 | | ||
| | P3.3.4 | Analyze code review practices | 🔴 | - | 3 | | ||
| | P3.3.5 | Write extract_quality.py script | 🔴 | - | 4 | | ||
|
|
||
| ### 3.4 DevOps & Tooling Analysis | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P3.4.1 | Parse CI/CD configurations (.github, .gitlab-ci) | 🔴 | - | 4 | | ||
| | P3.4.2 | Detect IaC patterns (Terraform, K8s, etc.) | 🔴 | - | 4 | | ||
| | P3.4.3 | Extract monitoring/observability setup | 🔴 | - | 3 | | ||
| | P3.4.4 | Identify security tooling (SAST, DAST, etc.) | 🔴 | - | 3 | | ||
| | P3.4.5 | Write extract_devops.py script | 🔴 | - | 4 | | ||
|
|
||
| ### 3.5 Documentation Mining | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P3.5.1 | Analyze README quality and structure | 🔴 | - | 3 | | ||
| | P3.5.2 | Extract ADRs and decision records | 🔴 | - | 3 | | ||
| | P3.5.3 | Find runbooks and playbooks | 🔴 | - | 2 | | ||
| | P3.5.4 | Extract contribution guidelines | 🔴 | - | 2 | | ||
| | P3.5.5 | Write extract_docs.py script | 🔴 | - | 3 | | ||
|
|
||
| ### 3.6 Baseline Comparison | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P3.6.1 | Compare tech stacks (ours vs discovered) | 🔴 | - | 3 | | ||
| | P3.6.2 | Identify capability gaps | 🔴 | - | 3 | | ||
| | P3.6.3 | Calculate potential impact scores | 🔴 | - | 3 | | ||
| | P3.6.4 | Estimate implementation effort | 🔴 | - | 3 | | ||
| | P3.6.5 | Write compare_baseline.py script | 🔴 | - | 4 | | ||
|
|
||
| ### 3.7 Analysis Orchestration | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P3.7.1 | Write analyze_repository.py orchestrator | 🔴 | - | 5 | | ||
| | P3.7.2 | Create analysis output schemas | 🔴 | - | 3 | | ||
| | P3.7.3 | Add error handling and retry logic | 🔴 | - | 3 | | ||
| | P3.7.4 | Implement progress tracking | 🔴 | - | 2 | | ||
| | P3.7.5 | Write unit tests for analysis | 🔴 | - | 5 | | ||
|
|
||
| --- | ||
|
|
||
| ## Phase 4: Pattern Recognition & Learning | ||
|
|
||
| ### 4.1 Pattern Aggregation | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P4.1.1 | Aggregate patterns across all analyzed repos | 🔴 | - | 4 | | ||
| | P4.1.2 | Calculate pattern frequency distributions | 🔴 | - | 3 | | ||
| | P4.1.3 | Identify pattern correlations | 🔴 | - | 4 | | ||
| | P4.1.4 | Track pattern evolution over time | 🔴 | - | 3 | | ||
| | P4.1.5 | Write aggregate_patterns.py script | 🔴 | - | 4 | | ||
|
|
||
| ### 4.2 Best Practice Identification | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P4.2.1 | Implement popularity scoring | 🔴 | - | 2 | | ||
| | P4.2.2 | Implement quality correlation analysis | 🔴 | - | 4 | | ||
| | P4.2.3 | Implement recency filtering | 🔴 | - | 2 | | ||
| | P4.2.4 | Assess maintainability of patterns | 🔴 | - | 3 | | ||
| | P4.2.5 | Calculate community endorsement scores | 🔴 | - | 2 | | ||
| | P4.2.6 | Write identify_best_practices.py script | 🔴 | - | 4 | | ||
|
|
||
| ### 4.3 Anti-Pattern Detection | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P4.3.1 | Identify patterns with negative correlations | 🔴 | - | 3 | | ||
| | P4.3.2 | Detect deprecated approaches | 🔴 | - | 3 | | ||
| | P4.3.3 | Flag security vulnerabilities in patterns | 🔴 | - | 4 | | ||
| | P4.3.4 | Write detect_anti_patterns.py script | 🔴 | - | 3 | | ||
|
|
||
| ### 4.4 Trend Analysis | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P4.4.1 | Identify emerging technologies | 🔴 | - | 3 | | ||
| | P4.4.2 | Detect shifting architectural paradigms | 🔴 | - | 4 | | ||
| | P4.4.3 | Track tool adoption curves | 🔴 | - | 3 | | ||
| | P4.4.4 | Write trend_analysis.py script | 🔴 | - | 4 | | ||
|
|
||
| ### 4.5 Personalization Engine | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P4.5.1 | Filter patterns by tech stack compatibility | 🔴 | - | 3 | | ||
| | P4.5.2 | Rank by alignment with org challenges | 🔴 | - | 4 | | ||
| | P4.5.3 | Adjust for team size and maturity | 🔴 | - | 3 | | ||
| | P4.5.4 | Account for existing constraints | 🔴 | - | 3 | | ||
| | P4.5.5 | Write personalize_insights.py script | 🔴 | - | 4 | | ||
|
|
||
| ### 4.6 Machine Learning Components | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P4.6.1 | Implement repository clustering | 🔴 | - | 5 | | ||
| | P4.6.2 | Implement pattern classification | 🔴 | - | 5 | | ||
| | P4.6.3 | Implement anomaly detection | 🔴 | - | 4 | | ||
| | P4.6.4 | Implement time series analysis | 🔴 | - | 4 | | ||
| | P4.6.5 | Create model training pipeline | 🔴 | - | 6 | | ||
|
|
||
| --- | ||
|
|
||
| ## Phase 5: Recommendation & Implementation Engine | ||
|
|
||
| ### 5.1 Recommendation Generation | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P5.1.1 | Create recommendation schema | 🔴 | - | 2 | | ||
| | P5.1.2 | Generate recommendations from patterns | 🔴 | - | 4 | | ||
| | P5.1.3 | Calculate impact scores | 🔴 | - | 3 | | ||
| | P5.1.4 | Estimate effort (T-shirt sizing) | 🔴 | - | 3 | | ||
| | P5.1.5 | Gather evidence from exemplar repos | 🔴 | - | 3 | | ||
| | P5.1.6 | Write recommendation rationales | 🔴 | - | 4 | | ||
| | P5.1.7 | Write generate_recommendations.py script | 🔴 | - | 5 | | ||
|
|
||
| ### 5.2 Prioritization | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P5.2.1 | Implement prioritization algorithm | 🔴 | - | 4 | | ||
| | P5.2.2 | Add strategic alignment multiplier | 🔴 | - | 2 | | ||
| | P5.2.3 | Add risk penalty calculation | 🔴 | - | 3 | | ||
| | P5.2.4 | Create configurable weight system | 🔴 | - | 2 | | ||
| | P5.2.5 | Write prioritize.py script | 🔴 | - | 3 | | ||
|
|
||
| ### 5.3 Implementation Scaffolding | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P5.3.1 | Generate ADR templates from recommendations | 🔴 | - | 4 | | ||
| | P5.3.2 | Generate code scaffolds from exemplars | 🔴 | - | 5 | | ||
| | P5.3.3 | Generate configuration files | 🔴 | - | 4 | | ||
| | P5.3.4 | Generate test templates | 🔴 | - | 3 | | ||
| | P5.3.5 | Generate documentation updates | 🔴 | - | 3 | | ||
| | P5.3.6 | Write scaffold_implementation.py script | 🔴 | - | 5 | | ||
|
|
||
| ### 5.4 Change Impact Analysis | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P5.4.1 | Identify affected components | 🔴 | - | 4 | | ||
| | P5.4.2 | Estimate blast radius | 🔴 | - | 3 | | ||
| | P5.4.3 | Generate rollback plans | 🔴 | - | 3 | | ||
| | P5.4.4 | Suggest feature flag strategies | 🔴 | - | 3 | | ||
| | P5.4.5 | Write impact_analysis.py script | 🔴 | - | 4 | | ||
|
|
||
| ### 5.5 Integration | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P5.5.1 | Create review interface for recommendations | 🔴 | - | 6 | | ||
| | P5.5.2 | Implement feedback collection | 🔴 | - | 4 | | ||
| | P5.5.3 | Add manual priority override | 🔴 | - | 2 | | ||
| | P5.5.4 | Add annotation and comments | 🔴 | - | 3 | | ||
|
|
||
| --- | ||
|
|
||
| ## Phase 6: Recursive Refinement System | ||
|
|
||
| ### 6.1 Feedback Collection | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P6.1.1 | Track recommendation acceptance/rejection | 🔴 | - | 3 | | ||
| | P6.1.2 | Collect qualitative feedback | 🔴 | - | 3 | | ||
| | P6.1.3 | Monitor implementation success metrics | 🔴 | - | 4 | | ||
| | P6.1.4 | Measure impact of implemented changes | 🔴 | - | 4 | | ||
| | P6.1.5 | Write collect_feedback.py script | 🔴 | - | 4 | | ||
|
|
||
| ### 6.2 Query Optimization | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P6.2.1 | Analyze search query hit/miss ratio | 🔴 | - | 3 | | ||
| | P6.2.2 | Adjust similarity weights based on feedback | 🔴 | - | 4 | | ||
| | P6.2.3 | Expand/contract search criteria dynamically | 🔴 | - | 4 | | ||
| | P6.2.4 | Write optimize_queries.py script | 🔴 | - | 4 | | ||
|
|
||
| ### 6.3 Model Retraining | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P6.3.1 | Retrain similarity scorer | 🔴 | - | 5 | | ||
| | P6.3.2 | Retrain pattern recognition models | 🔴 | - | 5 | | ||
| | P6.3.3 | Refine prioritization algorithm | 🔴 | - | 4 | | ||
| | P6.3.4 | Improve effort estimation | 🔴 | - | 4 | | ||
| | P6.3.5 | Write retrain_models.py script | 🔴 | - | 5 | | ||
|
|
||
| ### 6.4 Profile Evolution | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P6.4.1 | Update org profile with implemented changes | 🔴 | - | 3 | | ||
| | P6.4.2 | Track organizational evolution timeline | 🔴 | - | 3 | | ||
| | P6.4.3 | Adjust research priorities | 🔴 | - | 3 | | ||
| | P6.4.4 | Identify new gaps from continuous scanning | 🔴 | - | 3 | | ||
| | P6.4.5 | Write update_profile.py script | 🔴 | - | 4 | | ||
|
|
||
| ### 6.5 Meta-Learning | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P6.5.1 | Analyze implementation velocity patterns | 🔴 | - | 4 | | ||
| | P6.5.2 | Identify implementation barriers | 🔴 | - | 3 | | ||
| | P6.5.3 | Optimize for quick wins vs strategic initiatives | 🔴 | - | 3 | | ||
| | P6.5.4 | Learn from failures and near-misses | 🔴 | - | 4 | | ||
| | P6.5.5 | Write meta_analysis.py script | 🔴 | - | 4 | | ||
|
|
||
| --- | ||
|
|
||
| ## Infrastructure & Integration | ||
|
|
||
| ### 7.1 Configuration Management | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P7.1.1 | Create config/research/discovery_config.yaml | 🔴 | - | 2 | | ||
| | P7.1.2 | Create config/research/similarity_weights.yaml | 🔴 | - | 2 | | ||
| | P7.1.3 | Create config/research/analysis_config.yaml | 🔴 | - | 2 | | ||
| | P7.1.4 | Create config/research/prioritization_weights.yaml | 🔴 | - | 2 | | ||
| | P7.1.5 | Create config/research/blocklist.yaml | 🔴 | - | 1 | | ||
|
|
||
| ### 7.2 Database & Storage | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P7.2.1 | Design SQLite schema for analysis results | 🔴 | - | 4 | | ||
| | P7.2.2 | Implement caching layer (diskcache) | 🔴 | - | 3 | | ||
| | P7.2.3 | Create artifact storage structure | 🔴 | - | 2 | | ||
| | P7.2.4 | Implement data retention policies | 🔴 | - | 3 | | ||
|
|
||
| ### 7.3 Orchestration & Automation | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P7.3.1 | Add Makefile targets for research system | 🔴 | - | 3 | | ||
| | P7.3.2 | Create end-to-end pipeline script | 🔴 | - | 4 | | ||
| | P7.3.3 | Add scheduling/cron configuration | 🔴 | - | 2 | | ||
| | P7.3.4 | Create Docker container for research system | 🔴 | - | 4 | | ||
|
|
||
| ### 7.4 Monitoring & Logging | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P7.4.1 | Implement structured logging | 🔴 | - | 3 | | ||
| | P7.4.2 | Add performance metrics collection | 🔴 | - | 3 | | ||
| | P7.4.3 | Create monitoring dashboard | 🔴 | - | 5 | | ||
| | P7.4.4 | Add alerting for failures | 🔴 | - | 3 | | ||
|
|
||
| --- | ||
|
|
||
| ## Documentation & Testing | ||
|
|
||
| ### 8.1 User Documentation | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P8.1.1 | Create RESEARCH_SYSTEM_QUICKSTART.md | 🔴 | - | 4 | | ||
| | P8.1.2 | Create detailed usage guide | 🔴 | - | 6 | | ||
| | P8.1.3 | Document configuration options | 🔴 | - | 4 | | ||
| | P8.1.4 | Create troubleshooting guide | 🔴 | - | 3 | | ||
| | P8.1.5 | Create examples and tutorials | 🔴 | - | 5 | | ||
|
|
||
| ### 8.2 Developer Documentation | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P8.2.1 | Document system architecture | 🔴 | - | 4 | | ||
| | P8.2.2 | Document API interfaces | 🔴 | - | 4 | | ||
| | P8.2.3 | Document data schemas | 🔴 | - | 3 | | ||
| | P8.2.4 | Create contribution guide | 🔴 | - | 3 | | ||
|
|
||
| ### 8.3 Testing | ||
|
|
||
| | ID | Task | Status | Owner | Est. Hours | | ||
| |----|------|--------|-------|------------| | ||
| | P8.3.1 | Write unit tests (target: 80% coverage) | 🔴 | - | 20 | | ||
| | P8.3.2 | Write integration tests | 🔴 | - | 15 | | ||
| | P8.3.3 | Create test fixtures and mocks | 🔴 | - | 8 | | ||
| | P8.3.4 | Set up CI/CD for testing | 🔴 | - | 4 | | ||
| | P8.3.5 | Create end-to-end test scenarios | 🔴 | - | 8 | | ||
|
|
||
| --- | ||
|
|
||
| ## Summary Statistics | ||
|
|
||
| ### By Phase | ||
|
|
||
| | Phase | Total Tasks | Est. Hours | Status | | ||
| |-------|-------------|------------|--------| | ||
| | Phase 1: Profiling | 20 | 62 | 🔴 Not Started | | ||
| | Phase 2: Discovery | 19 | 59 | 🔴 Not Started | | ||
| | Phase 3: Analysis | 30 | 108 | 🔴 Not Started | | ||
| | Phase 4: Patterns | 18 | 65 | 🔴 Not Started | | ||
| | Phase 5: Recommendations | 20 | 75 | 🔴 Not Started | | ||
| | Phase 6: Refinement | 18 | 68 | 🔴 Not Started | | ||
| | Infrastructure | 13 | 34 | 🔴 Not Started | | ||
| | Documentation | 13 | 51 | 🔴 Not Started | | ||
| | **TOTAL** | **151** | **522** | **0% Complete** | |
There was a problem hiding this comment.
The task counts in this document are inconsistent, which could cause confusion about the project's scope and progress.
- The header on line 5 states
Total Tasks: 87. - The summary table at the end (lines 467-477) states
TOTAL: 151. - A manual count of the tasks listed in the document yields a different number entirely.
Please update these counts to be consistent.
| for lang in list(languages.keys())[:3]: # Top 3 languages | ||
| queries.append(f"language:{lang} topic:best-practices stars:>100") | ||
| queries.append(f"language:{lang} topic:architecture stars:>50") | ||
|
|
||
| # Framework-based queries | ||
| for lang, fw_list in frameworks.items(): | ||
| for fw in fw_list[:2]: # Top 2 frameworks per language | ||
| # Extract framework name (before @) | ||
| fw_name = fw.split('@')[0].lower() | ||
| queries.append(f"{fw_name} stars:>50") | ||
|
|
||
| # Research area queries | ||
| research_areas = self.org_profile.get('challenges', {}).get('research_areas', []) | ||
| for area in research_areas[:5]: # Top 5 research areas | ||
| queries.append(f"topic:{area} stars:>100") |
There was a problem hiding this comment.
Hardcoded slicing like [:3], [:2], and [:5] is used to limit the number of languages, frameworks, and research areas for building search queries. Additionally, the main loop on line 211 is hardcoded to [:10] queries. These limits reduce the script's flexibility and should be moved to the discovery_config.yaml file to allow for easier tuning of the discovery process without requiring code changes.
| # Analyze hotspots | ||
| if risk_data.get('hotspots'): | ||
| high_risk_files = [h for h in risk_data['hotspots'] if h.get('risk_score', 0) >= 0.7] | ||
| if len(high_risk_files) > 10: | ||
| challenges['high_priority'].append({ | ||
| 'category': 'code_quality', | ||
| 'issue': 'high_hotspot_count', | ||
| 'description': f'{len(high_risk_files)} files with high risk scores', | ||
| 'research_focus': ['refactoring', 'testing', 'complexity reduction'] | ||
| }) | ||
|
|
||
| # Analyze ownership | ||
| if risk_data.get('ownership_risks'): | ||
| single_owner = [r for r in risk_data['ownership_risks'] if 'SINGLE_CONTRIBUTOR' in r.get('flags', [])] | ||
| if len(single_owner) > 5: | ||
| challenges['high_priority'].append({ | ||
| 'category': 'knowledge_concentration', | ||
| 'issue': 'bus_factor_risk', | ||
| 'description': f'{len(single_owner)} areas with single contributor', | ||
| 'research_focus': ['documentation', 'knowledge_sharing', 'pair_programming'] | ||
| }) |
There was a problem hiding this comment.
The identify_challenges function contains several hardcoded thresholds for determining challenge priority (e.g., risk_score >= 0.7, len(high_risk_files) > 10, len(single_owner) > 5). This makes it difficult to tune the sensitivity of challenge detection. These values should be extracted into a configuration file (e.g., analysis_config.yaml or a new profiling_config.yaml) to improve maintainability and flexibility.
| # Rough approximation: 1 KB ≈ 30 lines of code | ||
| repo_loc_estimate = repo_size_kb * 30 |
There was a problem hiding this comment.
The approximation repo_loc_estimate = repo_size_kb * 30 uses a magic number 30. This and other magic numbers used for scoring and normalization throughout the script (e.g., on lines 135, 142, 176, 181) should be defined as named constants at the top of the file or, even better, moved to the similarity_weights.yaml configuration file. This would make the scoring logic more transparent and easier to tune.
| 'above_threshold': len(filtered_repos), | ||
| 'threshold': threshold | ||
| }, | ||
| 'repositories': filtered_repos[:100] # Top 100 |
There was a problem hiding this comment.
AI Code Review by LlamaPReview
🎯 TL;DR & Recommendation
Recommendation: Request Changes
This PR introduces a comprehensive research system but has critical security vulnerabilities in file parsing, API rate limit handling that will cause frequent failures, and scoring algorithm inaccuracies that undermine core functionality.
📄 Documentation Diagram
This diagram documents the core workflow of the new recursive research system from organization profiling to repository discovery and similarity scoring.
sequenceDiagram
participant OP as Organization Profiler
participant DE as Discovery Engine
participant SS as Similarity Scorer
participant GH as GitHub API
participant OR as Output Results
OP->>OP: Extract tech stack
OP->>OP: Identify challenges
OP->>DE: Organization profile
DE->>GH: Build and execute search queries
GH-->>DE: Repository metadata
DE->>SS: Discovered repositories
SS->>SS: Calculate multi-dimensional scores
SS->>OR: Ranked repositories
note over DE,SS: PR #35;4 implements Phases 1-2<br/>with profiling and discovery
🌟 Strengths
- Architecturally sound foundation for automated research and pattern discovery
- Comprehensive documentation and configuration system supporting future phases
| Priority | File | Category | Impact Summary | Anchors |
|---|---|---|---|---|
| P1 | scripts/research/discover_repos.py | Architecture | GitHub API rate limit handling causes frequent failures | path:config/research/discovery_config.yaml |
| P1 | scripts/research/extract_tech_stack.py | Security | Arbitrary JSON file reading creates security vulnerabilities | |
| P1 | scripts/research/similarity_scorer.py | Bug | Inaccurate scale similarity scoring undermines matching | path:config/research/similarity_weights.yaml |
| P2 | Makefile | Maintainability | Complex inline Python scripts are hard to maintain | |
| P2 | scripts/research/profile_org.py | Architecture | Subprocess calls reduce efficiency and complicate error handling | path:scripts/research/extract_tech_stack.py |
| P2 | config/research/analysis_config.yaml | Security | Workspace directory lacks isolation for cloning untrusted repos | path:docs/ROADMAP_RECURSIVE_RESEARCH_SYSTEM.md |
| P2 | requirements-research.txt | Maintainability | Dependency versions lack upper bounds risking breaks |
🔍 Notable Themes
- Security Hardening Needed: Multiple findings highlight vulnerabilities in file parsing and workspace isolation that could be exploited in production.
- API Integration Robustness: Rate limiting and error handling improvements are critical for reliable GitHub API usage.
- Maintainability Enhancements: Build scripts and dependency management would benefit from standardization and error handling.
📈 Risk Diagram
This diagram illustrates the GitHub API integration risks and file parsing vulnerabilities identified in the research system.
sequenceDiagram
participant User
participant DE as Discovery Engine
participant GH as GitHub API
participant FS as File System
participant SS as Similarity Scorer
User->>DE: Start discovery
DE->>GH: Search queries
note over DE,GH: R1(P1): Rate limit handling<br/>may cause frequent failures
GH-->>DE: Repository data
DE->>FS: Read package manifests
note over DE,FS: R2(P1): Arbitrary JSON reading<br/>creates security vulnerabilities
DE->>SS: Pass data for scoring
note over SS: R3(P1): Hardcoded approximations<br/>lead to inaccurate similarity scores
💡 Have feedback? We'd love to hear it in our GitHub Discussions.
✨ This review was generated by LlamaPReview Advanced, which is free for all open-source projects. Learn more.
|
|
||
| return results | ||
|
|
||
| def discover_from_all_sources(self) -> List[Dict[str, Any]]: |
There was a problem hiding this comment.
P1 | Confidence: High
The GitHub API integration lacks robust rate limit handling. The current implementation uses a simple time.sleep(1) but doesn't respect GitHub's actual rate limits (5000 requests/hour, 30 requests/minute). The discovery_config.yaml defines these limits but the code doesn't implement proper rate limiting logic. This will cause frequent rate limit exceptions in production use, especially when running multiple queries.
Code Suggestion:
def check_and_wait_rate_limit(self):
"""Check GitHub rate limit and wait if necessary."""
rate_limit = self.github.get_rate_limit()
core = rate_limit.core
if core.remaining < 10: # Buffer threshold
reset_time = core.reset.replace(tzinfo=None)
wait_seconds = (reset_time - datetime.utcnow()).total_seconds() + 10
print(f"[DISCOVERY] Rate limit low. Waiting {wait_seconds} seconds...")
time.sleep(max(1, wait_seconds))Evidence: path:config/research/discovery_config.yaml
| return file_stats, found_files | ||
|
|
||
|
|
||
| def detect_frameworks(root_path: str, files: List[str]) -> Dict[str, List[str]]: |
There was a problem hiding this comment.
P1 | Confidence: High
The code reads arbitrary JSON files without validation, creating a path traversal and deserialization vulnerability. An attacker could exploit this by placing malicious package.json files with circular references or extremely large payloads, potentially causing denial of service or remote code execution through JSON deserialization attacks.
Code Suggestion:
def safe_json_load(file_path: str, max_size: int = 10 * 1024 * 1024) -> Dict:
"""Safely load JSON file with size and content validation."""
if os.path.getsize(file_path) > max_size:
return {}
try:
with open(file_path, 'r', encoding='utf-8') as f:
return json.load(f)
except (json.JSONDecodeError, UnicodeDecodeError):
return {}| def calculate_scale_similarity(org_profile: Dict, repo: Dict, weights: Dict) -> float: | ||
| """Calculate scale/size similarity.""" | ||
| score = 0.0 | ||
| w = weights.get('scale', {}) | ||
|
|
||
| # Repository size similarity | ||
| org_loc = org_profile.get('metrics', {}).get('total_lines', 0) | ||
| repo_size_kb = repo.get('size', 0) | ||
|
|
There was a problem hiding this comment.
P1 | Confidence: High
The scale similarity calculation uses a hardcoded approximation (1KB = 30 LOC) that doesn't account for language differences. This will produce inaccurate similarity scores since different languages have vastly different line-to-byte ratios (e.g., Python vs. minified JavaScript). The related similarity_weights.yaml configures scale weight at 15%, making this a significant scoring component.
Code Suggestion:
# Language-specific approximations (lines per KB)
LANGUAGE_DENSITY = {
'Python': 25, 'JavaScript': 15, 'TypeScript': 15,
'Java': 10, 'Go': 20, 'Rust': 18, 'C++': 8
}
def estimate_loc_from_size(repo_size_kb: int, primary_language: str) -> int:
density = LANGUAGE_DENSITY.get(primary_language, 20)
return repo_size_kb * densityEvidence: path:config/research/similarity_weights.yaml
| research-report: research-similarity | ||
| @echo "=========================================" | ||
| @echo "Research System Summary" | ||
| @echo "=========================================" | ||
| @echo "" | ||
| @echo "Organization Profile:" | ||
| @python3 -c "import json; p=json.load(open('$(ORG_PROFILE)')); print(f\" Fingerprint: {p.get('fingerprint', 'unknown')}\"); print(f\" Languages: {', '.join(list(p.get('metrics', {}).get('primary_languages', []))[:5])}\"); print(f\" Research Areas: {len(p.get('challenges', {}).get('research_areas', []))}\"); print(f\" High Priority Challenges: {len(p.get('challenges', {}).get('high_priority', []))}\")" |
There was a problem hiding this comment.
P2 | Confidence: High
The Makefile embeds complex Python one-liners that are difficult to maintain and debug. These inline scripts lack proper error handling and will fail silently if JSON structure changes. This violates the principle of keeping build logic separate from complex data processing.
Code Suggestion:
research-report: research-similarity
@echo "========================================="
@echo "Research System Summary"
@echo "========================================="
@python3 scripts/research/generate_report.py \
--profile $(ORG_PROFILE) \
--scores $(SIMILARITY_SCORES)| @@ -0,0 +1,25 @@ | |||
| # Research System Dependencies | |||
There was a problem hiding this comment.
P2 | Confidence: High
The dependencies file specifies minimum versions but doesn't include upper bounds or compatibility constraints. This could lead to breaking changes when dependencies update, especially for major version bumps in pandas/numpy. The current constraints don't protect against known incompatible versions.
| # Research System Dependencies | |
| # Research System Dependencies | |
| PyGithub>=2.1.1,<3.0.0 | |
| PyYAML>=6.0.1,<7.0.0 | |
| pandas>=2.0.0,<3.0.0 | |
| numpy>=1.24.0,<2.0.0 |
| from typing import Dict, Any | ||
|
|
||
|
|
||
| def run_script(script_name: str, args: list) -> Dict[str, Any]: |
There was a problem hiding this comment.
P2 | Confidence: Medium
The orchestrator uses subprocess calls to run Python modules instead of direct imports. This creates unnecessary process overhead and complicates error handling and data passing. The system would be more efficient and maintainable using direct Python imports and function calls.
Code Suggestion:
def run_tech_stack_extraction(codebase_path: str, output_path: str) -> Dict[str, Any]:
"""Run tech stack extraction as module import."""
try:
from .extract_tech_stack import extract_tech_stack
return extract_tech_stack(codebase_path, output_path)
except ImportError as e:
return {'success': False, 'error': f'Import failed: {e}'}Evidence: path:scripts/research/extract_tech_stack.py, path:scripts/research/similarity_scorer.py
| last_updated: "2025-11-18" | ||
|
|
||
| # Cloning configuration | ||
| cloning: |
There was a problem hiding this comment.
P2 | Confidence: Medium
Speculative: The Phase 3 analysis configuration uses /tmp/research_clones as workspace directory without proper isolation. This creates potential security risks when cloning untrusted repositories, including path traversal attacks and conflicts between parallel analysis runs. The roadmap indicates Phase 3 will involve cloning external repositories.
Code Suggestion:
cloning:
workspace_dir: "/tmp/research_clones_${TIMESTAMP}_${RANDOM_SUFFIX}"
use_docker: true
docker_image: "research-analysis:latest"
read_only_mounts: trueEvidence: path:docs/ROADMAP_RECURSIVE_RESEARCH_SYSTEM.md
|
This PR has been automatically marked as stale because it has not had recent activity. It will be closed in 14 days if no further activity occurs. |
User description
Implements a comprehensive system for automatically discovering,
analyzing, and learning from similar organizations and repositories
to continuously improve architecture governance practices.
What's Added
Core System (Phases 1-2 Complete)
Phase 1: Organization Profiling
Phase 2: Repository Discovery
Scripts
scripts/research/profile_org.py- Organization profiling orchestratorscripts/research/extract_tech_stack.py- Technology detectionscripts/research/discover_repos.py- Repository discovery enginescripts/research/similarity_scorer.py- Similarity calculationConfiguration
config/research/discovery_config.yaml- Search parametersconfig/research/similarity_weights.yaml- Scoring weightsconfig/research/analysis_config.yaml- Analysis settings (Phase 3)config/research/prioritization_weights.yaml- Recommendation weights (Phase 5)Documentation
docs/ROADMAP_RECURSIVE_RESEARCH_SYSTEM.md- Complete vision (22 weeks)docs/TASK_LIST_RESEARCH_SYSTEM.md- Detailed tasks (151 items, 522 hours)docs/research/README.md- System overviewdocs/research/RESEARCH_QUICKSTART.md- Step-by-step guideBuild System
Makefilewith research-* targetsrequirements-research.txt- Python dependenciesMakefile Targets
make research-profile- Create organization profilemake research-discover- Discover similar repositoriesmake research-similarity- Calculate similarity scoresmake research-report- Generate summary reportmake research-full- Run complete research cyclemake research-check-deps- Verify dependenciesmake research-clean- Remove artifactsKey Features
Multi-Dimensional Similarity Scoring
Repositories ranked by 5 dimensions:
Intelligent Discovery
Recursive Design
Foundation for continuous improvement:
Usage
What's Next
Phase 3: Automated Analysis (In Progress)
Phase 4: Pattern Recognition
Phase 5: Recommendations
Phase 6: Recursive Refinement
Benefits
Architecture
Directory structure:
Implements roadmap items for automated research, pattern discovery,
and continuous improvement of architecture governance toolkit.
Related: #research #automation #ml #best-practices
PR Type
Enhancement, Documentation
Description
Implements a comprehensive two-phase recursive research system for discovering and analyzing similar organizations and repositories to improve architecture governance practices
Phase 1: Organization Profiling - Detects technology stacks, extracts architecture patterns, aggregates baseline metrics, and identifies organizational challenges
Phase 2: Repository Discovery - Integrates with GitHub API to discover similar repositories using multi-dimensional similarity scoring (tech stack, problem domain, scale, activity, maturity)
Adds four core research scripts:
profile_org.py(orchestrator),extract_tech_stack.py(fingerprinting),discover_repos.py(GitHub discovery), andsimilarity_scorer.py(ranking algorithm)Provides comprehensive configuration system with five YAML files for discovery parameters, similarity weights, analysis settings, and prioritization rules
Includes extensive documentation: strategic roadmap (22-week vision), detailed task list (151 tasks, 522 hours), quick start guide, and system overview
Adds seven new
maketargets for research workflow automation:research-profile,research-discover,research-similarity,research-report,research-full,research-check-deps, andresearch-cleanEstablishes artifact directory structure for profiles, discoveries, analysis results, patterns, recommendations, and feedback logs
Includes Python dependencies file with PyGithub, PyYAML, pandas, numpy, and optional ML/NLP libraries for future phases
Diagram Walkthrough
File Walkthrough
5 files
similarity_scorer.py
Multi-dimensional repository similarity scoring enginescripts/research/similarity_scorer.py
discovered repositories against organization profile
domain (keywords), scale (size/complexity), activity patterns, and
maturity alignment
overall similarity scores
with detailed breakdown
extract_tech_stack.py
Technology stack fingerprinting and detection systemscripts/research/extract_tech_stack.py
infrastructure patterns
etc.) to extract framework versions
configuration files
profile as JSON
discover_repos.py
GitHub-based repository discovery with query generationscripts/research/discover_repos.py
integration
frameworks, research areas)
repositories with metadata
profile_org.py
Organization profiling orchestrator and fingerprint generatorscripts/research/profile_org.py
and aggregating existing risk data
gaps
metrics, and research priorities
analysis phases
Makefile
Add Research System Make Targets and CommandsMakefile
research-check-deps,research-profile,research-discover,research-similarity,research-report,research-full, andresearch-cleanmanagement for profiles, discoveries, analysis, patterns,
recommendations, and feedback
GITHUB_TOKENandreorganizes help output with "Core Analysis" and "Research System
(NEW)" sections
research cycle orchestration with progress reporting
4 files
TASK_LIST_RESEARCH_SYSTEM.md
Detailed implementation task list for research systemdocs/TASK_LIST_RESEARCH_SYSTEM.md
implementation (151 total tasks)
and dependencies
recommendations, and refinement phases
distribution and team sizing estimates
ROADMAP_RECURSIVE_RESEARCH_SYSTEM.md
Strategic roadmap for recursive research system architecturedocs/ROADMAP_RECURSIVE_RESEARCH_SYSTEM.md
development plan (22 weeks total)
discovery, analysis, patterns, recommendations, and refinement
output artifacts
criteria, and future enhancements
RESEARCH_QUICKSTART.md
User-friendly quick start guide for research systemdocs/research/RESEARCH_QUICKSTART.md
maketargets
organization profiles
analysis workflow
README.md
Complete Research System Documentation and User Guidedocs/research/README.md
System with system overview, architecture diagrams, and feature
descriptions
and running the research cycle
weights with example YAML snippets
automated analysis and recursive learning
1 files
requirements-research.txt
Python dependencies for research system implementationrequirements-research.txt
PyYAML, pandas, numpy
for future phases
scraping libraries
4 files
prioritization_weights.yaml
Recommendation Prioritization and Scoring Configurationconfig/research/prioritization_weights.yaml
alignment, effort, and risk factors
testing, documentation, devops, tooling)
deadlines with time-based factors
configuration for recommendation ranking
analysis_config.yaml
Repository Analysis Pipeline Configurationconfig/research/analysis_config.yaml
timeouts, and parallel processing
devops, and documentation analysis
infrastructure-as-code, monitoring, and security tools
formatting for the analysis pipeline
similarity_weights.yaml
Repository Similarity Scoring Weights Configurationconfig/research/similarity_weights.yaml
tech stack (30%), problem domain (25%), scale (15%), activity (15%),
and maturity (15%)
tool) and domain similarity (topics, README, description)
documentation, test coverage, and CI/CD; penalties for security
vulnerabilities and stale repos
ranges for scale and maturity comparisons
discovery_config.yaml
Repository Discovery Search and Filtering Configurationconfig/research/discovery_config.yaml
criteria (min stars, recency), and result pagination
profile, manual queries, and template-based queries
lists, and topic-based discovery with organization similarity analysis
fork/archive handling, and blocklist/allowlist support with caching
and retry logic