Skip to content

Releases: MrBinnacle/azimuth

v1.4.0 — Minimal Architecture Redesign

18 May 20:15

Choose a tag to compare

What changed

SKILL.md reduced from 865 to 160 lines. The partial-load problem is eliminated: the entire skill file now fits inside the always-loaded zone, so all 5 load-bearing behavioral rules fire reliably regardless of conversation length.

Architecture

  • Depth-on-demand via three new reference filesreferences/module-guide.md, references/mode-behaviors.md, references/output-template.md. Module bodies, output format scaffolding, anti-slop rules, escalation logic, and heuristics load conditionally per mode. FAST mode loads only what it needs; STANDARD/RAPID/DEEP load the full depth.

  • Five load-bearing rules now explicit and always loaded — M4 PRE-CHECK, M2 sycophancy circuit-breaker, M10 confidence ceiling, M1 commitment-state inference, and the output lead rule are in a named block at lines 76–86. Previously all five were post-line-225 and invisible under partial load.

  • Post-build eval passed — 3 scenarios: DEEP full-load (EU SaaS expansion), FAST file-load discipline (CI/CD migration), M4 PRE-CHECK self-advocacy (VP Sales hire). All 5 load-bearing rules behaved correctly across all scenarios.

Skill fixes

  • FIX-3: SPOF + unconfirmed dependency now produces a VERDICT-BLOCKING verdict consequence (DELAY PENDING EVIDENCE), not just an action list.
  • M10 confidence ceiling promoted to Core Principle 8 — always loaded.
  • M4 PRE-CHECK clarifications: question mapping for assistant-as-proposer, FAST mode self-advocacy disclosure.
  • FAST mode domain template loading explicitly clarified.

Install / upgrade

npx skills add https://github.com/MrBinnacle/azimuth --skill azimuth -a claude-code -y

Full changelog: CHANGELOG.md

v1.3.0 — Routing spec, verdict semantics, adversarial robustness

12 May 14:43

Choose a tag to compare

See the tagged v1.3.0 CHANGELOG for full notes:
https://github.com/MrBinnacle/azimuth/blob/v1.3.0/CHANGELOG.md#130--2026-05-08

Highlights

  • Phrasing vs. stakes tiebreaker: stakes signals always win, escalation always visible
  • Verdict trichotomy named: action / refusal / alternative-deliverable
  • RESIDUAL-RISK-REGISTER: positive output spec added (3–5 residual risks with leading indicators, escalation triggers, owners)
  • PROCEED WITH SAFEGUARDS: cap added (>3 structural changes or scope/budget/headcount impact → unavailable)
  • Reframe-to-WRONG-TOOL escape closed via adversarial reframe gate in Module 1
  • Module 4 / Module 9 roles in Output Reduction specified
  • RAPID mode Module 7 omission made explicit
  • DEEP mode gotchas.md activation rule clarified
  • Interaction Effects count aligned to 1–5 across spec and template
  • FAST mode PRE-CHECK disclosure extended to name self-proposal gap
  • Escalation header positioned in output template as zeroth line

Note: two later master commits update README/landing-page presentation and accessibility only; they do not change SKILL.md behavior and are tracked under [Unreleased] → Meta.

v1.2.3 — Documentation & Characterization Release

08 May 02:14

Choose a tag to compare

Documentation and characterization release — coverage testing program complete (6 sessions, Tier 1 + Tier 2), M4 PRE-CHECK validated as LOAD-BEARING in unprimed test, M10 confidence ceiling operative domain narrowed, M5 verdict-delta documented as headline constraint-vs-guide finding. No SKILL.md behavioral changes.


What's in this release

Coverage testing program complete (Tier 1 + Tier 2, 6 sessions). Production-vs-control paired comparison for 6 hooks: M9 (Mitigation Design), M1 (Objective Integrity — WRONG TOOL and RESIDUAL-RISK-REGISTER branches), M4 (PRE-CHECK self-proposal), M5 (Dependency Fragility), M8 (Detectability & Recovery), gotchas.md (3-pattern sample). Control agent reads SKILL.md with targeted hook replaced by a redaction marker; no agent informed of paired comparison. Results: evals/results/. Synthesis: evals/methodology/coverage-program-synthesis.md.

Hook classification summary (Tier 1 + Tier 2 scope). LOAD-BEARING: M4 PRE-CHECK. PARTIAL: gotchas.md (named patterns and check questions uncompensated; underlying risk identification compensated by software-failure-patterns.md + diagnostics). CORROBORATING: M9, M1 WRONG TOOL branch, M5, M8. UNCLASSIFIED: M1 RESIDUAL-RISK-REGISTER (adversarial input confound). Cumulative load-bearing count including prior evals (M2 circuit-breaker, M10 confidence ceiling): at minimum 3.

M4 PRE-CHECK reclassified LOAD-BEARING (unprimed test). Control exited WRONG TOOL via pre-225 "Do Not Use When" clause — no analysis. Production inferred self-proposal from conversation history and delivered full analysis with self-proposer reframing. The "Do Not Use When" clause and PRE-CHECK are complementary, not redundant: clause covers self-advocacy → exit; PRE-CHECK covers self-advocacy → proceed with reframing.

M10 confidence ceiling operative domain narrowed. Operative domain is UNSUPPORTED top assumption + strong secondary evidence. In mixed-evidence scenarios where the top assumption is CONTRADICTED, assumption classification anchors self-assessment at MEDIUM without the ceiling; ceiling is CORROBORATING in that domain.

M5 verdict-delta documented as headline constraint-vs-guide finding. Production reached PROCEED WITH SAFEGUARDS; unanchored control reached DELAY PENDING EVIDENCE for an unsigned vendor contract on the core capability. Module taxonomy may bias toward structured action frame when a conserving delay is the correct call. Single data point.

Position-correlated redundancy documented. Most post-225 enforcement mechanisms fail simultaneously under partial load. Only M1 WRONG TOOL branch has position-diverse redundancy (pre-225 "Do Not Use When" clause survives partial load). Under partial load: exit-path coverage maintained; mitigation quality enforcement, detectability taxonomy, and named pattern identification degrade.

Case-study load-condition check complete. Healthcare.gov: 5 of 6 findings confirmed hook-dependent under full load; claims restricted to full-load condition with disclosure language in place.

No SKILL.md behavioral changes in this release.

v1.2.0 — Intake routing, incentive interview, RAPID mode, scope guardrails, org-change domain

07 May 15:06

Choose a tag to compare

What's in v1.2.0

Added

  • Intake Routing (pre-analysis triage). Three-layer triage fires before the 10-module pipeline: Layer 1 maps situation type to mode (stress-test → proceed; post-decision validation or pre-plan exploration → firm out-of-scope exits); Layer 2 maps stakes and reversibility to FAST / STANDARD / DEEP / RAPID; Layer 3 routes domain to the correct template. Bypass handling for users who supply structured context directly.

  • Module 4 expanded to Incentive Scan & Interview. Seven structured questions collect incentive context before any inference from plan text. GREEN / YELLOW / RED tiering: RED locks verdict confidence at LOW and removes PROCEED verdicts.

  • Module 10 RED-tier enforcement. Pre-verdict check tests Module 4 tier before verdict selection. PROCEED verdicts are blocked at Module 10, not just declared in Module 4.

  • RAPID mode. For high-stakes decisions under 24-hour time constraints. Modules 1, 4, 8, 10 at full depth; Modules 2, 3, 5, 6, 9 abbreviated.

  • FAST mode disclosure. FAST outputs explicitly state that Module 4 interview was not conducted and incentive misalignment is unverified.

  • LLM bias externalizations at four modules. Sycophancy circuit-breaker (Module 2), availability inversion (Module 6), domain calibration boundary (Module 7), verdict softening pre-check (Module 10).

  • STANDARD mode conditional gotchas.md load. Fires on RED tier or canonical-only failure chains.

  • WRONG TOOL verdict. Fires when the input is not a pre-commitment decision question (fact-finding, architecture review, exploration). No analysis produced.

  • RESIDUAL-RISK-REGISTER verdict. Fires when the decision is already made or execution is substantially underway. No go/no-go analysis produced.

  • Module 1 input classification. Determines pre-commitment, post-commitment, or non-decision — drives pre-verdict check items 4 and 5.

  • Module 7 backpropagation check. After base-rate grounding, reviews Module 6 chains for the historically common failure mode; adds to register if absent.

  • Module 4 Incentive Alignment Scan adapted for three templates. codebase-azimuth.md, product-launch-azimuth.md, and hiring-azimuth.md now have domain-adapted actor matrices with flag conditions and confidence ceiling consequences.

  • Market Timing and External Conditions gate in product-launch-azimuth.md. Covers competitive timing, regulatory clearance, platform dependencies, and market condition changes.

  • references/org-change-patterns.md. Six structural failure patterns for restructures: Symbolic Restructure, Change Fatigue Stacking, Informal Authority Network Destruction, Communication Sequencing Failure, Behavioral Change Timeline Compression, Accountability Transfer Gap.

  • references/base-rates.md — Structured Failure Analysis section. Fasolo, Heard & Scopelliti 2025 (Journal of Management); Roose, Lehman & Veinott 2023 (Human Factors).

  • templates/secondaries-ic-azimuth.md. IC recommendation template for PE secondaries investment partners. Adverse selection gate, process integrity gate (ILPA 2023), NAV reliability assessment, GP alignment signals, pricing discipline. Verdict taxonomy: COMMIT-AT-PRICE / BID-BELOW-INDICATED / COUNTER-AT-PRICE / CONDITIONAL-ON-TERMS / PASS-PROCESS / PASS-PRICING.

  • Layer 3 routing: domain option 6 — Org change / restructure. DEEP mode loads references/org-change-patterns.md.

Changed

  • gotchas.md §7: Survivorship Framing → Plan-Revision Gap. Zero coverage elsewhere; HIGH empirical confidence (Roose 2023, N=68 real teams).

  • Routing redirects tightened. Out-of-scope responses state only what AZIMUTH cannot do and why. No alternative framings, no guidance on deriving missing information.

  • Verdict taxonomy standardized across templates. product-launch-azimuth.md and hiring-azimuth.md now use Module 10 canonical verdict names throughout.

v1.1.2 — Counterfactual layer and coupling pass

05 May 14:49

Choose a tag to compare

What's new

Counterfactual layer (Module 2)

Module 2 now runs a Falsifiers pass after classifying assumptions. For every strong or partial assumption, it names the specific, observable evidence that would prove it wrong. Falsifiers must be concrete and measurable — a named metric and threshold, not "if it doesn't work." Unsupported assumptions are excluded (already flagged for validation).

New output section: Falsifiers — after Weak Assumptions. Omitted if no strong or partial assumptions exist.

Coupling pass (Module 6)

Module 6 now identifies pair-interactions after constructing the 3 independent failure chains. The bar is specific: when two risks fire together, one must block the other's recovery path or mask its visible signal. 3–5 interactions maximum; section omitted if no genuine multiplicative interactions exist.

New output section: Interaction Effects — after Likely Failure Paths. Omitted if no genuine multiplicative interactions exist.

Eval results

Case Gate Result
06 — Strong Evidence Only Falsifiers PASS — 5 specific falsifiers, each names what to measure
05 — Stacked Medium Risks Interaction Effects PASS — 3 genuine pair-interactions with multiplicative mechanisms
01 — INSUFFICIENT SIGNAL Regression PASS — fires correctly, new sections correctly absent
02 — Proceed With Safeguards Regression PASS — both new sections present, structure intact
03 — High-Stakes Rewrite Regression PASS — both new sections present, no degradation

v1.1.1 — INSUFFICIENT SIGNAL verdict + M&A domain coverage

05 May 13:50

Choose a tag to compare

What's new

INSUFFICIENT SIGNAL verdict state

Module 10 now refuses to return a verdict when the input is too sparse, vague, or contradictory to support honest analysis.

Trigger conditions (any one is sufficient):

  • Core required inputs (objective, scope, reversibility, or downside) are absent and cannot be reasonably inferred
  • The stated objective is so undefined that no assumption audit is possible
  • Input is internally contradictory in a way that cannot be resolved without user clarification
  • Producing any of the six standard verdicts would require inventing facts the user did not supply

When INSUFFICIENT SIGNAL fires: returns only a Missing Inputs section listing what is absent and which question — if answered — would most unlock the analysis. No verdict, no confidence level, no mitigations, no padding.

Anti-slop rule added: do not substitute DELAY PENDING EVIDENCE for INSUFFICIENT SIGNAL when the block is missing input, not missing time.

M&A / partnership domain coverage

  • references/ma-partnership-patterns.md — 8 failure patterns: Strategic Rationale Substitution, Integration Timeline Compression, Synergy Overestimation, Key Talent Flight, Due Diligence Gap, Partnership Incentive Drift, Dependency Lock-In, Governance Vacuum Post-Close
  • templates/partnership-azimuth.md — analysis template for M&A, acquisitions, strategic partnerships, and significant vendor relationships
  • DEEP mode now routes M&A/partnership decisions to both reference and template files

v1.1.0 — Audit fixes

04 May 17:32

Choose a tag to compare

[1.1.0] — 2026-05-04

This release implements the full set of fixes surfaced in the v1.0.0 independent audit. No breaking changes for users invoking the skill manually; behavior changes apply to automatic invocation and to default output structure.

Changed — invocation behavior

  • Tightened SKILL.md frontmatter description (669 → 489 chars). Removed over-broad triggers ("validate our plan," "timeline check," "user sounds overconfident/vague"). Now requires explicit user request to evaluate an initiative-level decision with meaningful downside. Added explicit Do NOT invoke for… clause to the description itself, not just the body.
  • Added explicit mode-selection signals for FAST / STANDARD / DEEP. Previously the body described when to use each mode but did not give Claude concrete decision rules. Now lists specific signals (reversibility, capital outlay, headcount changes, public exposure, scope window, user phrasing) per mode. Default escalation rule: when signals conflict, escalate; never silently downgrade.
  • Defined diagnostic-loading rule for STANDARD mode. Previously per-module instructions (for deep runs, load X) conflicted with the global DEEP-mode instruction. Diagnostics now load conditionally in STANDARD when the corresponding module surfaces a high-severity finding. DEEP continues to load all four diagnostics. Removed the redundant per-module load lines from Modules 2, 4, 5, 7, and 8.

Changed — output structure

  • Output now leads with the verdict. First three lines of every output: verdict line, recommended decision, confidence level. Reader must be able to act on the first paragraph alone.
  • Empty sections now omitted by default. A section header with no substantive content is a failure of the skill, not a feature. Padding is explicitly prohibited.
  • Added Module Output Reduction section. Modules 2, 5, 6, 7, and 8 share an underlying register of assumptions, dependencies, and risks. Output is now deduplicated across modules rather than emitting per-module dumps. Critical Risks section is the severity-ordered output of the register.

Changed — content

  • Compressed gotchas from 12 patterns to 8. Removed: Availability Illusion (duplicated Module 5 + fragility-scan indicator 6), Quiet Dependency (duplicated Module 5), Confidence-Competence Gap (overlapped assumption-audit + incentive-conflicts), Integration Tax (already covered in software-failure-patterns.md patterns 2 and 4). Renumbered remaining 8.
  • Rewrote references/base-rates.md with proper attribution. Every numeric claim is now cited to a real primary or widely cited secondary source (BLS Business Employment Dynamics, CB Insights post-mortem reports, McKinsey/Oxford BT Centre 2012 IT projects study, Standish CHAOS, Pendo Feature Adoption, Kotter, Schmidt & Hunter, Christensen). Numbers without defensible sourcing are softened from precise to ranged with hedged language. The legacy-rewrite "70–80% failure rate" claim is reframed against the McKinsey/Oxford finding rather than Joel Spolsky's opinion piece. Each section ends with a Sources block.

Fixed

  • Resolved "precommitment" / "pre-commitment" inconsistency. Standardized on "pre-commitment" across SKILL.md frontmatter, README.md, and MARKETING.md.
  • Fixed README brand-clarity confusion. "Most azimuth tools for AI agents…" was using the proper noun as a generic, which read as forced to anyone who didn't already know the brand. Changed to "code-focused agent skills" — clearer prose, doesn't name competitors.

Verification before push

  • SKILL.md frontmatter description: 489 chars (under 500-char installer limits).
  • All file paths in SKILL.md (references/, diagnostics/, templates/) match folder structure on disk.
  • All 14 files present and structurally valid.