Status: ⏳ Ready for Manual Testing Created: 2025-10-21 PR: #156 Branch: feat/issue-154-agentic-do-mode
The agentic "azlin do" command has NOT been tested against real Azure resources yet. This document provides:
- Why real testing is essential
- What has been tested (unit tests only)
- Step-by-step manual testing procedure
- Automated test script usage
- Expected results and known risks
✅ Unit Tests (83 passing)
- IntentParser logic
- CommandExecutor subprocess handling
- ResultValidator error handling
- ObjectiveManager state persistence
- AuditLogger security features
❌ NOT Tested
- Claude API intent parsing with real requests
- Actual Azure resource creation via natural language
- End-to-end flow from NL → Azure CLI → validation
- Error handling with real Azure failures
- Multi-step operations with real timing
- Cost estimation accuracy
- User confirmation flows
High Risk:
- Natural language → command translation errors could create/delete wrong resources
- Ambiguous requests might execute unintended operations
- API rate limiting or timeouts not tested
- Cost estimation could be wildly inaccurate
Medium Risk:
- User confirmation bypasses not tested
- Concurrent operation handling
- Large-scale operations (10+ VMs)
Low Risk:
- Basic list/status commands (read-only)
- Dry-run mode execution
# Get your API key from: https://console.anthropic.com/
export ANTHROPIC_API_KEY=sk-ant-xxxxx...# Ensure you're logged in
az login
az account show
# Set default subscription if needed
az account set --subscription "your-subscription-id"# Set default resource group
azlin config set default_resource_group=azlin-test-rg
# Or specify with --rg flag in each commandcd /path/to/azlin
uv pip install -e .
# Verify installation
python -m azlin.cli do --helpThese tests only call the Claude API to parse intent - they don't execute commands.
python -m azlin.cli do "list all my vms" --dry-run --verboseExpected Output:
- Parsed intent:
list_vms - Generated command:
azlin list - No actual execution
- Confidence score > 0.9
python -m azlin.cli do "create a new vm called test-agentic-001" --dry-run --verboseExpected Output:
- Parsed intent:
provision_vm - Parameters:
vm_name = test-agentic-001 - Generated command:
azlin new --name test-agentic-001 - No actual provisioning
python -m azlin.cli do "provision 3 vms and sync them all" --dry-run --verboseExpected Output:
- Parsed intent:
provision_vm+sync_vms - Multiple commands planned
- Shows execution plan
- Asks for confirmation (even in dry-run)
python -m azlin.cli do "do something with Sam" --dry-run --verboseExpected Output:
- Low confidence score (< 0.7)
- Warning about ambiguity
- May ask for clarification
- Should NOT execute anything
python -m azlin.cli do "make me coffee" --dry-run --verboseExpected Output:
- Recognizes out-of-scope request
- Friendly error message
- No command generation
- Suggests valid alternatives
These tests query Azure but don't create/modify/delete resources.
python -m azlin.cli do "show me all my vms" --verboseExpected Output:
- Executes:
azlin list - Shows actual VMs in resource group
- Returns success
- Result validation confirms list displayed
python -m azlin.cli do "what is the status of my vms" --verboseExpected Output:
- Executes:
azlin status - Shows power states, IPs, regions
- Returns success
- Validates status information retrieved
python -m azlin.cli do "what are my azure costs" --verboseExpected Output:
- Executes:
azlin cost - Shows running costs
- Estimates monthly spend
- Returns success
python -m azlin.cli do "create a new vm called agentic-test-001" --verboseExpected Behavior:
- Parses intent correctly
- Shows command to execute
- Asks for user confirmation
- Provisions VM with default settings
- Waits for IP assignment
- Validates VM created successfully
- Returns VM details (name, IP, region)
Cost: ~$0.10/hour for Standard_B2s
Manual Verification:
# Check VM exists
azlin list
# Check actual Azure resource
az vm show --resource-group <rg> --name agentic-test-001# Assumes agentic-test-001 from 3.1 exists
python -m azlin.cli do "sync my home directory to vm agentic-test-001" --verboseExpected Behavior:
- Parses intent as
sync_vms - Shows:
azlin sync --vm-name agentic-test-001 - Syncs ~/.azlin/home/ to VM
- Shows files transferred
- Validates sync completed
# Stop VM
python -m azlin.cli do "stop vm agentic-test-001" --verbose
# Verify stopped
azlin status
# Start VM
python -m azlin.cli do "start vm agentic-test-001" --verbose
# Verify running
azlin statuspython -m azlin.cli do "delete vm agentic-test-001" --verboseExpected Behavior:
- Parses as
delete_vm - MUST ask for confirmation (destructive operation)
- Shows what will be deleted
- Allows cancellation
- If confirmed, deletes VM and resources
- Validates deletion successful
Manual Verification:
# VM should be gone
azlin list
# Azure resources cleaned up
az vm show --resource-group <rg> --name agentic-test-001
# Should return: ResourceNotFound# Request more resources than quota allows
python -m azlin.cli do "create 100 vms" --verboseExpected Behavior:
- Should detect high resource count
- Warn about cost and quota
- If executed, handle quota error gracefully
- Suggest alternatives (different region, smaller count)
# Disconnect network during execution (manually)
python -m azlin.cli do "create vm test" --verbose
# (disconnect WiFi)Expected Behavior:
- Timeout handling
- Graceful error message
- No orphaned resources
- State saved for recovery
python -m azlin.cli do "create a vm called INVALID@NAME!" --verboseExpected Behavior:
- Validation error from Azure CLI
- Clear error message to user
- No resources created
- Suggests valid naming pattern
We've created an automated test script that runs all the above tests systematically.
cd /path/to/azlin
# Set API key
export ANTHROPIC_API_KEY=your-key-here
# Run all tests (including VM creation)
./scripts/test_agentic_integration.sh
# Skip VM creation tests (safer, no costs)
SKIP_VM_CREATION=1 ./scripts/test_agentic_integration.shThe script runs:
- 3 dry-run tests (safe)
- 4 read-only tests (safe)
- 3 VM creation tests (costs money, optional)
- 2 error handling tests
Total: 12 tests
[INFO] Starting azlin agentic integration tests...
[INFO] Running pre-flight checks...
[INFO] ✓ ANTHROPIC_API_KEY is set
[INFO] ✓ Azure CLI authenticated
[INFO] ✓ azlin available
[INFO] All pre-flight checks passed!
[INFO] ===== DRY-RUN TESTS =====
[INFO] Running test: Dry-run: List VMs
[INFO] ✅ PASSED: Dry-run: List VMs
[INFO] Running test: Dry-run: Create VM
[INFO] ✅ PASSED: Dry-run: Create VM
...
========================================
INTEGRATION TEST SUMMARY
========================================
Total tests passed: 12
Total tests failed: 0
[INFO] 🎉 ALL TESTS PASSED!
========================================
✅ Required:
- All dry-run tests pass
- Read-only tests pass (list, status, cost)
- At least 1 full VM lifecycle test (create → verify → delete)
- Error handling test passes (invalid request)
- No unexpected Azure resources created
- No credential leaks or security issues
⏳ Recommended:
- Multi-step operation test
- File sync test
- Ambiguous request handling
- Cost estimation accuracy validation
❌ Optional (Post-Merge):
- Large-scale testing (10+ VMs)
- Concurrent operation testing
- Failure recovery testing
- Performance benchmarking
-
Phase 1 Only: azdoit advanced features not yet implemented
- No strategy selection (Azure CLI, Terraform, MCP)
- No cost estimation (uses placeholder)
- No failure recovery
- No research mode
-
Natural Language Parsing: Accuracy depends on Claude API
- May misinterpret ambiguous requests
- Confidence threshold not tuned yet
- No context memory between commands
-
Error Handling: Basic implementation
- Some Azure error messages not parsed
- Timeout handling not comprehensive
- No partial rollback on multi-step failures
✅ User confirmation for destructive operations ✅ Dry-run mode ✅ API key validation ✅ Azure auth check ✅ Command validation before execution ✅ Execution history tracking ✅ Audit logging
Before declaring "azlin do" production-ready:
- All dry-run tests pass
- Read-only operations work (list, status, cost)
- Single VM creation works end-to-end
- VM deletion with confirmation works
- File sync works
- Multi-step operations work
- Ambiguous request handling works
- Invalid request handling works
- API key not logged or exposed
- User confirmation enforced for destructive ops
- No command injection vulnerabilities
- Audit log captures all operations
- Failed operations logged
- Quota exceeded handled gracefully
- Network failures don't leave orphaned resources
- Invalid parameters caught before execution
- Timeout handling works
- Azure errors translated to user-friendly messages
- Intent parsing < 2 seconds
- Command execution matches native azlin
- Result validation < 1 second
- No unnecessary API calls
- Cost warnings shown for expensive operations
- Actual costs match estimates
- No surprise charges from failed operations
Copy this template for each testing session:
## Testing Session
**Date:** YYYY-MM-DD
**Tester:** [Your Name]
**Branch:** feat/issue-154-agentic-do-mode
**Commit:** [git rev-parse HEAD]
**Environment:**
- Azure Subscription: [subscription-id]
- Resource Group: [rg-name]
- Region: [region]
### Pre-Flight Checks
- [ ] ANTHROPIC_API_KEY set
- [ ] Azure authenticated
- [ ] azlin configured
- [ ] Branch installed
### Test Results
#### Dry-Run Tests
- [ ] PASS/FAIL: List VMs
- [ ] PASS/FAIL: Create VM
- [ ] PASS/FAIL: Multi-step
- [ ] PASS/FAIL: Ambiguous request
- [ ] PASS/FAIL: Invalid request
#### Read-Only Tests
- [ ] PASS/FAIL: List VMs
- [ ] PASS/FAIL: VM status
- [ ] PASS/FAIL: Cost query
#### Write Operations (Optional)
- [ ] PASS/FAIL/SKIP: Create VM
- [ ] PASS/FAIL/SKIP: Sync files
- [ ] PASS/FAIL/SKIP: Delete VM
### Issues Found
[List any bugs, unexpected behavior, or concerns]
### Resources Created
[List any Azure resources that were created during testing]
### Cleanup Status
[Confirm all test resources deleted]
### Recommendation
- [ ] Ready to merge
- [ ] Needs fixes before merge
- [ ] Requires additional testing
### Notes
[Any additional observations or recommendations]-
Run Automated Script:
export ANTHROPIC_API_KEY=your-key-here SKIP_VM_CREATION=1 ./scripts/test_agentic_integration.sh -
Manual Verification:
- Test 1 complete lifecycle manually
- Document results in testing log
- File any bugs found
-
Update PR:
- Add testing results to PR description
- Mark integration tests as ✅ or ⏳
- Note any known issues
- Review Test Plan: Does it cover critical scenarios?
- Check Test Results: Were all required tests run?
- Verify Safety: Are destructive operations properly gated?
- Assess Risk: Is it safe to merge given test coverage?
- Start with dry-run mode
- Test non-destructive operations first
- Monitor costs closely
- Report any issues immediately
If you encounter issues during testing:
-
Check logs:
- Test script log:
/tmp/azlin-agentic-test-*.log - Audit log:
~/.azlin/audit.log - Claude API responses: Enable
--verbose
- Test script log:
-
Common issues:
- "API key not set":
export ANTHROPIC_API_KEY=... - "Not authenticated":
az login - "No resource group":
azlin config set default_resource_group=... - "Command not found": Reinstall with
uv pip install -e .
- "API key not set":
-
Report bugs:
- GitHub Issue: #156
- Include: command, error message, logs
- Expected vs actual behavior
The agentic "azlin do" command requires real Azure integration testing before being considered production-ready. This document provides:
- ✅ Comprehensive test plan
- ✅ Automated test script
- ✅ Safety guidelines
- ✅ Success criteria
- ✅ Documentation template
Next Action: Run the automated test script with your ANTHROPIC_API_KEY and document results.
🤖 Generated with Claude Code
Status: Ready for testing Recommendation: Test before merging to ensure quality