A production-grade Internal Developer Platform (IDP) that reduces the time for a developer to get a new production-ready microservice from weeks to under 5 minutes. Built on Backstage (service catalog + golden path templates), Crossplane (self-service infrastructure), ArgoCD (GitOps), and Tekton (Kubernetes-native CI). The platform codifies every engineering standard — security scanning, observability, GitOps workflows — into templates so following best practices is the default, not an afterthought.
- The Problem This Solves
- Architecture Overview
- Repository Structure
- Backstage — Internal Developer Portal
- Crossplane — Self-Service Infrastructure
- ArgoCD — GitOps Deployments
- Tekton — Kubernetes-Native CI
- Infrastructure
- Developer Experience — New Service in 5 Minutes
- Platform Maturity Scorecards
- Deployment Guide
- Runbook
- Troubleshooting
Without an Internal Developer Platform, engineering organizations accumulate toil:
- Onboarding a new service takes days — set up CI/CD, configure monitoring, write Dockerfiles, set up ArgoCD applications, create PagerDuty services, register in Backstage (if it exists), request a database via ticket, wait for ops approval
- Platform standards are unevenly adopted — some services have security scanning, others don't. Some have runbooks, most don't. There's no visibility into which services are following best practices
- Knowledge is tribal — how do I deploy to staging? Which secrets should I use? What's the on-call rotation? The answers live in Slack messages and individual developers' heads
- Infrastructure provisioning is a bottleneck — developers open tickets to ops for RDS instances, S3 buckets, SQS queues. Tickets sit for days. Services get deployed without the infrastructure they need
This platform solves all four:
- A Backstage template creates a complete production-ready service in < 5 minutes
- Standards are baked into the templates — security scanning is not optional
- Backstage is the single source of truth — ownership, docs, dependencies, health status
- Crossplane lets developers self-serve infrastructure via a Kubernetes CRD
┌─────────────────────────────────────────────────────────────────────────────┐
│ INTERNAL DEVELOPER PLATFORM │
│ │
│ DEVELOPER PORTAL │
│ ┌─────────────────────────────────────────────────────────────────────┐ │
│ │ Backstage │ │
│ │ │ │
│ │ Service Catalog Software Templates TechDocs │ │
│ │ ───────────────── ────────────────── ──────── │ │
│ │ Every service, Golden path for Auto-generated │ │
│ │ API, resource, new services, docs from markdown │ │
│ │ and team in libraries, and in each repo │ │
│ │ one place infra requests │ │
│ │ │ │
│ │ Plugins: ArgoCD · Kubernetes · GitHub Actions · PagerDuty · Cost │ │
│ └──────────────────────────────────┬──────────────────────────────────┘ │
│ │ Scaffolder actions │
│ ▼ │
│ GOLDEN PATH (what a template creates automatically) │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ GitHub │ │ Tekton │ │ ArgoCD │ │Crossplane│ │PagerDuty │ │
│ │ Repo │ │ Pipeline│ │ App │ │ Claim │ │ Service │ │
│ │+ branch │ │ 10 stages│ │ GitOps │ │ (RDS, │ │+ escalat-│ │
│ │ protect │ │ CI │ │ CD │ │ S3, SQS)│ │ ion │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │
│ CI LAYER │ │ CD LAYER │
│ ┌──────────────────▼─┐ ┌────────▼────────────────────┐ │
│ │ Tekton Pipelines │ │ ArgoCD + ApplicationSets │ │
│ │ │ │ │ │
│ │ Clone → Test → │ │ Matrix generator: │ │
│ │ SAST → OWASP → │ │ services × environments │ │
│ │ Build → Trivy → │ │ Auto-sync dev/staging │ │
│ │ Push → Update │ │ Manual approval for prod │ │
│ │ GitOps repo │ │ RBAC: devs sync, platform │ │
│ └────────────────────┘ │ team approves prod │ │
│ └─────────────────────────────┘ │
│ │
│ SELF-SERVICE INFRASTRUCTURE │
│ ┌──────────────────────────────────────────────────────────────────────┐ │
│ │ Crossplane │ │
│ │ │ │
│ │ Developer writes: Platform provisions: │ │
│ │ PostgreSQLInstance claim → RDS (encrypted, private, Multi-AZ) │ │
│ │ S3Bucket claim → S3 (encrypted, versioned, lifecycle) │ │
│ │ SQSQueue claim → SQS (encrypted, DLQ configured) │ │
│ │ │ │
│ │ Composition enforces all security defaults — devs can't opt out │ │
│ └──────────────────────────────────────────────────────────────────────┘ │
└─────────────────────────────────────────────────────────────────────────────┘
platform-engineering/
│
├── README.md
│
├── backstage/
│ ├── app-config/
│ │ └── app-config.production.yaml # Full Backstage config: auth, catalog,
│ │ # plugins, TechDocs, Kubernetes, ArgoCD
│ ├── catalog/
│ │ └── all-components.yaml # Catalog: Domains, Systems, Components,
│ │ # APIs, Resources, Groups
│ └── templates/
│ └── microservice/
│ └── template.yaml # Golden path: Spring Boot microservice
│ # (9 scaffolder steps, creates everything)
│
├── crossplane/
│ ├── xrds/
│ │ └── postgresql-xrd.yaml # CompositeResourceDefinition — the API
│ │ # developers see (4 simple fields)
│ ├── compositions/
│ │ └── postgresql-composition.yaml # How to fulfill a claim: subnet group,
│ │ # security group, parameter group, RDS
│ └── claims/
│ └── order-service-db-claim.yaml # Example developer claim
│
├── argocd/
│ └── applicationsets/
│ └── services-applicationset.yaml # Matrix generator: services × environments
│ # + AppProject RBAC definitions
│
├── tekton/
│ ├── pipelines/
│ │ └── microservice-ci-pipeline.yaml # 10-stage pipeline: clone → test → SAST
│ │ # → OWASP → build → Docker → Trivy →
│ │ # push → GitOps update → notify
│ └── triggers/
│ └── github-trigger.yaml # EventListener + TriggerBinding +
│ # TriggerTemplate for GitHub webhooks
│
└── terraform/
└── environments/prod/
├── main.tf # VPC, EKS, Backstage (RDS+S3),
│ # ArgoCD, Tekton, Crossplane,
│ # External Secrets Operator
└── variables.tf
The Backstage catalog is the source of truth for every software component in the engineering organization. Every service registers a catalog-info.yaml in its repository. The catalog tracks:
Entity types:
- Domain — Business domain (Payments, Identity, Platform). Services belong to domains.
- System — Group of related services (Order Management = order-service + payment-service + inventory-service). Provides architectural context.
- Component — Individual deployable unit (microservice, library, data pipeline). The most common entity type.
- API — External interface of a component. Allows consumers to browse available APIs before building integrations.
- Resource — Infrastructure (RDS instances, S3 buckets, Kafka topics). Linked to Crossplane claims for provisioning traceability.
- Group — Engineering team. Every component has an owner team.
Annotations enable plugin integrations:
annotations:
argocd/app-name: order-service-production # ArgoCD plugin shows deploy status
backstage.io/kubernetes-id: order-service # K8s plugin shows pod status
pagerduty.com/service-id: P1234AB # PagerDuty plugin shows incidents
grafana/dashboard-url: https://grafana.../d/... # Direct link to dashboard
security.company.com/pci-scope: "true" # Scorecard: PCI compliance checkAuto-discovery: The github-discovery location type scans all GitHub repositories for catalog-info.yaml files. New services automatically appear in the catalog without manual registration.
The Spring Boot microservice template creates a production-ready service in under 5 minutes through 9 automated steps:
| Step | Action | Result |
|---|---|---|
| 1 | fetch:template |
Clones skeleton with platform integrations |
| 2 | publish:github |
Creates repo with branch protection + CODEOWNERS |
| 3 | catalog:register |
Registers service in Backstage catalog |
| 4 | publish:github:file |
Creates Tekton pipeline in platform-gitops |
| 5 | argocd:create-resources |
Creates ArgoCD application |
| 6 | publish:github:file (conditional) |
Creates Crossplane RDS claim |
| 7 | pagerduty:service:create |
Creates PagerDuty service with escalation policy |
The developer inputs 4 fields:
- Service name
- Owning team
- Do you need a database? (yes/no)
- Is this PCI-scoped? (yes/no)
Everything else — Tekton pipeline, ArgoCD application, Grafana dashboard, PagerDuty service, security controls — is created automatically.
Why this matters: Without the template, a developer might skip security scanning because they don't know how to configure Trivy. They might skip PagerDuty integration because it's tedious. By automating everything in the template, the secure, observable, well-operated service is the default — not the exception.
TechDocs converts markdown in each service repository into a searchable documentation portal. The backstage.io/techdocs-ref annotation tells Backstage where to find the docs. A CI job in each repo builds the docs (using MkDocs) and publishes them to the S3 bucket. Backstage serves them from S3 at backstage.internal.company.com/docs.
Result: engineers find documentation in the same place they find everything else — no more hunting through wikis, Confluence, and scattered Google Docs.
| Plugin | What it shows |
|---|---|
| Kubernetes | Pod status, resource usage, recent events for each catalog component |
| ArgoCD | Deployment sync status, last sync time, diff from desired state |
| GitHub Actions | CI pipeline history and status |
| PagerDuty | Current on-call, open incidents, recent alerts |
| Cost Insights | AWS cost per team, per service, with trendlines |
| SonarQube | Code quality metrics (coverage, bugs, vulnerabilities) |
XRD (CompositeResourceDefinition): Defines the API that developers see. The PostgreSQL XRD exposes 4 fields: storageGB, instanceClass, multiAZ, backupRetentionDays. The developer never sees VPC IDs, KMS key ARNs, parameter groups, or subnet groups.
Composition: The platform team's implementation. Defines exactly which AWS resources to create and with what defaults. The PostgreSQL composition creates: a DB subnet group (private subnets only), a security group, a parameter group (with rds.force_ssl=1, connection logging, statement timeouts enforced), and the RDS instance (encrypted, private, automated backups, enhanced monitoring, KMS encryption). Developers cannot override these security defaults — they're enforced at the composition level.
Claim: What developers write. 6 lines of YAML:
apiVersion: platform.company.com/v1alpha1
kind: PostgreSQLInstance
metadata:
name: order-service-db
namespace: production
spec:
parameters:
storageGB: 50
instanceClass: db.t4g.large
multiAZ: true
backupRetentionDays: 14
writeConnectionSecretToRef:
name: order-service-db-secret
namespace: productionCrossplane provisions the full RDS stack (~5 minutes) and writes the connection details (endpoint, username, password) to the Kubernetes secret. The application reads the secret as environment variables.
Why Crossplane over Terraform for self-service:
- Developers don't need Terraform, S3 backends, or AWS IAM permissions
- Claims are committed to the service repo and deployed via ArgoCD — full audit trail
- Kubernetes RBAC controls who can create which resource types
- Crossplane continuously reconciles — if someone manually modifies the RDS instance, Crossplane detects drift and corrects it
- Platform team controls the Composition centrally — one change enforces a security policy across every database in the organization
Instead of creating one ArgoCD Application per service per environment (which quickly becomes hundreds of Applications to manage), the ApplicationSet uses a matrix generator:
services × environments = Applications
order-service × [dev, staging, production] = 3 Applications
payment-service × [dev, staging, production] = 3 Applications
...
Services are auto-discovered from the services/*/k8s directory structure in the gitops repo. When a new service is scaffolded, it gets ArgoCD applications in all three environments automatically.
Environment sync policies:
- Dev/Staging:
automated— every commit to the gitops repo syncs immediately - Production:
manual— requires a human to click sync in the ArgoCD UI. This is the gate between staging and production.
Two ArgoCD projects enforce team boundaries:
microservices project:
- Developers (backend-team, payments-team) can
getandsyncapplications but cannot delete - Release managers (platform-team) have full control
- Applications can only deploy to dev/staging/production namespaces — cannot touch platform namespaces like
argocd,monitoring, orcrossplane-system - Cannot manage Kubernetes Secrets (External Secrets Operator manages these)
platform project:
- Platform team only — full cluster access
- Manages Backstage, Tekton, Crossplane, monitoring stack
Developer pushes to main branch
→ Tekton CI runs (test, scan, build, push)
→ Tekton updates image tag in platform-gitops repo
→ ArgoCD detects change in staging Application
→ ArgoCD syncs staging cluster
→ Smoke tests run
→ Release manager reviews staging
→ Release manager syncs production Application
→ Production deploy completes
→ ArgoCD notifies Slack #deployments
Every cluster change traces back to a git commit. git log is the deployment history. Rolling back is git revert.
| Jenkins | Tekton | |
|---|---|---|
| Infrastructure | Dedicated server + agents | Runs as Pods in EKS |
| Scaling | Fixed agent pool | Scales to zero (no idle cost) |
| Environment | Shared, stateful agents | Ephemeral Pods (fresh every run) |
| Configuration | Groovy DSL in Jenkinsfile | Kubernetes-native YAML |
| Security | Jenkins credentials store | Kubernetes Secrets + IRSA |
| RBAC | Jenkins roles | Kubernetes RBAC |
The 10-stage golden path CI pipeline:
| Stage | Tool | Gate |
|---|---|---|
| Clone | git-clone Task | — |
| Unit Tests | Maven + JaCoCo | Fails if coverage < 80% |
| SAST | SonarQube | Fails if quality gate fails |
| OWASP Dependency Check | dependency-check | Fails if CVSS ≥ 7.0 |
| Build JAR | Maven | — |
| Docker Build | Kaniko (no daemon) | — |
| Trivy Scan | Trivy | Fails on CRITICAL CVE |
| Push to ECR | Kaniko | Main/develop branches only |
| Update GitOps Repo | yq + git | Triggers ArgoCD staging sync |
| Notify | Slack webhook | Always (success or failure) |
Kaniko builds Docker images without a Docker daemon — no privileged pods needed. The Kaniko executor Pod reads the Dockerfile, builds the image layer by layer, and pushes directly to ECR.
Parallel execution: OWASP dependency check and unit tests run in parallel (both runAfter: [clone]), reducing total pipeline time.
GitOps update step: After a successful build, Tekton updates the image tag in the platform-gitops repository. ArgoCD detects this change and syncs the staging cluster. This is the bridge between CI (Tekton) and CD (ArgoCD) — the GitOps repo is the contract between them.
GitHub webhooks POST to the Tekton EventListener on every push. The EventListener:
- Validates the HMAC signature (prevents forgery)
- CEL-filters to only main and develop branches
- Extracts service name from repository name
- Creates a PipelineRun resource
Developers don't configure CI — pushing to main just works.
EKS cluster: Platform tooling (Backstage, ArgoCD, Tekton, Crossplane) runs on a dedicated EKS cluster separate from application workloads. This prevents platform tooling from competing with production workloads for resources, and limits blast radius — a misbehaving application workload cannot affect the CI/CD platform.
Backstage persistence:
- PostgreSQL on RDS (encrypted, Multi-AZ in production) — stores catalog entities, user sessions, scaffolder runs
- Redis (ElastiCache) — session cache for Backstage's backend
- S3 — TechDocs static assets (served via Backstage's TechDocs publisher)
External Secrets Operator: All application secrets (database credentials, API keys, OAuth tokens) are stored in AWS Secrets Manager and synced into Kubernetes Secrets by ESO. Secrets are never stored in git or in Kubernetes YAML files.
TechDocs S3 lifecycle: TechDocs assets are KMS-encrypted and blocked from public access. Only the Backstage service account's IRSA role can read the bucket.
This is what the developer actually does:
1. Open Backstage → Create → Spring Boot Microservice
2. Fill in 4 fields: name, owner, needs database?, PCI scoped?
3. Click "Create"
4. Wait ~2 minutes
5. Get output links:
- GitHub repo (with branch protection, CODEOWNERS, skeleton code)
- Backstage catalog page
- ArgoCD application (already deployed to dev)
- PagerDuty service
6. git clone the repo
7. ./gradlew bootRun → service is running locally
8. git commit && git push → Tekton CI runs automatically
Compare to without the platform:
- Create GitHub repo → 5 minutes
- Set up Dockerfile → 30 minutes (if they know how to write a distroless multi-stage build)
- Set up Tekton/Jenkins pipeline → 2 hours
- Set up ArgoCD application → 30 minutes
- Set up Grafana dashboard → 1 hour
- Open ticket for RDS → 2 days wait
- Set up PagerDuty → 30 minutes
- Register in Backstage → 20 minutes
- Write catalog-info.yaml → 20 minutes
Total without platform: ~3 days. With platform: 5 minutes.
Backstage scorecards (via the TechInsights plugin) track engineering maturity across every service. Each service gets a score across dimensions:
| Dimension | Checks | Weight |
|---|---|---|
| Reliability | Has PagerDuty service, has runbook, has health check endpoint | High |
| Observability | Has Grafana dashboard, has Prometheus metrics, has distributed tracing | High |
| Security | Security scanning in CI, no critical CVEs, has DAST results | Critical |
| Documentation | TechDocs present, README exists, API contract documented | Medium |
| Ownership | Has owner team, owner team has oncall, has CODEOWNERS | High |
| Operations | Last deployed < 90 days, no failing ArgoCD sync | Medium |
Scores are visible on every catalog page. Teams with low scores get automated Slack nudges. Platform team reviews the org-wide scorecard monthly to identify systemic gaps.
# 1. Deploy platform infrastructure
cd terraform/environments/prod
terraform init
terraform plan -out=tfplan
terraform apply tfplan
# 2. Configure kubectl for the new cluster
aws eks update-kubeconfig --name platform-prod --region us-east-1
# 3. Deploy ArgoCD ApplicationSets
kubectl apply -f argocd/applicationsets/services-applicationset.yaml
# 4. Deploy Tekton resources
kubectl apply -f tekton/pipelines/
kubectl apply -f tekton/triggers/
# 5. Deploy Crossplane XRDs and Compositions
kubectl apply -f crossplane/xrds/
kubectl apply -f crossplane/compositions/
# 6. Create GitHub webhook for Tekton EventListener
LISTENER_URL=$(kubectl get svc tekton-triggers-eventlistener -n tekton-pipelines \
-o jsonpath='{.status.loadBalancer.ingress[0].hostname}')
echo "Add GitHub webhook to: http://${LISTENER_URL}:8080"
# 7. Verify Backstage is running
kubectl get pods -n backstage
kubectl logs -n backstage deployment/backstage
# 8. Access the portal
kubectl port-forward -n backstage svc/backstage 3000:80
open http://localhost:3000# Add catalog-info.yaml to the service repo (if not scaffolded via Backstage)
cat > catalog-info.yaml << 'EOF'
apiVersion: backstage.io/v1alpha1
kind: Component
metadata:
name: my-service
description: "My service"
annotations:
github.com/project-slug: company/my-service
argocd/app-name: my-service-production
backstage.io/kubernetes-id: my-service
spec:
type: service
lifecycle: production
owner: group:my-team
EOF
git add catalog-info.yaml && git commit -m "feat: add Backstage catalog entry"# 1. Verify staging is healthy
kubectl get pods -n staging -l app=order-service
# 2. Check ArgoCD staging status
argocd app get order-service-staging
# 3. Sync production in ArgoCD UI, or via CLI
argocd app sync order-service-production --revision HEAD
argocd app wait order-service-production --health --timeout 300
# 4. Verify production is healthy
kubectl rollout status deployment/order-service -n production# List recent pipeline runs
tkn pipelinerun list -n tekton-pipelines --limit 10
# Get logs from a failed run
tkn pipelinerun logs <run-name> -n tekton-pipelines -f
# Get logs from a specific failed task
tkn taskrun logs <taskrun-name> -n tekton-pipelines
# Re-run a failed pipeline
tkn pipeline start microservice-ci-pipeline \
-p git-url=https://github.com/company/order-service \
-p git-revision=main \
-p service-name=order-service \
-w name=source,claimName=order-service-source-pvc \
-n tekton-pipelines# List all PostgreSQL claims
kubectl get postgresqlinstance -A
# Describe a specific claim (shows Crossplane events)
kubectl describe postgresqlinstance order-service-db -n production
# Get the connection secret
kubectl get secret order-service-db-secret -n production -o jsonpath='{.data}' | \
base64 -d
# Check managed resources
kubectl get managed | grep order-serviceBackstage catalog not showing new services:
- Check the
github-discoverylocation is configured inapp-config.yaml - Force a catalog refresh: Settings → Backstage → Catalog → Refresh
- Check Backstage logs:
kubectl logs -n backstage deployment/backstage
ArgoCD Application stuck in Progressing:
# Check application events
argocd app get <app-name> --show-operation
# Check Kubernetes events
kubectl get events -n production --sort-by='.lastTimestamp' | tail -20
# Force a hard refresh
argocd app get <app-name> --hard-refreshCrossplane claim stuck in Creating:
# Check Crossplane controller logs
kubectl logs -n crossplane-system deployment/crossplane
# Check the composite resource
kubectl describe xpostgresqlinstance -A
# Check managed resources for errors
kubectl get managed -o wide | grep -v ReadyTekton EventListener not triggering:
# Check EventListener pod is running
kubectl get pods -n tekton-pipelines -l eventlistener=github-event-listener
# Test webhook manually
curl -X POST http://LISTENER_URL:8080 \
-H "X-GitHub-Event: push" \
-H "X-Hub-Signature-256: sha256=..." \
-d '{"ref":"refs/heads/main","repository":{"name":"order-service","clone_url":"..."}}'
# Check trigger interceptor logs
kubectl logs -n tekton-pipelines -l app.kubernetes.io/component=interceptorsMIT