Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
43 commits
Select commit Hold shift + click to select a range
ccf144e
Add network churn validation and enhance test execution with timestamps
agrawaliti Apr 2, 2025
1456095
Update argument parsing to allow optional parameters for network and …
agrawaliti Apr 2, 2025
6108368
Add CL2_CONFIG_FILE parameter to benchmark execution step
agrawaliti Apr 2, 2025
b7e23c0
Refactor network test parameters and update image version for network…
agrawaliti Apr 2, 2025
4dc8f36
Update AKS CLI module to use fixed version (#602)
sumanthreddy29 Apr 24, 2025
31b99cd
Merge remote-tracking branch 'origin/main' into itia/network-churn
agrawaliti Apr 25, 2025
e113c29
Refactor YAML files to improve formatting and maintain consistency
agrawaliti Apr 25, 2025
1e9ea17
finxing typo
agrawaliti Apr 29, 2025
ec34c12
Merge remote-tracking branch 'origin/main' into itia/network-churn
agrawaliti Apr 29, 2025
3a2e9f2
removing deploymentQPS from load config
agrawaliti Apr 29, 2025
5265717
Cleaning code
agrawaliti Apr 29, 2025
3c0657e
Adding schedule
agrawaliti Apr 29, 2025
612e5aa
updating time and fix lint
agrawaliti Apr 30, 2025
493e802
Merge remote-tracking branch 'origin/main' into itia/network-churn
agrawaliti Apr 30, 2025
eb57652
Merge remote-tracking branch 'origin/main' into itia/network-churn
agrawaliti May 7, 2025
3144df6
Add Azure Terraform configuration for network policy churn scenario
agrawaliti May 7, 2025
d28c866
Update resource management conditions to check for 'false' instead of…
agrawaliti May 8, 2025
49e0261
Terraform configuration for net role
agrawaliti May 8, 2025
1921bdf
Add post-provisioning step for Azure NPM configuration and update res…
agrawaliti May 9, 2025
02b6218
Remove unused parameters and simplify run ID handling in setup-tests …
agrawaliti May 9, 2025
5133c99
Remove npm_enabled configuration from AKS resource
agrawaliti May 9, 2025
f9e0579
Remove npm_enabled variable and its references from AKS configuration
agrawaliti May 9, 2025
5248935
Remove npm_enabled argument from Terraform input variables
agrawaliti May 9, 2025
bc80b8f
Add Azure NPM configuration step with conditional execution
agrawaliti May 13, 2025
7f645e8
Set npm_enabled to False in network-churn configuration
agrawaliti May 13, 2025
79e3863
Remove post-provisioning configuration step and clean up credential h…
agrawaliti May 13, 2025
67a992b
Add scale-cluster template to network-churn steps for resource valida…
agrawaliti May 13, 2025
de12301
debug steps
agrawaliti May 13, 2025
18e17d1
Refactor NPM_ENABLED check and update Azure NPM configuration file path
agrawaliti May 13, 2025
45593d1
Restore scale-cluster template and remove debug environment variables…
agrawaliti May 13, 2025
456f726
Rename "prompool" to "promnodepool" in extra node pool configuration
agrawaliti May 13, 2025
c0316fc
Update default node_count parameter to 24 in network-churn configuration
agrawaliti May 14, 2025
a0c6e8d
Remove hardcoded nodes_per_namespace value for network test in calcul…
agrawaliti May 14, 2025
5ce8813
Comment out the schedules section in cilium-network-churn configuration
agrawaliti May 14, 2025
3fdfb57
Update load-config to enable service test and fix total pods calculation
agrawaliti May 14, 2025
d250c00
Merge branch 'main' into itia/network-churn
agrawaliti May 15, 2025
adc8831
Refactor YAML configurations for clarity and consistency
agrawaliti May 15, 2025
a58ef9a
Merge branch 'main' into itia/network-churn
agrawaliti May 19, 2025
f02e171
Refactor test execution scripts to remove redundant timestamp setup a…
agrawaliti May 19, 2025
2188ec3
Update Azure Terraform inputs and validate resources configuration fo…
agrawaliti May 20, 2025
21c5c20
Merge remote-tracking branch 'origin' into itia/network-churn
agrawaliti May 20, 2025
a805425
Refactor Azure NPM configuration paths and migrate azure-npm.yaml to …
agrawaliti May 20, 2025
239c7bc
Add functionality to scrape kubelets metrics and capture npm metrics …
agrawaliti May 21, 2025
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,13 @@
{{$cnp_test:= .cnp_test}}
{{$ccnp_test:= .ccnp_test}}

{{$EnableNetworkPolicyEnforcementLatencyTest := DefaultParam .EnableNetworkPolicyEnforcementLatencyTest false}}
{{$TargetLabelValue := DefaultParam .TargetLabelValue "enforcement-latency"}}
# Run a server pod for network policy enforcement latency test only on every Nth pod.
# Default run on every pod.
{{$NetPolServerOnEveryNthPod := 1}}
{{$RunNetPolicyTest := and $EnableNetworkPolicyEnforcementLatencyTest (eq (Mod .Index $NetPolServerOnEveryNthPod) 0)}}

{{$Image := DefaultParam .Image "mcr.microsoft.com/oss/kubernetes/pause:3.6"}}

apiVersion: apps/v1
Expand All @@ -18,7 +25,7 @@ spec:
replicas: {{.Replicas}}
selector:
matchLabels:
name: {{.Name}}
name: {{if $RunNetPolicyTest}}policy-load-{{end}}{{.Name}}
strategy:
type: RollingUpdate
rollingUpdate:
Expand All @@ -27,27 +34,43 @@ spec:
template:
metadata:
labels:
name: {{.Name}}
name: {{if $RunNetPolicyTest}}policy-load-{{end}}{{.Name}}
group: {{.Group}}
{{if .SvcName}}
svc: {{.SvcName}}-{{.Index}}
{{end}}
restart: {{.deploymentLabel}}
{{if $RunNetPolicyTest}}
net-pol-test: {{$TargetLabelValue}}
{{end}}
spec:
nodeSelector:
slo: "true"
{{if $RunNetPolicyTest}}
hostNetwork: false
containers:
- image: acnpublic.azurecr.io/scaletest/nginx:latest
name: nginx-server
ports:
- containerPort: 80
resources:
requests:
cpu: {{$CpuRequest}}
memory: {{$MemoryRequest}}
{{else}}
containers:
- env:
- name: ENV_VAR
value: a
image: {{$Image}}
imagePullPolicy: IfNotPresent
name: {{.Name}}
ports:
ports: []
resources:
requests:
cpu: {{$CpuRequest}}
memory: {{$MemoryRequest}}
{{end}}
# Add not-ready/unreachable tolerations for 15 minutes so that node
# failure doesn't trigger pod deletion.
tolerations:
Expand Down
69 changes: 67 additions & 2 deletions modules/python/clusterloader2/slo/config/load-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@ name: load-config

# Config options for test type
{{$SERVICE_TEST := DefaultParam .CL2_SERVICE_TEST true}}
{{$NETWORK_TEST := DefaultParam .CL2_NETWORK_TEST false}}
{{$CNP_TEST := DefaultParam .CL2_CNP_TEST false}}
{{$CCNP_TEST := DefaultParam .CL2_CCNP_TEST false}}

Expand All @@ -14,7 +15,7 @@ name: load-config
{{$groupName := DefaultParam .CL2_GROUP_NAME "service-discovery"}}

# TODO(jshr-w): This should eventually use >1 namespace.
{{$namespaces := 1}}
{{$namespaces := DefaultParam .CL2_NO_OF_NAMESPACES 1}}
{{$nodes := DefaultParam .CL2_NODES 1000}}

{{$operationTimeout := DefaultParam .CL2_OPERATION_TIMEOUT "15m"}}
Expand All @@ -27,6 +28,7 @@ name: load-config
{{$podStartupLatencyThreshold := DefaultParam .CL2_POD_STARTUP_LATENCY_THRESHOLD "15s"}}

{{$CILIUM_METRICS_ENABLED := DefaultParam .CL2_CILIUM_METRICS_ENABLED false}}
{{$SCRAPE_KUBELETS := DefaultParam .CL2_SCRAPE_KUBELETS false}}
{{$SCRAPE_CONTAINERD := DefaultParam .CL2_SCRAPE_CONTAINERD false}}

# Service test
Expand Down Expand Up @@ -75,7 +77,7 @@ tuningSets:
timeLimit: {{$deletionTime}}s

steps:
- name: Log - namespaces={{$namespaces}}, nodesPerNamespace={{$nodesPerNamespace}}, podsPerNode={{$podsPerNode}}, totalPods={{$totalPods}}, podsPerNamespace={{$podsPerNamespace}}, bigDeploymentsPerNamespace={{$bigDeploymentsPerNamespace}}, smallDeploymentsPerNamespace={{$smallDeploymentsPerNamespace}}, bigGroupSize={{$BIG_GROUP_SIZE}}, smallGroupSize={{$SMALL_GROUP_SIZE}}, repeats={{$repeats}}, $saturationTime={{$saturationTime}}, $deletionTime={{$deletionTime}}
- name: Log - namespaces={{$namespaces}}, nodes={{$nodes}}, nodesPerNamespace={{$nodesPerNamespace}}, podsPerNode={{$podsPerNode}}, totalPods={{$totalPods}}, podsPerNamespace={{$podsPerNamespace}}, bigDeploymentsPerNamespace={{$bigDeploymentsPerNamespace}}, smallDeploymentsPerNamespace={{$smallDeploymentsPerNamespace}}, bigGroupSize={{$BIG_GROUP_SIZE}}, smallGroupSize={{$SMALL_GROUP_SIZE}}, repeats={{$repeats}}, $saturationTime={{$saturationTime}}, $deletionTime={{$deletionTime}}
measurements:
- Identifier: Dummy
Method: Sleep
Expand All @@ -96,6 +98,20 @@ steps:
action: start
{{end}}

{{if $NETWORK_TEST}}
- module:
path: /modules/network-policy/net-policy-metrics.yaml
params:
action: start

{{if $SCRAPE_KUBELETS}}
- module:
path: /modules/npm-measurements.yaml
params:
action: start
{{end}}
{{end}}

{{if $SCRAPE_CONTAINERD}}
- module:
path: /modules/containerd-measurements.yaml
Expand Down Expand Up @@ -133,6 +149,15 @@ steps:
ccnps: {{$CCNPS}}
{{end}}

{{if $NETWORK_TEST}}
- module:
path: modules/network-policy/net-policy-enforcement-latency.yaml
params:
setup: true
run: true
testType: "pod-creation"
{{end}}

- module:
path: /modules/reconcile-objects.yaml
params:
Expand All @@ -156,6 +181,26 @@ steps:
Group: {{$groupName}}
deploymentLabel: start

{{if $NETWORK_TEST}}
- module:
path: modules/network-policy/net-policy-metrics.yaml
params:
action: gather
usePolicyCreationMetrics: false

- module:
path: modules/network-policy/net-policy-enforcement-latency.yaml
params:
complete: true
testType: "pod-creation"

- module:
path: modules/network-policy/net-policy-enforcement-latency.yaml
params:
run: true
testType: "policy-creation"
{{end}}

- module:
path: /modules/reconcile-objects.yaml
params:
Expand Down Expand Up @@ -252,3 +297,23 @@ steps:
params:
action: gather
group: {{$groupName}}

{{if $NETWORK_TEST}}
- module:
path: modules/network-policy/net-policy-metrics.yaml
params:
action: gather

{{if $SCRAPE_KUBELETS}}
- module:
path: /modules/npm-measurements.yaml
params:
action: gather
{{end}}

- module:
path: modules/network-policy/net-policy-enforcement-latency.yaml
params:
complete: true
testType: "policy-creation"
{{end}}
Original file line number Diff line number Diff line change
@@ -0,0 +1,55 @@
{{$NETWORK_POLICY_ENFORCEMENT_LATENCY_BASELINE := DefaultParam .CL2_NETWORK_POLICY_ENFORCEMENT_LATENCY_BASELINE false}}
{{$NET_POLICY_ENFORCEMENT_LATENCY_TARGET_LABEL_KEY := DefaultParam .CL2_NET_POLICY_ENFORCEMENT_LATENCY_TARGET_LABEL_KEY "net-pol-test"}}
{{$NET_POLICY_ENFORCEMENT_LATENCY_TARGET_LABEL_VALUE := DefaultParam .CL2_NET_POLICY_ENFORCEMENT_LATENCY_TARGET_LABEL_VALUE "enforcement-latency"}}
{{$NET_POLICY_ENFORCEMENT_LATENCY_NODE_LABEL_VALUE := DefaultParam .CL2_NET_POLICY_ENFORCEMENT_LATENCY_NODE_LABEL_VALUE "net-policy-client"}}
{{$NET_POLICY_ENFORCEMENT_LATENCY_MAX_TARGET_PODS_PER_NS := DefaultParam .CL2_NET_POLICY_ENFORCEMENT_LATENCY_MAX_TARGET_PODS_PER_NS 100}}
{{$NET_POLICY_ENFORCEMENT_LOAD_COUNT := DefaultParam .CL2_NET_POLICY_ENFORCEMENT_LOAD_COUNT 1000}}
{{$NET_POLICY_ENFORCEMENT_LOAD_QPS := DefaultParam .CL2_NET_POLICY_ENFORCEMENT_LOAD_QPS 10}}
{{$NET_POLICY_ENFORCEMENT_LOAD_TARGET_NAME := DefaultParam .CL2_POLICY_ENFORCEMENT_LOAD_TARGET_NAME "small-deployment"}}

{{$setup := DefaultParam .setup false}}
{{$run := DefaultParam .run false}}
{{$complete := DefaultParam .complete false}}
{{$testType := DefaultParam .testType "policy-creation"}}
# Target port needs to match the server container port of target pods that have
# "targetLabelKey: targetLabelValue" label selector.
{{$targetPort := 80}}

steps:
{{if $setup}}
- name: Setup network policy enforcement latency measurement
measurements:
- Identifier: NetworkPolicyEnforcement
Method: NetworkPolicyEnforcement
Params:
action: setup
targetLabelKey: {{$NET_POLICY_ENFORCEMENT_LATENCY_TARGET_LABEL_KEY}}
targetLabelValue: {{$NET_POLICY_ENFORCEMENT_LATENCY_TARGET_LABEL_VALUE}}
baseline: {{$NETWORK_POLICY_ENFORCEMENT_LATENCY_BASELINE}}
testClientNodeSelectorValue: {{$NET_POLICY_ENFORCEMENT_LATENCY_NODE_LABEL_VALUE}}
{{end}}

{{if $run}}
- name: "Run pod creation network policy enforcement latency measurement (testType={{$testType}})"
measurements:
- Identifier: NetworkPolicyEnforcement
Method: NetworkPolicyEnforcement
Params:
action: run
testType: {{$testType}}
targetPort: {{$targetPort}}
maxTargets: {{$NET_POLICY_ENFORCEMENT_LATENCY_MAX_TARGET_PODS_PER_NS}}
policyLoadCount: {{$NET_POLICY_ENFORCEMENT_LOAD_COUNT}}
policyLoadQPS: {{$NET_POLICY_ENFORCEMENT_LOAD_QPS}}
policyLoadTargetBaseName: {{$NET_POLICY_ENFORCEMENT_LOAD_TARGET_NAME}}
{{end}}

{{if $complete}}
- name: "Complete pod creation network policy enforcement latency measurement (testType={{$testType}})"
measurements:
- Identifier: NetworkPolicyEnforcement
Method: NetworkPolicyEnforcement
Params:
action: complete
testType: {{$testType}}
{{end}}
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# Valid actions: "start", "gather"
{{$action := .action}}
{{$usePolicyCreationMetrics := DefaultParam .usePolicyCreationMetrics true}}
{{$usePodCreationMetrics := DefaultParam .usePodCreationMetrics true}}
{{$useCiliumMetrics := DefaultParam .useCiliumMetrics true}}

# CL2 params
# Negative default values are used to turn thresholds off if not overridden. Thresholds are only enabled with values of zero or higher.
{{$NP_ENFORCE_POLICY_CREATION_99_THRESHOLD_SECONDS := DefaultParam .CL2_NP_ENFORCE_POLICY_CREATION_99_THRESHOLD_SECONDS -1}}
{{$NP_ENFORCE_POD_CREATION_99_THRESHOLD_SECONDS := DefaultParam .CL2_NP_ENFORCE_POD_CREATION_99_THRESHOLD_SECONDS -1}}
{{$NP_ENFORCE_POD_IP_ASSIGNED_99_THRESHOLD_SECONDS := DefaultParam .CL2_NP_ENFORCE_POD_IP_ASSIGNED_99_THRESHOLD_SECONDS -1}}
{{$CILIUM_POLICY_IMPORTS_ERROR_THRESHOLD := DefaultParam .CL2_CILIUM_POLICY_IMPORTS_ERROR_THRESHOLD 0}}
{{$CILIUM_ENDPOINT_REGEN_FAIL_PERC_THRESHOLD := DefaultParam .CL2_CILIUM_ENDPOINT_REGEN_FAIL_PERC_THRESHOLD 0.01}}
{{$CILIUM_POLICY_REGEN_TIME_99_THRESHOLD := DefaultParam .CL2_CILIUM_POLICY_REGEN_TIME_99_THRESHOLD -1}}
{{$CILIUM_ENDPOINT_REGEN_TIME_99_THRESHOLD := DefaultParam .CL2_CILIUM_ENDPOINT_REGEN_TIME_99_THRESHOLD -1}}

steps:
- name: "{{$action}}ing network policy metrics"
measurements:
- Identifier: NetworkPolicyEnforcementLatency
Method: GenericPrometheusQuery
Params:
action: {{$action}}
metricName: "Network Policy Enforcement Latency"
metricVersion: v1
unit: s
queries:
# Network policy enforcement metrics gathered from the test clients.
{{if $usePolicyCreationMetrics}}
- name: PolicyCreation - TargetCount
query: sum(policy_enforcement_latency_policy_creation_seconds_count)
- name: PolicyCreation - Perc50
query: histogram_quantile(0.5, sum(policy_enforcement_latency_policy_creation_seconds_bucket) by (le))
- name: PolicyCreation - Perc90
query: histogram_quantile(0.9, sum(policy_enforcement_latency_policy_creation_seconds_bucket) by (le))
- name: PolicyCreation - Perc95
query: histogram_quantile(0.95, sum(policy_enforcement_latency_policy_creation_seconds_bucket) by (le))
- name: PolicyCreation - Perc99
query: histogram_quantile(0.99, sum(policy_enforcement_latency_policy_creation_seconds_bucket) by (le))
{{if ge $NP_ENFORCE_POLICY_CREATION_99_THRESHOLD_SECONDS 0}}
threshold: {{$NP_ENFORCE_POLICY_CREATION_99_THRESHOLD_SECONDS}}
{{end}}
{{end}}
{{if $usePodCreationMetrics}}
- name: PodCreation - TargetCount
query: sum(pod_creation_reachability_latency_seconds_count)
- name: PodCreation - Perc50
query: histogram_quantile(0.5, sum(rate(pod_creation_reachability_latency_seconds_bucket[%v])) by (le))
- name: PodCreation - Perc90
query: histogram_quantile(0.9, sum(rate(pod_creation_reachability_latency_seconds_bucket[%v])) by (le))
- name: PodCreation - Perc95
query: histogram_quantile(0.95, sum(rate(pod_creation_reachability_latency_seconds_bucket[%v])) by (le))
- name: PodCreation - Perc99
query: histogram_quantile(0.99, sum(rate(pod_creation_reachability_latency_seconds_bucket[%v])) by (le))
{{if ge $NP_ENFORCE_POD_CREATION_99_THRESHOLD_SECONDS 0}}
threshold: {{$NP_ENFORCE_POD_CREATION_99_THRESHOLD_SECONDS}}
{{end}}
- name: PodIpAssignedLatency - TargetCount
query: sum(pod_ip_address_assigned_latency_seconds_count)
- name: PodIpAssignedLatency - Perc50
query: histogram_quantile(0.50, sum(rate(pod_ip_address_assigned_latency_seconds_bucket[%v])) by (le))
- name: PodIpAssignedLatency - Perc90
query: histogram_quantile(0.90, sum(rate(pod_ip_address_assigned_latency_seconds_bucket[%v])) by (le))
- name: PodIpAssignedLatency - Perc95
query: histogram_quantile(0.95, sum(rate(pod_ip_address_assigned_latency_seconds_bucket[%v])) by (le))
- name: PodIpAssignedLatency - Perc99
query: histogram_quantile(0.99, sum(rate(pod_ip_address_assigned_latency_seconds_bucket[%v])) by (le))
{{if ge $NP_ENFORCE_POD_IP_ASSIGNED_99_THRESHOLD_SECONDS 0}}
threshold: {{$NP_ENFORCE_POD_IP_ASSIGNED_99_THRESHOLD_SECONDS}}
{{end}}
{{end}}

{{if $useCiliumMetrics}}
- Identifier: NetworkPolicyMetrics
Method: GenericPrometheusQuery
Params:
action: {{$action}}
metricName: "Network Policy Performance"
metricVersion: v1
unit: s
queries:
# Cilium agent metrics that are related to network policies.
- name: Number of times a policy import has failed
# To be replaced with the new Cilium metric that counts all policy changes, not just import errors.
# With that, this can be a percentage of failed imports.
# https://github.com/cilium/cilium/pull/23349
query: sum(cilium_policy_import_errors_total)
threshold: {{$CILIUM_POLICY_IMPORTS_ERROR_THRESHOLD}}
- name: Failed endpoint regenerations percentage
query: sum(cilium_endpoint_regenerations_total{outcome="fail"}) / sum(cilium_endpoint_regenerations_total) * 100
threshold: {{$CILIUM_ENDPOINT_REGEN_FAIL_PERC_THRESHOLD}}
- name: Policy regeneration time - Perc50
query: histogram_quantile(0.50, sum(cilium_policy_regeneration_time_stats_seconds_bucket{scope="total"}) by (le))
- name: Policy regeneration time - Perc99
query: histogram_quantile(0.99, sum(cilium_policy_regeneration_time_stats_seconds_bucket{scope="total"}) by (le))
{{if ge $CILIUM_POLICY_REGEN_TIME_99_THRESHOLD 0}}
threshold: {{$CILIUM_POLICY_REGEN_TIME_99_THRESHOLD}}
{{end}}
- name: Time between a policy change and it being fully deployed into the datapath - Perc50
query: histogram_quantile(0.50, sum(cilium_policy_implementation_delay_bucket) by (le))
- name: Time between a policy change and it being fully deployed into the datapath - Perc99
query: histogram_quantile(0.99, sum(cilium_policy_implementation_delay_bucket) by (le))
- name: Latency of policy update trigger - Perc50
query: histogram_quantile(0.50, sum(cilium_triggers_policy_update_call_duration_seconds_bucket{type="latency"}) by (le))
- name: Latency of policy update trigger - Perc99
query: histogram_quantile(0.99, sum(cilium_triggers_policy_update_call_duration_seconds_bucket{type="latency"}) by (le))
- name: Duration of policy update trigger - Perc50
query: histogram_quantile(0.50, sum(cilium_triggers_policy_update_call_duration_seconds_bucket{type="duration"}) by (le))
- name: Duration of policy update trigger - Perc99
query: histogram_quantile(0.99, sum(cilium_triggers_policy_update_call_duration_seconds_bucket{type="duration"}) by (le))
- name: Endpoint regeneration latency - Perc50
query: histogram_quantile(0.50, sum(cilium_endpoint_regeneration_time_stats_seconds_bucket{scope="total"}) by (le))
- name: Endpoint regeneration latency - Perc99
query: histogram_quantile(0.99, sum(cilium_endpoint_regeneration_time_stats_seconds_bucket{scope="total"}) by (le))
{{if ge $CILIUM_ENDPOINT_REGEN_TIME_99_THRESHOLD 0}}
threshold: {{$CILIUM_ENDPOINT_REGEN_TIME_99_THRESHOLD}}
{{end}}
- name: Number of policies currently loaded
query: avg(cilium_policy)
- name: Number of endpoints labeled by policy enforcement status
query: sum(cilium_policy_endpoint_enforcement_status)
{{end}}
Loading