Skip to content

Adding k8s network policy benchmarking test between cilium and npm#599

Open
agrawaliti wants to merge 43 commits into
mainfrom
itia/network-churn
Open

Adding k8s network policy benchmarking test between cilium and npm#599
agrawaliti wants to merge 43 commits into
mainfrom
itia/network-churn

Conversation

@agrawaliti
Copy link
Copy Markdown

@agrawaliti agrawaliti commented Apr 17, 2025

Integrate network policy enforcement latency measurement.

Developed pipelines to compare network policy-related metrics between Azure powered by Cilium and Azure powered by CNI Overlay using Network Policy Manager.
All the configuration like nodes, pods, n
This pull request introduces significant changes to support network policy enforcement latency testing and related metrics in the cluster loader configuration. It also includes updates to enable secondary cluster usage in competitive tests. Below is a categorized breakdown of the most important changes:

Network Policy Enforcement Latency Testing:

  • Added a new deployment configuration in deployment_template.yaml to support network policy enforcement latency tests, including conditional labels, a new server container, and specific configurations for test pods. [1] [2] [3]
  • Introduced new modules net-policy-enforcement-latency.yaml and net-policy-metrics.yaml to define steps for measuring network policy enforcement latency and collecting related metrics. [1] [2]
  • Updated load-config.yaml to include parameters and steps for enabling and running network policy tests, such as pod creation and policy creation latency measurements. [1] [2] [3] [4] [5]

Competitive Test Enhancements:

  • Added run_id_2 and use_secondary_cluster parameters to competitive-test.yml to allow tests to optionally use a secondary cluster. Updated job parameters to support these additions. [1] [2] [3]

Configuration Updates:

  • Modified load-config.yaml to dynamically calculate the number of namespaces and pods per namespace based on new parameters.
  • Adjusted slo.py to account for network tests, including a hardcoded value for nodes per namespace during network testing.

Reconciliation Enhancements:

  • Updated reconcile-objects.yaml to include network policy-related parameters and integration with the new deployment template. [1] [2]amespaces, no. of policies per namespace can be updated in pipeline.

Pipeline: https://dev.azure.com/akstelescope/telescope/_build?definitionId=41

Dashboard with new metrics: https://dataexplorer.azure.com/dashboards/e033bb3b-2cf4-4263-b41b-31597a8c4401?p-_startTime=24hours&p-_endTime=now&p-_cluster=v-cilium_network_churn_main&p-_test-type=v-default-config#5117e0aa-eb12-4f7f-b55d-6ffba1eab4ad

Comment thread modules/python/clusterloader2/slo/slo.py Outdated
Copy link
Copy Markdown
Contributor

@MikeZappa87 MikeZappa87 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

RCE present

@agrawaliti agrawaliti requested a review from MikeZappa87 April 25, 2025 13:26
sumanthreddy29 and others added 10 commits April 25, 2025 14:28
This pull request updates the Azure AKS CLI preview extension
configuration in the Terraform module to specify a version for the
extension.

*
[`modules/terraform/azure/aks-cli/main.tf`](diffhunk://#diff-1c09e32cd63aa3f4a6cfae577b3db448daedf498df32bd8834fd4344f6b86ab4R82-R83):
Added the `--version` flag with a value of `14.0.0b2` to ensure a
specific version of the `aks-preview` extension is used.
Comment thread steps/setup-tests.yml Outdated
Comment thread steps/topology/network-churn/collect-clusterloader2.yml
Comment thread jobs/competitive-test.yml Outdated
Comment thread steps/execute-tests.yml Outdated
default: {}

steps:
- script: |
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same here. Revert as this will affect all pipelines

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is that a problem? its just collecting start time, and show that.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a best practice, minimize the code changes to unrelated areas. While this most likely has 0 impact, we should scope changes to impact just our code.

Copy link
Copy Markdown
Collaborator

@alyssa1303 alyssa1303 May 19, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can put this in here steps/topology/network-churn/validate-resources.yml. We only put change in framework files if we want to it apply for all tests.

Copy link
Copy Markdown
Contributor

@jshr-w jshr-w left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please share the runs on our default Cluster Churn and Service Churn pipelines off this branch first so we can make sure they aren't broken.

Comment thread modules/python/clusterloader2/slo/config/load-config.yaml Outdated
@agrawaliti agrawaliti requested review from alyssa1303 and jshr-w May 15, 2025 13:11
Comment thread steps/collect-telescope-metadata.yml Outdated
Comment on lines +93 to +94
jq --arg telescope_run_id $RUN_ID --arg start_timestamp $START \
-c '. + {telescope_run_id: $telescope_run_id, start_timestamp: $start_timestamp}' $(TEST_RESULTS_FILE) > temp-$RUN_ID.json \
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change affects all tests running in Telescope. Please revert this change and collect the start time as part of the collect implementation for this test.

Comment on lines +15 to +20
- template: /steps/engine/clusterloader2/cilium/scale-cluster.yml
parameters:
role: net
region: ${{ parameters.regions[0] }}
nodes_per_nodepool: 240
enable_autoscale: "false"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This step should no longer be needed as we use terraform to create new cluster every run. You should update terraform file to create cluster with your desired number of node

},
{
name = "userpool0"
node_count = 0
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just put 240 her and disable auto-scaling. This should create a cluster with 240 nodes for user pool

Comment thread scripts/azure-npm.yaml
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file can go to folder called kubernetes under your scenario folder: scenarios/perf-eval/network-policy-churn/. That's where we usually store k8s files to apply for a specific scenario. Example: file

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants