sig-node e2e tests machine hardware requirements #7339

ffromani · 2024-09-25T06:43:18Z

sig-node owns a set of features related to exposing and using hardware details which require some hardware features to exercise the code. Examples are exclusive CPU allocation (cpumanager), device allocation (device manager), NUMA alignment (topology manager), NUMA alignment considering distances between NUMA zones (topology manager).

Note: some requirement overlap. Easy example: a powerful high end (at time of writing) server CPU can have at the same time multi core count, exposing multiple NUMA nodes, and have split L3, satisfying in one go all cpumanager requirements

Hardware requirements, driven by feature, rationale

cpumanager (GA): x86_64, arm: machine with at least 4 cores exposed, better like 16 or more. We need more cores to be able to run different set of tests in different scenarios. 4 cores is the minimum to run basic test about the feature. We already got machines with 4 cores
topology manager (GA): x86_64, arm: machine with 2 or more NUMA nodes. We need to align on NUMA nodes, so we need 2 or more nodes to begin with
topology manager (KEP Improved multi-numa alignment in Topology Manager enhancements#3545): x86_64, arm: machine with 4 or more NUMA nodes. We need to consider NUMA distances in allocation, so we need 4 or more nodes with different distances between each other
topology manager (KEP KEP-4622: Add a TopologyManager policy option for MaxAllowableNUMANodes enhancements#4622): arm only?: machine with 9 or more NUMA nodes (!) like grace gpus
cpumanager (KEP KEP-4800: Split UnCoreCache awareness enhancements#4810): x86_64, arm: a machine with split (non-uniform) L3 cache, like epyc cpus
cpumanager (KEP Add CPUManager policy option to align CPUs by Socket instead of by NUMA node enhancements#3327): x86_64, arm: a machine with multiple CPU sockets, in order to exercise alignment by socket.
devicemanager (GA): x86_64, arm: hardware devices controlled by device plugins. Most common usecases are SRIOV cards and GPUs. One device is the bare minimum, we would like 2 or more to be able to use them also in the topology manager tests as deciding factor (see topology manager in this list)
memorymanager (Beta, GA graduating): x86_64, arm: totally overlaps with topology manager requirements, listed here for the sake of completeness.

this list will be updated after more review of the ongoing sig-node features

ffromani · 2024-09-25T06:43:44Z

tagging some relevant sig-node people: @kannon92 @PiotrProkop @klueska

ffromani · 2024-09-25T06:53:58Z

slack thread for context: https://kubernetes.slack.com/archives/CCK68P2Q2/p1727202732284529

ameukam · 2024-09-25T06:55:24Z

cc @dims @upodroid @BenTheElder

BenTheElder · 2024-09-25T15:20:28Z

We have EC2 and GCE pretty well setup in particular at the moment, do any of the machine types available there meet your requirements?

Please make sure any new resources you use on any platform are handled by the kubernetes-sigs/boskos cleanup scripts.
If you're using GCP projects / AWS accounts with VMs that should already work.

ffromani · 2024-09-25T15:36:08Z

@catblade kindly pointed out equinix donated cloud credits and their offering seems also interesting and maybe we can use it. Some CNCF TAGs already make use if it. This is the reference I got: https://github.com/cncf-tags/green-reviews-tooling/

ameukam · 2024-09-25T15:56:38Z

@ffromani why can't we use AWS EC2 instances to run those tests ?

ameukam · 2024-09-25T15:58:39Z

CPU architectures available:

GCP: https://cloud.google.com/compute/docs/cpu-platforms
AWS:

ffromani · 2024-09-25T16:01:00Z

@ffromani why can't we use AWS EC2 instances to run those tests ?

I think we totally can, I'm not aware of any blocker. The efforts in this area have been somehow sparse, we're taking the chance of sig-node 1.32 planning to re-evaluate and improve the current state. Will review the GCP/AWS offerings and comment.

BenTheElder · 2024-09-25T18:48:27Z

@catblade kindly pointed out equinix donated cloud credits and their offering seems also interesting and maybe we can use it. Some CNCF TAGs already make use if it. This is the reference I got: https://github.com/cncf-tags/green-reviews-tooling/

Yes, however we generally are running critical infra on Kubernetes specific resource allocations, and we don't currently have a lot setup to manage this. For equinix SIG K8s Infra doesn't currently have observability into the amount of resources available and the spending trends which has bitten us in the past (see reports like https://kubernetes.slack.com/archives/CCK68P2Q2/p1727127173398879 for some of the others).

(@dims does have cs.k8s.io running on equinix currently, we also have some presence in DO and Azure but not as mature yet, and Fastly for CDN)

It would be easier if we can use one of vendors for which we already have tooling (like https://github.com/kubernetes-sigs/boskos) setup to avoid resource leaks etc.

Otherwise we need help to invest in and onboard new resource types, observability into utilization and remaining credits, etc

ffromani · 2024-10-21T08:01:01Z

review of the GCP offering:
AMD: https://cloud.google.com/compute/docs/cpu-platforms#amd_processors
We do have some epycs which seems to fit all the requirements. We need perhaps to doublecheck the > 2 NUMA nodes requirement which brings us to C2D or Tau-T2D instances which are surely good:

c2d >= c2d-standard-32 <- note SMT is enabled here
tau-t2d >= t2d-standard-32 <- note SMT is disabled here

considering the SMT support status c2d seems better for our purposes

Intel: https://cloud.google.com/compute/docs/cpu-platforms#intel_processors
IIRC we already have lanes running on n1 or n2 instances, which should be good for everything but split L3 instances and > 2 NUMA instances:

n2 >= n2-standard-32

AI: review current usage of intel n2 instances and comment here

ameukam · 2024-10-21T08:25:40Z

Can we also get the same analysis on AWS ? and possibly on Azure ?

ffromani · 2024-10-21T08:26:34Z

Can we also get the same analysis on AWS ? and possibly on Azure ?

yes, ongoing

ffromani · 2024-10-21T08:49:14Z

review of the AWS offering:
Looking at the available instance types the best candidates (all factors considered, including budget-aware selection) seems to be "compute intensive" or "memory intensive".

compute intensive:
Intel: c5 >= c5.12xlarge seems OK but is unlikely (docs seems more opaque) to be multi-numa and multi-socket. Also unclear if it is split-L3.
AMD: managed to find some deep-dive information and c5a >= c5a.8xlarge seems OK. Noteworthy that these instances seems to DO HAVE split-L3 but needs to be verified if these are multi NUMA. Unlilkely these are multi-socket.

memory intensive:
AMD: we have interesting CPUs EPYC 7571 which by the chatter in the internet seems to be AWS-specific versions of AMD 7601 SKUs which in turn do have split L3 but still unclear the multi-numa and multi-socket status.
r5a >= r5a.16xlarge seems to be OK with the usual caveats

Intel: likewise, with the usual caveat,
r5 >= r5.16xlarge seems to be OK

ffromani · 2024-10-21T08:51:32Z

so I need to figure out how's the multi-NUMA and multi-socket situation on both AWS and GCP.

Chances are high that if we require "high enough" cores, we exceed the silicon limits and then the cloud provider is forced to give us multi-socket instances (a single physical socket can only have so many cores), but this is and likely will be us second-guessing them.

In addition, need to review the current usage for periodic non-presubmit jobs. IIRC we have n1 or n2 intances, will check and comment here and crosslink old conversations.

Coming up next: azure.

ffromani added the sig/k8s-infra Categorizes an issue or PR as relevant to SIG K8s Infra. label Sep 25, 2024

PiotrProkop mentioned this issue Sep 25, 2024

Improved multi-numa alignment in Topology Manager kubernetes/enhancements#3545

Open

12 tasks

BenTheElder mentioned this issue Sep 25, 2024

Multi-NUMA systems for testing on Kubernetes test infrastructure kubernetes/test-infra#28211

Closed

This was referenced Oct 1, 2024

KEP-4800: Split UnCoreCache awareness kubernetes/enhancements#4810

Merged

KEP-3545: graduate to GA kubernetes/enhancements#4882

Merged

PiotrProkop mentioned this issue Oct 24, 2024

add e2e tests for prefer-closest-numa-nodes TopologyManagerPolicyOption kubernetes/kubernetes#127922

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

sig-node e2e tests machine hardware requirements #7339

sig-node e2e tests machine hardware requirements #7339

ffromani commented Sep 25, 2024 •

edited

Loading

ffromani commented Sep 25, 2024

ffromani commented Sep 25, 2024

ameukam commented Sep 25, 2024

BenTheElder commented Sep 25, 2024 •

edited

Loading

ffromani commented Sep 25, 2024

ameukam commented Sep 25, 2024

ameukam commented Sep 25, 2024 •

edited

Loading

ffromani commented Sep 25, 2024

BenTheElder commented Sep 25, 2024

ffromani commented Oct 21, 2024 •

edited

Loading

ameukam commented Oct 21, 2024

ffromani commented Oct 21, 2024

ffromani commented Oct 21, 2024 •

edited

Loading

ffromani commented Oct 21, 2024

sig-node e2e tests machine hardware requirements #7339

sig-node e2e tests machine hardware requirements #7339

Comments

ffromani commented Sep 25, 2024 • edited Loading

ffromani commented Sep 25, 2024

ffromani commented Sep 25, 2024

ameukam commented Sep 25, 2024

BenTheElder commented Sep 25, 2024 • edited Loading

ffromani commented Sep 25, 2024

ameukam commented Sep 25, 2024

ameukam commented Sep 25, 2024 • edited Loading

ffromani commented Sep 25, 2024

BenTheElder commented Sep 25, 2024

ffromani commented Oct 21, 2024 • edited Loading

ameukam commented Oct 21, 2024

ffromani commented Oct 21, 2024

ffromani commented Oct 21, 2024 • edited Loading

ffromani commented Oct 21, 2024

ffromani commented Sep 25, 2024 •

edited

Loading

BenTheElder commented Sep 25, 2024 •

edited

Loading

ameukam commented Sep 25, 2024 •

edited

Loading

ffromani commented Oct 21, 2024 •

edited

Loading

ffromani commented Oct 21, 2024 •

edited

Loading