VMware vSphere CPU topology effects on VDI performance: NUMA vNUMA and CPU scheduling

Introduction
Understanding the technology
Problem Statement
Research Objectives
Environment Configuration
Methodology
Results
Conclusion and Take Away

Introduction

CPU, NUMA and vNUMA topology, including single socket versus multiple sockets, cores per socket, and virtual NUMA alignment, are often discussed in VDI design. At the same time, it is sometimes completely ignored in deployments.

Historically, administrators preferred specific layouts due to scheduler behavior, licensing considerations and early hypervisor co-scheduling constraints. With the evolution of hypervisors and operating systems, it is important to validate whether manual CPU topology tuning still provides measurable benefits.

This research evaluates the impact of CPU topology on virtual desktop environments running a standard knowledge worker workload, providing measurable benefits for performance and efficiency.

Although the test environment is based on Omnissa Horizon Instant Clones, the findings are not limited to this platform and are expected to apply to other VDI solutions as well, including Citrix Virtual Apps and Desktops and Microsoft RDS.

Understanding the technology

NUMA Architecture

Servers use Non-Uniform Memory Access (NUMA) architecture, where each CPU socket has its own local memory. Accessing memory from another socket introduces latency. VMware ESXi detects the physical NUMA topology and automatically aligns virtual CPUs and memory to maintain locality whenever possible. If a virtual machine fits within a single physical NUMA node, ESXi keeps its vCPUs and memory within that node. Only when a VM exceeds the capacity of a physical NUMA node does ESXi distribute it across multiple nodes and expose virtual NUMA to the guest OS. Improvements such as relaxed co-scheduling and NUMA-aware CPU scheduling have significantly reduced historical performance penalties.

Historical vNUMA Context

Before ESXi 4.1, virtual machines were not NUMA-aware, which could result in larger VMs accessing memory across multiple NUMA nodes. This cross-node memory access increased latency and reduced performance.

ESXi 4.1 introduced virtual NUMA (vNUMA), allowing virtual machines that exceed the capacity of a single physical NUMA node to expose multiple virtual NUMA nodes to the guest operating system. This enables NUMA-aware guest operating systems to optimize memory allocation and CPU scheduling based on the topology presented by the hypervisor. When a virtual machine becomes large enough to span multiple physical NUMA nodes, ESXi automatically exposes vNUMA to the guest operating system to reflect the underlying topology.

In later vSphere releases, the NUMA scheduler and automatic placement logic were further refined. For example, starting with vSphere 6.5, NUMA client creation was decoupled from the virtual machine’s cores-per-socket configuration, allowing ESXi to determine the optimal virtual NUMA topology more intelligently while still allowing administrators to configure advanced NUMA settings when required.

In recent vSphere versions, administrators can influence the virtual NUMA topology through settings such as NUMA nodes per socket in the vSphere Client.

When multiple vNUMA nodes are presented to a VM, the guest operating system detects this topology and schedules workloads accordingly. Modern operating systems are NUMA-aware and can optimize thread scheduling and memory allocation based on the NUMA layout exposed by the hypervisor.

Changing the number of vNUMA nodes in the VM configuration changes the NUMA topology exposed to the guest operating system.

Guest OS NUMA Awareness

Changing the number of vNUMA nodes in the VM configuration changes the NUMA topology exposed to the guest operating system.

NUMA Nodes configuration on a Virtual Machine

This can be verified within the guest VM. for example in a Windows VM using Windows performance counters or WMI queries such as the following powershell command:

(Get-WmiObject Win32_PerfFormattedData_PerfOS_NUMANodeMemory | Where-Object { $_.Name -ne "_Total" }).Name.Count

For example configuring 4 vNUMA nodes on a Virtual Machine will give you a result of 4 with this command:

Output of number of vNUMA nodes seen by the Virtual Machine

But, if a VM still fits within a single physical NUMA node, increasing the number of vNUMA nodes does not change the underlying physical memory locality. Instead, it only changes the topology presented to the guest OS, allowing testing of operating system scheduling behavior without spanning multiple physical NUMA nodes. More about this in this performance study: chapter Sockets and NUMA

Problem Statement

Despite advancements in hypervisor scheduling and automatic vNUMA alignment, CPU topology recommendations are still frequently referenced in VDI design discussions. Traditional guidance often suggests carefully configuring cores per socket or manually defining virtual NUMA nodes to optimize performance, even for relatively small virtual desktops.

In more recent versions of VMware ESXi, the scheduler is capable of automatically aligning vCPUs and memory to physical NUMA boundaries.

For virtual machines that fit entirely within a single physical NUMA node, manual CPU topology adjustments may therefore have limited or no practical impact.

However, VDI environments often operate under conditions that differ from typical server workloads. High host density, bursty login storms, and simultaneous user activity may amplify scheduling behavior or introduce contention that could expose performance differences between CPU topology configurations.

This raises the question of whether historical CPU topology tuning practices remain relevant for VDI environments.

Research Objectives

Does vCPU topology influence user experience in VDI environments, as measured by LoginVSI EUXscore?
Does splitting vCPUs across multiple vNUMA nodes provide measurable performance improvements?
Does manual vNUMA configuration provide benefits compared to automatic vNUMA placement in ESXi?

Environment Configuration

The infrastructure used for this validation was selected to provide sufficient compute, memory, and storage resources for high-density VDI workloads, while minimizing external bottlenecks that could influence CPU topology results.

Resource Infrastructure

Hypervisor: VMware ESXi 8.0.3 Update 3
Host: Cisco UCSC-C240-M4S2
CPU: 2x Intel Xeon CPU E5-2680 v4, 56 Logical Processors including Hyperthreading
Memory: 256 GB DRAM
Storage: Multiple SAS SSD local datastores

VDI Infrastructure

Hypervisor: VMware ESXi 8.0.3 Update 3
Host: Cisco UCSC-C240-M5SX
CPU: 2x Intel Xeon Gold 6144 @ 3.50GHz, 32 Logical Processors including Hyperthreading
Memory: 768 GB DRAM
Storage: Intel Optane P4800X 750 GB
Memory is evenly distributed across the two CPUs, forming two physical NUMA nodes with 8 cores and 384 GB of RAM each, with hyperthreading excluded from the NUMA

VDI Platform

Omnissa Horizon 2512
64 Instant Clones
Windows 11 25H2
Microsoft 365 Apps for Enterprise
4 vCPU per Instant Clone resulting in 1 physical core to 8 vCPU ratio
Each virtual desktop is configured with 8 GB of fully reserved RAM, not accounting for any latency that may arise from memory management mechanisms.

With 4 vCPUs per virtual desktop, each VM resides entirely within a single physical NUMA node, ensuring local memory access and minimal cross-node latency.

This configuration results in the following NUMA layout:

Section	NUMA Node 0	NUMA Node 1
Virtual vCPUs	0–127 (32 VMs × 4 vCPUs each)	128–255 (32 VMs × 4 vCPUs each)
Virtual RAM Assigned	32 × 8 GB = 256 GB	32 × 8 GB = 256 GB
Physical DRAM	384 GB	384 GB
Physical Cores	8 cores (Core 0–7)	8 cores (Core 10–17)
Threads	16 HT (2 per core)	16 HT (2 per core)
PCPU Blocks	8 PCPU blocks	8 PCPU blocks

Golden Image Configuration

Default OS optimization by Omnissa OSOT 2512
Antivirus disabled
Basic DEM configuration
No profile containers or application layering
All desktops were non-persistent and destroyed after logoff, allowing each test run to start from a clean state, ensuring that performance measurements reflected fresh provisioning and CPU topology effects rather than session accumulation or caching artifacts.

Four configurations were evaluated by adjusting the settings on the VM used for the Master Image. The settings are propagated during the provisioning process to the Instant Clones:

Scenario 1: Single vNUMA node assigned
Scenario 2: Four vNUMA nodes assigned
Scenario 3: Two vNUMA nodes assigned
Scenario 4: Automatic vNUMA assigned at power on

If a VM still fits within a single physical NUMA node, increasing the number of vNUMA nodes typically does not change the underlying physical memory locality.

Methodology

This research evaluates the performance and user experience impact of different infrastructure topologies using the Login VSI Knowledge Worker 2022 default workload. The Knowledge Worker profile simulates end-user behavior, including common office applications, web browsing, and document handling, providing a representative workload for digital workspace environments.

Each test scenario was executed with 64 concurrent sessions, following a 16-minute ramp-up period to gradually introduce user load in a controlled manner. The 16-minute ramp-up results in a ramp-up of 4 sessions a minute.

After the final session was established, a 14-minute steady-state phase was maintained to capture stable performance characteristics under consistent load. To eliminate cross-test influence and ensure that results were not affected by residual resource utilization, a 900-second idle period was enforced between consecutive runs. All topology scenarios were executed under identical configuration, workload, and timing conditions to ensure reproducibility and enable objective comparison.

Performance and infrastructure data were collected centrally in InfluxDB. Metrics were gathered primarily using Telegraf agents. Where native metric collection was not available, custom PowerShell scripts were used to retrieve and forward the required data. This approach ensured consistent metric ingestion, standardized time-series storage, and reliable cross-scenario comparison.

Each scenario was executed seven times. Performance metrics presented in the results section represent the averaged values across all runs to reduce run-to-run variability and highlight structural performance differences between topology configurations. The graphs shown represent the mean value across all seven runs for each scenario.

Results

Active Sessions

Metric: Number of concurrently active user sessions during the test Risk indicator: High load, potential host/resource saturation Source: Direct count from LoginVSI test run

Active sessions followed a consistent pattern across all four scenarios during ramp-up, steady state, and cooldown. Each scenario gradually increased to the peak of ~64 sessions, remained stable during steady state, and declined simultaneously during logoff.

Scenario 2 (4 vNUMA) and Scenario 4 (auto vNUMA) reached peak sessions slightly faster, while Scenario 1 (1 vNUMA) and Scenario 3 (2 vNUMA) followed closely. These differences are minimal and within normal run variation.

At peak load, all scenarios sustained the same number of active sessions and showed identical cooldown behavior. Overall, vNUMA configuration did not result in any meaningful difference in session capacity or stability.

EUX Score

Metric: End-User Experience Score (LoginVSI) Risk indicator: Lower score indicates degraded user experience Source: Direct count from LoginVSI test run

All four scenarios follow a similar EUX score pattern during ramp-up, peak load, and steady state. Scores initially rise, especially during login activity, decline as load increases, and stabilize once logins are complete and the environment reaches steady state. This is expected behavior for the EUX metric, as higher load leads to more resource sharing.

Scenario 2 shows a slightly deeper dip during peak load, indicating marginally higher contention. Scenario 3 shows minor oscillations during this phase, while Scenario 1 and Scenario 4 remain closely aligned.

During steady state, all scenarios converge and stabilize around similar EUX score levels. Overall, the results show no structural EUX advantage for any vNUMA configuration, with only small transient differences during peak contention.

Additional information on the EUX Score: Login Enterprise EUX score and VSImax - Login VSI

CPU Ready

Metric:Percentage of time a VM is ready to run but waiting for physical CPU Risk indicator: Host CPU contention or overcommitment Source: VMware vSphere VM performance counter cpu.ready.summation, converted to percentage over the sample interval and vCPU count Calculation:CPU Ready % = (Total CPU Ready Time ÷ (Sample Interval × vCPU count)) × 100

CPU Ready is one of the primary performance metrics in virtualized environments because it measures how often a virtual machine is ready to execute but is waiting for access to physical CPU resources. High CPU Ready values indicate CPU contention on the host, which can directly impact application responsiveness and end-user experience.

Monitoring this metric helps identify whether CPU scheduling delays may be impacting workload performance.

The following table summarizes typical CPU Ready percentages, observed behavior, and the expected impact on end-user experience in a VDI environment

CPU Ready %	Observation	Impact on User Experience
0-3%	Low CPU wait time	Users experience smooth logins and responsive applications.
3-5%	Moderate CPU wait time	Minor delays may occur during sustained peak activity, brief spikes are unlikely to affect users.
5-10%	Elevated CPU wait time	Some lag may be noticeable if sustained, short peaks are generally tolerated.
>10%	High CPU wait time	Performance may be affected if sustained, isolated spikes typically have minimal impact.

Note: A single or brief spike in CPU Ready (e.g., 10%) usually does not significantly impact end-user experience. Only sustained periods of high CPU Ready indicate potential contention that could degrade responsiveness.

CPU Ready increased gradually during ramp-up, stabilized around 10-12% during steady state, and dropped again during cooldown across all four scenarios.

Scenario 1 and Scenario 4 trend slightly higher during peak load, while Scenario 2 remains marginally lower, indicating slightly less CPU contention. Scenario 3 follows a similar pattern to the others.

Overall, differences between scenarios are small and within normal variation, showing no structural impact of vNUMA configuration on CPU Ready behavior.

Co-Stop Summation

Metric: Time a vCPU waits in milliseconds because other vCPUs of the same VM must be co-scheduled Risk indicator: Structural co-scheduling delays for multi-vCPU VMs Source: VMware vSphere VM performance counter cpu.costop.summation

Co-Stop Summation measures the total time all vCPUs of a VM are ready to execute but are delayed because they must be scheduled simultaneously. When collected with for example, using the Telegraf vSphere plugin with a 20-second interval, the values are reported in milliseconds and are summed across all vCPUs.

The following table summarizes typical Co-Stop summation values, observed behavior, and the expected impact on end-user experience in a VDI environment, based on 256 vCPUs allocated across the test scenarios.

Average Co-Stop per vCPU per 20 s (ms)	Total Co-Stop Summation for 256 vCPUs (ms)	Observation	Impact on End-User Experience
0-500	0-128,000	Minimal scheduling delays	Users experience smooth logins and responsive applications.
500-1,500	128,000-384,000	Minor delays	Occasional slight delays during peak activity, generally tolerated.
1,500-3,000	384,000-768,000	Noticeable delays	Some responsiveness issues may be observed if sustained.
>3,000	>768,000	High scheduling delays	Sustained delays can impact performance, multi-vCPU VMs may show degraded responsiveness.

Co-Stop remained near zero during early ramp-up and increased gradually as CPU load grew. All scenarios reached their peak around the same time during maximum workload and declined quickly during cooldown.

Scenario 3 shows the highest peak Co-Stop values, while Scenario 2 consistently trends slightly lower. Scenarios 1 and 4 fall in between with similar behavior. From a total Co-Stop perspective, Scenario 2 shows the lowest average Co-Stop across all sessions, Scenario 3 the highest, and Scenarios 1 and 4 are intermediate.

CPU runtime varies across scenarios: Scenario 2 has the shortest runtime, Scenario 3 the longest, and Scenarios 1 and 4 are intermediate, respectively.

The runtime gap between the fastest and slowest scenario is roughly 58 seconds (~23%).

CPU Latency

Metric: Average time a VM waits for CPU scheduling in ms Risk indicator: Host CPU saturation or contention Source: VMware vSphere VM performance counter cpu.latency.average

CPU Latency measures the average time a virtual machine’s vCPUs are ready to run but are delayed waiting for physical CPU resources. It is a key metric for evaluating CPU scheduling performance in virtualized environments.

CPU Latency (ms)	Observation	Impact on End-User Experience
0-5	Minimal latency	Users experience smooth logins and responsive applications.
5-10	Low latency	Minor delays may occur during peak activity; generally tolerable.
10-20	Moderate latency	Some responsiveness issues may be noticeable if sustained.
>20	High latency	Performance may be degraded; sustained high latency can affect VDI user experience.

CPU latency increased gradually during ramp-up, stabilized around 35-45 ms during steady state, and dropped sharply once the workload ended across all scenarios. Scenario 2 trends slightly lower during ramp-up and steady state also resulting in the lowest CPU latency score when averaged, while Scenario 3 occasionally shows slightly higher peaks. Scenario 1 and Scenario 4 track closely together.

Overall, differences are minor and within normal variation, indicating no structural impact of vNUMA configuration on host CPU latency.

Conclusion and Take Away

For the tested scenario with knowledge worker desktops with 4 vCPUs:

Single, multi, and automatic vNUMA configurations demonstrated equivalent performance under standard operating conditions.
No structural co-scheduling penalties were detected.
ESXi automatic NUMA placement effectively maintains compute and memory locality.
Observed differences under extreme host congestion occurred only when performance metrics exceeded defined operational thresholds, such as EUX scores falling below considered acceptable ranges (e.g. EUXscore of 7 or higher).

Adjusting CPU topology in small desktops did not produce measurable differences in EUX scores under typical workloads. Minor improvements were observed only under extreme host congestion, when EUX scores fell below acceptable operational thresholds. Within the scope of this evaluation, manual CPU topology tuning therefore does not present a practical optimization strategy.

All tested desktops fit entirely within a single physical NUMA node. From the Windows guest perspective, modifying the number of vNUMA nodes altered only the logical NUMA topology presented to the scheduler. Physical memory locality remained unchanged, but multiple vNUMA nodes create additional logical boundaries that may influence how the Windows scheduler distributes threads under conditions of high CPU load. This can explain minor variations in co-stop or scheduler latency observed under extreme host congestion.

Due to the non-persistent and floating nature of Instant Clones, combined with the random assignment of clones to sessions during the load test, these scheduler effects are subtle and highly variable at the per-VM level. Consequently, these effects as a result are not directly or consistently observable in individual NUMA or CPU metrics within the guest VM and are primarily reflected through aggregated EUX scores under stressed conditions.

Key observations addressing the research questions

Does vCPU topology influence user experience in VDI environments, as measured by LoginVSI EUXscore?

For desktops with 4 vCPUs in this scenario , CPU topology did not measurably affect EUX scores or session responsiveness under typical workloads
Does splitting vCPUs across multiple vNUMA nodes provide measurable performance improvements?

Multi vNUMA configurations provided no measurable benefit under normal conditions. Minor improvements were observed only under extreme host congestion, when EUX scores fell below considered acceptable thresholds; however, these improvements are not considered a reliable tuning strategy.
Does manual NUMA configuration provide benefits compared to automatic NUMA placement in ESXi?

Automatic NUMA placement in ESXi 8.0.3 Update 3 effectively aligns compute and memory locality for small desktops under normal load. Manual vNUMA adjustments generally did not improve EUX scores or scheduler metrics, though marginal improvements were observed under extreme CPU congestion, conditions outside typical operational thresholds.

Bottom Line

In desktop virtualization environments running knowledge worker workloads, host sizing, density planning, and load management are the primary factors influencing end-user experience, while manual CPU or vNUMA topology tuning typically provides negligible benefit. Although this study was conducted using Omnissa Horizon Instant Clones, the findings are not platform-specific and are expected to apply broadly to other desktop virtualization solutions such as Citrix Virtual Apps and Desktops and Microsoft RDS. Adjusting vNUMA configuration changes the logical NUMA topology visible to the Windows scheduler, but measurable performance effects are generally limited to extreme host congestion conditions and are primarily observable through aggregated EUX metrics.

As expected, configuring a single NUMA node on the client produces results nearly identical to the auto-NUMA configuration, since the auto mode resolves to one NUMA node in this scenario.

Based on this study, maintaining CPU Ready below ~5% per VM corresponds to a stable EUX score of ~7.5, aligning with VMware best practices for predictable end-user experience in higher-density VDI environments.

Final note

In theory, certain specialized workloads, such as single-threaded, memory-intensive applications, could benefit from NUMA-aware vCPU placement; however, these scenarios were outside the scope of this research and were not evaluated.

Would you be interested in a follow-up research exploring larger VDI configurations that exceed a single NUMA node, comparing scale-out versus scale-up architectures, or examining specific workloads such as single-threaded, memory-intensive applications?

Let us know!

Photo by Minku Kang on Unsplash

Edwin de Bruin

Edwin de Bruin is a Solutions Architect at ITQ, specializing in End User Computing (EUC) and Cloud solutions.

Evaluating the visual quality and colour differences of fullscreen H.265 Video compression with YUV 4:2:0 and YUV 4:4:4 in Citrix HDX

Table of Contents

Introduction