Does ECC matter in a vGPU powered VDI environment?

ECC (Error Correcting Code) explained
Setup and scenarios
Hypothesis and results
Conclusion

ECC (Error Correcting Code) memory is a reliability feature in modern GPUs, designed to protect workloads from memory corruption and ensure accuracy. While ECC is often considered essential for workloads such as AI and scientific computing, its benefits in desktop virtualization are less clear.

ECC, which is by default enabled on modern NVIDIA datacenter GPUs, improves reliability, but it also consumes GPU memory and introduces a performance overhead. In a vGPU environment, where user density and resource efficiency are key, this trade-off becomes particularly interesting.

This research explores the performance impact of keeping ECC enabled in a vGPU desktop scenario.

ECC (Error Correcting Code) explained

ECC is a reliability feature designed to ensure that data stored in GPU memory remains accurate, even when hardware errors occur.

GPUs store large amounts of information in memory. While this memory is highly reliable, it is not perfect. Occasionally, due to factors such as electrical interference, cosmic radiation, or hardware imperfections, information stored in memory can change unexpectedly. Without ECC, the GPU has no way of knowing this has happened. It simply continues operating under the assumption that the data is correct, even though it has been altered at bit level.

Depending on the workload, this can lead to unexpected behavior. In gaming, for example, it may result in visual artifacts, which are often so minor that users barely notice them. In AI and scientific computing workloads, however, even a single incorrect bit can affect the outcome of complex calculations. The error may not cause an application crash, but it could produce an incorrect result without anyone realizing it.

ECC addresses this problem by storing additional information alongside the actual data. Whenever data is written to memory, the GPU calculates a set of check bits based on that data. When the data is read back, those checks are recalculated and compared against the stored values. If a single bit has changed unexpectedly, the ECC logic can identify exactly which bit is incorrect and automatically correct it before the data is returned to the application. This process happens entirely on the GPU and is invisible to both the operating system and the application.

The reason ECC became important is that modern workloads often run for days or even weeks. Imagine training a large AI model for two weeks, only to discover at the end that a memory corruption occurred halfway through the process. In that scenario, the cost of restarting the workload is far greater than the small performance penalty introduced by ECC. For this reason, modern workstation and datacenter GPUs typically run with ECC enabled by default.

This raises an important question: if ECC improves reliability, why would anyone disable it in a VDI environment?

When ECC is enabled, a portion of the physical GPU memory is reserved for ECC metadata. The exact amount varies by GPU architecture, but it is typically around 12.5% of the available memory. On a GPU with 48 GB of memory, enabling ECC can reduce the usable memory by several gigabytes.

It is important to understand that ECC does not directly impact the number of virtual desktops that can be deployed, as vGPU allocations are based on fixed profile sizes. However, within the virtual desktop itself, less framebuffer memory is available when ECC is enabled. While this may not affect standard office workloads, it could become relevant for graphics-intensive applications or users who operate close to the limits of their assigned vGPU profile.

Note: GPUs equipped with HBM2 or HBM3 implement ECC differently from GPUs using GDDR memory. Rather than storing ECC information within the framebuffer, HBM provides dedicated storage for ECC within the memory package. As a result, enabling ECC has little to no performance impact and does not reduce the available framebuffer capacity. This behavior does not apply to the GDDR6 memory used by the NVIDIA RTX 6000 Ada GPU evaluated in this research.

In a VDI environment, GPU memory is often the most valuable resource and the primary limiting factor when it comes to scalability. Most GPU based EUC workloads are scaled on GPU memory and depending on the vGPU profile assigned to a virtual desktop, a portion of the GPU memory is allocated to that desktop as framebuffer memory.

To verify whether ECC is enabled, run the following nvidia-smi command directly on the hypervisor:

nvidia-smi -q | grep -i "ECC Mode" -A2

To disable ECC, run the following command:

nvidia-smi -e 0

If there are multiple GPUs installed in the server, you can disable ECC on a specific GPU by using the -i parameter followed by the GPU ID. This allows you to target an individual GPU without affecting the ECC configuration of the other GPUs in the system.

To re-enable ECC, simply execute the following command:

nvidia-smi -e 1

Please note that changing the ECC configuration requires a reboot of the hypervisor before the new setting becomes active. More information can be found on the NVIDIA website.

Setup and scenarios

To test the hypothesis, the default GO-EUC lab environment is used to conduct the research. A total of 10 virtual desktops are provisioned, each configured with 8 vCPUs, 32 GB of memory, and an NVIDIA 4Q vGPU profile hosted on an NVIDIA RTX 6000 Ada Lovelace GPU. Each virtual desktop runs Windows 11 26H1 and is optimized using the Citrix Optimizer. The environment is configured to provide the highest possible vGPU density while ensuring the GPU is sufficiently utilized during the benchmarks.

To ensure consistent and reliable results, Microsoft Defender is completely disabled so it does not influence the benchmark results. Please note that this is not considered a best practice in a production environment, but it is a common approach in benchmarking scenarios where the goal is to isolate and measure the impact of a specific component or configuration change.

The following two scenarios are part of this research:

ECC enabled
ECC disabled

Apart from the ECC configuration, all hardware, software, benchmark settings, and virtual machine configurations remain identical throughout the research.

To get a complete picture of the impact of ECC memory on the GPU, two benchmark solutions are used: SPECviewperf and OBUX.

SPECviewperf is an industry standard benchmark test suite focused on measuring 3D graphics performance and is ideal for evaluating the impact when the GPU is heavily utilized. SPECviewperf includes multiple workload models, and for this research, the Creo workload is used. The Creo test in SPECviewperf is a“viewset”: a predefined workload derived from PTC Creo CAD application behavior. SPECViewperf does not actually run Creo itself, but replays recorded GPU workloads from it. This makes it a “simulated viewport behavior of Creo Parametric” using recorded traces. The Creo test was selected because it continuously stresses the GPU, making it well suited to evaluate the performance impact of ECC.

To measure the impact in a more typical knowledge worker environment, OBUX is also used. OBUX is designed to measure both system performance and user experience by evaluating compute performance and application responsiveness.

During each benchmark run, telemetry is collected from both the hypervisor and the virtual machines. The collected metrics include CPU utilization, GPU utilization, framebuffer usage, reserved vGPU memory, benchmark execution time, and, for OBUX, the UX Score and System Score.

Each benchmark is executed ten times for both ECC configurations, resulting in a total of 40 benchmark runs. Running the benchmarks multiple times reduces the influence of run-to-run variation and increases confidence in the observed results.

Hypothesis and results

Nutanix has previously evaluated the impact of enabling ECC memory and reported that it can introduce an approximate 30% performance overhead for graphics-intensive workloads.

Based on this observation, a performance improvement of roughly the same magnitude could reasonably be expected in SPECviewperf results when ECC is disabled.

However, these findings are specifically tied to graphics-heavy scenarios and do not explicitly address their impact on typical knowledge worker environments. As a result, it remains unclear whether disabling ECC provides any meaningful benefit for everyday desktop workloads. This makes it particularly valuable to compare the SPECviewperf results with the OBUX results, as it helps determine whether the performance impact is confined to GPU-intensive workloads or also extends to more traditional virtual desktop use cases.

Let’s first start with the SPECviewperf performance data from a hypervisor perspective.

When analyzing CPU utilization from the hypervisor perspective, there are a couple of interesting observations. First of all, the host is not fully utilized during the benchmark, which indicates that CPU capacity is not the bottleneck in this test. Secondly, the CPU difference is relatively small on average, but the total execution time differs significantly. Both ECC enabled (organge) and ECC disabled (blue) follow a very similar trend overall, with a similar execution pattern, but with ECC disabled, the benchmark completes approximately 29% faster, despite only minor differences in CPU utilization.

This means that the improvement in execution time is not driven by CPU, but by faster progress through the workload (GPU-side efficiency).

To directly quantify this difference, the benchmark with ECC enabled, which is the default configuration, takes approximately eight minutes longer to complete, representing an approximate 29% reduction in execution time.

A 29% reduction in execution is large enough to be considered a significant performance improvement for this particular graphics-intensive workload.

This also aligns very closely with the Nutanix results, which reported around an approximate 30% impact for graphics-intensive workloads.

The Host GPU utilization chart shows both configurations ramping up to very high GPU utilization after the initial load phase, starting from the 3 minute mark. During the main active section, ECC enabled and ECC disabled both sit mostly in the ~90–98% range, which indicates the SPECviewperf workload is definitely GPU-driven in both cases.

Looking at GPU utilization from the hypervisor perspective, it becomes clear that the GPU is the bottleneck during this benchmark. Although the average GPU utilization is slightly lower when ECC is disabled, this is mainly because the benchmark completes much faster. During the rendering phases, the GPU repeatedly reaches 100% utilization in both configurations.

ECC enabled appears to keep the GPU busy for longer, but that extra GPU activity does not translate into faster completion. That is a sign of lower effective throughput: the GPU is active, but the workload progresses more slowly.

When switching to the perspective of the virtual machine, the CPU utilization closely matches the host CPU utilization, which is expected.

The vGPU utilization is much lower inside the virtual machine because the physical GPU resources are shared across ten virtual machines simultaneously. As a result, the maximum vGPU utilization reaches only around 20% during the benchmark.

While utilization is an important factor, the most interesting metric in this comparison is the framebuffer, since ECC is a memory feature.

Overall, the framebuffer usage follows the same pattern as the previous datasets. The main observation is that disabling ECC makes more framebuffer available to the virtual machine, with a maximum difference of approximately 225 MB. ECC enabled consumes more average framebuffer while also taking longer to complete the same benchmark. ECC disabled completes the workload faster with lower average framebuffer usage and more usable framebuffer headroom.

The vGPU reserved memory provides additional insight into this difference. This is memory reserved by the NVIDIA driver that can be quickly reused when required. When ECC is disabled, this reserved memory footprint is consistently lower, explaining why more framebuffer is available to the virtual machine.

For the SPECViewperf workload, ECC disabled completes the same workload substantially faster, with lower CPU usage, lower average GPU utilization, lower average framebuffer usage, and more framebuffer available to the VM.

Now let’s switch to OBUX.

As OBUX is a workload focused on the average knowledge worker profile, the comparison is very similar, showing almost no difference in both the UX Score and the System Score. This means there is no noticeable difference from an end-user perspective.

The OBUX benchmark is designed to ensure consistent test runs, so there is effectively no difference in the duration of the benchmark, unlike what was observed in SPECviewperf.

The system calculations in the OBUX benchmark are CPU-intensive and are designed to measure computational performance. This is clearly visible on the hypervisor, although the CPU is still not fully utilized while running ten virtual machines simultaneously.

There are two specific parts in the workload that make more use of the GPU, namely video playback and a small graphical pixel application. These workloads are not designed to fully stress the GPU, which is clearly reflected in the overall GPU utilization.

From the virtual machine perspective, the CPU utilization is slightly higher when ECC is disabled. However, the average difference is only around 0.4%, making it negligible in practice.

As expected, OBUX is not a graphics benchmark, which is clearly visible in the vGPU utilization from the virtual machine perspective.

The framebuffer usage is also very similar between both configurations and remains well below the allocated limit, confirming that disabling ECC provides little to no measurable benefit for a typical knowledge worker workload.

When comparing the results from both types of workloads, it is clear that the ECC impact is very workload-dependent. It is clearly visible in SPECviewperf Creo, but not significantly visible in the OBUX results.

Conclusion

There is a reason why ECC was introduced in modern GPUs, and it is important to understand the purpose and benefits of this reliability feature. As explained in the introduction, keeping ECC enabled makes perfect sense for AI, machine learning, and scientific computing workloads, where data integrity is critical and a single memory error can impact the final result. However, in the context of desktop virtualization, the benefits of ECC are more nuanced, especially for workloads where graphics performance and available framebuffer are more important than the additional reliability ECC provides.

Comparing both benchmarks demonstrates that the impact of ECC is highly workload dependent and its benefits must be weighed against its performance and capacity implications.

The SPECviewperf results highlight this distinction clearly. As a GPU-intensive workload that continuously stresses the graphics pipeline, SPECviewperf completed nearly 30% faster with ECC disabled. This perfectly aligns with the statement from Nutanix that ECC can have a significant performance impact on graphics-intensive workloads. In addition to the reduction in execution time, disabling ECC also made more framebuffer available to the virtual machines.

To contrast the GPU-intensive workload, OBUX showed virtually identical results from a user perspective when comparing ECC enabled to ECC disabled. As a more knowledge-worker representative workload, there was no meaningful difference in the UX Score, System Score, CPU utilization, GPU utilization, framebuffer usage, or the overall duration of the benchmark. These results indicate that the performance overhead introduced by ECC only becomes noticeable when the GPU is heavily utilized, while a typical knowledge worker workload remains largely unaffected.

That leaves us with an interesting follow up question: if disabling ECC can result in performance gains of up to ~30% for graphics-intensive workloads:

Can that performance uplift be translated into higher user density or more efficient vGPU profile sizing?

In practice, the answer will often be yes, depending on the workload characteristics. The additional framebuffer capacity recovered by disabling ECC, combined with improved GPU processing efficiency, may allow environments to:

Support more users per host due to reduced GPU contention
Use smaller vGPU profiles without compromising user experience
Increase overall workload consolidation efficiency

However, this is not a universal outcome. For knowledge-worker workloads, as demonstrated by OBUX, ECC has minimal impact, meaning density gains from disabling ECC are unlikely to be significant in those scenarios. The true benefit lies primarily in environments where GPU and framebuffer are already limiting factors.

Based on our findings, disabling ECC is a valid optimization strategy for EUC and virtual desktop environments where performance, scalability, and framebuffer capacity are prioritized, particularly for graphics-intensive use cases. At the same time, for environments running workloads that rely on strict data integrity, such as AI, scientific computing, or regulated industries, keeping ECC enabled remains the recommended and safest configuration.

The key takeaway is that disabling ECC won’t magically improve every VDI workload, but in the right scenarios, it can be the difference between “just working” and “optimally scaled.”

Question to our readers: If ECC disabled can unlock both performance and additional framebuffer, could it also allow you to run smaller vGPU profiles or increase user density on your hosts?

Photo by Li Zhang on Unsplash

Ryan Ververs-Bijkerk

Ryan is a self-employed technologist at GO-INIT, specializing in EUC (End-User Computing) and code. His primary focus is optimizing the user experience in centralized desktop environments.

Eltjo van Gulik

Eltjo van Gulik is the Principal Product Manager for HDX Graphics & Seamless at Citrix and one of the co-founders of G0-EUC. Social, a tad pedantic and suitably lazy.

Microsoft Intune App Deployment Performance: Entra ID vs Hybrid Join Research

Table of Contents

ECC (Error Correcting Code) explained

Setup and scenarios

Hypothesis and results

Conclusion

Microsoft Intune App Deployment Performance: Entra ID vs Hybrid Join Research

Evaluating the visual quality and colour differences of fullscreen H.265 Video compression with YUV 4:2:0 and YUV 4:4:4 in Citrix HDX

Comparative Quality Analysis of Citrix Codecs: H.264, H.265, and AV1

The effect of input latency when introducing a NVIDIA vGPU in Citrix HDX

Upcoming events

Dutch Citrix User Group

E2EMVC

Sponsors

Buy us a beer

Does ECC matter in a vGPU powered VDI environment?

Table of Contents

ECC (Error Correcting Code) explained

Setup and scenarios

Hypothesis and results

Conclusion

Microsoft Intune App Deployment Performance: Entra ID vs Hybrid Join Research

You may also like

Evaluating the visual quality and colour differences of fullscreen H.265 Video compression with YUV 4:2:0 and YUV 4:4:4 in Citrix HDX

Comparative Quality Analysis of Citrix Codecs: H.264, H.265, and AV1

The effect of input latency when introducing a NVIDIA vGPU in Citrix HDX

Upcoming events

Dutch Citrix User Group

E2EMVC

Sponsors

Buy us a beer