Evaluating the visual quality and colour differences of fullscreen H.265 Video compression with YUV 4:2:0 and YUV 4:4:4 in Citrix HDX

Table of Contents

Chroma subsampling is a type of colour compression that reduces the colour information in an image or video in favour of luminance data. The goal of chroma subsampling is to reduce the image size and bandwidth without significantly affecting image quality.

Remoting protocols typically offer the highest quality with YUV 4:4:4 or a solution that provides reduced bandwidth and resource usage, but with a lower (visual) image quality with YUV 4:2:0. Citrix HDX for example, labels YUV 4:4:4 as visually lossless, while with Omnissa Blast this is called high-colour accuracy mode.

This follow-up research to the previous article, researches the differences between YUV 4:2:0 and YUV 4:4:4 when used with Citrix Virtual Apps and Desktops using H.265 as the video codec. It uses quantitative image-quality metrics and introduces new perceived colour-difference calculations to determine how each format performs in terms of perceived image quality and colour accuracy.

Background

YUV 4:4:4 has no colour compression, often denoted as non-subsampled, and has both the luminance and colour data entirely. In a four-by-two array of pixels, YUV 4:2:0 has a quarter of the colour information available.

More information and background on chroma subsampling, YUV 4:2:0 and YUV 4:4:4 are is available in the previous article.

The primary goal of this research is to evaluate the impact of the different colour compression methods on the visual quality and colour accuracy of full-screen H.265-compressed sessions.

The colour space will determine if and what colour compression mode (also known as chroma subsampling) will be used by HDX Graphics when using a video codec. By default, HDX Graphics uses YUV 4:2:0 chroma subsampling

Chroma subsampling is a technique used to reduce the bandwidth needed to transmit video data from the VDA to the client. It works by reducing the amount of colour information that is transmitted, since the human eye is more sensitive to changes in brightness (luminance) than to changes in colour (chrominance).

Different chroma subsampling schemes exist, denoted by ratios like 4:2:0 and 4:4:4. These ratios represent the amount of chrominance information sampled relative to the luminance information:

  • 4:4:4 - No chroma subsampling. All colour information is retained. This provides the highest image quality and colour accuracy, but requires the most bandwidth.
  • 4:2:0 - Chrominance information is subsampled both horizontally and vertically by a factor of 2. This offers the greatest bandwidth reduction but may result in noticeable colour artifacts.

Citrix HDX graphics utilizes chroma subsampling to optimize bandwidth usage in virtual desktop environments. By default, HDX Graphics uses YUV 4:2:0 for encoding screen content. This provides a good balance between image quality and bandwidth efficiency for typical office productivity applications for the majority of the use cases.

Scope and goal

The goal of the research is to determine the quantitative image quality and colour difference between YUV 4:2:0 and YUV 4:4:4.

The scope of this research is deliberately kept as narrow as possible. Only image quality and colour accuracy are analysed. Bandwidth usage, CPU and GPU utilisation, frame pacing, and session interactivity are not measured. The workload is video only: no text‑heavy or productivity applications are included.

In addition, the remote session is configured to use a full‑screen video codec for the entire screen, which is different from the default selective encoding behaviour of Citrix HDX graphics.

The goal is to have a narrow-as-possible environment to isolate the differences between YUV 4:4:4 and YUV 4:2:0 in a controlled environment and reproducible manner.

Please note that this does not necessarily reflect real-world workloads or scenarios; this scope and setup were deliberately chosen with the sole aim of measuring the effects and collecting data in a controlled testing environment. Please don’t treat this set up a recommendation for real world scenarios.

Setup & methodology

The methodology used for this research is to use video content playing inside the VDA, record the Citrix session locally on the endpoint for the two different colour compression settings, and compare the results to a local recording of the same content as the baseline.

The test environment is designed to isolate the behaviour of the YUV 4:4:4 compared to YUV 4:2:0 while keeping the scenarios consistent and reproducible.

The video codec for display remoting is configured to use H.265 (HEVC). H.265 is a good balance between performance and quality and requires a server-side GPU for encoding. To ensure that the video codec is applied uniformly and the scenario is reproducible, the policy “Use video codec for compression” is set to “For the entire screen.” This forces full‑screen encoding of the desktop and avoids selective encoding, where only certain regions are compressed with the video codec while the rest uses a different mechanism.

At the time of testing, Citrix’s visually lossless mode, which corresponds to a 4:4:4 chroma representation, only operates in combination with full‑screen encoding.

It is important to note that as a result, the configuration used in this research does not reflect the default HDX behaviour, but it rather is an intentional decision to simplify the comparison and to ensure that both chroma formats are evaluated under identical conditions.

For the comparison two scenarios were defined:

  • Scenario 1 uses H.265 with YUV 4:2:0 chroma subsampling (all policies are kept at the default)
  • Scenario 2 uses H.265 with YUV 4:4:4, corresponding to Citrix’s visually lossless mode.

In both scenarios, all other settings were kept identical, and the only variable was the chroma subsampling configuration. Endpoint capabilities, such as GPU offload, were kept consistent between tests.

All tests were conducted at least 3 times, and for each comparison, the best result was selected. The data was extremely consistent across the 3 runs and therefore did not warrant extending the number of runs.

VDA configuration

For the VDA, a single VM with 4vCPUs (Intel Xeon Gold 6346 CPU with a 3.10 GHz base clock speed) and 8GB of memory running Windows 11 Enterprise 24H2 was used.

On the VDA, the GPU used was an Intel Data Center Flex 170 GPU with 2GB of framebuffer memory.

The VDA used Citrix Virtual Apps and Desktops version 2503 installed with the default settings, and apart from the Graphics policies used to switch between YUV 4:2:0 and YUV 4:4:4, no further customisations were applied.

Endpoint configuration

The endpoint is a Windows 11 24H2 workstation equipped with a 13th-generation Intel i9 processor and a discrete NVIDIA GeForce RTX 4080 GPU, running Citrix Workspace App for Windows version 2503.1, with all recordings performed using the RTX 4080.

Workload

The workload used in this research is the “Costa Rica 4K” YouTube video, a widely known demonstration clip containing high‑quality natural video footage. This content is chosen for its combination of visual features that stress both the codec and the chroma representation. The resolution and video quality were set to 4K 60 FPS in the YouTube web player.

For this research, OBS (Open Broadcaster Software) was used with specific settings to ensure high-quality video capture of the Citrix sessions. The recordings were made using H.265 Lossless compression with NVIDIA NVENC on an RTX 4080 and a 13th generation Intel Core i9 processor.

These settings were chosen to balance performance and quality, ensuring that the recorded videos had minimal compression artefacts and the most accurate colour representation.

These settings leverage the encoding capabilities of the NVIDIA RTX 4080 and the processing power of the Intel Core i9. Using H.265 Lossless ensures that the video quality is preserved without introducing compression artefacts. The I444 colour format captures full colour information, which is critical for accurately assessing colour differences between YUV 4:2:0 and YUV 4:4:4.

From the baseline recording, a contiguous five‑second segment is selected. At a frame rate of approximately 30 frames per second, this segment contains 165 frames. The 165 frames form the the reference baseline for all of the comparisons. It is long enough to contain diverse content (different textures and lighting conditions) and still remains manageable for detailed frame‑level analysis.

For each scenario, all frames are extracted and saved to file in lossless PNG format using FFmpeg. Because the scenarios are not time-synchronised, they typically contain more than 165 frames over the same approximate 5-second interval. Therefore, the frames are not considered frame-synchronous with the baseline.

This means that directly comparing “frame N” of the baseline with “frame N” of one of the scenarios is not possible with this setup.

Even a single dropped or duplicated frame breaks the one‑to‑one correspondence across an entire segment.

To solve this, a content‑based frame matching approach is used. For each of the 165 baseline frames, every frame in a run is compared and the best match is selected based on Structural Similarity Index (SSIM). SSIM is a perceptual metric that compares two images in terms of luminance, contrast, and structural information. For a single baseline frame, the frame of the comparison that shows the highest SSIM value compared to the baseline is assumed to be the closest visual match. The metrics for that matched pair are then computed and stored.

This procedure is applied to all runs for both YUV 4:2:0 and YUV 4:4:4.

Each baseline frame therefore, has two associated frame matches, one best‑match frame from the 4:2:0 session and one from the 4:4:4 session.

Because the best‑matching frame index for a given baseline frame is not necessarily the same between the two chroma configurations, the resulting sets of matched pairs are treated as independent samples for a pairwise comparison between YUV 4:2:0 and YUV 4:4:4.

This method serves two purposes: it compensates for desynchronisation in a principled way, and it ensures that the quality metrics are computed on frame pairs that are as similar as possible in content rather than in timestamp or frame numbers.

Metrics

To determine the perceived visual image quality, SSIM (Structural Similarity Index) and PSNR (Peak Signal-to-Noise Ratio) were used, as we did in previous researches involving perceived image quality.

In addition to the perceived visual image quality, the Delta E metric was used to determine colour accuracy and colour differences.

While there is an expected difference in bandwidth usage, the goal of this research was just to focus on the image quality and colour differences. Therefore, the scope of this research is deliberately kept as narrow as possible. Only image quality and colour accuracy are analysed. Bandwidth usage, CPU and GPU utilisation, frame pacing, and session interactivity are not measured. The workload is video only: no text‑heavy or productivity applications are included.

Delta E (colour difference)

Delta E measures the colour difference or distance between two images. Lower Delta E values will indicate better colour accuracy.

For the DeltaE calculations, it was decided to use the DeltaE CIE2000 calculation over the standard CIE76 metric.

The original CIE76 formula calculated the colour difference as a simple Euclidean distance vector in a three-dimensional space to determine the ‘distance’ between two colours in the CIELAB colour space. Those of you with high school mathematics should be familiar with the distance vector.

CIE76, while accurate and quick to calculate, suffers from the fact that it does not accurately reflect how the human eye perceives these colour differences.

CIE2000 is a more complex calculation designed to address this shortcoming. It refines the CIE76 calculation by adding weighting factors for luminance, chroma and Hue that adjust the formula based on where the colours are in the CIELAB colour space. This provides a result that aligns much more closely with how our eyes perceive colour.

The Delta E values can be interpreted as follows:

  • ΔE < 1: Imperceptible colour difference.
  • ΔE 1 - 2: Perceptible only to a trained eye.
  • ΔE 2 - 10: Perceptible at a glance.
  • ΔE 11 - 49: Noticeable and significant colour difference.
  • ΔE ≥ 50: Completely different colours.

Implementation of Delta E Calculation

During this research, we modified the existing image quality code pipeline to include Delta E calculations. The initial setup for SSIM and PSNR was designed to handle pairwise comparisons of video frames, which was expanded to incorporate colour accuracy metrics for each frame pair.

To calculate Delta E, we first converted the RGB colour space of the video frames to the CIE Lab colour space, which is designed to approximate human vision and is a requirement for the Delta E calculations.

For the Delta E calculations themselves, the CIEDE2000 formula was used, which provides a more accurate representation of colour differences by taking into account perceptual non-uniformities compared to the default DeltaE method.

The general form of the CIEDE2000 formula used is as follows: \(\Delta E_{00}^{*}=\sqrt{ \left(\frac{\Delta L'}{k_{L}S_{L}}\right)^{2}+ \left(\frac{\Delta C'}{k_{C}S_{C}}\right)^{2}+ \left(\frac{\Delta H'}{k_{H}S_{H}}\right)^{2}+ R_{T}\left(\frac{\Delta C'}{k_{C}S_{C}}\right)\left(\frac{\Delta H'}{k_{H}S_{H}}\right) }\)

Source.

Results

The data across all 165 frames show a consistent advantage for YUV 4:4:4 over YUV 4:2:0.

For SSIM, the average value for the 4:2:0 session lies slightly below 0.95, while the 4:4:4 session shows an average SSIM value across all frames of just below 0.97.

Both formats comfortably showcase values above the 0.95 perceptual threshold, where structural detail appears pristine, but 4:4:4 showed tighter consistency, with a median of 0.974 and fewer dips below 0.95 in the complex scene or scenes with more motion.

A visual inspection confirms that this difference corresponds to crisper edges, more stable fine textures, and reduced softness in high‑contrast regions.

The results for PSNR are less pronounced than those for SSIM. YUV 4:4:4 averaged 40.73 dB, compared with 40.18 dB for YUV 4:2:0, yielding a very small mean gain of 0.55 dB.

YUV 4:2:0 shows some larger drops to 33dB in some frames.

The PSNR results align with expectations: eliminating chroma subsampling removes some colour‑related error, but PSNR, because it is luminance‑dominated, does not fully capture the perceptual impact of this change.

As expected, the most significant difference appears in the DeltaE values. Both scenarios achieve low absolute colour error against the baseline, but 4:4:4 reduces this error significantly.

The mean ΔE for the 4:2:0 session is around 0.58, while the 4:4:4 session achieves a mean near 0.38. This is roughly a 35% reduction in colour difference.

Both YUV 4:2:0 as well as YUV 4:4:4 fall in “imperceptible” range according to standard thresholds, which explains why static screenshots from both sessions can appear very similar at first glance. However, a look at the full distribution reveals that 4:4:4 not only lowers the mean, but also tightens the spread. The 4:2:0 distribution contains more frames with higher ΔE values, while 4:4:4 compresses the distribution towards lower differences.

From a statistical perspective, the differences between 4:2:0 and 4:4:4 are significant for all three metrics. Effect sizes are moderate for SSIM, small for PSNR, and large for ΔE. This indicates that switching to 4:4:4 provides a broadly measurable quality improvement, but the most relevant gain is in colour accuracy and stability.

Conclusion

Within the constraints of this research, the findings show that YUV 4:4:4 is the preferred choice for remote sessions where visual quality and colour fidelity are the key priorities.

For scenarios involving media review, customer‑facing digital signage, or any task where the remote image needs to be as visually accurate as possible, the combination of higher SSIM, improved PSNR, and significantly lower ΔE makes a strong case for visually lossless encoding with YUV 4:4:4.

At the same time, YUV 4:2:0 remains a very viable option for most deployments. The absolute quality achieved in this research is high for both formats, particularly in terms of colour accuracy.

For standard office workloads with regular video consumption, such as training materials, embedded clips in presentations, or streaming video, the differences measured will most likely not justify the additional bandwidth or overhead associated with 4:4:4, especially in bandwidth‑constrained environments.

These trade‑offs were intentionally out of scope for this research. No bandwidth, CPU, or GPU metrics were collected. However, the existence of a measurable quality difference means that administrators can make an informed decision based on their own constraints and scenarios. In environments where network capacity is ample and visual fidelity is the key metric that matters most, enabling visually lossless 4:4:4 for relevant user groups is supported by the data. In environments with limited bandwidth, 4:2:0 offers an excellent compromise that still delivers significantly acceptable quality for video workloads.

YUV 4:4:4 will use significantly more bandwidth than the default YUV 4:2:0. Visually lossless compression is recommended only when sufficient bandwidth is available and the best colour accuracy is required for the workload. Before enabling visually lossless compression, please determine whether the added colour accuracy is actually required and test whether the additional bandwidth usage fits the network constraints in your own environment.

It is also important to note that the configuration used here relies on full‑screen encoding. In most deployments, Citrix HDX uses selective encoding by default, where only specific regions of the screen are encoded with the video codec, while static areas may be handled by other mechanisms, resulting in the best performance and visual quality based on what is happening on screen.

Photo by Pawel Czerwinski on Unsplash