Video QoE Model¶
This section gives details on the video QoE models that Surfmeter uses. Unless otherwise stated, the information in this section is valid for all video QoE KQIs. Other models may be used depending on the use case. We first give a general introduction on what types of models exist, then present the ITU-T P.1203 and P.1204 model families, then we explain extensions and variants of the models that are used in practice.
For a comprehensive introduction to video quality assessment methods and model types, see AVEQ's blog post: Video Quality Models and Measurements — An Overview.
Video Quality vs. Delivery Quality¶
When measuring video streaming quality, we distinguish between two categories of degradations:
- Video quality: Issues inherent to the encoded video itself, such as compression artifacts, low resolution, or temporal/spatial downscaling
- Delivery quality: Issues caused by the transmission and playback process, such as initial loading delay, rebuffering (stalling), and quality switching during adaptive streaming
The overall Quality of Experience (QoE) is determined by a combination of both aspects. Often, video quality is assessed separately from delivery quality to isolate the effects of encoding and transmission. However, for a complete understanding of user experience, both factors must be considered together. This is especially important when assessing the network's capability to deliver a satisfactory streaming experience. In simple terms, good video quality can be undermined by poor delivery quality, and vice versa. In particular, if you are benchmarking a network, it is very unlikely that you want to measure an OTT streaming platform's encoding quality – you can assume that they do a good job at that. Therefore, using a proper model that can capture delivery impairments is essential.
Model Classification¶
Video quality models can be classified by whether they require a reference video:
- Full Reference (FR): Compare the encoded video pixel-by-pixel against the original source. Most accurate but computationally expensive and require access to the original.
- Reduced Reference (RR): Use extracted features from both source and encoded video, requiring less bandwidth than FR models.
- No Reference (NR): Operate without any reference, using only the encoded video or its metadata. Less accurate but practical for real-world monitoring.
Models can also be classified by what data they analyze: pixel-based models decode and analyze the video signal, bitstream-based models inspect the encoded stream (QP values, frame types, motion vectors), and metadata-based models use only high-level parameters (bitrate, resolution, framerate, codec).
Interpretation of MOS Values¶
The Mean Opinion Score (MOS) is a standardized metric for expressing perceived quality on a scale from 1 to 5. Originally developed for subjective testing—where human viewers rate video quality in controlled laboratory conditions—the MOS scale is also used by algorithmic (instrumental) models to express their predictions in a comparable format.
MOS values calculated by Surfmeter reflect the average rating that a group of users would give to the video sequence. These values are computed using ITU-T standardized algorithms (such as P.1203) that have been trained on actual human ratings from subjective tests. While a perfect correspondence with subjective experience cannot be guaranteed, such instrumental models typically achieve a correlation of 0.75–0.90 with actual human ratings. The accuracy depends on the specific model used and the characteristics of the video content.
The MOS reflects the overall experience of the user. It includes effects of initial loading, stalling, and quality variations throughout the video. This is the primary value of concern when assessing the streaming quality.
The value range can be interpreted as follows:
- 1: Bad
- 2: Poor
- 3: Fair
- 4: Good
- 5: Excellent
In practice, a perfect score of 5 cannot be reached, since even humans cannot always agree that a video sequence is Excellent. This means that the highest possible score will be around 4.7. Any value above 4 is therefore considered very good.
A value between 3 and 4 indicates issues with the streaming performance that are worth looking at, because over time they may impact the user experience.
A value below 3 indicates severe problems with the streaming performance that warrant a detailed investigation.
To find out more, see the MOS Troubleshooting Guide.