Video QoE Model¶

This section gives details on the used video QoE model. Unless otherwise stated, the information in this section is valid for all video QoE KQIs. Other models may be used depending on the use case.

ITU-T P.1203 Introduction¶

The ITU-T Recommendation P.1203 is a family of standards that specifies the world's first model to predict the Quality of Experience (QoE) for HTTP Adaptive Streaming (HAS) services. It consists of one main and three sub-recommendations:

ITU-T P.1203: Parametric bitstream-based quality assessment of progressive download and adaptive audiovisual streaming services over reliable transport
ITU-T P.1203.1: Video quality estimation module (short-term, providing per-one-second output information)
ITU-T P.1203.2: Audio quality estimation module (short-term, providing per-one-second output information)
ITU-T P.1203.3: Audiovisual integration and integration of final score, reflecting remembered quality for viewing sessions between 30 s and 5 min duration

The standard predicts the QoE in terms of Mean Opinion Scores (MOS) on a scale from 1–5, where 1 refers to Bad quality, and 5 to Excellent. The scope of the standard includes all degradations that may occur in a video stream caused by lossy compression, temporal or spatial downscaling, and stalling effects due to rebuffering events (including initial loading). We provide the P.1203 MOS as a statistic value called p1203_overall_mos. In the standard, it is referred to as output O.46 (see further Outputs below).

The models described in the standard have been created by an international consortium of academic and industrial partners; they have been trained and validated on over 1000 audiovisual sequences that were rated by human viewers, thus over 25,000 individual ratings. The ratings were given in the context of standardized subjective tests conducted in dedicated laboratories.

P.1203 is composed of several modules that each compute different aspects of the overall quality estimation.

The input streams are analyzed separately for audio and video quality. The P.1203.1 and P.1203.2 Pv and Pa modules produce a per-one-second MOS value corresponding to the per-stream video and audio quality, which are then integrated over time–-considering any influence by stalling and quality fluctuation happening during playout. The integration happens in the Pq module. It predicts the final MOS value (O.46). This MOS value corresponds to the quality rating a user would have given had she/he seen the video.

The modular structure allows the integration module to be used with other video/audio quality models, under the condition that the combination is validated in terms of prediction accuracy.

P.1203.1, the video quality estimation module, offers four modes of operation, depending on the available information from the audiovisual stream and the required/available computational resources.

P.1203's simplest mode of operation (mode 0) takes as input: audio/video bitrate, video resolution, frames per second, and stalling events happening at the client side. Depending on the available data, it offers higher modes of operation that increase prediction accuracy at the expense of being more computationally intensive and requiring input data from more in-depth bitstream inspection.

While Mode 0 has access to basic data, Mode 1 can inspect the packet headers of the transmitted stream to obtain frame sizes and types. Modes 2 and 3 have access to the bitstream itself, where mode 2 only accesses 2% of the stream to reduce computing efforts. Mode 2 will be rarely used in practice, since Mode 3 can be calculated rather efficiently using modern hardware.

Outputs¶

The model produces various outputs that can be used for diagnostic purposes:

O.21: per-second audio quality scores (see statistic value p1203_average_audio_quality)
O.22: per-second video quality scores (see statistic value p1203_average_video_quality)
O.23: the "Perceptual Stalling Indicator", also stalling quality (see statistic value p1203_stalling_quality)
O.34: the per-second audiovisual quality (see statistic value p1203_average_audiovisual_quality)
O.35: the overall audiovisual quality, excluding stalling (see statistic value p1203_overall_audiovisual_quality)
O.46: the overall audiovisual quality, including stalling (see statistic value p1203_overall_mos)

Scope and Limitations¶

The following limitations apply for P.1203-related KQIs:

ITU-T Rec. P.1203 has only been validated for video sequences of up to 5 minutes length. Hence, a measurement of a video source that is longer than this duration is technically possible but may be considered invalid with respect to the standard.
ITU-T Rec. P.1203 has only been validated for video up to 25 fps frame rate. Use of the model for video with higher frame rates is technically possible but will not yield higher quality ratings.
ITU-T Rec. P.1203 has only been validated for video up to 1080p resolution. Use of the model for video with higher resolution is technically possible but will not yield higher quality ratings. An Appendix exists for P.1203 that enables the use of up to UHD-1 resolution.

Amendments for Impact of Stalling and Low Quality¶

During the deployment of ITU-T Rec. P.1203 it was discovered that the impact of low audiovisual quality and stalling on the overall MOS was too low in comparison to what users of the model would expect. In certain edge cases, the model would give too high predictions for the MOS with considerably large values of initial loading delay or stalling. A set of modifications have been proposed and are available in Surfmeter.

In order to increase the impact of stalling events and very low audiovisual quality, ITU-T Rec. P.1203.3 has been officially updated with Amendment 1 "Adjustment of the audiovisual quality". This amendment is available in the Surfmeter software.

Note

The amendment is enabled by default.

Extensions or Variants of the Models¶

ITU-T Rec. P.1203.1 and P.1203.2 (the video and audio modules) have been developed for the H.264 video codec and MPEG-4 AAC (AAC-LC, HE-AAC) and AC-3 audio codecs only.

In order to use the model for video services that use other codecs than the ones specified, specific extensions or variants have been developed in collaboration with TU Ilmenau to enhance AVEQ's monitoring features.

Type	Extension/Module	Scope	Used by default?
Video (Pv)	AVQBits M0 – P.1204.3-based Mode 0 Video Quality Module	Codecs: H.264, H.265, VP9, AV1 FPS: 12–60 Resolution: 240p–2160p	No
Video (Pv)	Retrained P.1203.1 Coefficients	Codecs: H.264, H.265, VP9, AV1 FPS: 12–24 Resolution: 240p–1080p	No
Video (Pv)	Open-Source P.1203.1 Codec Extension	Codecs: H.264, H.265, VP9 FPS: 12–24 Resolution: 240p–1080p	No

These extensions may be used with the existing Pq component for final quality integration.

They are described in the following sections.

Video¶

AVQBits|M0 (P.1204.3-based Mode 0 Video Quality Module)¶

AVQBits|M0 is a mode 0 model (using only metadata), but it is based on the ITU-T Rec. P.1204.3 model, which is a model that actually uses the bitstream of the video (Mode 3). It synthesizes the assumed quantization parameter (QP) based on video codec metadata, and derives a MOS score with metadata only. The module is available on GitHub. It covers the H.264, H.265, VP9, and – in our AVEQ-supplied variant – the AV1 codecs, and resolutions up to 2160p and framerates up to 60 fps.

It is recommended to use this model for any video service that matches the above scope. The model's internal accuracy is very high, and it is also very fast to compute. The accuracy is better for low-bitrate encodings (and, consequently, evaluating bitrate ladders) compared to the original P.1203.1 model.

The following figure shows the behavior of the models for different bitrates (kBit/s), codecs, and resolutions, all assuming a frame rate of 60 fps and a target display resolution of 2160p.

Retrained P.1203.1 Codec Coefficients¶

To add support for H.265, VP9 and AV1, the coefficients of the P.1203.1 model functions have been updated by TU Ilmenau via retraining of the coefficients. Specifically, the coefficients modified were a1 through a4 and q1 through q3 (see Table A.1 and Table B.1 in ITU-T Rec. P.1203.1).

The retraining was made with the help of a newly generated set of test sequences based on the publicly available AVT-VQDB-UHD1 video database. 17 sources were encoded with H.265, VP9, and AV1, for a total number of 1581 video sequences, ranging from a resolution of 360p to 1080p, and a bitrate from 100 kbps to 16 Mbps. As encoders, libx265, libvpx-vp9, and libaom-av1 were used. Two-pass encoding was enabled with a specific encoder speed preset (medium for HEVC, 2 for VP9, 4 for AV1). Framerate was kept constant to 30 fps. As ground truth for retraining the coefficients, VMAF scores were used, which were calculated using the VMAF 0.6.1 model.

Note

This extension is enabled by default for services that do not use H.264.

Open-Source P.1203.1 Codec Extension¶

To support the codecs H.265 and VP9, a publicly available extension to ITU-T Rec. P.1203.1 can be used. This implementation has been developed by TU Ilmenau and is available on GitHub. Here, a linear mapping is applied to each calculated O.22 score to compensate for the improved efficiency of H.265 and VP9 compared to H.264. The method is described in more detail at the given URL.

Note

This extension is currently not enabled by default due to its lower accuracy in comparison to the retrained coefficients.

Audio¶

Due to the lack of available quality models for Opus, for this codec, the same coefficients as for HE-AAC will be used. The impact on the overall audio and consequently the audiovisual quality is considered negligible.

An updated version of the model with support for Opus is currently being investigated.

Interpretation of MOS Values¶

MOS values reflect the average rating that a group of users would give to the video sequence. In the software, the MOS values are calculated according to the ITU-T Rec. P.1203 algorithm that predicts these ratings. The algorithm was developed based on actual human ratings, however a perfect correspondence with the subjective experience cannot be guaranteed. Still, such algorithmic models can reach a correlation of 0.75–0.90 with actual human ratings. The accuracy depends on the specific type of model used as well as the sequences that it is used with, hence only a general statement can be made in this documentation.

The MOS reflects the overall experience of the user. It includes effects of initial loading, stalling, and quality variations throughout the video. This is the primary value of concern when assessing the streaming quality.

The value range can be interpreted as follows:

1: Bad
2: Poor
3: Fair
4: Good
5: Excellent

In practice, a perfect score of 5 cannot be reached, since even humans cannot always agree that a video sequence is Excellent. This means that the highest possible score will be around 4.7. Any value above 4 is therefore considered very good.

A value between 3 and 4 indicates issues with the streaming performance that are worth looking at, because over time they may impact the user experience.

A value below 3 indicates severe problems with the streaming performance that warrant a detailed investigation.

To find out more, see the MOS Troubleshooting Guide.