ITU-T P.1203¶

The ITU-T Recommendation P.1203 is a family of standards that specifies the world's first model to predict the Quality of Experience (QoE) for HTTP Adaptive Streaming (HAS) services. It consists of one main and three sub-recommendations:

ITU-T P.1203: Parametric bitstream-based quality assessment of progressive download and adaptive audiovisual streaming services over reliable transport
ITU-T P.1203.1: Video quality estimation module (short-term, providing per-one-second output information)
ITU-T P.1203.2: Audio quality estimation module (short-term, providing per-one-second output information)
ITU-T P.1203.3: Audiovisual integration and integration of final score, reflecting remembered quality for viewing sessions between 30 s and 5 min duration

The standard predicts the QoE in terms of Mean Opinion Scores (MOS) on a scale from 1–5, where 1 refers to Bad quality, and 5 to Excellent. P.1203 is unique among video quality models in that it combines both video quality and delivery quality into an overall session-level QoE score.

We provide the P.1203 MOS as a statistic value called p1203_overall_mos. In the standard, it is referred to as output O.46 (see further Outputs below).

The models described in the standard have been created by an international consortium of academic and industrial partners; they have been trained and validated on over 1000 audiovisual sequences that were rated by human viewers, thus over 25,000 individual ratings. The ratings were captured in the context of standardized subjective tests conducted in dedicated laboratories. The model development was coordinated within ITU-T Study Group 12, and the competition approach ensured only the best performing model candidates were standardized.

Modular Structure of P.1203¶

P.1203 is composed of several modules that each compute different aspects of the overall quality estimation.

The input streams are analyzed separately for audio and video quality. The P.1203.1 and P.1203.2 Pv and Pa modules produce a per-one-second MOS value corresponding to the per-stream video and audio quality, which are then integrated over time–-considering any influence by stalling and quality fluctuation happening during playout. The integration happens in the Pq module. It predicts the final MOS value (O.46). This MOS value corresponds to the quality rating a user would have given had she/he seen the video.

The modular structure allows the integration module to be used with other video/audio quality models, under the condition that the combination is validated in terms of prediction accuracy.

Modes of Operation¶

P.1203.1, the video quality estimation module, offers four modes of operation, depending on the available information from the audiovisual stream and the required/available computational resources.

P.1203's simplest mode of operation (mode 0) takes as input: audio/video bitrate, video resolution, frames per second, and stalling events happening at the client side. Depending on the available data, it offers higher modes of operation that increase prediction accuracy at the expense of being more computationally intensive and requiring input data from more in-depth bitstream inspection.

While Mode 0 has access to basic data, Mode 1 can inspect the packet headers of the transmitted stream to obtain frame sizes and types. Modes 2 and 3 have access to the bitstream itself, where mode 2 only accesses 2% of the stream to reduce computing efforts. Mode 2 will be rarely used in practice, since Mode 3 can be calculated rather efficiently using modern hardware.

Outputs¶

The model produces various outputs that can be used for diagnostic purposes:

O.21: per-second audio quality scores (see statistic value p1203_average_audio_quality)
O.22: per-second video quality scores (see statistic value p1203_average_video_quality)
O.23: the "Perceptual Stalling Indicator", also stalling quality (see statistic value p1203_stalling_quality)
O.35: the overall audiovisual quality, excluding stalling (see statistic value p1203_overall_audiovisual_quality)
O.46: the overall audiovisual quality, including stalling (see statistic value p1203_overall_mos)

Scope and Limitations¶

The following limitations apply for P.1203-related KQIs:

ITU-T Rec. P.1203 has only been validated for video sequences of up to 5 minutes length. Hence, a measurement of a video source that is longer than this duration is technically possible but may be considered invalid with respect to the standard.
ITU-T Rec. P.1203 has only been validated for video up to 25 fps frame rate. Use of the model for video with higher frame rates is technically possible but will not yield higher quality ratings.
ITU-T Rec. P.1203 has only been validated for video up to 1080p resolution. Use of the model for video with higher resolution is technically possible but will not yield higher quality ratings. An Appendix exists for P.1203 that enables the use of up to UHD-1 resolution.

Amendments for Impact of Stalling and Low Quality¶

During the deployment of ITU-T Rec. P.1203 it was discovered that the impact of low audiovisual quality and stalling on the overall MOS was too low in comparison to what users of the model would expect. In certain edge cases, the model would give too high predictions for the MOS with considerably large values of initial loading delay or stalling. A set of modifications have been proposed and are available in Surfmeter.

In order to increase the impact of stalling events and very low audiovisual quality, ITU-T Rec. P.1203.3 has been officially updated with Amendment 1 "Adjustment of the audiovisual quality". This amendment is available in the Surfmeter software.

Note

The amendment is enabled by default.

Extensions and Variants for Pv/Pa Modules¶

ITU-T Rec. P.1203.1 and P.1203.2 (the video and audio modules) have been developed for the H.264 video codec and MPEG-4 AAC (AAC-LC, HE-AAC) and AC-3 audio codecs only.

In order to use the model for video services that use other codecs than the ones specified with a full P.1203-type evaluation (i.e., a session score), specific extensions or variants have been developed in collaboration with TU Ilmenau to enhance AVEQ's monitoring features. These variants are summarized in the following table:

Type	Extension/Module	Scope	Used by default?
Video (Pv)	ITU-T P.1204.1 Mode 0 Video Quality Module	Codecs: H.264, H.265, VP9 FPS: 12–60 Resolution: 240p–2160p	✅ *
Video (Pv)	AVQBits M0 – P.1204.3-based Mode 0 Video Quality Module	Codecs: H.264, H.265, VP9, AV1 FPS: 12–60 Resolution: 240p–2160p	✅
Video (Pv)	Retrained P.1203.1 Coefficients	Codecs: H.264, H.265, VP9, AV1 FPS: 12–24 Resolution: 240p–1080p	❌
Video (Pv)	Open-Source P.1203.1 Codec Extension	Codecs: H.264, H.265, VP9 FPS: 12–24 Resolution: 240p–1080p	❌

^{* P.1204.1 is a subset of AVQBits|M0 in terms of supported codecs, so it is, per definition, enabled.}

These extensions may be used with the existing Pq component for final quality integration. They are described in the following sections.

P.1204.1 Mode 0 Video Quality Module¶

ITU-T Rec. P.1204.1 specifies a metadata-based (Mode 0) video quality model. For technical details on the model, see the sections above. For background on the P.1204 series, see Raake et al.: Multi-Model Standard for Bitstream-, Pixel-Based and Hybrid Video Quality Assessment of UHD/4K: ITU-T P.1204.

Because P.1204.1 uses the AVQBits|M0 model internally, except for the AV1 coefficients, please read the next section for details on the model behavior.

AVQBits|M0 (P.1204.3-based Mode 0 Video Quality Module)¶

AVQBits|M0 is a mode 0 model (using only metadata), but it is architecturally based on the ITU-T Rec. P.1204.3 model, which is a model that actually uses the bitstream of the video (Mode 3). For background on P.1204.3, read Rao et al.'s paper.

AVQBits|M0 synthesizes the assumed quantization parameter (QP) based on video codec metadata, and derives a MOS score with metadata only. The module is available on GitHub. It covers the H.264, H.265, VP9, and – in our AVEQ-supplied variant – the AV1 codecs, and resolutions up to 2160p and framerates up to 60 fps.

Details on the model are available in Rao et al.'s paper: AVQBits—Adaptive Video Quality Model Based on Bitstream Information for Various Video Applications.

It is recommended to use this model for any video service that matches the above scope. The model's internal accuracy is very high, yielding a correlation of around 0.890 on the AVT-VQDB-UHD1 database, and it is also very fast to compute. The accuracy is better for low-bitrate encodings (and, consequently, evaluating bitrate ladders) compared to the original P.1203.1 model. More performance statistics are available in the paper linked above.

The following figure shows the behavior of the models for different bitrates (kBit/s), codecs, and resolutions, all assuming a frame rate of 60 fps and a target display resolution of 2160p.

Retrained P.1203.1 Codec Coefficients¶

To add support for H.265, VP9 and AV1, the coefficients of the P.1203.1 model functions have been updated by TU Ilmenau via retraining of the coefficients. Specifically, the coefficients modified were a1 through a4 and q1 through q3 (see Table A.1 and Table B.1 in ITU-T Rec. P.1203.1).

The retraining was made with the help of a newly generated set of test sequences based on the publicly available AVT-VQDB-UHD1 video database. 17 sources were encoded with H.265, VP9, and AV1, for a total number of 1581 video sequences, ranging from a resolution of 360p to 1080p, and a bitrate from 100 kbps to 16 Mbps. As encoders, libx265, libvpx-vp9, and libaom-av1 were used. Two-pass encoding was enabled with a specific encoder speed preset (medium for HEVC, 2 for VP9, 4 for AV1). Framerate was kept constant to 30 fps. As ground truth for retraining the coefficients, VMAF scores were used, which were calculated using the VMAF 0.6.1 model.

Note

This extension is enabled by default for services that do not use H.264.

Open-Source P.1203.1 Codec Extension¶

To support the codecs H.265 and VP9, a publicly available extension to ITU-T Rec. P.1203.1 can be used. This implementation has been developed by TU Ilmenau and is available on GitHub. Here, a linear mapping is applied to each calculated O.22 score to compensate for the improved efficiency of H.265 and VP9 compared to H.264. The method is described in more detail at the given URL.

Note

This extension is currently not enabled by default due to its lower accuracy in comparison to the retrained coefficients.

Audio Extensions¶

Due to the lack of available quality models for Opus, for this codec, the same coefficients as for HE-AAC will be used. The impact on the overall audio and consequently the audiovisual quality is considered negligible.

An updated version of the model with support for Opus is currently being investigated.