Video Quality Assessment
In Eyevinn’s initiative to share our knowledge around quality we continue with addressing video quality assessment; from both a subjective and objective point of view.
In the first article we discussed QoE (Quality of Experience) its definition and different impact factors. It was described that QoE was driven from the aim to evaluate the perceptual quality from the engineering’s point of view. However, it was not discussed how we can perform an assessment of the quality. In this article we will address different possibilities.
When the whole video communication chain is looked upon, it is seen that there are many influence factors of the perceived quality or QoE, e.g. capturing, compression, transmission, reconstruction, displaying etc. It is then natural we want to estimate this degradation and we call this Video Quality Assessment (VQA). There are two fundamental ways to do this either with subjective or objective quality assessment and we will give an overview of both.
Subjective Quality Assessment
The legitimate judges of visual quality are humans as end users, and the opinions of which can be obtained by subjective experiments. Subjective experiments involve a panel of participants which are usually non-experts, also referred to as test subjects, to assess the perceptual quality of given test material such as a sequence of videos. Subjective experiments are typically conducted in a controlled laboratory environment, even if crowdsourcing based quality assessment bears promising values of correlation with laboratory based testing.
For laboratory experiments careful planning and several factors including assessment method, selection of test material, viewing conditions, grading scale, and timing of presentation must be considered prior to a subjective experiment. To give guidelines and enable comparison between subjective experiments different recommendations have been standardized. For example, Recommendation ITU-R BT.500 [1] and ITU-T P.910 [2] provides detailed guidelines for conducting various types of subjective experiments for the assessment of quality. These types include either single stimulus or double stimulus based methods. In single stimulus methods, the subjects are shown variants of the test videos and no reference for comparison is provided, e.g. Absolute Category Rating (ACR). In some situations, a hidden reference can be included but the assessment is based only on a no-reference scoring by the subjects, e.g. Absolute Category Rating including Hidden Reference (ACR-HR). In double stimulus methods a pair of videos comprising the reference video and a degraded video are presented once or twice and the subject rate the quality or change in quality between the two video streams e.g. Degradation Category Rating (DCR) or Double Stimulus Impairment Scale (DSIS). There are also a third method, multi stimulus, e.g. Subjective Assessment of Multimedia Video Quality (SAMVIQ), ITU-T Rec. P.913 [3], where the subject rates the quality between several test videos including reference and hidden reference. The subjects are here also allowed to view the videos many times.
The outcomes of a subjective experiment are the individual scores given by the test subjects. These scores are used to compute Mean Opinion Score (MOS) or Differential Mean Opinion Score (DMOS) depending on how the experiment is designed. The main difference between MOS and DMOS is that MOS is the outcome when the subject rate a stimulus in isolation while for DMOS the change in quality between two versions of the same stimulus are rated, e.g. MOS is used for ACR, ACR-HR, SAMVIQ and DMOS for ACR-HR and DSIS.
These obtained scores (MOS, DMOS) is then representing the ground truth for the subjective quality assessment but will also be used as input for the development of different objective quality metrics.
Objective Quality Assessment
Due to the time-consuming nature of executing subjective experiments, large efforts have been made to develop objective quality metrics, alternatively called objective quality methods. The purpose of such objective quality methods is to automatically predict MOS with high accuracy. Objective quality methods may be classified into psychophysical and engineering approaches [4].
Psychophysical metrics aim at modeling the human visual system (HVS) using aspects such as contrast and orientation sensitivity, frequency selectivity, spatial and temporal pattern, masking, and color perception. These metrics can be used for a wide variety of video degradations, but the computation is generally demanding and usually not used in practice in the streaming context.
Engineering metrics usually uses simplified metrics based on the extraction and analysis of certain features or artifacts in a video but do not necessarily disregard the attributes of the HVS as they often consider psychophysical effects as well. However, the conceptual basis for their design is to do analysis of video content and distortion rather than fundamental vision modeling. A set of features or quality-related parameters of a video are pooled together to establish an objective quality method which can be mapped to predict MOS.
Depending on the degree of information that is available from the original video as a reference in the quality assessment, the objective methods are further divided into full reference (FR), reduced reference (RR), and no-reference (NR) methods as follows:
- FR methods: With this approach, the entire original video is available as a reference. Accordingly, FR methods are based on comparing a distorted video with the original video.
- RR methods: In this case, it is not required to give access to the original video but only to provide representative features of the characteristics of the original video. The comparison of the reduced information from the original video with the corresponding information from the distorted video provides the input for RR methods.
- NR methods: This class of objective quality methods does not require access to the original video but searches for artifacts with respect to the pixel domain of a video, utilizes information embedded in the bitstream of the related video format, or performs quality assessment as a hybrid of pixel-based and bitstream-based approaches.
Pixelbased metrics
The most used engineering metric is the pixelbased metrics where Peak Signal to Noise Ratio (PSNR) is the most widely used measure. PSNR is full reference method, easily calculated and gives a measure of the fidelity on a logarithmic scale between an original and degraded frame. And even if PSNRs limitation is well documented when it comes to reflect the human quality perception adequately in specific situation it has its benefits. In many situations it can also be an advantage to look at PSNR-Y since the chroma components is typically subsamples in 4:2:0 color formats. Further, to adapt against human perception a modification of PSNR where properties of human perceptions was added was proposed in PSNR-HVS [5] and PSNR-HVS-M [6]. For PSNR-HVS this is achieved by determining the PSNR of a DCT version of the frame weighted by a contrast sensitivity function and for PSNR-HVS-M this was extended this by including an additional masking model.
Another highly popular full reference metric is the Structural SIMilarity index (SSIM) introduced by Wang et al. [7]. SSIM, considers image degradations as changes in the variation of structural information by combining measures of the distortion in luminance, contrast and structure between an original and degraded frame. SSIM, was extended to video in Video Structural SIMilarity index (VSSIM) where SSIM values are calculated for all the frames but in the pooling stage the averaging is weighted based on motion between consecutive frames [8].
Other objective quality metrics
One psychophysical metrics that has been used both in industry and in academy is Opticom’s Perceptual Evaluation of Video Quality (PEVQ), standardized in ITU-T J.247 Annex B [9]. PEVQ is a full reference metric that after temporal and spatial alignment uses distortion classification of measures of the perceptual differences in the luminance and chrominance domains between corresponding frame. Together with temporal information this is then aggregated, forming the final result. An updated version for HD, PEVQ-HD, has also been presented and was evaluated by the VQEG HD project in 2010. In 2011 was Swissqual’s VQuad-HD standardized in ITU-T J.341 [10]. VQuad-HD is also a FR psychophysical metric targeting HD content. The model is designed based on aligning the reference and degraded signals after pre-processing including noise removal and sub-sampling, after this are the spatial and temporal perceptual degradation evaluated and a score predicted.
The latest engineering metric that has been adapted by the steaming community is Netflix’s Video Multi-method Assessment Fusion (VMAF) targeting coding distortions. VMAF is also a full reference based metric that considers up/down scaling and relies on image metrics Visual Information Fidelity (VIF) [11], and Detail Loss Metric (DLM) [12], together with temporal difference between consecutive frames. The final score is the output from a Support Vector Machine (SVM) regressor, that been trained based on subjective test performed by Netflix and is a representative subset of their catalog. Also, the VMAF framework is general and allows for others to retrain it for their own use-case with the inevitable result of losing the opportunity to compare with others if that is requested.
Content Independent Metrics
As was described in the previous blog QoS can be seen as a contributor to QoE. This can be used as a piori information for using content independent metrics to extract information about the quality of a streaming session, also known as Parametric Models. This is especially useful for service providers in a pint-to-multipoint access network where information of ever frame for every user is required becomes impactable. By knowing the codec and the designed behavior content independent metrics, for example latency, packet loss, frame size, and frame type, can reveal QoE related information.
This article was updated on April 12 2018 with some clarifications regarding PEVQ, added VQuad-HD and additional references.
References
[1] Methodology for the Subjective Assessment of the Quality for Television Pictures. Standard ITU-R BT.500, revision 13. ITU-R, January 2012.
[2] Subjective video quality assessment methods for multimedia applications. Standard ITU-T P.910, revision 3. ITU-T, April 2008.
[3] Methodology for the subjective assessment of video quality in multimedia applications. Standard ITU-R BT.1788, revision 1. ITU-R, January 2007.
[4] H. R. Wu, K. R. Rao, Digital Video Image Quality and Perceptual Coding., Signal Processing and Communications. (CRC, Boca Raton, 2005)
[5] K. Egiazarian, J. Astola, N. Ponomarenko, V. Lukin, F. Battisti, M. Carli, New full-reference quality metrics based on HVS, CD-ROM Proceedings of the Second International Workshop on Video Processing and Quality Metrics, Scottsdale, USA, 2006.
[6] Nikolay Ponomarenko, Flavia Silvestri, Karen Egiazarian, Marco Carli, Jaakko Astola, Vladimir Lukin, On between-coefficient contrast masking of DCT basis functions, CD-ROM Proceedings of the Third International Workshop on Video Processing and Quality Metrics for Consumer Electronics VPQM-07, Scottsdale, Arizona, USA, January 2007.
[7] Z. Wang, A. C. Bovik, H. R. Sheikh, and E. P. Simoncelli, Image quality assessment: From error visibility to structural similarity, IEEE Transactions on Image Processing, vol. 13, pp. 600–612, April 2004.
[8] Z.Wang, L. Lu and A.C. Bovik, Video quality assessment based on structural distortion measurement, Signal Processing: Image Communication, Special issue on Objective video quality metrics, vol. 19, no. 2, February 2004.
[9] Objective perceptual multimedia video quality measurement in the presence of a full reference. Standard ITU-T Rec. J.247, 2008.
[10] Objective perceptual multimedia video quality measurement of HDTV for digital cable television in the presence of a full reference. Standard ITU-T Rec. J.341, 2011.
[11] H. Sheikh and A. Bovik, Image Information and Visual Quality, IEEE Transactions on Image Processing, vol. 15, no. 2, pp. 430–444, Feb. 2006.
[12] S. Li, F. Zhang, L. Ma, and K. Ngan, Image Quality Assessment by Separately Evaluating Detail Losses and Additive Impairments, IEEE Transactions on Multimedia, vol. 13, no. 5, pp. 935–949, Oct. 2011.