As video streaming grows, there are big reasons to optimize the visual streaming quality and, at the same time, lower the CDN costs. This can be done by measuring the visual quality of the transcoding outputs to make sure no bandwidth is wasted.
Since more and more streaming services wants to keep up with the premium segment, video quality is a hot spot to optimize to please the film nerds.
In this article we’ll walk through the use of VMAF for measuring the perceptual visual quality of transcoded video.
Wasting bits? Start measuring your output!
At Eyevinn, we have seen many examples of streaming services that either have poor video quality or waste a lot of bandwidth because they still don’t measure the visual quality of the outputs, and therefore uses higher bitrates than needed.
When articles started to pop up in November 2019 with the release of Apple TV+ and their in some examples, ridiculous high bitrates (like this article from Flatpanels HD, one question that came in mind was if this was a pure marketing job from Apple or if they just don’t measure their output quality at all?
Could it be a direct marketing answer to the Netflix tech blog where Netflix has written articles like this one about how much effort they put into optimization of the video encoding pipeline to keep bitrates as low as possible, without ruining the video quality?
Apple, as a company, have a long history of video compression with iTunes movies, as well as their own developed compression app Compressor and QuickTime. By that, Apple should have the right competence to make efficient video encodings without wasting bits.
How do you know if you are wasting bits? Well, it is as easy as start measuring the output video quality in a way that correlates to the human vision system. To do this at scale, Netflix developed a great tool for that called VMAF (https://github.com/Netflix/vmaf).
Netflix describes VMAF as “a perceptual video quality assessment algorithm” which means that it’s not just a PSNR comparison, but rather a mix of digital metrics and machine learned models, trained on “real” human eyeballs.
As of today, VMAF comes with mainly three trained models:
- HD: Viewing HD content on a big screen at a distance of 3 times the height of the display
- 4K: Viewing 4K UHD content on a big screen at a distance of 1.5 times the height of the display
- Phone: Viewing on a phone or smaller device on a “normal” distance
VMAF values spans from quality 0–100, where 0 is worst, and 100 is best. There is hard to find information of what VMAF value to target, but Real Networks wrote a white paper in the subject and they found that VMAF 93 is a good value to target, but everything with an average over 90 will look good. The question is how low VMAF will go in the more complex parts of the video. This is the reason why it’s interesting to measure in shorter chunks, rather than an entire title.
Since the models are trained on Netflix content, there might be reasons to train your own models, if your content differs very much from Netflix’s content. For example, if you mostly have sports in your service, or “conference content” like Ted Talks or similar.
But it’s better to start using the models that comes with VMAF than not measuring at all, so let’s get started!
VMAF — Best practice
Depending on the goal with the measurement, there are two main parameters that can be of interest, and where optimizations can be done: Video encoding artefacts and scaling artefacts.
- First of all, you need to have access to both the source (reference) video file and the compressed (distorted) video file that should be measured
- Reference file and the distorted file must have the same resolution in a raw-video file. So, if the encoded files are downscaled, they must be upscaled back to the reference resolution without adding any codec artefacts. In some cases, you might want to downscale the reference to match the distorted (see below)
- Files must be frame accurate
- Files must have the same frame rate
- If the reference file is interlaced and the distorted files are progressive, the reference have to be de-interlaced as well
- If the reference file is interlaced and the distorted files have been interpolated into double frame rate, the reference have to be de-interlaced into double frame rate as well
How to scale resolutions?
As mentioned above, the reference file and the distorted files must have the same resolution before entering into the VMAF measurement.
Working with adaptive streaming, the bitrate ladder most truly have distorted files at resolutions that differs from the reference resolution. Most of the time the lower resolutions of the distorted files will be upscaled at playback, and in that case, it is good to know what upscaling algorithm the playback will use? When it comes to TV sets, this might be impossible to know, and in thus cases, Netflix recommend the bicubic upscaling algorithm.
There are other cases where it might be useful to re-scale the source file into a reference that matches the distorted files resolution. For example, if you know that the files will be played back on a display with HD resolution, but the original source file was 4K UHD. Then it might be useful to downscale the source to an HD reference to use against the distorted files in the ladder. Make sure to use the same downscaling algorithm for the reference file as used in the distorted files.
NOTE: If you downscale the source file into a new reference file, VMAF will only measure the video encoding artefacts, but not the scaling artefacts. This gives a VMAF value that is only valid when playing the distorted files at its native resolutions. So, the scope for this case is a bit specific.
Measuring the different encodings to find the ultimate quality for each resolution in the bitrate ladder, there is good to know the resolution the files will be played back at.
Measuring a scene or the entire title?
When measuring a long-format video you will truly use the average VMAF value of the entire title. Using the average value works fine to get an overview of the distorted files. But the average number is not that useful if the video contains both easy-to-encode parts and hard-to-encode-parts with high complexity. Unfortunately, this is the case for almost all titles. So, the way forward here is to analyze shorter parts of the video to find the hard-to-encode parts.
This makes the process a bit more complex, but briefly a scene change detection could help you out to find parts to measure. Since VMAF measures every frame, there is a good idea to plot the VMAF values in a graph to find out complex parts of a video. Below is two transcodes of the same title, in the same resolution, but with different bitrates. Even if the average VMAF score only differs about two steps, we can see huge difference in the complex parts of the video, where the lower bitrate sometimes goes below VMAF 60.
Tools to use
There are several ways of using the VMAF libraries. Netflix has created some different ways, and VMAF is also implemented in both open source- and some commercial products.
- VMAF Python libraries:
- vmafossexec C++ executable:
- libvmaf, a C library that can be used from within the media framework FFMPEG
- VMAF Dockerfile using the Python libraries
Since the VMAF libraries wants the reference and the distorted files to be in the same resolution and frame rate, you need to prepare the files before measuring them. This can be done using FFMPEG or other encoding tools. When preparing the files, it is important to not add any compression artefacts, so the files must be encoded with an uncompressed codec, which makes longer formats to be huge in file size. But since the VMAF libraries are also available from within FFMPEG, and FFMPEG can do the scaling, frame rate conversion, de-interlacing and measurement in the same command line, using the decoded video streams without re-encoding, FFMPEG is a very good tool to start with.
On some platforms the static builds of FFMPEG already have the VMAF feature enabled. If not, FFMPEG needs to be built from the source code and configured with
./configure --enable-libvmaf --enable-version3
Using VMAF within FFMPEG
In FFMPEG, libvmaf can be used in one of several filter options, like the “lavfi” filter or the “filter complex” filter.
Below are some commands to start off with:
ffmpeg -i distorted-file -i reference-file -lavfi libvmaf -f null –
Logging VMAF values into a JSON-file
ffmpeg -i distorted-file -i reference-file -lavfi libvmaf="log_fmt=json:log_path=/path/to/json/file.json" -f null –
VMAF values can be saved to a file, and the example above saves the values to a JSON file.
Scale the distorted file using filter complex
ffmpeg -i distorted-file -i reference-file-in-HD -filter_complex "[0:v]scale=1920x1080:flags=bicubic[distorted],[distorted][1:v]libvmaf" -f null –
Other open source tools
If you want a graphical interface, Videobench is an open source solution. It is a Docker image that can be used for measurement both of VMAF and PSNR, but also measuring the bitrates of the distorted files.
The measuring will be performed by FFMPEG, but with a graphical interface to add files, and to get nice colorful graphs of the values when done.
NOTE: Make sure to use the Eyevinn fork of Videobench, since the original version had older VMAF libraries and a critical bug that has been solved. https://github.com/Eyevinn/videobench
Measuring downstream transcodings from a broadcast signal
If the reference file comes from a linear broadcast, there might be interlacing in the reference file, but not in the distorted files. This needs to take care about, and that could be tricky since there are many different ways of de-interlacing a signal. Make sure you know what de-interlace filter used on the distorted encodes, and then do the same de-interlacing on the reference file without adding encoding artefacts.
To take control of your outputs, start measuring the files! Using VMAF is an efficient way to make sure the encodings are visually good. Today it’s quite easy to get started with VMAF, and the models are getting better all the time.
One challenge is to prepare the reference file depending of what you want to measure. Especially interlaced sources are tricky, and unfortunately they are still present in many broadcast environments.
This blog was written by an Eyevinn Special Video-Dev Team. Bring in an Eyevinn Special Video-Dev Team when you need a unique combined skillset with the experience from the entire video development domain, to add a feature to your software but lacks the right competence in-house or don’t have enough resources available to do it.
Eyevinn Technology is an independent consultant firm specialized in video and streaming. Independent in a way that we are not commercially tied to any platform or technology vendor.