Video transcoding comparison – Intel Vs AMD

Video transcoding is certainly an integral part of any Video On Demand (VOD) service. At Mobishaala, on a daily basis, thousands of videos get uploaded. Before making these video contents available for playback at the user end, they are required to be first converted into different video formats, bitrate and resolution like 1080p, 720p or 360p etc. This process is called Transcode. Above all, it is done to provide smooth video streaming across different user devices, having varying network speeds. Based on the user’s network speed, the video player automatically switches between the different quality of video chunks.

Last year, we overhauled the video transcoding service at the Mobishaala platform, to make it more efficient and reduce the overall operating cost at the same time. This service is hosted on AWS and was implemented in a very crude form using multiple C5.4xlarge compute instances.

Table of contents:

How we improved the transcoding service?

To improve the video transcoding, we considered following steps:

  • Firstly, redesign and optimise our existing transcode pipeline process.
  • Secondly, compare and switch to other cheaper options available on AWS instance.
    • AWS Compute series is ideal for transcode processing and in addition AWS provides various configs and processor type. We compared our the then Intel config against other cheaper instance types.

So in the latter part of this article, I am presenting the video transcode comparison, conducted on different AWS’s Compute instance types like C5 (Intel) Vs C5a (AMD) series.

C5 and C5d instances feature either the 1st or 2nd generation Intel Xeon Platinum 8000 series processor (Skylake-SP or Cascade Lake) with a sustained all core Turbo CPU clock speed of up to 3.6 GHz. 

While, C5a instances feature custom 2nd generation up to 3.3 GHz AMD EPYC 7002 series processors built on a 7nm process node for increased efficiency. In addition, C5a instances deliver leading x86 price-performance through a combination of high performance processing and 10% lower cost.

Amazon

What we used for comparison ?

For all video transcode comparisons, the following video file specifications were considered:

1080p video raw file, captured from video cam.
  • 1920×1080 resolution
  • Timecode, H.264 , AAC, stereo channel
    Video 1:
  • Duration: 30 min
  • File Size: 4 GB
    Video 2:
  • Duration: 10 min
  • File Size: 1.2 GB 
720p video file, captured from our live classroom recording.
    Video 3:
  • 1280×720 resolution
  • H.264 encoded, AAC, stereo channel
  • Duration: 50 min 30 sec
  • File Size: 608.4 MB

1- Redesign and optimisation of video transcode pipeline:

Prior to the optimization, this service was used to generate different bitrate transcoded videos in a sequential manner. For transcoding videos, we used FFmpeg software. Because FFmpeg is a well-known open-source & free software that provides different libraries for audio/video processing. Also being a command-line tool, it is easy to integrate it with the backend scripts.

Generation of different resolution videos (720p, 360p, 144p)

720p:

ffmpeg -i video.mp4 -r 24 -c:a aac -ac 2 -b:a 192k -ar 48000 -c:v libx264 -x264opts ‘keyint=24:min-keyint=24:no-scenecut’ -b:v 700k -maxrate 700k -bufsize 1000k -vf ‘scale=trunc(oh*a/2)*2:720’ ./screenshot/temp_720.mp4

360p:

ffmpeg -i video.mp4 -r 24 -c:a aac -ac 2 -b:a 64k -ar 22050 -c:v libx264 -x264opts ‘keyint=24:min-keyint=24:no-scenecut’ -b:v 400k -maxrate 400k -bufsize 400k -vf ‘scale=trunc(oh*a/2)*2:360’ ./screenshot/temp_360.mp4

144p:

ffmpeg -i video.mp4 -r 24 -c:a aac -ac 2 -b:a 64k -ar 22050 -c:v libx264 -x264opts ‘keyint=24:min-keyint=24:no-scenecut’ -b:v 100k -maxrate 100k -bufsize 150k -vf ‘scale=trunc(oh*a/2)*2:144’ ./screenshot/temp_144.mp4

As mentioned earlier, we were using the C5.4x large instance type, which is Intel Xeon Platinum 8000 series. It has the following configuration:

c5.4xlarge (Intel 1st or 2nd gen, 3.4 GHz) 16 vCPU, 32 GiB, upto 10 Gbps network bandwidth, 4750 Mbps EBS bandwidth
Intel Xeon Platinum 8000

Before optimization, transcode time on C5.4x instance (Intel Xeon Platinum 8000):

Video fileOriginal
Resolution
DurationTotal Transcode time to generate
720p, 360p, 144p
Transcoded file
size
Video 11080p (4 GB)34 min 26 sec594 sec317 MB
Video 21080p (1.2 GB)10 min180 sec102 MB
Video 3720p 50 min 30 sec510 sec558 MB
C5.4xlarge instance

Transcode pipeline optimizations implemented:

  • Firstly, you may have noticed that we were sequentially generating the 3 resolution videos (720p, 320p, 144p) for each of the input video. It was obvious to shift it towards parallel transcode, as much as possible.
  • To further speed up the transcoding process, we tried few FFmpeg tweaks: 
    • Changed ‘preset mode to fast’.  {default is medium}  
    • Changed ‘constant rate factor ie crf to 20’. {default is 23}
    • Keeping the frame rate to 24.    
  • Lastly, audio transcode is the slowest process, as its not multithreaded, and we were transcoding 3 times per video (bitrates: 192kb, 64kb, 64kb). So as part of the optimisation, we modified the command to make use of only single transcoded audio version with bitrate of 64kb and sampling rate of 44100 Hz, for all the transcoded versions. Doing this by first generating a single audio file and then copy encode for three versions, to save further processing time. 

While considering these optimisation main criteria were:

  • Firstly, there should not be any significant degradation in the quality of the generated Video / Audio files.
  • Also, transcoded file size should not vary too much. Because lesser file size is always welcome.

Optimised FFmpeg transcode command:

1- First generate audio

ffmpeg -y -i video.mp4 -vn -ar 44100 -ac 2 -b:a 64k output.aac 

2- Generate the required ABR video resolutions in parallel

ffmpeg -y -i video.mp4 -i output.aac -filter_complex “[0]split=3[v0][v1][v2];[v0]scale=trunc(oh*a/2)*2:144[low];[v1]scale=trunc(oh*a/2)*2:360[mid];[v2]scale=trunc(oh*a/2)*2:720[high]” \

        -map ‘[high]’ -map 1:a -c:a copy -c:v libx264 -x264opts ‘keyint=24:min-keyint=24:no-scenecut’ -r 24 -b:v 700k -maxrate 700k -bufsize 1000k -preset fast -crf 20 ./x_720.mp4 \

        -map ‘[mid]’ -map 1:a -c:a copy -c:v libx264 -x264opts ‘keyint=24:min-keyint=24:no-scenecut’ -r 24 -b:v 400k -maxrate 400k -bufsize 400k -preset fast -crf 20 ./x_360.mp4 \

        -map ‘[low]’ -map 1:a -c:a copy -c:v libx264 -x264opts ‘keyint=24:min-keyint=24:no-scenecut’ -r 24 -b:v 100k -maxrate 100k -bufsize 150k -preset fast -crf 20 ./x_144.mp4

After optimization, transcode time on C5.4x instance (Intel Xeon Platinum 8000):

Video fileDurationTotal Transcode
time
Transcoded
file size
Video 1
(1080p)
34 min 26 sec368sec279 MB
video 2
(1080p)
10 min114 sec96 MB
video 3
(720p)
50 min 30 sec319 sec508 MB
C5.4x instance

Here is the comparison between unoptimized Vs optimized transcode pipelines, on the same C5.4x large (Intel) instance:

Video transcode comparison on Intel instances - C5.4xlarge

So, the optimized pipeline is already 36% – 38% faster than our original implementation, which is a huge improvement.

2- Cheaper AWS Options:

AWS also provides various other CPU instance types which are available at a much cheaper rate. Yes, I am referring to AMD and Arm series. Since we faced compatibility issues with the Arm instance type, we could not compare them. However, we may revisit them in the future.

So are these cheaper instances really better or at least at par with Intel instances?

Although, we tried with different instance configuration types, here I am showing the data for the C5a.4xlarge version for the apples to apple comparison. It has the following configuration:

c5a.4xlarge (AMD EPYC 3.3 GHz)16 vCPU, 32 GiB, upto 10 Gbps bandwidth, upto 3170 Mbps EBS bandwidth
AMD 2nd gen EPYC 7002 series

Transcode time using C5a.4x instance  (AMD Epyc processor):

Video fileDurationTotal Transcode time
Video 1
(1080p)
34 min 26 sec314 sec
video 2
(1080p)
10 min97 sec
video 3
(720p)
50 min 30 sec290 sec
C5a.4x instance

Finally here are Transcode time comparisons between AMD Vs Intel Vs Intel (unoptimized pipeline)

Video transcode comparison on Intel and AMD instances.
Intel Vs AMD

Conclusion

With C5.4x (Intel) instance type and optimized FFmpeg command:

  • Transcode is 1.5 – 1.6 times faster than non optimized execution on C5.4xlarge instance.

With C5a.4x (AMD Epyc) instance type and optimized FFmpeg command:

  • Transcode is 1.7 – 1.8 times faster than non optimized execution on C5.4xlarge instance.
  • Transcode is 9% – 15% faster as compared to optimised execution on C5.4xlarge instance.
  • In addition, C5a.4x instance type is available at almost half rate as compared to C5.4x instance type.
    • C5.4x is at $0.68/hr Vs C5a.4x at $0.37/hr.
    • As a result, by just switching to these cheaper instances, we are already saving around ₹19k – 20k per month based on the workload. So, these savings will increase as more video transcoding is done.

In short, AMD instances are slightly faster and interestingly cheaper options at the same time!

Share the post, if you liked it

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.