Video transcoding is certainly an integral part of any Video On Demand (VOD) service. At Mobishaala, on a daily basis, thousands of videos get uploaded. Before making these video contents available for playback at the user end, they are required to be first converted into different video formats, bitrate and resolution like 1080p, 720p or 360p etc. This process is called Transcode. Above all, it is done to provide smooth video streaming across different user devices, having varying network speeds. Based on the user’s network speed, the video player automatically switches between the different quality of video chunks.
Last year, we overhauled the video transcoding service at the Mobishaala platform, to make it more efficient and reduce the overall operating cost at the same time. This service is hosted on AWS and was implemented in a very crude form using multiple C5.4xlarge compute instances.
How we improved the video transcoding service?
To improve the video transcoding, we considered following steps:
- Firstly, redesign and optimise our existing transcode pipeline process.
- Secondly, compare and switch to other cheaper options available on AWS instance.
- AWS Compute series is ideal for transcode processing and in addition AWS provides various configs and processor type. We compared our the then Intel config against other cheaper instance types.
So in the latter part of this article, I am presenting the video transcode comparison, conducted on different AWS’s Compute instance types like C5 (Intel) Vs C5a (AMD) series.
C5 and C5d instances feature either the 1st or 2nd generation Intel Xeon Platinum 8000 series processor (Skylake-SP or Cascade Lake) with a sustained all core Turbo CPU clock speed of up to 3.6 GHz.
While, C5a instances feature custom 2nd generation up to 3.3 GHz AMD EPYC 7002 series processors built on a 7nm process node for increased efficiency. In addition, C5a instances deliver leading x86 price-performance through a combination of high performance processing and 10% lower cost.
Amazon
What we used for video transcode comparison ?
For all video transcode comparisons, the following video file specifications were considered:
1080p video raw file, captured from video cam.
- 1920×1080 resolution
- Timecode, H.264 , AAC, stereo channel
Video 1:
- Duration: 30 min
- File Size: 4 GB
Video 2:
- Duration: 10 min
- File Size: 1.2 GB
720p video file, captured from our live classroom recording.
Video 3:
- 1280×720 resolution
- H.264 encoded, AAC, stereo channel
- Duration: 50 min 30 sec
- File Size: 608.4 MB
1- Redesign and optimisation of video transcode pipeline:
Prior to the optimization, this service was used to generate different bitrate transcoded videos in a sequential manner. For transcoding videos, we used FFmpeg software. Because FFmpeg is a well-known open-source & free software that provides different libraries for audio/video processing. Also being a command-line tool, it is easy to integrate it with the backend scripts.
FFmpeg command for video transcode
Generation of different resolution videos (720p, 360p, 144p)
720p:
ffmpeg -i video.mp4 -r 24 -c:a aac -ac 2 -b:a 192k -ar 48000 -c:v libx264 -x264opts ‘keyint=24:min-keyint=24:no-scenecut’ -b:v 700k -maxrate 700k -bufsize 1000k -vf ‘scale=trunc(oh*a/2)*2:720’ ./screenshot/temp_720.mp4
360p:
ffmpeg -i video.mp4 -r 24 -c:a aac -ac 2 -b:a 64k -ar 22050 -c:v libx264 -x264opts ‘keyint=24:min-keyint=24:no-scenecut’ -b:v 400k -maxrate 400k -bufsize 400k -vf ‘scale=trunc(oh*a/2)*2:360’ ./screenshot/temp_360.mp4
144p:
ffmpeg -i video.mp4 -r 24 -c:a aac -ac 2 -b:a 64k -ar 22050 -c:v libx264 -x264opts ‘keyint=24:min-keyint=24:no-scenecut’ -b:v 100k -maxrate 100k -bufsize 150k -vf ‘scale=trunc(oh*a/2)*2:144’ ./screenshot/temp_144.mp4
FFmpeg commands for video transcode
As mentioned earlier, we were using the C5.4x large instance type, which is Intel Xeon Platinum 8000 series. It has the following configuration:
c5.4xlarge (Intel 1st or 2nd gen, 3.4 GHz) | 16 vCPU, 32 GiB, upto 10 Gbps network bandwidth, 4750 Mbps EBS bandwidth |
Before optimization, transcode time on C5.4x instance (Intel Xeon Platinum 8000):
Video file | Original Resolution | Duration | Total Transcode time to generate 720p, 360p, 144p | Transcoded file size |
Video 1 | 1080p (4 GB) | 34 min 26 sec | 594 sec | 317 MB |
Video 2 | 1080p (1.2 GB) | 10 min | 180 sec | 102 MB |
Video 3 | 720p | 50 min 30 sec | 510 sec | 558 MB |
Video Transcode pipeline optimizations implemented:
- Firstly, you may have noticed that we were sequentially generating the 3 resolution videos (720p, 320p, 144p) for each of the input video. It was obvious to shift it towards parallel transcode, as much as possible.
- To further speed up the transcoding process, we tried few FFmpeg tweaks:
- Changed ‘preset mode to fast’. {default is medium}
- Changed ‘constant rate factor ie crf to 20’. {default is 23}
- Keeping the frame rate to 24.
- Lastly, audio transcode is the slowest process, as its not multithreaded, and we were transcoding 3 times per video (bitrates: 192kb, 64kb, 64kb). So as part of the optimisation, we modified the command to make use of only single transcoded audio version with bitrate of 64kb and sampling rate of 44100 Hz, for all the transcoded versions. Doing this by first generating a single audio file and then copy encode for three versions, to save further processing time.
While considering these optimisation main criteria were:
- Firstly, there should not be any significant degradation in the quality of the generated Video / Audio files.
- Also, transcoded file size should not vary too much. Because lesser file size is always welcome.
Optimised FFmpeg command for video transcode:
1- First generate audio
ffmpeg -y -i video.mp4 -vn -ar 44100 -ac 2 -b:a 64k output.aac
2- Generate the required ABR video resolutions in parallel
ffmpeg -y -i video.mp4 -i output.aac -filter_complex “[0]split=3[v0][v1][v2];[v0]scale=trunc(oh*a/2)*2:144[low];[v1]scale=trunc(oh*a/2)*2:360[mid];[v2]scale=trunc(oh*a/2)*2:720[high]” \
-map ‘[high]’ -map 1:a -c:a copy -c:v libx264 -x264opts ‘keyint=24:min-keyint=24:no-scenecut’ -r 24 -b:v 700k -maxrate 700k -bufsize 1000k -preset fast -crf 20 ./x_720.mp4 \
-map ‘[mid]’ -map 1:a -c:a copy -c:v libx264 -x264opts ‘keyint=24:min-keyint=24:no-scenecut’ -r 24 -b:v 400k -maxrate 400k -bufsize 400k -preset fast -crf 20 ./x_360.mp4 \
-map ‘[low]’ -map 1:a -c:a copy -c:v libx264 -x264opts ‘keyint=24:min-keyint=24:no-scenecut’ -r 24 -b:v 100k -maxrate 100k -bufsize 150k -preset fast -crf 20 ./x_144.mp4
Optimized FFmpeg commands for video transcode
After optimization, transcode time on C5.4x instance (Intel Xeon Platinum 8000):
Video file | Duration | Total Transcode time | Transcoded file size |
Video 1 (1080p) | 34 min 26 sec | 368sec | 279 MB |
video 2 (1080p) | 10 min | 114 sec | 96 MB |
video 3 (720p) | 50 min 30 sec | 319 sec | 508 MB |
Here is the comparison between unoptimized Vs optimized transcode pipelines, on the same C5.4x large (Intel) instance:
So, the optimized pipeline is already 36% – 38% faster than our original implementation, which is a huge improvement.
2- Cheaper AWS Options:
AWS also provides various other CPU instance types which are available at a much cheaper rate. Yes, I am referring to AMD and Arm series. Since we faced compatibility issues with the Arm instance type, we could not compare them. However, we may revisit them in the future.
So are these cheaper instances really better or at least at par with Intel instances?
Although, we tried with different instance configuration types, here I am showing the data for the C5a.4xlarge version for the apples to apple comparison. It has the following configuration:
c5a.4xlarge (AMD EPYC 3.3 GHz) | 16 vCPU, 32 GiB, upto 10 Gbps bandwidth, upto 3170 Mbps EBS bandwidth |
Video Transcode time using C5a.4x instance (AMD Epyc processor):
Video file | Duration | Total Transcode time |
Video 1 (1080p) | 34 min 26 sec | 314 sec |
video 2 (1080p) | 10 min | 97 sec |
video 3 (720p) | 50 min 30 sec | 290 sec |
Finally here are Video Transcode time comparisons between AMD Vs Intel Vs Intel (unoptimized pipeline)
Conclusion
So which cpu server is better for video transcode on AWS ?
With C5.4x (Intel) instance type and optimized FFmpeg command:
- Transcode is 1.5 – 1.6 times faster than non optimized execution on C5.4xlarge instance.
With C5a.4x (AMD Epyc) instance type and optimized FFmpeg command:
- Transcode is 1.7 – 1.8 times faster than non optimized execution on C5.4xlarge instance.
- Transcode is 9% – 15% faster as compared to optimised execution on C5.4xlarge instance.
- In addition, C5a.4x instance type is available at almost half rate as compared to C5.4x instance type.
- C5.4x is at $0.68/hr Vs C5a.4x at $0.37/hr.
- As a result, by just switching to these cheaper instances, we are already saving around ₹19k – 20k per month based on the workload. So, these savings will increase as more video transcoding is done.
In short, AMD instances are slightly faster and interestingly cheaper options at the same time!