Video processing from a front-end perspective

Recently, I was working on the back-end development of a video editing project, and the processing of the video has always been in a blank state. Many concepts involved in the project, with constant contact, have a clear understanding from vague to clear.

Video, English: video, literally translated as the frequency of visual pictures, the most primitive meaning should be the continuous playback of pictures as time goes by, thus producing a visually continuous effect, as if to reproduce the real world scene .

screen update frequency

The above picture is a collection of pictures of a group of small people running (partial clippings), composed of a picture sequence. When we set it to play continuously and automatically, it will form the simplest and most original video.


2fps	4fps	6fps	8fps	10fps

10fps = 10 frames per second, i.e. each image has a visual dwell time of 0.1 second (1/10)

As shown in the picture above, each picture stays from 0.5 seconds to 0.1 seconds, and when the picture is played at different speeds, different visual effects will be produced. Here is a very important concept in video: the screen update frequency, that is, the frame rate (English: Frame rate, frame per second, FPS). The screen update frequency ranges from 6 or 8 frames per second in the early days to 120 frames per second today.

How many frames can ensure the smoothness of the video picture?

Usually more than 24 frames, the human eye's "visual persistence" and "brain supplement" phenomenon, the former refers to the phenomenon that the "afterimage" of the human retina will remain for a certain period of time after the light signal disappears; the latter is the phenomenon of the brain's own Complement the "brain supplement" function of the middle frame of the picture. Their blending effect makes us mistakenly believe that the photos played back at 24 frames per second are continuous.

It can be seen from this that the pictures seen in the video can be infinitely close to the real scene, but it is difficult to restore the real world. Just like the limit in mathematics, what is presented in the video is a piecewise function, which can never present a smooth curve, and can be fitted infinitely close.

video size ratio

Common videos are 720P, 1080P, 4K, etc.

P, is the abbreviation of English progressive, which means the number of pixel lines in the video screen. Today's cameras scan line by line, that is, scan the pixels of each line one by one. 1080P, which means a video with a height of 1080 pixels

For example, a 1080P video is generally 1920×1080, or about 2 million pixels, while a 720P video is 1280×720, or about 920,000 pixels.

K, represents the horizontal resolution of the video, which can be understood as the total number of pixels in each line.

For example, 2K video is generally 2048×1080, and 4K video is generally 4096×2160 (or: 3840×2160 standard on home appliance monitors)

The ratio of the video, indicating the ratio of the length and width of the video screen. Common video ratios are 4:3 and 16:9.

It can be seen that the ratio of 4:3 is more square than the ratio of 16:9, which is more suitable for reading. Most of the screens of books or e-readers use this ratio.

16:9, commonly known as widescreen, is more suitable for watching TV HD video or DVD. When the mobile phone is placed vertically, the ratio of the photos taken is generally 9:16

track

The tracks in the video can be imagined as independently running train tracks. The independent variable is time, and the dependent variable is the material parameters on different tracks. Contains tracks such as background, video, audio, subtitles, etc.

As shown in the figure above, it is similar to the absolutely positioned DIV blocks stacked together in the front-end web, or the layers in the picture, the difference is that the track in the video continues with the timeline. Each track exists independently and can be freely edited on a single track. In addition, you can also add filters, special effects, flower characters, transitions, text and other effects.

Filter, which has the same meaning as the filter attribute in CSS3, is equivalent to adding a filter to an image to achieve various special effects of the image, such as gray, color inversion, black and white, mosaic, sharpening, etc. It can make the picture show another style, and the effect achieved by the filter is also very cool, such as turning on the beauty filter, instantly rejuvenating the youth. Behind it is a set of filter functions, the common ones are scale (zoom), overlay (overlay), rotate (rotation), etc.

The processing of text is used to realize the effects of subtitles, narration, commentary and other effects of the video.
The transition is easy to understand. In the TV series, the protagonist suddenly had a dream, and when he returned to the scene when he was a child, the picture was cut.
Special effects, such as the effect of flying through the clouds and driving the fog in Journey to the West, and the eighteen palms of the hero Qiao Feng in the martial arts TV series, etc.

Video codec

Image depth, the storage space occupied by each pixel (BPP, byte per pixel, pixel depth), determines the display quality of the image. If a color image is represented by three components of R (red) G (green) B (blue), each component occupies 8 bits, then a pixel needs to occupy 24 bits, that is, 3 bytes in size.

Bit rate: The number of bits (bits) transmitted per second. The unit is bps (Bit Per Second). The higher the bit rate, the faster the data transmission speed.

Uncompressed video data occupies a very large storage space and is inconvenient for transmission over the network. If the video plays 30 pictures per second, the width and height of each picture are 300 and 200 pixels respectively, and each pixel requires 24 bits (8 bits per byte, or 3 bytes) of storage space, then a How much space does a second of video take up?

FPS (frame rate)	size (image width and height)	BPP (Image Depth)	BPS (Bit Rate)	file size (KB)
30	300 ✖️ 200	twenty four	1M	5273

The data volume is about 5.3M. Calculated according to the 1M transmission bandwidth, the bit rate is 131072 bytes/second (1Mbps=131072 bytes/second=128kb/s=0.125M/S), and you need to wait for more than 40 seconds. This is not to mention the information of other tracks, general videos have audio tracks and subtitles.

Video can be compressed because the original video contains a lot of redundant information, such as: the human visual system has some innate characteristics and is not sensitive to certain details. Theoretically, removing redundant information from video based on human visual characteristics can not only ensure video quality but also compress video volume.

Prediction: Reduce the spatial and temporal redundancy of video images through intra-frame prediction and inter-frame prediction.
Transformation: By transforming from the time domain to the frequency domain, the correlation between adjacent data is removed, that is, the spatial redundancy is removed.
Quantization: Reduce the amount of encoded data by representing finer data with coarser data, or reduce the amount of encoded data by removing information that is not sensitive to the human eye.
Scan: Reorganize 2D transformed quantified data into a 1D sequence of data.
Entropy coding: reduces coding redundancy according to the probabilistic properties of the data to be coded.

International standards organization for video coding: l ITU-T (International Telecommunication Union Communications Standards Department) video coding standard is expressed in the form of H.26x, which is mainly designed for real-time video communication applications such as video conferencing and video telephony.

ISO/IEC (International Organization for Standardization; I International Electrotechnical Commission) standards, expressed in the form of MPEG-x, are primarily designed for video storage (DVD), broadcast video, and video streaming (eg, online video, wireless video applications).

Evolution of Video Coding Standards

The above picture can clearly see the evolution history of various encodings. The current coding standards are H.26x series video coding organized by ITU-T and some coding standards formulated by MPEG organization. The same standard may be called differently in different organizations. For example, AVC (Advanced Video Coding), you may be more familiar with its other name - H.264, AVC is the name given to it in the standard by the MPEG organization.

Project Practice

At present, I have been exposed to two open source video processing libraries, OpenCV and FFmpeg. OpenCV is a computer vision processing library. It is open source and cross-platform. It provides C++, Python and Java interfaces. It is mostly used in computer vision application scenarios based on machine learning and deep learning. FFmpeg is a set of open source computer programs that can be used to record, convert digital audio and video, and convert them into streams. FFmpeg will be included in openCV, focusing more on image processing, while FFmpeg provides powerful video processing capabilities. Both have been used in recent projects.

The above figure shows the business processing flow of the video editing project.

After parsing the project's configuration, initialize the project's working directory
Parse the material address and download it to the local directory
The combination of multi-threading and multi-process is used to render material media and control the number of concurrent
The bottom layer calls OpenCV and FFmpeg, synthesizes video, and generates target format
Add title, ending, watermark, etc., and upload to the cloud
Delete files generated by intermediate links and release system resources

web front-end extension

The above figure roughly shows the process of the browser running the page, from entering the page address to the final rendering of the view, going through a series of processes. To a certain extent, the browser can be regarded as a special video player, which also processes the page frame by frame.

When you encounter network delay or computer performance problems, the phenomenon of stuttering occurs. This kind of problem is mainly caused by dropped frames. Some key picture frames are not displayed, so that a continuous picture cannot be formed in the mind, giving people a feeling of disconnection.

FFCreator

FFCreator launched by our team is a lightweight and flexible short video production library based on node.js. You just need to add a few pictures or video clips and a piece of background music to quickly generate a cool video clip.

You can add music, subtitles, text, virtual anchors, and more to your videos. Of course, it is very convenient to make single or batch data visualization videos.

characteristic

Developed entirely based on node.js, it is very easy to use and easy to extend and develop.
It has few dependencies, is easy to install, cross-platform, and requires less machine configuration.
Video production is extremely fast, a 5-minute video only takes 1-2 minutes.
Support nearly 100 kinds of scene cool transition animation effects.
Supports elements such as pictures, sounds, video clips, and text.
Support subtitle component, can combine subtitle and voice tts to synthesize audio news.
Supports chart components and can make data visualization videos.
Simple (scalable) virtual anchors are supported, and you can make your own virtual anchors.
Contains animate.css90% animation effects, which can convert css animations to videos.

FFCreator official website: https://github.com/tnfe/FFCreator

Summarize

This article briefly introduces some basic concepts of video processing, and explains them one by one in combination with the confusion encountered in actual projects. Then, the processing process of some business projects and the association with the Web front end are introduced. Finally, we recommend a video processing tool independently developed by our team. Interested friends are welcome to like and collect.