Tubi data engineer – Shen Da was invited to teach an open class for LiveVideoStack live technology community on January 18, 2022. This article is a prepared version dedicated to providing technology and experience sharing for various industries to analyze and understand the growing amount of unstructured video data. To watch the video for free, please go directly to the end of the article, let’s go to the B station ~ Tubi continue to share technology, encourage to follow wechat “bitu technology”!

Hi, everyone. My name is Shen Da, today’s speaker. I would like to talk to you about Rikai, a video content understanding engine.

First of all, LET me introduce myself. I am a data engineer at Tubi. I have been involved in the GNU TeXmacs project for a long time since college; After work, I became active in the Scala community, translated and published the Scala Practical Guide, and contributed a lot of code to the Spark community. I’m currently working on video content understanding at Tubi, and most recently on the Rikai project.

Today’s share is divided into four sections:

Video data

>> Growing video data

Everyone is familiar with video data. Companies and industries such as IQiyi, Youku and Tencent, as well as Bilibili, Douyin, Kuaishou, online education, online conference and automatic driving, are generating a large amount of video data every day. At Tubi, we have video data for traditional movies and TV shows. Tubi is a streaming media company in Silicon Valley in North America. The biggest difference between Tubi and Hulu and Netflix is that it is free, and its profit model is advertising, without membership mechanism.

>> Tubi’s video content

Tubi’s video content comes in three categories: commercials, movies and TV shows, and live TV. At present, we mainly analyze advertising and movies and TV plays, mainly advertising. Tubi’s ads are programmed ads. We get ads from advertisers. We can’t predict what the ads are. The difference between film and TV series and advertisements is that they are longer and the content is not changeable, so we can only do some auxiliary and meticulous work in them, such as marking the end of the title and the beginning of the title, so that we can skip the beginning.

For advertising, the main problem is whether to target the AD to the user. We need to know: 1) what the AD is, for example, about cars or drinks; 2) Whether the advertisement has been released to users recently? We should not let users watch the same advertisement repeatedly in a short period of time, which will lead to poor user experience and low advertisement conversion rate.

For movies and TV dramas, the problem we need to solve is to find the appropriate time to insert advertisements. The point of inserting advertisements needs to be considered as follows: 1) To reduce the disturbance to users, and advertisements should not be inserted when users see key points; 2) Increase the click-through rate of advertisements, which is the core requirement of Tubi.

>> Why Rikai?

Without a framework like Rikai, doing video content analysis and understanding would be very complex. Because it involves multiple teams working together: at least the video team, the data team, and the machine learning team. The technical frameworks involved include FFmpeg, PySpark, PyTorch, etc. There are four programming languages behind these three frameworks alone: C++, Python, Scala, and SQL. It is impossible for one person to master all the technical details. Working across multiple teams is definitely less efficient. So we created Rikai’s framework to drive the project in an open source way. With Rikai, different teams can work at different points, much like a group of blind artists trying to draw an elephant, with one person touching the trunk, another the tusks, and another the limbs.

For example, this is a frame in a beverage advertisement, and the focus of the advertisement is on the two beverage bottles on the left. I apply an object detection model to this picture, which can find the objects in the picture, such as the bottle, table, woman, man and tent.

With Rikai, this is easy to do using SQL, as shown below. The result of my SQL query is also shown on the right side of the figure below. At Tubi we also need to determine whether the AD is a beverage AD in various ways.

Core features of Rikai

Rikai has three core functions: ML_PREDICT, which I’ve just introduced, can invoke a model, and the second function is visualization, as shown earlier, and images can also be displayed in query results. Third, we will design and implement a Rikai storage format in the future.

>> What is Rikai?

Official explanation — Rikai is a high-performance AI data format based on Parquet for processing unstructured data. So far we have a rough design of the data format, and for now we are focusing more on the upper application: how to call the model and how to process image and video data on a large scale. As we gain more experience, we will design a better format to find the perfect balance between computation and storage. Current video formats (such as MP4, etc.) are designed for video playback and volume compression, but are CPU intensive when decoding. We don’t need to compress as much, we want to sacrifice space to save CPU usage and achieve the best effect of AI processing. Rikai comes from Japanese and means to understand, and in southern China such as Hubei, Zhejiang and Guangdong (confirmed with some friends), our pronunciation of “understand” is the same or similar to Rikai. Rikai’s two authors are Chang, co-author of the renowned framework Pandas, and Eddy, PMC Member of Apache Hadoop.

>> ML_PREDICT and visualization

ML_PREDICT is the most obvious feature of Rikai. Given the model and inputs, the ML_PREDICT UDF is applied to obtain the results of the model application. For object detection model like Yolov5m, its function is to find every object in the image. Box refers to mark the position of the object with two coordinates, label refers to identify what the marked object is, and Score refers to confidence. The closer it is to 1, the higher the confidence is.

In this SQL statement, to_image downloads an Image from the network and transforms it into Rikai’s Image. ML_PREDICT takes the parameters yolov5m and Image and gets the result of applying the model. But we don’t really know what the image is or what the model will turn out to be. For such unstructured data, we need to visualize what is actually happening.

At the Tubi Data team, we maintain exploration platforms such as Jupyter Notebook and Databricks Notebook, which are paragraph based platforms that execute SNIPpets of SQL and Python code. As shown in the figure below, we use an image defined by Rikai, and receive a URL to display the image in the notebook. This is the cover of Jay Mojito’s single. On top of Rikai Image, there are many easy-to-use operators that can transform various images. For example, for the image, we can use the | operator implementation layer of the stack. We also plan to use the * operator to zoom in and out of the image. (| read as a vertical bar, an arithmetic operator, here said the layer stack) in the same way, video can be visualized, demonstrate later. At present, we have not implemented the function of frame replacement. For example, in the video of Ghost Livestock in station B, a person’s face can be replaced. Scenarios like this (in other words, AI face swapping) are very easy to implement with Rikai with frame swapping, and can be very complete. The diagram below shows the process images with objects detection model and common visual demand is to detect object in the image annotation, in the original image using the | operator and Box2d two dots (upper left and lower right) to the characters in the image frame.

>> Model management

As MENTIONED earlier ML Predict involves a key model, Yolov5m. We can use create Model to create such a model, but Rikai does not manage the model directly because there are many model management platforms that manage the model. Such as MLflow from Databricks. Rikai ADAPTS MLflow and provides corresponding SQL statements to create model instances. This is an example of connecting to a distributed model registry such as TorchHub, which has the advantage of requiring no additional dependencies because TorchHub works as follows: There is a file defined in the AI repository on Github that specifies where the model is stored (or how to download it). For example, with yolov5m, you can see that I am using TorchHub’s model registry. The part circled in red is the [organization name]/[project name]:[tag name] of yoloV5 on Github. Yolov5 has many model families. I’m using m series here.

For the model yolov5m, we can configure it in option:

1) Device: which device is used, GPU/CPU? 2) batCH_size: the size of a batch when training 3) conf_Thres: confidence, for example I set a threshold here, below which I choose to ignore it. And so on. In this way, a model can be derived from multiple model instances through create Model. Knowing the parameters, we can create model instances in an automated manner and automatically evaluate them to find the most appropriate parameters to apply to our production environment.

Practice: understand images, understand videos

For the actual combat content of the last two parts, please go directly to station B to watch the video of the open class. With Google Colab, try Rikai directly on the web without installing any additional software. You’ll get even better results! Portal: colab.research.google.com/github/eto-…


To get free access to public courses, Slides is on wechat public account “Bitu Technology”. Follow and reply to Rikai

Search for “Bitu Technology” at station B

Tubi data engineer recruitment, send resume directly to [email protected]

Tubi has more

  • Tubi engineer culture
  • Inside Tubi’s new Beijing office
  • The candidate came to Tubi and said…
  • What do you do with your OoO
  • We are the best workplace in Asia
  • Here’s everything you need to know about Tubi!