Flink application practice under open source big data ecology

Over the past decade, key technologies for the entire digital age have come and gone, from acceptance to adoption. Big data and computing have been widely recognized as the key words of the era, and the importance of computing power has become increasingly prominent and developed into a new growth point for enterprises. Apache Flink (hereinafter referred to as Flink) has attracted much attention for its fast and accurate computing power. How to better combine Flink with big data ecological technology, fully tap the potential of data and really play the value of data is a difficult problem faced by most enterprises.

From November 28 to 30, Flink Forward Asia invited technical experts from Alibaba, Dell Technology Group, Intel, Cloudera, Qutoutiao, Baidu, Stream Native and other different directions to discuss the current development trend and future trend of big data around Apache Flink’s core big data ecology. And show the excellent practice of related technology in the front line production scene.

Click to learn more about the conference and purchase tickets

An overview of some of the highlights

Apache Flink and the Apache Way

Fabian Hueske Apache Flink PMC, Ververica Co-founder, Software Engineer

Apache Flink is a project of the Apache Software Foundation (ASF). The ASF is the world’s largest open source foundation and the home of more than 350 individual projects and initiatives.

Every ASF project is independently governed and managed by its own community but follows the principles of the ASF, the so-called Apache Way. Knowing the Apache Way is important to fully understand how the community of an ASF project works.

In this talk, I’ll briefly explain the Apache Way and how ASF projects organize themselves. I’ll take a look back at how the Apache Flink community started and its journey to where it is today.

Finally, I’ll give you some guidance and advice that will help you to start contributing to Apache Flink and maybe become a committer at some point in the future.

Optimize Apache Flink on Kubernetes with YuniKorn Scheduler

Yang Weiwei, Cloudera senior software engineer Yang Tao, Alibaba technology expert

Running Flink on K8s was simple, but when we tried to run large-scale Flink tasks on a K8s cluster with stringent requirements for multi-tenant environments and SLAs, problems began to emerge. Especially at the scheduling level, we found that Flink’s job scheduling slowed down, and the allocation of resources became chaotic and unfair, which often led to job starvation or resource waste. So we began to seek to use YuniKorn to solve the scheduling problem on K8s.

YuniKorn is an open source, lightweight, universal resource scheduler that can be easily adapted to K8s. Compared with the native K8s scheduler, YuniKorn provides richer scheduling features, such as hierarchical queuing, resource fairness guarantee, preoccupation, and better performance. It is more suitable for large-scale multi-tenant, long-run, and batch jobs. YuniKorn scheduling takes into account the resource usage of various dimensions, such as applications, users and queues, and provides flexible capacity allocation based on fairness principles. In this topic, we will mainly discuss how to optimize the operation of Flink on K8s through YuniKorn, including performance, multi-tenancy, resource fairness and other aspects.

Qutoutiao builds real-time data analysis platform based on Flink+ClickHouse

Wang Jinhai, head of Qutoutiao data platform

Qutoutiao has always been committed to using big data analysis to guide its business development. Flink+ClickHouse solutions are currently used in the real-time space, covering scenarios such as real-time data reporting, Adhoc real-time query, event analysis, funnel analysis, retention analysis and other refined operational strategies. 80% of the overall response is completed in 1 second, greatly improving users’ real-time data retrieval experience. Promote faster and iterative business development. Main contents of this sharing:

Business scenario and status analysis
Flink to Hive hour-level scenario
Flink to ClickHouse second level scene
The future planning

Edge streaming computing based on Apache Flink

Yuan Youjun, Senior R&D engineer of Baidu Cloud Huang Jiatian, senior R&D engineer of Real-time computing of Internet of Things Department of Baidu Cloud

With the development of 5G and IoT technology, computing will spread from the current cloud to more and more other places, one typical scenario is edge computing. The computing power of the devices in these scenarios is very limited compared to the powerful computing clusters in the cloud. As a new generation of streaming computing engine, Apache Flink has been widely used in the cloud of many top Internet companies. However, there is no successful case study on how to run streaming computing engines on edge devices with extremely limited resources.

We believe Apache Flink should not only run in the cloud, but on any device that needs it. In this talk, we will share some explorations of Baidu Intelligent Cloud running streaming jobs on edge devices, introducing how to reduce the memory consumption of jobs to less than 10M, and how to achieve zero dependence of jobs on the running environment. The conference will focus on introducing The Edge streaming computing framework Creek developed by Baidu based on Flink. The key contents include:

Introduces the significance and challenges of streaming computing in edge devices
The technical solution of Creek is introduced
Show Creek’s performance indicators
Live demonstration of Creek operation build and run

Apache Flink integration with Apache Hive

Rui Li, Apache Hive PMC, Apache Flink Contributor, Alibaba Technology Expert Gang Wang, Alibaba Senior Development Engineer

Hive has become the de facto standard for data warehousing in big data. To enrich Flink’s ecosystem, starting with version 1.9.0, we provided Flink with the ability to integrate with Hive, allowing users to read and write tables in Hive using Flink. Since the release of 1.9.0, we have further improved the functionality of the Flink-Hive integration, including support for more comprehensive data types, better DDL support, and functions.

In the new version, we can support more application scenarios and provide better ease of use. This presentation will cover the design architecture of the Flink-Hive integration, the progress of the project, and new features in subsequent releases. Finally, we’ll demonstrate how to use Flink to interact with Hive.

Open source Big data Ecology special session full agenda

In addition to the above topics, The open source Big data Ecology special session also featured more wonderful sharing from dell Technology Group, Intel, Stream Native and other heavyweight guests, as well as Apache Member, Apache Flink PMC, Apache Calcite Committer, etc. Here’s the full agenda:

(Afternoon of November 28th, special agenda)

(Morning of November 29, special session agenda)

In-depth training, realize the accumulation and improvement of technology and application ability

November 11-14, Flink Forward Asia Training Course tickets buy one, get one free for 3 days! Make an appointment for the training course, and then add wechat (ID: Candy1764) to provide a list of partners who will participate in the training together. The activity will end at 12:00 noon on November 14, limited number, while the gift lasts, to the students who are interested in the training quickly start!

Apache Flink PMC led the team, super luxury lineup, Alibaba and the founding team of Flink senior technical experts as training instructors, for the developer training courses to develop a comprehensive learning system.

The courses can meet different learning needs, no matter for beginners or advanced ones. Developers can choose course content based on their own basis to accumulate and improve their technology and application ability.

The main outline of the course is as follows:

Middle Level 1: Apache Flink developer training

Tips: This course is taught in English only, with 2 Chinese technical experts to answer questions.

This course is a hands-on introduction to Apache Flink for Java and Scala developers who want to learn how to build streaming applications. The training will focus on core concepts such as distributed data flow, event time, and state. The exercise will give you a chance to see how the above concepts are represented in the API and how they can be combined to solve real-world problems.

Introduces stream computing and Apache Flink
The basics of the DataStream API
Preparation for Flink development (including exercises)
Stateful flow processing (including exercises)
Time, timer, and ProcessFunction(including exercises)
Connect multiple streams (including exercises)
Tests (including exercises)

Note: No knowledge of Apache Flink is required.

Middle level 2: Apache Flink operation and maintenance training

This course is a practical introduction to the deployment and operation of Apache Flink applications. The target audience includes developers and operations personnel responsible for deploying Flink applications and maintaining Flink clusters. The demo will highlight the core concepts involved in running Flink, as well as the main tools for deploying, upgrading, and monitoring Flink applications.

Introduces stream computing and Apache Flink
Flink in the data center
Introduction to distributed Architecture
Containerized deployment (including actual operations)
State backends and fault tolerance (including actual operations)
Upgrade and state migration (including actual operations)
Indicators (including practices)
Capacity planning

Note: Prior knowledge of Apache Flink is not required.

Middle Stage 3: TRAINING for SQL developers

Apache Flink supports SQL as a unified API for stream processing and batch processing. SQL can be used in a wide variety of scenarios, and is much easier to build and maintain than using Flink’s underlying API. In this training, you will learn how to use SQL to write Apache Flink jobs to their full potential. We’ll look at different examples of streaming SQL, including joining streaming data, dimension table association, window aggregation, maintaining materialized views, and pattern matching using the MATCH RECOGNIZE clause (a new standard in SQL 2016).

This section describes SQL on Flink
Use SQL to query dynamic tables
Join dynamic table
Pattern matching and match_recognition
Ecosystems & write external tables

Note: No prior knowledge of Apache Flink is required, but basic SQL knowledge is required.

Advanced: Apache Flink tuning and troubleshooting

Over the years, we have worked with many Flink users to learn about many of the most common challenges in moving flow computing jobs from the early PoC phase to production. In this training, we will focus on introducing these challenges and helping you eliminate them together. We will provide a useful set of troubleshooting tools and introduce best practices and tips in areas such as monitoring, watermarking, serialization, state backends, and more. In between practical sessions, participants will have the opportunity to use their newly learned knowledge to solve some of the problems presented by abnormal Flink assignments. We will also summarize common reasons for jobs not progressing or throughput not meeting expectations, or for job delays.

Time and watermark
State handling and state backends
Flink’s fault tolerance mechanism
Checkpoints and save points
DataStream API and ProcessFunction.

The training courses are small class, the number of courses is limited, the entrance will be closed for full reservation, students with relevant training needs can book as soon as possible. Details:

Please purchase A VIP package to attend the training. Purchase VIP package 1 for intermediate training, purchase VIP package 2 for advanced training.
VIP package 1 can participate in all courses of intermediate level, and VIP package 2 can participate in all courses including advanced and intermediate level training.

If you are also curious about Flink’s main exploration direction in the future, how to use Flink to push big data and computing power to the extreme, what new scenarios, new planning and best practices Flink has, come to the scene! We believe that this group of technical experts from the front line will refresh your knowledge of Apache Flink.

The original link

This article is the original content of the cloud habitat community, shall not be reproduced without permission.

Flink application practice under open source big data ecology

An overview of some of the highlights

Apache Flink and the Apache Way

Optimize Apache Flink on Kubernetes with YuniKorn Scheduler

Qutoutiao builds real-time data analysis platform based on Flink+ClickHouse

Edge streaming computing based on Apache Flink

Apache Flink integration with Apache Hive

Open source Big data Ecology special session full agenda

In-depth training, realize the accumulation and improvement of technology and application ability

Related Posts

How can banks recover from the upgrading of users’ financial demands?

TS Type Assertion (part 1)

21 years of experience as a software designer