This article briefly reviews the core technologies of federated computing. With the application and popularization of federated computing in industry, it can protect data privacy and solve data silos, and provide a new idea for digital advertising and marketing.

The full text is 4761 words, and the expected reading time is 12 minutes.

A, leads

As we all know, data is the fuel of AI technology, and more high-quality data means better performing business models can be trained. With the pace of IT mobility, Internet data has been divided into islands. A bottleneck restricting the development of AI is the protection of user data privacy and the breaking of data islands between different subjects. With the enhancement of the performance of mobile devices and the rise and popularity of 4G/5G, model training on mobile terminals becomes feasible. In 2016, Google team published a paper: Communication-efficient Learning of Deep Networks from Decentralized Data. Since then, the industry of Federal Learning (” Federal Learning “, Google Chinese named alliance Learning, domestic custom called Federal Learning) opened the curtain.

Federated L__earning: Collaborative Machine Learning Without Centralized Training Data

Deploying federated learning on millions of different smartphones is essentially about moving models, not data. To avoid user privacy breaches, federated learning does not need to store user data in the cloud. Smart phones download the current version of the model, improve the model by learning local data, and send incremental encryption of model improvement to the cloud, and integrate it with updates of other users into a shared model in real time. All training data are still stored in the devices of each end user, and user data will not be saved in the cloud.

In the international Privacy Data Regulations (GDPR, etc.) increasingly strict situation, represented by Google ToC business (such as input method) opened up a new idea. In China, federal learning extends to ToB in the industry to solve the dilemma of ToB AI: privacy protection and data silos. For example, risk control and marketing involve large-scale user data interaction and have higher privacy protection value. At the end of 2019, five companies, including Baidu, Wezhong, Ant, Fushu and Huavang, obtained security computing certificates issued by THE Information and Communications Institute, which are the most influential certificates among current security computing qualifications.

Federal computing core technology

In order to jointly open up data islands and solve the problem of data security and privacy protection for participants, the generalized implementation scheme of “Federated Computing” includes: Cryptographic-focused MPC Secure Multi-Party Computation, and Hardware Trusted Execution Environment TEE.

2.1 MPC: Secure Multi-Party Computation

Multi-party secure computing MPC is based on cryptography and adopts the basic concept of algorithm/program logic to ensure the security and trust of computation. Its security can be verified by mathematical formula derivation. The MPC is independent of hardware and other environment facilities, that is, it is compatible with various heterogeneous system environments and does not depend on specific hardware.

2.1.1 Garbled Circuit

Confounding circuit is a cryptographic protocol. Turing Prize-winning academician Yao Qizhi proposed Yao’s Millionaires’ Problem in 1982 and proposed a solution based on confounding circuit. The problem is that Alex and Bob are richer than each other without being trusted and without telling each other what their wealth is.

Its principle is: all computable function problems can be converted into different circuits, by the addition circuit, multiplication circuit, shift circuit, selection circuit and so on. The circuit is essentially composed of gates, logic gates include and gate, not gate, or gate, and not gate, etc. Obfuscation circuits encrypt and scramble these gates to mask information. Alice encrypts the truth table of the gate with the key and scrambles it and sends it to Bob. Bob decrypts each line of the truth table, and the algorithm guarantees that only one line can be decrypted successfully, and extracts the result. Finally, Bob synchronizes the results to Alice. In the process, the two exchange are random numbers or ciphertext, no privacy data leakage, but from the program logic level to complete the required business calculation.

2.1.2 Secret Sharing

Secret sharing (also known as secret splitting) is a method used to distribute secrets among a group of participants, each of whom is assigned a secret share. Secrets can be reconstructed only when a sufficient number of different types of shares are combined, and individual shares themselves are meaningless.

Among Secret Sharing, the most classic algorithm is Shamir’s Secret Sharing, whose basic design principle is that k points on a plane can uniquely determine a polynomial of order K −1

For example, two points can uniquely determine a line. We call A0 the secret S. We take n points (I,f(I)) from any curve and assign one point to each participant as the share of a password. Then any k participants can restore the secret S. Here, the polynomial can be calculated based on the Lagrange difference method, which is not expanded in this paper.

2.1.3 Homomorphic Encryption

The concept of homomorphic encryption was first proposed in 1978 in the context of banking applications by R(Ron Rivest) and A(Leonard Adleman) and Michael L. Dertouzos in RSA algorithm. For the concept of homomorphic encryption, quote the definition of homomorphic encryption guru Craig Genty:

“A way to delegate processing of your data, without giving A way access to it.”

That is, users can process data without touching the original data. Its essence is: the user processes the ciphertext directly, then decrypts the ciphertext and obtains the plaintext result which is equivalent to that obtained by processing the plaintext directly. The subtlety lies in that the data processor does not know the plaintext of the data, but ultimately calculates the desired result of the business, that is, the data provider does not disclose its original data, effectively protecting data privacy.

The mathematical definition of homomorphic encryption is: E(M1)*E(m2)=E(M1 *m2) ∀ M1, M2 ∈M

Where E is the encryption algorithm, M is the collection of all information, and * is the operator. If the encryption algorithm meets the above formula, the function of E on operation * conforms to homomorphic encryption property.

HE can be classified according to the number and frequency of supported operations: Partial homomorphic encryption (PHE) and hierarchical homomorphic encryption (SWHE) have been applied in the industry production environment, but full homomorphic encryption (FHE) has low efficiency and cannot support large-scale computing at present.

2.2 Hardware Trusted Execution Technology Environment (TEE: Trusted Execution Environment)

The Trusted Execution Environment (TEE) is the safe zone of the main processor. It ensures that internally loaded code and data are protected with respect to confidentiality and integrity. TEE, as a standalone execution environment, provides security features such as isolated execution, application integrity, and confidentiality of its assets. The core mechanism of its security guarantee is instruction set extension, which takes hardware security as mandatory guarantee and does not depend on the security state of firmware and software.

Gidon Gershinsky Trust Management in Intel SGX Enclaves

Intel® Software Guard Extensions (Intel® SGX) protect selected code and data from leakage and modification. Developers can partition applications into CPU-fortified Encalve enclaves to improve security even in compromised platform environments (operating systems/virtual machines). Using the application-layer trusted execution environment, developers can enable identity and record privacy, secure browsing and digital management protection (DRM), or any high-security application scenario that requires secure storage of confidential or protected data.

In addition to Intel SGX, TEE solutions also include ARM’s TrustZone, AMD’s Secure cryptography virtualization SEV and NVIDIA’s trusted small kernel TLK. The core of each vendor’s hardware-based solution is to achieve the smallest possible attack surface: the CPU boundary becomes the periphery of the attack surface, and all data, memory, and I/O outside the periphery are encrypted.

2.3 Horizontal comparison between MPC and TEE

The horizontal comparison between MPC scheme and TEE scheme is as follows:

2.4 Federal classification of learning

Federated learning is defined as: in the process of machine learning, participants can make joint modeling with the help of data from other parties. All parties do not need to share data resources, that is, if the data does not go out locally, they can conduct data joint training and establish a shared machine learning model. Federal learning can be divided into three categories:

  • “Horizontal federated learning” Horizontal federated learning (based on user latitude sharding) is suitable for situations where two datasets share the same feature space but have different sample ID Spaces. Training is performed using data with identical but not identical users from both parties.

  • “Longitudinal federated learning” Longitudinal federated learning (based on feature latitude segmentation) is suitable for situations where two datasets share the same sample ID space but have different feature Spaces. The part of data with the same user but different user characteristics is taken out for training.

  • “Federated transfer learning” Federated transfer learning applies to situations where two data sets are different not only in the sample but also in the feature space. In this scenario, data is not segmented, but transfer learning can be used to overcome the data or label shortage.

Baidu federal computing business

3.1 Baidu federated Computing features

As described in Baidu’s White paper on Secure Computing, baidu’s main product innovations in the field of data security and privacy protection include:

  • “Organic integration of multiple technologies, covering multiple scenarios of data security and privacy protection”

The platform organically integrates MPC, TEE, DP and other leading technologies to provide a set of data security solutions for multi-party secure computing. On the basis of protecting enterprise data assets, it effectively prevents the risk of user privacy disclosure and covers various scenarios of data security and privacy protection.

  • “Unique Multi-party secure Computing DSL language, secure and controllable” electronic contract “operation mechanism”

The platform designs a SPECIAL DSL language for multi-party secure computing scenarios to describe the complex logic of the whole process of multi-party data joint computing, forming an “electronic contract” for multi-party secure computing, which can only be executed after the confirmation of each participant. The participant clearly knows the way of data use, and realizes data security and control by combining the multi-party security scheme.

  • Deep optimization of multi-party secure Computing technology to support multi-party secure computing with massive data

In order to adapt to baidu’s demand for large-scale data security computing, the platform supports multi-party secure computing of billions of data through large-scale engineering transformation and a variety of performance optimization, and is easily competent for various multi-party secure computing scenarios, providing technical support for multi-party secure computing from academia to industry.

3.2 Typical business of Baidu federated Computing

Federated computing model can be applied to the field of advertising and marketing, in which crowd targeting is an important branch. Federal precision crowd refers to the PSI (Private Set Intersection) of ID based on the full amount of large data of both sides. Advertising is carried out on the media side based on the precise data of the customer side. * * “and realize the customer data available invisible, meet strangers” * * effect, effectively protecting the privacy of customer data.

External partners, such as advertisers, store their data in their own servers or cloud storage, while Baidu’s data is stored in Baidu domain. BFC** “Baidu Federated Computing” ** Primary node coordinates all Computing nodes without touching all local data. On the premise that data does not leave the domain, encrypted information such as parameters is exchanged among computing nodes to complete the calculation of the service model. Marketing practice cases show that the federal precision crowd model based on the big data of the customer can greatly improve the ROI of the customer.

In order to solve the data security compliance problem in the joint marketing scenario, based on Baidu security, the joint data circulation service uses the “federal computing” technology to open up a “joint marketing green channel” for starwatch. On the premise of ensuring that sensitive data of all parties does not go out of the domain, baidu Starwatch data and advertiser data are safely connected based on the safe calculation of “available and invisible”, so as to realize joint precision marketing.

Business follow-up prospects, based on the calculation/federal study in advertising and marketing the whole link (former insight, shooting touch and shots after analysis) can assign AI marketing operations, precipitation data assets, the activation data assets, in the protection of data privacy compliance under the condition of maximum value of data mining, users, customers, the media of the tripartite win-win situation.

This author | chong-jie wang, baidu business platform for the research and development department senior r&d engineers, long-term focus on Internet advertising marketing business. Focus on big data processing, distributed system architecture, middleware design, network data security and other technical fields.

Recruitment information

The R&D department of Baidu Commercial Platform is mainly responsible for the platform construction of Baidu commercial products, including advertising, landing page hosting, global data insight and other core business directions. It is committed to using plat-based technical services to enable customers and ecological partners to continue to grow and become the most dependent business service platform for customers.

Whether you are a backend, front-end, big data or algorithm, there are several positions waiting for you here. Welcome to submit your resume. Please follow baidu Geek, the public account of the same name.

Read the original plate | federal calculation in baidu stargazing practice

———- END ———-

Baidu said Geek

Baidu official technology public number online!

Technical dry goods, industry information, online salon, industry conference

Recruitment information · Internal push information · technical books · Baidu surrounding

Welcome to your attention