Abstract:PSI is the full term for Private Set Intersection (PSI), which means that two parties holding data can calculate the Intersection of their data sets without exposing any data Set information other than the Intersection.

This article was shared from the Huawei Cloud Community “A Brief Discussion on PSI Privacy Collections for Interchange”, originally written by: tics Magic Conch.

1, the introduction of

PSI is the full term for Private Set Intersection (PSI), which means that two parties holding data can calculate the Intersection of their data sets without exposing any data Set information other than the Intersection.

PSI usually has the following three characteristics:

  1. Semi-credible scenario: The data parties do not want to expose all the data and only want to find the intersection of the data sets
  2. Data minimization: Data other than the intersection of data sets cannot be leaked to either party
  3. Secure Two-party Computing: Both parties involved in computing need to jointly implement a set of secure computing protocols to ensure the security of data.

PSI can be implemented in a variety of ways, and the following are some of the common ones and their complexity.

2. Simple cases

Based on the data selected by both parties and the fields that uniquely identify the data (you can understand the primary key, such as ID, ID, cell phone number), find the records that are common to both datasets and store them in the same order as the alignment results.

For example, there are two tables A and B, respectively

Table A Personnel Deposit Form:

Table B Summary of Consumption:

PSI was carried out by both parties through the ID card field, and the final common records were calculated as the three marked in red. The results were as follows:

In this process, Party A does not want Party B to know the bank card deposit of the intersection data, and Party B does not want Party A to know the annual consumption amount of the intersection data. At the same time, Party A should not know that Party B has users with “01234” ID card, and vice versa. Both parties should only know that the ID in the result is the intersection of the data set.

3. Technical Principle

The following is a brief introduction to PSI implemented using pseudo-random functions.

Suppose there are two squares, A and B, with data ID sets of X and Y, respectively.

  1. H() means that A and B hash their data ID set once to ensure that the PSI calculated data of both parties are of the same length
  2. B squared uses the random factor R generated by the pseudo-random function, times its own H of Y, and sends it to A squared
  3. Party A uses the key k generated by the pseudorandom function to multiply by its own H(X) and B1 sent by Party B respectively to get A and B2, and then sends both calculation results to Party B
  4. B squared is taking the inverse of the random factor r minus 1 times B2, canceling the random factor r, and you get B
  5. A and B use the same key k to encrypt, then ciphertext comparison can be performed to calculate the intersection.

4. Application scenarios

Online advertising is an important form of advertising. A common way to measure the effectiveness of an AD is to calculate what’s known as the conversion rate, which is how many of the users who viewed the AD ended up visiting the product page or buying the product or service. A common approach is to calculate the intersection of the user information that viewed the AD (owned by the advertiser) and the user information that completed the corresponding transaction (owned by the merchant) (for example, by calculating the total amount of the transaction, or the total number of transactions).

Looking for Contacts

When a user signs up for a new service (WeChat, WhatsApp, etc.), it is necessary in most cases to find out from the user’s existing contacts who have already signed up for the same service. This can be done effectively by sending the user’s contacts to the service provider, but at the same time the user’s contact information, which would be considered private in most cases, is also exposed to the service provider. Therefore, in this scenario, the PSI protocol with the user’s contact information as input to one party and all the user information of the service provider as input to the other party can complete the function of contact discovery and prevent information outside the intersection from leaking to either party.

Federated learning sample alignment

Before federated learning initiates training, PSI must be conducted based on the data of both parties. User information shared by both parties (such as user ID) is used to find the intersection, so as to correspond to the characteristics and labels of the data of the two parties. Model training on the aligned data sets makes sense.

5, reference

Privacy protection set intersection technology PSI – xiao deer (https://blog.alienx.cn/2020/1… Cui Hongrui Liu Tianyi, yu yu, the stronger the routine, Zhang Yulong, WeiTao: multiparty secure computing hot – privacy protection set intersection technology (PSI) analysis report (https://anquan.baidu.com/uplo…

Click on the attention, the first time to understand Huawei cloud fresh technology ~