Abstract: The PSI is Private Set Intersection (PSI), which means that the two parties holding the data can calculate the intersection of the data sets of both parties without revealing any data set information other than the intersection.

This article is shared from the HUAWEI cloud community " on PSI privacy collection ", the original author: tics magic conch.

1 Introduction

The full name of PSI is Private Set Intersection (Private Set Intersection, PSI), which means that the two parties holding the data can calculate the intersection of the data sets of both parties without exposing any data set information other than the intersection.

PSI usually has the following three characteristics:

  1. Semi-trusted scenario: The two parties of the data are not willing to expose all the data, but only hope to find the intersection of the data sets
  2. Data minimization: data other than the intersection of data sets cannot be leaked to any party
  3. Secure two-party computing: The two parties involved in the calculation need to jointly implement a set of secure computing protocols to ensure data security.

There are many ways to implement PSI. The following are some common ways to implement them and their complexity.
image.png

2. Simple case

According to the data selected by the two parties and the field that uniquely identifies the data (which can be understood as the primary key, such as id, ID card, mobile phone number), find the records shared by the two data sets, and arrange them in the same order and store them as the alignment result.

For example: A and B have two tables a and b, respectively

Table a Staff deposit table:
image.png

Table b Consumption summary table:
image.png

The two parties conduct PSI through the ID field, and calculate that the last shared records are three marked in red. The results are as follows:
image.png

In this process, party A does not want party B to know the bank card deposits of the intersection data, party B does not want party A to know the annual consumption data of the intersection data and other data, and party A should not know that party B still has "01234". ID users, and vice versa. Both parties should only know that the ID in the result is the intersection of the data set.

3. Technical principle

The following briefly introduces PSI implemented using pseudo-random functions.
image.png

Suppose there are two parties A and B, and there are sets of X and Y data ids respectively.

  1. H() means that both parties A and B hash their own data id set to ensure that the PSI calculation data of the two parties are equal in length
  2. Party B uses the random factor r generated by the pseudo-random function, multiplies it by its own H(Y), and sends it to Party A
  3. Party A uses the key k generated by the pseudo-random function, multiplies it by its own H(X) and B1 sent by Party B to get A and B2, and then sends both calculation results to Party B
  4. Party B is using the inverse r-1 of the random factor r to multiply B2 to eliminate the random factor r, and get B
  5. A and B use the same key k to encrypt, then the ciphertext can be compared to calculate the intersection.

4. Application scenarios

Calculate the actual effect of advertising
Online advertising is an important form of advertising. A common method for measuring the effectiveness of advertisements is to calculate the so-called conversion rate, that is, how many users who browse the advertisements finally browse the corresponding product page or finally purchase the corresponding product or service. A general calculation method is to calculate the intersection of the user information (occupied by the sender of the advertisement) who browses the advertisement and the user information (occupied by the merchant) who completes the corresponding transaction (such as calculating the total transaction amount or the total transaction volume, etc.).

Find a contact

When a user registers to use a new service (such as WeChat, Whatsapp, etc.), it is a very necessary operation in most cases to find from the user's existing contacts which have registered similar services. This function can be effectively completed by sending the user's contacts to the service provider, but at the same time, the user's contact information, which is considered private information in most cases, is also exposed to the service provider Up. Therefore, in this scenario, using the user's contact information as one party's input and all the service provider's user information as the other party's input for the PSI protocol can complete the contact discovery function and prevent information other than the intersection. Divulge to any party.

Federated learning sample alignment

Before federated learning initiates training, PSI must be performed based on the data of both parties, and the user information shared by both parties (such as user ID) must be used to find the intersection, so as to correspond to the features and labels of the two parties' data, and perform model training on the aligned data set. Significant.

5. Reference

Privacy Protection Set Intersection Technology PSI — Xiaolu ( https://blog.alienx.cn/2020/10/10/E10101535/)
Cui Hongrui, Liu Tianyi, Yu Yu, Cheng Yueqiang, Zhang Yulong, Wei Tao: Multi-party secure computing hotspot-Privacy Preserving Set Intersection Technology (PSI) Analysis Research Report ( https://anquan.baidu.com/upload/ue/file/20190814 /1565763561975581.pdf)

Click to follow, and get to know the fresh technology of


华为云开发者联盟
1.4k 声望1.8k 粉丝

生于云,长于云,让开发者成为决定性力量