Privacy Preserving Vertical Federated Learning for Tree-based Models
4 minute read ∼ Filed in : A paper noteQuestions
-
Why use PHE instead of HE or SWHE, or FHE?
-
The active party initializes the P and S keys and sends them to all parties; the Active party can decrypt any information.
What is auth server?
-
Why was it corrupted up to m-1?
If one party gathers all data from all other parties, then the FL is unnecessary.
-
Why encrypt them and convert them to security sharing instead of directly converting plain text such as WX to security sharing?
This is because all shares will be sent to a single party to compute; however, without encryption using PHE., the party can get the plain text of all values.
-
Why use AS3 instead of only PHE?
For rich computations.
Abstract & Introduction
The paper proposes an algorithm for decision tree training in VFL; it protects against a semi-honest adversary.
Motivation
Existing work mainly focuses on the horizontal setting. And VFL is needed. At the same time, some VFL algorithm has some limitations in either efficiency or data privacy.
- Some work requires labels must be in plaintext across all parties.
- Some work assumes intermediate results can be revealed in plaintext.
- Some work replies on secure hardware, but the hardware may only be trusted by some parties and can be vulnerable to side-channel attacks.
- Some work use MPC but assume the client can outsource the data into non-colliding servers.
Contribution
Propose a system that does not rely on any trusted third party.
It is a hybrid framework that utilizes both threshold partially homomorphic encryption (TPHE) and MPC.
- TPHE: efficient communication cost, but only support some computations
- MPC: support arbitrary computations but has high communication overhead.
It uses TPHE as much as possible for local computations and only invokes MPC in places where TPHE is inadequate in terms of functionality.
The results demonstrate good accuracy comparable to non-private algorithms and high efficiency.
Pivot’s basic and enhanced protocols achieve up to 37.5x and 4.5x speedup (w.r.t. training time) over an MPC baseline.
Preliminaries
TPHE
MPC
it allows participants to compute a function over their inputs while keeping the information private.
Solution Overview
Assume:
- Pivot focuses on the vertical federated learning scenario where each party shares the duplicate sample ids with different features.
- Label set Y is held by only activate party and cannot be directly shared with other clients.
- Semi-host model. All parties have to follow the exact prespecified protocol.
Protocol overview.
- Protocol Initialization:
- All parties reach a consensus on aligning the samples.
- All parties jointly generate the keys of TPHE, and each part has PK and part of SK.
- Begin computation. In each iteration,
PlainText => Cipher => secretly shared values => compute => secretly shared values => Cipher
After training, each client will get a tree model in plaintext in the basic algorithm.
In the enhanced algorithm, the model is in a secretly shared form.
Basic Protocol
Computation
- encrypted mast vector to indicate which samples are available.
- Local computations
- MPC computations
Secure guarantees
TPHE => local computations and model updates are secure
MPC => MPC computation step is secure.
Enhance protocol
The final model is plaintext; it could leak label and feature value information.
Clients can split the sample set based on the split information in the model and their datasets.
The enhanced protocol saves the model in a secretly shared form and proposes some methods to predict it.
Experiments
Datasets:
Baselines:
- Accuracy: compare with sklearn, MSE/precision.
- Efficiency: compare with the pure secret sharing-based algorithm implemented in the MPC method. and measure
- the total running time of the model training stage
- the prediction is the running time per sample of the model prediction stage.
Accuracy
Efficiency of training
m: number of clients
n: number of samples
d: number of features
b: number of splits
d: number of features.
h: tree depth
W: Number of trees.
Training of inference
Further protection
- As for malicious adversaries, we can extend the system into zero-knowledge proofs (ZKP), and SPDZ with authenticated shares.