Privacy Preserving Vertical Federated Learning for Tree-based Models

Posted on December 6, 2021   4 minute read ∼ Filed in  : 

Questions

  1. Why use PHE instead of HE or SWHE, or FHE?

  2. The active party initializes the P and S keys and sends them to all parties; the Active party can decrypt any information.

    What is auth server?

  3. Why was it corrupted up to m-1?

    If one party gathers all data from all other parties, then the FL is unnecessary.

  4. Why encrypt them and convert them to security sharing instead of directly converting plain text such as WX to security sharing?

    This is because all shares will be sent to a single party to compute; however, without encryption using PHE., the party can get the plain text of all values.

  5. Why use AS3 instead of only PHE?

    For rich computations.

Abstract & Introduction

The paper proposes an algorithm for decision tree training in VFL; it protects against a semi-honest adversary.

Motivation

Existing work mainly focuses on the horizontal setting. And VFL is needed. At the same time, some VFL algorithm has some limitations in either efficiency or data privacy.

  1. Some work requires labels must be in plaintext across all parties.
  2. Some work assumes intermediate results can be revealed in plaintext.
  3. Some work replies on secure hardware, but the hardware may only be trusted by some parties and can be vulnerable to side-channel attacks.
  4. Some work use MPC but assume the client can outsource the data into non-colliding servers.

Contribution

Propose a system that does not rely on any trusted third party.

It is a hybrid framework that utilizes both threshold partially homomorphic encryption (TPHE) and MPC.

  1. TPHE: efficient communication cost, but only support some computations
  2. MPC: support arbitrary computations but has high communication overhead.

It uses TPHE as much as possible for local computations and only invokes MPC in places where TPHE is inadequate in terms of functionality.

The results demonstrate good accuracy comparable to non-private algorithms and high efficiency.

Pivot’s basic and enhanced protocols achieve up to 37.5x and 4.5x speedup (w.r.t. training time) over an MPC baseline.

Preliminaries

TPHE

image-20220623151445071

MPC

it allows participants to compute a function over their inputs while keeping the information private.

image-20220623152158698

Solution Overview

Assume:

  1. Pivot focuses on the vertical federated learning scenario where each party shares the duplicate sample ids with different features.
  2. Label set Y is held by only activate party and cannot be directly shared with other clients.
  3. Semi-host model. All parties have to follow the exact prespecified protocol.

Protocol overview.

  1. Protocol Initialization:
    • All parties reach a consensus on aligning the samples.
    • All parties jointly generate the keys of TPHE, and each part has PK and part of SK.
  2. Begin computation. In each iteration, PlainText => Cipher => secretly shared values => compute => secretly shared values => Cipher

After training, each client will get a tree model in plaintext in the basic algorithm.

In the enhanced algorithm, the model is in a secretly shared form.

Basic Protocol

Computation

  1. encrypted mast vector to indicate which samples are available.
  2. Local computations
  3. MPC computations

image-20220623160431397

image-20220715133210801

Secure guarantees

TPHE => local computations and model updates are secure

MPC => MPC computation step is secure.

Enhance protocol

The final model is plaintext; it could leak label and feature value information.

Clients can split the sample set based on the split information in the model and their datasets.

The enhanced protocol saves the model in a secretly shared form and proposes some methods to predict it.

Experiments

Datasets:

image-20220623161848084

Baselines:

  1. Accuracy: compare with sklearn, MSE/precision.
  2. Efficiency: compare with the pure secret sharing-based algorithm implemented in the MPC method. and measure
    • the total running time of the model training stage
    • the prediction is the running time per sample of the model prediction stage.

Accuracy

image-20220623162227694

Efficiency of training

m: number of clients

n: number of samples

d: number of features

b: number of splits

d: number of features.

h: tree depth

W: Number of trees.

image-20220623162307009

Training of inference

image-20220623162703338

Further protection

  1. As for malicious adversaries, we can extend the system into zero-knowledge proofs (ZKP), and SPDZ with authenticated shares.




END OF POST




Tags Cloud


Categories Cloud




It's the niceties that make the difference fate gives us the hand, and we play the cards.