IntegriDB Verifiable SQL for Outsourced Databases
7 minute read ∼ Filed in : A paper noteIntegriDB: Verifiable SQL for Outsourced Databases
INTRODUCTION
Problems
An authenticated data structure (ADS) could be used to verify queried data.
Prior work
used to construct ADS handling some subset of SQL queries can be classified into two categories
:
- Generic approaches:
- Data owners first compile database into a program (circuit / RAM-model of computation.) which accepts query as input and return result.
- Size of circuit is as large as database.
- Circuit-based approach not allow efficient updates.
- Specific approaches:
- Tree-based (Merkel tree) / Signature-based (data in server is pre-signed by data owner)
- Both of them only support single-dimensional range queries.
- Both of them don’t support updates.
Generic approach taks large memory space, Specific approach is limited in dimension. Both of them don't support short proofs
for (join, sum, max, min, count)
Contribution
The paper designing an ADS and an associated system that supports native SQL queries over a relational database, Such a system would be suitable for integration into the most prevalent applications running in the cloud today and could be offered as a software layer on top of any SQL implementation.
- expressive: supports multidimensional range queries and nesting queries.
- efficient:
small proofs
(few kb), low verification time, feasible server computation. - scalable: can execute on DB table with 6 million rows.
- cryptographic assumptions: based on crypto-graphic assumptions
PRELIMINARIES
Authenticated Data Structure
Three roles, server, data owner, client
ADS for query Q and U consists of efficient algorithms init, setup, prove, verify, UpdateO, UpdateS
D: data, D’: encrypted data, R: Result, pi: Proof, Q: Query, lambda: secure parameter
(data owner)
Init: lambda => <pk, sk>(data owner)
Setup: ( D, sk ) => ( digest, D’ ), and send D’ to server, publish (digest, pk)(Server)
Prove: ( digest, D’, Q) => (R, pi)(client)
Verify: ( digest, Q, R, pi) => 0 or 1UpdateO/UpdateS
: Interactive algorithm run by data owner and server.- UpdateO: ( sk, digest, update ) => ( digest, 0/1 indicting accept or reject )
- UpdateS: D’ => D’
SQL queries supported by IntegriDB
Multidimensional range query
multiple columns and range filter for each column
SELECT * FROM A WHERE (age BETWEEN 22 AND 24) AND (student_ID > 10730)
Join, Sum, Max, Min, COunt, Like, Nested Query.
BUILDING BLOCKS
two authenticated data structures used as building blocks in INTEGRIDB:
one for set operations (that we call ASO
),
one for interval trees (that we call AIT
)
ADS for set operations with Summation
set operations
includes: Union, Intersection, Sum.
Authenticated data structure
Mainly use bilnear accumulator primitive
as ADS.
Data owner firstly init (s, g:public key), and then use bilnear accumulator primitive
to calculate the digest
for each set.
Data Owner can publish g and acc(s).
Verify set operation
- Client get above digest and g
- server compute:
-
the server then releases a1 and a0, compute w1, w2 with g and then realse them also.
-
client check follows:(
SBDH assumption)
Complexity
Proofs are constant size. eg,. query involving d set operations, proof size is O(d)
ADS for interval Trees
Interval tree
Random function f, binary tree T
- Leaf node stores element of dataset S = {(k1,v1)…(kn,vn)}
- Internal node u stores kv pair,
- k = maximum key stored at any node in the left subtree of u
- v = f(leftNode, rightNode)
Support queries: Search, RangeCover, Insert, Delete
Authenticated data structure
Mainly use Merkle tree, where each leaf node is hashed using kv pair. And the root of Merkle tree is a digest of the tree.
Complexity
The size of the minimal covering set output by RangeCover is O(log n)
RnageCover is O(logn), size of proof and complexity of verification for Search, RangeCover, Insert, Delete is O(n).
OUR CONSTRUCTION
ADS used in IntegriDB for different queries
Setup
For each table, each pair of columns in the table, data owner
create an authenticated interval tree.
Leaf Node: store kv of Sij
Internal node U
- Key: minimum key stored at the leaves on the left.
- Value:
The accumulation value in each node will be used to handle JOIN and multidimensional range queries.
Complexity
mi = columns and ni = rows in table i.
size of the secret key is O(1)
Join Queries
Scenario
The Join return result: C* = Ci intersect with Cj.
select ... from T join T' on T.Ci = T'.Cj
Server
- For each value x in C*, the server returns the entire row.
- Return acc(ci) and acc(cj)
Client
- Hash each row received from server and check each row’s hash value == digest’s hash value.
- Check acc(ci) or acc(cj) == acc in digest
proof size and verification time are O( | R | ) |
Multidimensional Range Queries
Scenario
2 dimensional range query
select ... from ... where w in [w−, w+] and z in [z−, z+];
Server
return value
Client
Verify bu using acc(cn) , and check it acc == acc in digest.
Check row hashes to guarantee that the returned rows have not been altered.
Complexity
The proof size is O(d·log n)
Verification complexity O(d log n + | R | ). |
SQL Functions
Summation
For any column j, client can verify using Acc(cj) in root, The proof size and verification time are both O(1).
Max and min
server compute max value and then convert it to range query.
select max(j) from T => select * from T where j >= j_max
Count
server convert count to sum operation.
Updates
the data owner and server jointly update the corresponding AIT and ASO using their respective interfaces.
There is a rotation (to maintain a balanced tree
) after an update in the interval tree
Nested Queries
ADDITIONAL DISCUSSION
Extensions and Optimizations
Improve setup time and reduce server storage space.
To do this, we construct a homomorphic Merkle tree
over each row, and add a column c to each table that stores the root of the Merkle tree
for the corresponding row.
Limitations
cannot support agregations, comparsion and join with deplicates in nested query such as
SELECT c1 +2∗c2 FROM . . .
IMPLEMENTATION
EVALUATION
Env
Awazon EC2, TPC Benchmark.
Evaluation on TPC Benchmark
Performance on Synthetic Data