Distributed Deep Learning on Data Systems A Comparative Analysis of Approaches
1 minute read ∼ Filed in : A paper noteIntroduction
Motivation
In-RDBMS ML (Greenplum as exmaple) is used ofen, and provide good usability. In-RDBMS ML requires scalable execution, and is compatible with DL toos.
And there are four metrics to evaluate performance of in-db-ms: 1.Runtime efficiency. 2. Easy-of-governace. 3. Implementation difficulty. 4. Portability.
Challenges
Integrating parallely model selection alg into RDBMS:
- Gettting partial results from the UDF is not trivial.
- UDFs communication requires pipes provided by RDMBS
- Only one query is permitted at anytime in one DB session, Mulitple session + split big query into small => enable parallelism. But it may not make full use of query optimization.
- Data is usally compressed and stored on disk as pagefiles. Frequently access involve decompression and accessing.
The paper introduce 4 design if integrating AI to DB: Fully in-RDBMS, Partially in-DBMS, Access data in DB with Direct Access, and out-of-RDBMS
Goal
The paper then integrating MOP based model selection into RDBMS since MOP requires no communications and workds on sharded data.