MultiModel dialogue response generation
1 minute read ∼ Filed in : A paper notePropose a new task: Multi-Model dialogue response generation: given the dialogue context, the model should not only generate a pure text response but also have the capacity to generate a multimodal response (e.g., containing both image and text).
Challenges:
- the training is over-fitted to the training datasets, and cannot generalize to the new domain.
- not easy to collect enough training data for a new domain.
Ideas:
make parameters that rely on multimodal dialogues small and independent by disentangling textual response generation and image response generation, and thus we can learn the major part of the generation model from text-only dialogues and image_description+image, pairs that are much easier to be obtained.
Problem formulation:
(dialogue context U, response R) => learned model P(R | U; \theta) |
U and R may contains images.
- Unified representations of both text and images => express image in form of sequence tokens.
- Texts => BPE-encoded tokens
- Images => each token is a discrete Auto-Encoder