MultiModel dialogue response generation

Posted on March 18, 2024   1 minute read ∼ Filed in  : 

Propose a new task: Multi-Model dialogue response generation: given the dialogue context, the model should not only generate a pure text response but also have the capacity to generate a multimodal response (e.g., containing both image and text).

Challenges:

  1. the training is over-fitted to the training datasets, and cannot generalize to the new domain.
  2. not easy to collect enough training data for a new domain.

Ideas:

make parameters that rely on multimodal dialogues small and independent by disentangling textual response generation and image response generation, and thus we can learn the major part of the generation model from text-only dialogues and image_description+image, pairs that are much easier to be obtained.

image-20240318185701629

Problem formulation:

(dialogue context U, response R) => learned model P(R U; \theta)

U and R may contains images.

  1. Unified representations of both text and images => express image in form of sequence tokens.
    1. Texts => BPE-encoded tokens
    2. Images => each token is a discrete Auto-Encoder




END OF POST




Tags Cloud


Categories Cloud




It's the niceties that make the difference fate gives us the hand, and we play the cards.