Attention Is All You Need
1 minute read ∼ Filed in : A paper notehttps://github.com/hyunwoongko/transformer
Decoder uses Auto-regresisve model.
Masked multi-head attention: this is to prevent the t’s output cannot includes the subsequent inputs.
Parameters: N and d
LayerNorm vs batchNorm