Attention Is All You Need

Posted on May 9, 2023 1 minute read ∼ Filed in : A paper note

https://github.com/hyunwoongko/transformer

Decoder uses Auto-regresisve model.

Masked multi-head attention: this is to prevent the t’s output cannot includes the subsequent inputs.

Parameters: N and d

LayerNorm vs batchNorm

Tags Cloud

Categories Cloud

It's the niceties that make the difference fate gives us the hand, and we play the cards.