Attention Is All You Need

Posted on May 9, 2023   1 minute read ∼ Filed in  : 

https://github.com/hyunwoongko/transformer

Decoder uses Auto-regresisve model.

Masked multi-head attention: this is to prevent the t’s output cannot includes the subsequent inputs.

Parameters: N and d

LayerNorm vs batchNorm

image-20230509141907224

image-20230509131836284

image-20230509131849366

image-20230509135430026





END OF POST




Tags Cloud


Categories Cloud




It's the niceties that make the difference fate gives us the hand, and we play the cards.