Large Language Models
1 minute read ∼ Filed in : A paper noteLLM
Transformer-based neural network as language model.
Fine-tune is not additive, may break existing knowledge learned.
Promp engineering: few-shot prompting.
RAG pipeline

LLaMA

Difference between LLaMA with Transformer.
- 
    internal covariate shift make the training slower, thus, we need layer normization to avoid it. - 
        Layer norm works since it devided the variance 
- 
        Computing mean is costly, thus Root Mean Square Layer Norm (RMSNorm) avoid that. 
 
- 
        
- 
    use relative position representation - add a distance between each two tokens
 
