Skip to content

cyan-ide/nn_models

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

nn_models

Neural network / AI models

My experiments with various neural network models via coding their implementations from scratch, just using pytorch.

For all architectures below I share my results of pre-training up to similar level of accuracy as in the original research papers (I think that's the key difference to multitude of other repos/articles that usually just pre-train on toy data, so unknown if they really work or not). I listed in the architectures order I was implementing it.

  • Transformer - everything from zero, following excelent tutorial by Peter Bloem

  • GPT-2 - adapted the earlier done transformer code for Masked Attention and later followed Karpathy awesome tutorial on writing GPT-2 from scratch

  • BERT - took what I learned from earlier GPT-2 tutorial and replicated the entire process to write BERT from scratch. Quite a bit harder, as there aren't any tutorials available that would go all the way with pre-training. Most publications/videos/code just play with tiny data or dont really follow through til the end to show the results. I did the entire thing and managed to get better results than BERT pre-trained by HuggingFace (which I considered as my target in this exercise).

  • Llama2 - modified my GPT-2 code with some small architectural changes applied in Llama series (RMS norm, removal of Dropout, SiLU activation instead of GELU, vocabulary size and biggest change being ROPE/Rotary positional embeddings). I kept the size of network similar as GPT-2 and done training with same data (fineweb_edu). Interestingly, with same amount of training steps, I got 0.36 Hellaswag accuracy compared to 0.306 for GPT-2. The Llama changes I did based on a very nice tutorial by Sebastian Raschka done for his book "Build a Large Language Model From Scratch".

  • T5 - similar principle as my Llama implementation, except that here I used BERT code as starting point (due to T5 authors using same MLM training routine). Key difference is usage of encoder/decoder architecture (same as original Transformer paper) instead of encoder only like BERT (or decoder only like GPT-2/Llama). The interesting part of this exercise is comparison to BERT, with almost identical training setup I got quite a bit of increase in validation accuracy (from 0.63 to 0.68).

About

Neural network / AI models / LLM models - implementations from scratch in pytorch

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors