Caio Temer Episode 13 part 1

Update: 2024-01-17

Description

Caio Temer explains LLMs and machine learning. found that generalized learning happens after 10k epochs with a considerable amount of data per epoch, are there studies in this regard?

ChatGPT

Yes, there are studies related to generalized learning after many times, with a large amount of data per season. A notable example is the "Deep Bootstrap Framework", presented at ICLR 2021. This framework makes a connection between generalization and online optimization. He compares a "real world", where a model is trained on a finite number of samples by multiple times, with an "ideal world", where each training step uses fresh samples from a virtually infinite data set. The research found that models that train quickly in infinite data are the same as those that generalize well in finite data. This study offers a new perspective on design choices in practice and provides a theoretical path to understand generalization.

Get started with ChatGPT

Caio Temer

show notes:

Part One summary

interesting topics for trading- -how I created the first bots -how models combine language models with autonomous robots -why conventional methods don't work about language models: -language models are simpler to obtain with other structures -problem in the current structure and how I solved it - development of intuitive neural networks using hash abstraction

Caio developed a way for neural networks, through cryptography techniques that use sum, to learn to abstract and understand deeply quickly

created a lib of a tiny language model that learns to predict all tokens at an absurd speed by using distance vector embeds instead of tokens

Transformers:

[](https://arxiv.org/abs/2205.13504)

for example, if the network is exploding, there's something wrong with a component, or the update rate is wrong If the rate is right, the linearities are wrong, just exchange some linear or more linear connection like relu for tanh or add a tanh. softmax in deep layers as a projector reduces network convergence time by forcing layer alignment training the transformer one layer at a time, adding a copy of the first layer and freezing the first accelerates training and improves retention

In sequence for sequence, a permuted linear dense solves 99% of problems in less time and better than the transformer. Giant embeds work like entire networks, only needing one activation at the end after reshaping

“1 tanh layer of 1 million units solves everything better than any other network with the same number of parameters, at least in the tasks I use, I always start with it, then see how to reduce”

Collabs:

making a language model using Facebook's FastText https://colab.research.google.com/drive/1wVQrpzyY-SkCZTRZCcP6xJGRdg1ik0jR#scrollTo=pIW6-VwMuRlz

https://colab.research.google.com/drive/1oABIZr1xiIu7DKc7AbbZfeBJFLkZW6Ep#scrollTo=XfR-3PpPYoFU

https://github.com/rtkclouds/fast-js-language-model

https://x.com/canalCCore2/status/1735044356535472278?s=20

GitHub gists:

[](https://gist.github.com/rtkclouds/50b81d10736793f07cdca354516e8757)

[](https://gist.github.com/rtkclouds/a6ee9afd96461ca94b3e9c22f78bda3a)