Pretraining Without Attention

#LLM

Pretraining Without Attention

[2212.10544] Pretraining Without Attention

This is an old paper from 2022. This paper proposed a different architecture from Transformer - without attention. And with a similar parameter count, different architectures like BiGS can match the performance of attention-based models like Transformers.

It may suggest thaty given an ample number of parameters and a well-conditioned setup (with an appropriate mix of nonlinearities and linearities), the specific arrangement of components in a neural architecture becomes less critical.
IN THE CONDITION OF A GOOD PARAMETER AMOUNT AND DATASET SCALE MATCH.

Note

Inductive Biases: The paper also discusses the different inductive biases of BiGS compared to BERT, particularly in terms of interactions and syntactic representations.

The “it” in AI models is the dataset. – Non_Interactive – Software & ML

The “it” in AI models is the dataset.

Posted on June 10, 2023 by jbetker

I’ve been at OpenAI for almost a year now. In that time, I’ve trained a lot of generative models. More than anyone really has any right to train. As I’ve spent these hours observing the effects of tweaking various model configurations and hyperparameters, one thing that has struck me is the similarities in between all the training runs.

It’s becoming awfully clear to me that these models are truly approximating their datasets to an incredible degree. What that means is not only that they learn what it means to be a dog or a cat, but the interstitial frequencies between distributions that don’t matter, like what photos humans are likely to take or words humans commonly write down.

What this manifests as is – trained on the same dataset for long enough, pretty much every model with enough weights and training time converges to the same point. Sufficiently large diffusion conv-unets produce the same images as ViT generators. AR sampling produces the same images as diffusion.

This is a surprising observation! It implies that model behavior is not determined by architecture, hyperparameters, or optimizer choices. It’s determined by your dataset, nothing else. Everything else is a means to an end in efficiently delivery compute to approximating that dataset.

Then, when you refer to “Lambda”, “ChatGPT”, “Bard”, or “Claude” then, it’s not the model weights that you are referring to. It’s the dataset.