Evaluating the fidelity of knowledge distillation. The effect of enlarging the CIFAR-100 distillation dataset with GAN-generated samples. (a): The student and teacher are both single ResNet-56 networks. Student fidelity increases as the dataset grows, but test accuracy decreases. (b): The student is a single ResNet-56 network and the teacher is a 3-component ensemble. Student fidelity again increases as the dataset grows, but test accuracy now slightly increases. The shaded region corresponds to µ ± σ, estimated over 3 trials. The image is taken from page 2 of the paper.

The motivating problem for this paper is model compression and the authors found something really speculating, according to the download counts of five BiT models from Tensorflow Hub, the smallest ResNet-50 model has been downloaded for significantly more times than the larger ones. As a result, many recent improvements in vision do not translate to real-world applications. …

MLP-Mixer consists of per-patch linear embeddings, Mixer layers, and a classifier head. Mixer layers contain one token-mixing MLP and one channel-mixing MLP, each consisting of two fully-connected layers and a GELU nonlinearity. Other components include: skip-connections, dropout, layer norm on the channels, and linear classifier head. The image is taken from page 2 of the paper.

This paper presents a neural network that is just a feed forward multi-layer perceptron (MLP), meaning there is no convolution, no attention mechanism, no lambda layers, nothing of those sorts. It’s just matrix multiplication, non-linearities, normalization and skip connections (adapted from ResNets’). This paper is similar to the abstractions elaborated in the recent SOTA paper known as ‘Vision Transformers’. I have formulated a blog explaining Vision Transformers meticulously, you can check it out here. 😌

MLP Mixer Architecture

The authors have proposed a classification architecture. Like in Vision Transformers, we divide the input image into small mini patches (preferable of size 16✕16). The…

NFNet-F1 model achieves comparable accuracy to an EffNet-B7 while being 8.7× faster to train. The image is taken from page 1 of the paper.

Introduction & Overview

So the point of this paper is to build networks in this case specifically, Convolutional Residual Style Networks that have no batch normalization build in them. But without the batch normalization usually these networks are not performing so well or cannot scale to larger batch sizes however this paper right here builds networks that scale to large batch sizes and are more efficient than previous state-of-the-art methods (like LambdaNets, I have also written a detailed article on it, click right here to check it out!!!🤞). The training latency vs accuracy graph shows that NFnets are 8.7× times faster than EffNet-B7

Approximation of the regular attention mechanism AV (before D⁻¹ -renormalization) via (random) feature maps. Dashed-blocks indicate the order of computation with corresponding time complexities attached. The image is taken from the paper.

Introduction & Overview

Performers are a new class of models and they approximate the Transformers. They do so without running into the classic transformer bottleneck which is that, the Attention matrix in the transformer has space and compute requirements that are quadratic in the size of the input, and that limits how much input (text or images) you can feed into the model. Performers get around this problem by a technique called Fast Attention Via Positive Orthogonal Random Features (FAVOR+). Performers are linear architectures fully compatible with regular Transformers and with strong theoretical guarantees: unbiased or nearly-unbiased estimation of the attention matrix, uniform…

Model Overview. The image is taken from the paper.

The Limitation with Transformers For Images

Transformers work really really well for NLP however they are limited by the memory and compute requirements of the expensive quadratic attention computation in the encoder block. Images are therefore much harder for transformers because an image is a raster of pixels and there are many many many… pixels to an image. The rasterization of images is a problem in itself even for Convolutional Neural Networks. To feed an image into a transformer every single pixel has to attend to every single other pixel (just like the attention mechanism), the image itself is 255² big so the attention for an…

The illustration snip is taken from Open AIs’ official blog post. Visit the website for even more crazy things!!!

Introduction & Overview

Honestly, my heart just skipped a beat, the power of this neural network is just unbelievably good. These images are so cool and the point here is that these images aren't photoshopped or human-created, they are AI-generated by this new model called DALL·E. What it can do is, it can take a piece of text and it can output a picture that matches that text. The thing that is super astounding to note is the quality of these images and what’s even more astounding is sort of the range of capabilities that this model has. …

Underspecification in a simple epidemiological model. The image is taken from the paper.

Introduction & Overview

The authors believe that underspecification is one of the key reasons for the poor behavior of Machine Learning models when deployed in real-world domains. Think of it like this, you have a big training set, you train your model on the training set, then you test it on the testing set, and usually, they come from the same distribution. However, there is a caveat here, when you deploy your trained model to production in the real-world, the distribution of the data is very different and the model might not as well perform well. So, the underspecification problem the authors identify…

Overview of the proposed approach MAMA. MAMA constructs an open knowledge graph (KG) with a single forward pass of the pre-trained Language model (LM) (without fine-tuning) over the corpus. The image is taken from the paper.

Introduction & Overview

This paper on a high level proposes to construct Knowledge Graphs which is a structured object that's usually built by human experts. It proposes to construct Knowledge Graphs automatically by simply using a pre-trained language model together with a corpus to extract the knowledge graph without human supervision. The cool thing about this paper is that there is no training involved, the entire knowledge is simply extracted from running the corpus once. So one forward pass through the pre-trained language model and that constructs the Knowledge Graph.

In this paper, the authors design an unsupervised approach called MAMA that successfully…

Comparison between attention and lambda layers. (Left) An example of 3 queries and their local contexts within a global context. (Middle) The attention operation associates each query with an attention distribution over its context. (Right) The lambda layer transforms each context into a linear function lambda that is applied to the corresponding query. The image is taken from the paper

Introduction & Overview

In recent times we have seen Transformers take over image classification (do check out my Medium Post - Vision Transformers ❤️) but it came either with downsampling the image to 16×16 patches or just by throwing the massive amount of data. LambdaNetworks, are computationally efficient and simple to implement using direct calls to operations available in modern neural network libraries. The attention Mechanism is a very very very general computational framework, it’s like a dynamic routing of information and the authors don't adapt to use expensive attention maps as it is computationally very expensive.

Lambda Layers Vs Attention Layers

The Lambda Layers take the global…

Nakshatra Singh

A Machine Learning, Deep Learning, and Natural Language Processing enthusiast. Making life easy for beginners to read SOTA research papers🤞❤️

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store