Long Short Transformers

6 min readAug 23, 2021

Image is taken from the page 4 of the paper.

Introduction & Overview

A quick overview of the latest efficient transformers, the long-short transformer. The motivation behind this is, we want to do this attention operation over longer input sequences. Most of these transformers for natural language processing can only take as input 512 tokens and we want to expand them to maybe take in an entire scientific paper as input or maybe an entire legal document or the entire pixel grid of images as input. So we want to be able to attend over more than 512 tokens and this is problematic because of the n² computation of the current design of the attention layer. So researchers have come up with these different designs like, strided attention in the sparse transformers or you have say some kind of either local window or some kind of alternating pattern of masking out the spatial grid of the query and value matrices. We also have low rank projections like how singular value decomposition (SVD) which can decompose say the query key value matrices into the most salient diagonal row to compress it and have that more information packed multiplication or we can have the low rank projections like this paper where you have a parameterization of it to compress the query key value matrices into a vector space and then finally we have recurrence like transformer xl or compressive transformer where you add the hidden state idea to attend over say the last 512 tokens and compress that into the hidden state at t and then the next 512 tokens hidden state t+1 attending back over the hidden state of t. These are some of these ideas of how we can attend over longer than 512 tokens to take in longer inputs for transformers. Let’s discuss these methods mentioned above.

Strided/Sparse Attention

Image is taken from page 3 of this paper.

This idea of strided sparse attention, this concept is taken from the sparse transformers paper from open ai, this is the autoregressive task where you mask out the future inputs, this would be a way where you either set a local window. So you only look at say the last five sequentially looping it or you go up with how you’re indexing the spatial grid of the value matrix projection and then there are other ideas like having a sparser pattern. So it isn’t contiguous like this and and so these other ways of designing, the sparse attention, the local window to apply the attention on rather than doing the full matrix multiplication.

Low Rank Projections

In attention we have these parameterizations that blow up the input sequence into query key and value matrices and then those matrix matrices can multiply together to do the processing of the attention computation, so we might be able to take these query matrices at the key or the value matrices and decompose them into the most salient row and in this technique like singular value decomposition where you can compress it into this diagonal and then just maybe multiply the diagonals by each other.

Recurrence

A transformer architecture using recurrence is the Transformer-XL (meaning extra long), this is a Transformer architecture that introduces the notion of recurrence to the deep self-attention network. Instead of computing the hidden states from scratch for each new segment, Transformer-XL reuses the hidden states obtained in previous segments. The reused hidden states serve as memory for the current segment, which builds up a recurrent connection between the segments. As a result, modeling very long-term dependency becomes possible because information can be propagated through the recurrent connections. As an additional contribution, the Transformer-XL uses a new relative positional encoding formulation that generalizes to attention lengths longer than the one observed during training.

Long Short Term Attention

So the idea behind this new long short term attention is to combine the outputs from a short term strided attention layer with a dynamic low rank projection matrix transformation. The strata detention only tends over this filtered kernel rather than the entire value key multiplication, it has these local windows to attend over and then the dynamic projection is where you take the key matrix projection compress it with a weight matrix which takes a m ✕ n matrix into n ✕ k where k is smaller than m or n ✕ n. So then you transpose it and then you multiply it by the original value or something but you compress it with this parametric compression operation so it’s kind of like this down sampling, like when you have a strided convolution and you desample the layers so say it goes from 32 by 32 down to 30 and 30 these kinds of ideas of compressing the representation with a weighted matrix multiplication. So then here’s the next big idea is introducing these dual layer normalization strategy in order to combine the two different outputs.

This snipper is taken from page 4 of the paper.

In the paper they describe how these the outputs from the short-term retention and the low-rank projection attention have different scales so that it’s hard to just concatenate them (as shown above) and not have this misalignment of the overall like mean and variance parameters, say, if it’s normally distributed but generally like the scale of these parameters are going to be too different from the short term and long term retention so you have to apply some special layer normalizations(LNG and LNL) into unifying the scale of these features for further computation into stacking this together and making a gigantic transformer out of it.

Long Range Arena Benchmark

This snippet is taken from page 7 of the paper.

We see a result on using the transformer, the longshore transformer, performing better than these other models like reformer and on these long range arena tests as well as language modeling benchmarks on the end wiki data set and then showing the parameter count with the transformer-LS and then the complexity that it achieves.

This snippet is taken from page 8 of the paper.

Conclusion

I hope from this blog you’re able to get a quick sense of this idea of having strided attention which they reason has short-term attention because this strided attention local window attention compared with this dynamic projection long-term retention and then combining this through the use of this layer normalization and overall just the state of this efficient transformer design continues to advance and it’s a very exciting area of research being able to attend over longer inputs would allow more applications so thanks for reading and please stay tuned for more.

If you enjoyed this article and gained insightful knowledge, consider buying me a coffee ☕️ by clicking here. 🤤

References

If you liked this post, please make sure to clap 👏. 💬 Connect? Let’s get social: http://myurls.co/nakshatrasinghh.