Introduction

Last Updated on July 25, 2023 by Editorial Team

Author(s): Sherwin Chen

Originally published on Towards AI.

Introduction — A climbing snail trying to see the outside world U+007C Source: Pinterest

Diving Into SNAIL U+007C Towards AI

A Simple Neural Attentive Meta-Learner — SNAIL

Traditional reinforcement learning algorithms train an agent to solve a single task, expecting it to generalize well to unseen samples from a similar data distribution. Meta-learning trains a meta-learner on the distribution of similar tasks, in the hopes of generalization to a novel but related tasks by learning a high-level strategy that captures the essence of the problem it is asked to solve.

Yan Duan et al. in 2016 structured a meta-learner, namely RL², as a recurrent neural network, which receives past rewards, actions, and termination flags as inputs in addition to the normally received observations. Despite its simplicity and universality, this approach is barely satisfactory in practice. Mishara et al. hypothesize that this is because traditional RNN architectures propagate information by keeping it in their hidden state from one timestep to the next; this temporally-linear dependency bottlenecks their capacity to perform sophisticated computation on a stream of inputs. Instead, they propose a Simple Neural AttentIve meta-Learner(SNAIL), which combines temporal convolutions and self-attention to distill useful information from the experience it gathers. This general-purpose model has shown its efficacy on a variety of experiments, including a few-shot image classification and reinforcement learning tasks.

In this article, we will first introduce the structural components of SNAIL, specifically temporal convolutions and attention. Then we discuss their pros and cons and see how they complement each other. As usual, this article ends with a discussion of my own thought.

Simple Neural Attentive Meta-Learner

The overall architecture of SNAIL goes first

Figure 1. Green nodes represent attention block and orange nodes denote temporal convolution blocks. Source: A Simple Neural Attentive Meta-Learner

Now let us take a deeper look at each component.

Temporal Convolutions

Figure 2. Dilated Causal Convolution. Source: Wavenet: A Generative Model For Rar Audio

Before discussing the structure of Temporal Convolutions(TC), we first introduce a dense block, which applies a single causal 1D-convolution with kernel size 2, dilation rate R and D(e.g., 16) filters, and then concatenates the result with its input.

Causal 1D-convolution filters are illustrated by the red triangles in Figure 2, with dilation rates 8, 4, 2, 1 from the top down. Note that 1D-convolution is applied to the sequence dimension, and the data dimension is treated as the channel dimension. The causal convolution helps summarize temporal information just as 2-D convolutions summarize spatial information. In the 3rd line, we use the gated activation function, which has been wildly used in LSTM and GRUs.

A TC block consists of a series of dense blocks whose dilation rates increase exponentially until their receptive filed exceeds the desired sequence length T so that nodes in the last layer captures all past information.

Attention

An attention block performs a key-value lookup; we style this operation after the scaled dot-product attention, which has been covered in the previous article, Here, we only provide pseudocode for completeness

Notice that SNAIL uses dense connections(concatenating x and y at the end of dense and attention blocks) to prevent the vanishing gradient problem.

Cooperation between Temporal Convolutions and Attention

Thanks to dilated causal convolutions, which support exponentially expanding receptive fields without losing resolution or coverage, temporal convolutions offer more direct, high-bandwidth access to past information, compared to traditional RNNs. This allows them to perform more sophisticated computation over a temporal context of fixed size. However, to scale to long sequences, the dilation rates generally increase exponentially, so that the required number of layers scales logarithmically with the sequence length. Their bounded capacity and positional dependence can be undesirable in a meta-learner, which should be able to fully utilize increasingly large amounts of experience.

In contrast, soft attention allows a model to pinpoint a specific piece of information, from a potentially infinitely-large context. However, the lack of positional dependence can also be undesirable, especially in reinforcement learning, where the observations, actions, and rewards are intrinsically sequential.

Despite their individual shortcomings, temporal convolutions and attentions complement each other: while the former provides high-bandwidth access at the expense of finite context size, the latter provide pinpoint access over an infinitely large context. By interleaving TC layers with causal attention layers, SNAIL can have high-bandwidth access over its past experience without constraints on the amount of experience it can effectively use. By using attention at multiple stages within a model that is trained end-to-end, SNAIL can learn what pieces of information to pick out from the experience it gathers, as well as a feature representation that is amenable to doing so easily. In short, temporal convolutions learn how to aggregate contextual information, from which attention learns how to distill specific pieces of information.

Discussion

How does SNAIL make a decision?

I personally think that SNAIL makes decisions using a minibatch of size T, which includes the current observation in addition to observation-action pairs from the previous episode. What I do not understand is that the authors claim SNAIL maintains the internal state:

Crucially, following existing work in meta-RL (Duan et al., 2016; Wang et al., 2016), we preserve the internal state of a SNAIL across episode boundaries, which allows it to have memory that spans multiple episodes. The observations also contain a binary input that indicates episode termination.

Welcome to discuss this on StackOverflow.

References

Yan Duan et al. RL²: Fast Reinforcement Learning via Slow Reinforcement Learning

Fisher Yu et al. Multi-Scale Context Aggregation by Dilated Convolutions

Nikhil Mishra et al. A Simple Neural Attentive Meta-Learner

Ashish Vaswani et al. Attention Is All You Need

Aaron van den Oord et al. Wavenet: A Generative Model For Rar Audio

Join thousands of data leaders on the AI newsletter. Join over 80,000 subscribers and keep up to date with the latest developments in AI. From research to projects and ideas. If you are building an AI startup, an AI-related product, or a service, we invite you to consider becoming a sponsor.

Published via Towards AI

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

15 engineers. 100,000+ students. Towards AI Academy teaches what actually survives production.

Start free — no commitment:

→ Agents Architecture Cheatsheet — 3 years of architecture decisions in 6 pages

Our courses:

→ AI Engineering Certification — 90+ lessons from project selection to deployed product. The most comprehensive practical LLM course out there.

→ Agent Engineering Course — Hands on with production agent architectures, memory, routing, and eval frameworks — built from real enterprise engagements.

→ AI for Work — Understand, evaluate, and apply AI for complex work tasks.

Note: Article content contains the views of the contributing authors and not Towards AI.

Frequently Used, Contextual References

Resources

Author(s): Sherwin Chen

Diving Into SNAIL U+007C Towards AI

A Simple Neural Attentive Meta-Learner — SNAIL

Simple Neural Attentive Meta-Learner

Cooperation between Temporal Convolutions and Attention

Discussion

References

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Recent Posts

RNNs Cannot Think What Transformers Think Cheaply. ICLR 2026 Proved the Gap Is Exponential.

Time Series Made So Easy My Aunt Got It on the Second Read

Claude Cowork 101

Is 3-Bit KV Cache the Holy Grail? A Reality Check on Google’s TurboQuant

LangGraph Multi-Agent Architecture: Building a Self-Critiquing AI Debate System

AutoML on Autopilot

I Ran This Open-Source AI Tool on a Messy Codebase and Got 71x Fewer Tokens — Here Is Exactly What Happened

Month in 4 Papers (April 2026)

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

Cookie	Duration	Description
cookielawinfo-checkbox-analytics	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Analytics".
cookielawinfo-checkbox-functional	11 months	The cookie is set by GDPR cookie consent to record the user consent for the cookies in the category "Functional".
cookielawinfo-checkbox-necessary	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookies is used to store the user consent for the cookies in the category "Necessary".
cookielawinfo-checkbox-others	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Other.
cookielawinfo-checkbox-performance	11 months	This cookie is set by GDPR Cookie Consent plugin. The cookie is used to store the user consent for the cookies in the category "Performance".
viewed_cookie_policy	11 months	The cookie is set by the GDPR Cookie Consent plugin and is used to store whether or not user has consented to the use of cookies. It does not store any personal data.

Frequently Used, Contextual References

Resources

Introduction

Author(s): Sherwin Chen

Diving Into SNAIL U+007C Towards AI

A Simple Neural Attentive Meta-Learner — SNAIL

Simple Neural Attentive Meta-Learner

Cooperation between Temporal Convolutions and Attention

Discussion

References

Towards AI Academy

We Build Enterprise-Grade AI. We'll Teach You to Master It Too.

Related posts

Recent Posts

Comprehensive AI Engineering and AI for Work certifications

Company

CONTACT US

GDPR CCPA Statement