Yuxin Wen, Jim Wu, Ajay Jain, Tom Goldstein, Ashwinee Panda

Abstract

We conduct an in-depth analysis of attention in video diffusion transformers (VDiTs) and report a number of novel findings. We identify three key properties of attention in VDiTs: Structure, Sparsity, and Sinks. Structure: We observe that attention patterns across different VDiTs exhibit similar structure across different prompts, and that we can make use of the similarity of attention patterns to unlock video editing via self-attention map transfer. Sparse: We study attention sparsity in VDiTs, finding that proposed sparsity methods do not work for all VDiTs, because some layers that are seemingly sparse cannot be sparsified. Sinks: We make the first study of attention sinks in VDiTs, comparing and contrasting them to attention sinks in language models. We propose a number of future directions that can make use of our insights to improve the efficiency-quality Pareto frontier for VDiTs.

1 Structured Attention

The attention maps shown above are very similar, and for good reason: attention in VDiTs is structured based on spatial-temporal locality. We see a clear diagonal stripe, and then off-diagonal stripes of varying strength. A diagonal stripe is spatial locality. Inside of one band on the diagonal, if we zoom in, we will see a visually distinct square for each frame.

This is because pixels near each other exhibit spatial locality, with high attention weights. Off-diagonal stripes are temporal locality, and they decrease in strength as we move further off-diagonal because frames that are farther away from the current frame are less relevant.

1.1 Self-Attention Transfer for Zero-Shot Video Editing

Can a single attention map transfer to multiple prompts? We use Wan2.1-T2V-1.3B due to its architectural design, which applies self-attention over vision tokens before performing cross-attention with text tokens. This separation allows for more targeted attention transfer experiments, specifically on the self-attention layers operating over the visual input. During generation, given a source prompt, we directly take the attention map from the generation of the source prompt to the generation of the target prompt.

Below, we transfer the self-attention map from a car-driving prompt to a generation with a dog-running prompt. Surprisingly, this makes the video resemble the original car video—preserving its structure while ignoring the new prompt. This highlights how attention maps encode prompt-specific structure.

(a): “A car is driving on the highway.”

(b): “A dog is running on the grass.”

Attention Map Transfer from (a) to (b)

Motivated by this, we explore the limits of fine-grained video editing. In the videos below, we conduct the same attention transfer experiment using a new prompt that is very similar to the original: A car is driving on the highway vs. A red car is driving on the highway. As shown below, without attention transfer, the two generated videos differ significantly. Not just in the color of the car, but also in the perspective of the video and how it changes over time. From a user perspective, if we were happy with the cinematography of the original generated video and only wanted to change the color of the car, we would not be pleased that generating a video with the same seed leads to such a drastically different video. However, with attention map transfer, the resulting video is nearly identical to the original—except that the car is now red. This is a new capability for fine-grained video editing, where only the attribute that the user wants to change is modified.

(a): “A car is driving on the highway.”

(b): “A red car is driving on the highway.”

Attention Map Transfer from (a) to (b)

We can also try to change elements of the background or the car itself. We try to change the background to winter, and change the car to a truck. As shown below, the transfer works well in the ``winter'' case. However, when the variation becomes larger—such as changing a car to a truck—the quality of the generation becomes limited.

“A car is driving on the highway in the winter.”

“A truck is driving on the highway.”

Below, we perform attention transfer for each individual layer to identify which ones have the most significant impact. As shown below, layers such as layer 0 and layer 19 tend to produce generated videos that closely resemble the original when their attention maps are transferred. Interestingly, layer 3 stands out as an exception—its generation differs noticeably from the original but closely resembles the output from the source prompt. This suggests that layer 3 may play a key role in controlling the structural aspects of the generation. None of the other layers really have this characteristic.