[Flux.1] improve pos embed for ascend npu by computing on npu #12897

zhangtao0408 · 2025-12-27T10:09:50Z

What does this PR do?

Moving pos_embed computation from CPU back to NPU results in a 1.07x speedup in Flux.1's end-to-end latency.

Since CANN updated to 8.3.RC1, the bad performance of torch.repeat_interleave operator has been optimized. Results shown below:

Model	Device	Resolution	Steps	e2e latency
FLUX.1-DEV	npu	1024 x 1024	50	25.54
FLUX.1-DEV	cpu	1024 x 1024	50	27.41
FLUX.2-DEV	npu	1024 x 1024	28	101.49
FLUX.2-DEV	cpu	1024 x 1024	28	118.22
LongCat-Image	npu	768x1344	28	31.87
LongCat-Image	cpu	768x1344	28	36.19
Ovis-Image	npu	1024 x 1024	28	27.16
Ovis-Image	cpu	1024 x 1024	28	40.47

Tested Hardware

Ascend 910B3

Repro Code

1. FLUX.1-dev

import time

import torch
import torch_npu
from torch_npu.contrib import transfer_to_npu

from diffusers import FluxPipeline
pipe = FluxPipeline.from_pretrained("black-forest-labs/FLUX.1-dev", torch_dtype=torch.bfloat16).to("npu")

prompt = "A cat holding a sign that says hello world"

# Warmup
_ = pipe(prompt, height=1024, width=1024, guidance_scale=3.5, num_inference_steps=2, max_sequence_length=512, generator=torch.Generator("cpu").manual_seed(0))

# Inference
start_time = time.time()

image = pipe(prompt, height=1024, width=1024, guidance_scale=3.5, num_inference_steps=2, max_sequence_length=512, generator=torch.Generator("cpu").manual_seed(0)).images[0]
image.save("flux-dev.png")

end_time = time.time()
print(f"Time: {end_time - start_time:.2f}s")

2. FLUX.2-DEV

import time

import torch
import torch_npu
from torch_npu.contrib import transfer_to_npu

from diffusers import Flux2Pipeline
pipe = Flux2Pipeline.from_pretrained("black-forest-labs/FLUX.2-dev", torch_dtype=torch.bfloat16)
pipe.enable_group_offload(
    onload_device=torch.device("npu"),
    offload_device=torch.device("cpu"),
    offload_type="leaf_level",
    use_stream=True
)

prompt = "A cat holding a sign that says hello world"

# Warmup
_ = pipe(prompt, height=1024, width=1024, guidance_scale=3.5, num_inference_steps=2, max_sequence_length=512, generator=torch.Generator("cpu").manual_seed(0))

# Inference
start_time = time.time()

image = pipe(prompt, height=1024, width=1024, guidance_scale=3.5, num_inference_steps=2, max_sequence_length=512, generator=torch.Generator("cpu").manual_seed(0)).images[0]
image.save("flux.2-dev.png")

end_time = time.time()
print(f"Time: {end_time - start_time:.2f}s")

3. LongCat-Image

import time

import torch
import torch_npu
from torch_npu.contrib import transfer_to_npu

from diffusers import LongCatImagePipeline
pipe = LongCatImagePipeline.from_pretrained("meituan-longcat/LongCat-Image/", torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()

prompt = '一个年轻的亚裔女性，身穿黄色针织衫，搭配白色项链。她的双手放在膝盖上，表情恬静。背景是一堵粗糙的砖墙，午后的阳光温暖地洒在她身上，营造出一种宁静而温馨的氛围。镜头采用中距离视角，突出她的神态和服饰的细节。光线柔和地打在她的脸上，强调她的五官和饰品的质感，增加画面的层次感与亲和力。整个画面构图简洁，砖墙的纹理与阳光的光影效果相得益彰，突显出人物的优雅与从容。'

# WARMUP
image = pipe(prompt, height=768, width=1344, guidance_scale=4.0, num_inference_steps=2, num_images_per_prompt=1, generator=torch.Generator("cpu").manual_seed(43), enable_cfg_renorm=True, enable_prompt_rewrite=True).images[0]

# Inference
start_time = time.time()

image = pipe(prompt, height=768, width=1344, guidance_scale=4.0, num_inference_steps=28, num_images_per_prompt=1, generator=torch.Generator("cpu").manual_seed(43), enable_cfg_renorm=True, enable_prompt_rewrite=True).images[0]

image.save("longcat.png")

end_time = time.time()
print(f"Time: {end_time - start_time:.2f}s")

4. Ovis-Image

import time

import torch
import torch_npu
from torch_npu.contrib import transfer_to_npu

from diffusers import OvisImagePipeline
pipe = OvisImagePipeline.from_pretrained("AIDC-AI/Ovis-Image-7B", torch_dtype=torch.bfloat16)
pipe.enable_model_cpu_offload()

prompt = "A creative 3D artistic render where the text \"OVIS-IMAGE\" is written in a bold, expressive handwritten brush style using thick, wet oil paint. The paint is a mix of vibrant rainbow colors (red, blue, yellow) swirling together like toothpaste or impasto art. You can see the ridges of the brush bristles and the glossy, wet texture of the paint. The background is a clean artist's canvas. Dynamic lighting creates soft shadows behind the floating paint strokes. Colorful, expressive, tactile texture, 4k detail."

# Warmup
image = pipe(prompt, negative_prompt="", num_inference_steps=2, guidance_scale=5.0).images[0]

# Inference
start_time = time.time()
image = pipe(prompt, negative_prompt="", num_inference_steps=28, guidance_scale=5.0).images[0]
image.save("ovis.png")

end_time = time.time()
print(f"Time: {end_time - start_time:.2f}s")

Before submitting

This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
Did you read the contributor guideline?
Did you read our philosophy doc (important for complex PRs)?
Was this discussed/approved via a GitHub issue or the forum? Please add a link to it if that's the case.
Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
Did you write any new necessary tests?

Who can review?

Anyone in the community is free to review the PR once the tests have passed. Feel free to tag
members/contributors who may be interested in your PR.

…omputation.

zhangtao0408 · 2025-12-27T10:27:04Z

@sayakpaul Please review this pr, thanks.

sayakpaul

Thanks! We should also take care of others that follow this pattern. For example:

diffusers/src/diffusers/models/transformers/transformer_flux2.py

Lines 838 to 845 in f6b6a71

    
           if is_torch_npu_available(): 
        
               freqs_cos_image, freqs_sin_image = self.pos_embed(img_ids.cpu()) 
        
               image_rotary_emb = (freqs_cos_image.npu(), freqs_sin_image.npu()) 
        
               freqs_cos_text, freqs_sin_text = self.pos_embed(txt_ids.cpu()) 
        
               text_rotary_emb = (freqs_cos_text.npu(), freqs_sin_text.npu()) 
        
           else: 
        
               image_rotary_emb = self.pos_embed(img_ids) 
        
               text_rotary_emb = self.pos_embed(txt_ids)

…omputation.

…o npu computation.

…pu computation.

zhangtao0408 · 2026-01-03T17:15:01Z

Thanks! We should also take care of others that follow this pattern. For example:

diffusers/src/diffusers/models/transformers/transformer_flux2.py

Lines 838 to 845 in f6b6a71

if is_torch_npu_available():

freqs_cos_image, freqs_sin_image = self.pos_embed(img_ids.cpu())

image_rotary_emb = (freqs_cos_image.npu(), freqs_sin_image.npu())

freqs_cos_text, freqs_sin_text = self.pos_embed(txt_ids.cpu())

text_rotary_emb = (freqs_cos_text.npu(), freqs_sin_text.npu())

else:

image_rotary_emb = self.pos_embed(img_ids)

text_rotary_emb = self.pos_embed(txt_ids)

Thanks for your suggestion, I tested the FLUX.2-Dev, LongCat-Image, and Ovis-Image models on the Ascend platform. Their performance improved after switching the position embedding calculation from CPU back to the NPU.

[Flux.1] improve pos embed for ascend npu by setting it back to npu c…

edf0a69

…omputation.

sayakpaul reviewed Dec 29, 2025

View reviewed changes

TaoZhang-Work added 3 commits January 3, 2026 15:47

[Flux.2] improve pos embed for ascend npu by setting it back to npu c…

23c70fd

…omputation.

[LongCat-Image] improve pos embed for ascend npu by setting it back t…

6efcd0a

…o npu computation.

[Ovis-Image] improve pos embed for ascend npu by setting it back to n…

a0f7b63

…pu computation.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

[Flux.1] improve pos embed for ascend npu by computing on npu #12897

[Flux.1] improve pos embed for ascend npu by computing on npu #12897

zhangtao0408 commented Dec 27, 2025 •

edited

Loading

Uh oh!

zhangtao0408 commented Dec 27, 2025

Uh oh!

sayakpaul left a comment

Uh oh!

zhangtao0408 commented Jan 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	if is_torch_npu_available():
	freqs_cos_image, freqs_sin_image = self.pos_embed(img_ids.cpu())
	image_rotary_emb = (freqs_cos_image.npu(), freqs_sin_image.npu())
	freqs_cos_text, freqs_sin_text = self.pos_embed(txt_ids.cpu())
	text_rotary_emb = (freqs_cos_text.npu(), freqs_sin_text.npu())
	else:
	image_rotary_emb = self.pos_embed(img_ids)
	text_rotary_emb = self.pos_embed(txt_ids)

[Flux.1] improve pos embed for ascend npu by computing on npu #12897

Are you sure you want to change the base?

[Flux.1] improve pos embed for ascend npu by computing on npu #12897

Conversation

zhangtao0408 commented Dec 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Tested Hardware

Repro Code

1. FLUX.1-dev

2. FLUX.2-DEV

3. LongCat-Image

4. Ovis-Image

Before submitting

Who can review?

Uh oh!

zhangtao0408 commented Dec 27, 2025

Uh oh!

sayakpaul left a comment

Choose a reason for hiding this comment

Uh oh!

zhangtao0408 commented Jan 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

zhangtao0408 commented Dec 27, 2025 •

edited

Loading