Local AI Video Generation with LTX-2 on an RTX 3090

AI-generated frame from a Victorian hallway video — flickering gas lamp illuminates a shadowy corridor with peeling wallpaper

Still frame from an AI-generated video — LTX-2 19B Distilled on an RTX 3090

Text-to-video generation has crossed the “runnable on consumer hardware” threshold. Lightricks’ LTX-2 19B Distilled model generates 5-second, 1080p videos with audio on a single RTX 3090 — no cloud GPU rental, no API costs, no uploaded data. The catch is that it takes some work to get there, and the model has non-obvious constraints that will waste hours if you do not know about them upfront.

I ran 12 prompts across different genres to stress-test the model. Here are the results — every video paired with the exact prompt that generated it:

Convincing

Horror / Atmospheric

Victorian Hallway

A narrow Victorian hallway stretches into shadow, lit by a single flickering gas lamp on the wall. Dust motes drift through the amber light. Faded wallpaper peels at the edges. The camera creeps forward slowly as floorboards creak beneath unseen footsteps. A distant clock ticks in another room.

Analysis Minimal subject motion, strong atmospheric textures, slow deliberate camera movement. The flickering light gives the scene life without requiring complex object interaction.

1920x1088 5s @ 24fps 758 KB FP8 Distilled

Convincing

Macro / Nature

Spider Web Dew

Morning dew clings to the delicate threads of a spider web stretched between two blades of grass. The camera captures an extreme macro shot as a single dewdrop trembles, refracting the sunrise into tiny rainbows. A gentle breeze causes the web to sway. The soft chirping of early morning birds fills the air.

Analysis Macro subjects with natural subtle motion are LTX-2's sweet spot. The trembling dewdrops and swaying web are physics-consistent motion the model handles well.

1920x1088 5s @ 24fps 3.4 MB FP8 Distilled

Convincing

Music / Close-up

Jazz Pianist

A jazz pianist's weathered hands move across ivory keys in a dimly lit smoke-filled club. Close-up on the fingers as they glide through a melancholy chord progression. Warm amber stage light catches the worn edges of the keys. The camera slowly pulls back to reveal the piano. A mellow saxophone accompanies the melody.

Analysis Hands-on-keys is a focused, repetitive motion the model reproduces well. Minor face warping only in the final frames as the camera pulls back.

1920x1088 5s @ 24fps 2.2 MB FP8 Distilled

Shared settings: LTX-2 19B Distilled FP8, euler_ancestral sampler, CFG 1.0, 8 Pass 1 steps + 3 Pass 2 refinement steps, 121 frames at 24 fps. Only the prompts differ.

This guide covers everything I learned setting up LTX-2 on an RTX 3090 (24 GB VRAM, 64 GB RAM) with ComfyUI: the weight streaming architecture that makes this possible, a head-to-head FP8 vs BF16 checkpoint comparison, the distilled model constraints that most online guides get wrong, and the prompting lessons that make the difference between a cinematic clip and a Ken Burns slideshow.

Open Table of contents

Hardware Requirements
How Weight Streaming Works
The Two-Pass Pipeline
FP8 vs BF16: The Checkpoint That Changes Everything
The Distilled Model Trap
Prompting: Video Style, Not Photography Style
- Prompting Rules
- What Works and What Doesn’t: 12-Prompt Batch Test
Image-to-Video Pipeline
API-Driven Generation with ComfyUI
- Batch Generation
Performance Summary
What I Would Do Differently
Setup Checklist

Hardware Requirements

LTX-2 19B is a large model. The checkpoint alone is 27-40 GB depending on precision, and it needs a text encoder (Gemma 3 12B, 8.8 GB) and a spatial upscaler (950 MB) alongside it. You cannot fit this in 24 GB of VRAM.

The minimum practical setup:

Component	Requirement	Why
GPU VRAM	24 GB (RTX 3090/4090)	Model layers stream in and out during inference, peaking at ~23 GB
System RAM	64 GB	Offloaded model weights live here (~22-28 GB depending on checkpoint)
Disk swap	32 GB recommended	Safety net during model loading when RAM temporarily overflows
Storage	~70 GB free	Checkpoint + text encoder + upscaler + output files

With less than 64 GB RAM, you will likely OOM during model loading. With less than 24 GB VRAM, you would need to use --novram mode (CPU-only inference), which is too slow to be practical.

How Weight Streaming Works

ComfyUI’s weight streaming is what makes consumer GPU video generation possible. The concept is straightforward: the model is too large for VRAM, so ComfyUI loads what fits on the GPU and parks the rest in system RAM. During inference, it streams layers back and forth as needed.

flowchart TD
    subgraph DISK["Disk Storage"]
        LTX["LTX-2 19B Distilled FP8
Checkpoint: 27.1 GB
Text Encoder: 8.8 GB
Upscaler: 950 MB"]
    end

    subgraph RAM["System RAM (64 GB)"]
        OS["OS + Apps ~15 GB"]
        OFFLOAD["Offloaded Weights ~22 GB"]
    end

    subgraph SWAP["Disk Swap (32 GB)"]
        SPILL["Overflow during load ~10 GB"]
    end

    subgraph GPU["RTX 3090 VRAM (24 GB)"]
        LOADED["Active Layers ~23 GB"]
    end

    DISK -->|"Sequential read"| RAM
    OFFLOAD <-->|"Weight streaming
during inference"| LOADED
    RAM -->|"Spills during load"| SWAP

This is not free — streaming adds latency compared to having the full model in VRAM. But it transforms “impossible on this hardware” into “takes a few minutes.” The practical impact is that cold starts (first generation after loading) take longer than warm runs (models already in memory).

The Two-Pass Pipeline

LTX-2 uses a clever two-pass architecture to generate high-quality video:

Pass 1: Generate at half resolution (e.g., 640x360 for a 1280x720 target) with 8 sampling steps
Latent Upscale: 2x upscale using a dedicated spatial upscaler model in latent space
Pass 2: Refine at full resolution with 3 additional sampling steps

This is more efficient than generating at full resolution from scratch, and it produces better results because the upscaler preserves temporal coherence across frames.

The default output is 121 frames at 24 fps — about 5 seconds of video with generated audio.

FP8 vs BF16: The Checkpoint That Changes Everything

Lightricks provides two versions of the distilled checkpoint:

BF16 (ltx-2-19b-distilled.safetensors): 40.3 GB, full 16-bit precision
FP8 (ltx-2-19b-distilled-fp8.safetensors): 27.1 GB, 8-bit quantized

I ran both on the same hardware and the difference is dramatic:

Metric	BF16	FP8	Improvement
Cold start time	~13 min	~3.5 min	3.7x faster
Checkpoint size	40.3 GB	27.1 GB	-33%
Peak swap usage	33 GB	~10 GB	-70%
Peak RAM	~50 GB	~45 GB	-10%
VRAM peak	22.9 GB	23.3 GB	Similar
Video quality	Baseline	Comparable	Minimal loss

The cold start improvement alone is worth the switch. Going from 13 minutes to 3.5 minutes of staring at a progress bar changes the experience from “go make coffee” to “that was quick.” And the swap reduction means the system stays responsive during model loading instead of thrashing the disk for several minutes.

Quality difference? In side-by-side comparisons at 1080p, I could not consistently tell which was FP8 and which was BF16. For video output where individual frame sharpness matters less than temporal coherence, the quantization loss is effectively invisible.

Recommendation: Start with FP8. Keep the BF16 checkpoint as a backup for quality-critical work, but FP8 should be your default.

The Distilled Model Trap

This is where most online guides will lead you astray. If you search for “LTX-2 best settings ComfyUI,” you will find recommendations like:

CFG scale 3.5-4.0
DPM++ 2M Karras sampler
Negative prompts for quality
5+ refinement steps in Pass 2

These settings are for the full (non-distilled) model. They will break the distilled variant.

I tested each of these changes systematically. Here is what happened:

Change	Result
CFG 4.0 (from 1.0)	Severe color distortion, unusable output
DPM++ 2M Karras sampler (from euler_ancestral)	Heavy filter-like artifacts, looked like a “1990s Photoshop filter on a music video”
Negative prompt added	Zero effect (mathematically cancelled at CFG 1.0)
Pass 2 steps increased to 5 (from 3)	Scene reinterpretation — backwards heads, flipped body orientation
Pass 2 starting sigma increased to 0.975 (from 0.909)	Same scene reinterpretation artifacts

The distilled model has guidance baked into its weights through the distillation process. When you apply external CFG on top, you are double-applying guidance — like adjusting the brightness twice. The math explains why negative prompts do nothing: at CFG 1.0, the classifier-free guidance formula output = uncond + CFG * (cond - uncond) simplifies to just cond, completely cancelling the negative.

The safe parameter space for the distilled model:

Parameter	Required Value	Explanation
Sampler	`euler_ancestral`	Distillation was trained for this sampler specifically
CFG	1.0	Guidance is baked into weights
Pass 2 starting sigma	0.909375 max	Higher values add too much noise, causing regeneration
Pass 2 steps	3	More steps render artifacts in higher detail rather than fixing them

Resolution is the quality lever. The most visible quality improvement comes from going 720p to 1080p (2.3x more pixels). The two presets worth using:

Preset	Resolution	Warm Time	Use Case
Standard	1280x720	~96s	Quick tests, iteration
High	1920x1088	~198s	Final output

Resolution must be divisible by 32 for the latent space encoder, which is why “1080p” is actually 1920x1088.

Prompting: Video Style, Not Photography Style

This was the most frustrating lesson to learn. My first attempts used photography-style prompts:

“A woman in ornate fantasy plate armor, photorealistic, shot on 35mm film, natural lighting, shallow depth of field, fine film grain”

The result? A still image with a slow Ken Burns pan across it. Every single time.

The problem is that LTX-2 interprets photography terms literally. “Shot on 35mm film” means a photograph. “Shallow depth of field” describes a still camera setup. The model obliges by generating a static image and adding subtle camera drift.

What works is video-style prompting — describing events unfolding over time:

“A woman in ornate fantasy plate armor stands on a windswept cliff edge at golden hour. The wind catches her dark hair, sending strands across her face as her crimson cape billows behind her. She gazes out over a vast misty canyon below. The camera slowly pushes in toward her profile as distant thunder echoes through the valley. The golden light catches the intricate engravings on her steel pauldrons. A faint metallic clinking of armor accompanies the howling wind.”

The difference is night and day. The second prompt produces actual motion — hair blowing, cape moving, camera pushing in, ambient audio.

Prompting Rules

What works well:

Temporal descriptions: Describe what happens over the 5 seconds. “The wind catches…”, “The camera slowly pushes in…”, “Smoke rises from…”
Camera movement: Be explicit — “camera slowly dollies in”, “tracking shot from the side.” Simpler movements produce better results.
Audio cues: LTX-2 generates audio alongside video. Including sound descriptions (“howling wind”, “distant thunder”) gives the model context for the scene even though the audio quality itself is poor — expect generic ambient noise rather than accurate sound design. The audio generation is best treated as a proof-of-concept; plan to replace it in post-production for anything you share.
Textures and materials: Specific details improve output — “weathered steel plate armor”, “scarred textured hide”
Subtle motion over complex action: A person standing with wind in their hair looks far better than someone swinging a weapon

What to avoid:

Photography terms: “Shot on 35mm film”, “shallow depth of field”, “bokeh” — these produce stills
“Static shot” or “no camera movement”: Produces a Ken Burns pan on a frozen frame
Complex multi-character action: Multiple people fighting or running causes severe warping artifacts
High subject deformation: Complex body motion (rearing back, swinging weapons, running) causes warping that no amount of quality settings can fix

The general principle: the less things move on screen, the better they look. A static subject with atmospheric motion (wind, smoke, light changes) and a slow camera move will always beat a complex action scene at this model size.

What Works and What Doesn’t: 12-Prompt Batch Test

I ran 12 prompts across different genres at 1080p to stress-test the model. The results split cleanly:

Convincing output:

Atmospheric/ambient scenes dominated — a flickering Victorian hallway, morning dew on a spider web, and a jazz pianist’s hands on keys all looked great. Minimal motion, strong textures, slow camera moves.
Macro/close-up subjects with natural motion (dew drops catching light, fingers on piano keys) were the most photorealistic results.

Unconvincing output:

Object interaction broke down — a grizzly bear “catching” a salmon produced disconnected motion where the fish and bear moved independently.
Crowd/fleet scenes looked like a mediocre video game — a Viking longship fleet had too many ships crammed together with flat, unconvincing water.
Zero-gravity/unusual physics produced toy-like results — an astronaut floating outside a space station looked like a plastic figure on a string.
Face close-ups during camera movement caused late-frame warping — a piano scene was clean until the camera panned up to the musician’s face at the end.

Universal issue: audio quality is poor. Every clip produced generic ambient noise regardless of the audio cues in the prompt. Include sound descriptions for scene context, but plan to replace the audio track entirely for anything you share.

The file size of each clip correlated with motion complexity: the calm Victorian hallway was 758 KB while the splashing grizzly bear was 5.0 MB at the same resolution. Less motion means more efficient encoding and more quality budget per pixel.

Image-to-Video Pipeline

LTX-2 supports animating still images, which opens up a powerful two-step workflow:

Generate a high-quality still with Flux.2 Dev (or any image model)
Animate it with LTX-2 image-to-video

This gives you composition control that text-to-video cannot match — you can iterate on the image until it is exactly right, then add motion.

The key difference in prompting: for image-to-video, the prompt should describe what happens next, not the scene. The image already defines the visual content. Focus on camera movement, subject motion, and ambient sounds.

flowchart LR
    subgraph Step1["Generate Image"]
        P1["Text Prompt"] --> FLUX["Flux.2 Dev
~2 min warm"]
        FLUX --> IMG["Source Image"]
    end

    subgraph Step2["Review & Iterate"]
        IMG --> REVIEW{"Satisfied?"}
        REVIEW -->|No| P1
        REVIEW -->|Yes| NEXT["Continue"]
    end

    subgraph Step3["Animate"]
        NEXT --> LTX["LTX-2 I2V
~198s warm"]
        P2["Motion Prompt"] --> LTX
        LTX --> VID["Output Video + Audio"]
    end

In practice, text-to-video sometimes outperforms image-to-video because it has more freedom to create motion-optimized first frames. Test both approaches for important outputs.

API-Driven Generation with ComfyUI

ComfyUI exposes a REST API that lets you queue workflows without the browser UI. This is essential for batch generation and integration with other tools.

The workflow:

Export a workflow template from ComfyUI’s browser interface (app.graphToPrompt())
Modify parameters in the JSON (prompt text, resolution, seeds, filenames)
POST to /api/prompt to queue
Poll /api/queue to monitor progress

import json, random, urllib.request

with open("ltx2-text-to-video.json") as f:
    workflow = json.load(f)

# Set your prompt
workflow["177:109"]["inputs"]["value"] = "Your video prompt here..."

# Set resolution (1920x1088 for high quality)
workflow["177:131"]["inputs"]["width"] = 1920
workflow["177:131"]["inputs"]["height"] = 1088

# Randomize seeds for variety
workflow["177:123"]["inputs"]["noise_seed"] = random.randint(1, 2**32)
workflow["177:118"]["inputs"]["noise_seed"] = random.randint(1, 2**32)

# Queue it
payload = json.dumps({"prompt": workflow}).encode()
req = urllib.request.Request("http://localhost:8188/api/prompt",
    data=payload, headers={"Content-Type": "application/json"})
resp = json.loads(urllib.request.urlopen(req).read())
print(f"Queued: {resp['prompt_id']}")

Batch Generation

ComfyUI processes a queue FIFO and keeps models loaded between jobs that use the same models. Queue all your videos at once:

Optimal:  [Video1, Video2, Video3, Video4]
                    ↑ models load once

Wasteful: [Image1, Video1, Image2, Video2]
               ↑ load  ↑ swap  ↑ swap  ↑ swap

Group same-model jobs together. If you are generating both images (Flux.2) and videos (LTX-2), do all images first, then all videos. Each model swap costs 3-13 minutes depending on the checkpoint.

Performance Summary

With FP8 checkpoint on RTX 3090 (24 GB VRAM, 64 GB RAM):

Scenario	Time	Notes
Cold start (720p)	~3.5 min	Models loading from disk
Warm standard (720p)	~96s	Models already in memory
Warm high (1080p)	~198s	Best quality, 2.3x more pixels
Batch (warm, per video)	~180s avg	Between-job overhead ~30s

GPU temperature peaks at 79°C during Pass 2 refinement. The RTX 3090 stays well within thermal limits even during extended batch runs.

What I Would Do Differently

If I were starting from scratch:

Skip BF16, go straight to FP8. The quality difference is imperceptible and you avoid swap headaches entirely.
Read the built-in template prompts before writing your own. ComfyUI’s example prompts for LTX-2 use temporal descriptions with camera movements and audio cues. They are a better tutorial than any guide.
Start at 720p for prompt iteration. It takes half the time and the motion quality is the same — only sharpness differs. Switch to 1080p once you have a prompt you like.
Do not trust community “best settings” posts. Most target the full (non-distilled) model. If someone recommends CFG > 1.0 or a sampler other than euler_ancestral for the distilled checkpoint, their advice will produce artifacts.
Keep scenes simple. The model’s quality ceiling is high for calm scenes with subtle motion. Push it toward complex action and quality falls off a cliff regardless of settings.

Setup Checklist

For anyone setting up LTX-2 on similar hardware:

Install ComfyUI with PyTorch 2.6+ and CUDA 12.4+
Install ComfyUI-LTXVideo custom node from Lightricks
Download ltx-2-19b-distilled-fp8.safetensors (27.1 GB) to models/checkpoints/
Download Gemma 3 12B text encoder (8.8 GB) to models/text_encoders/
Download spatial upscaler (950 MB) to models/latent_upscale_models/
Create 32 GB swap file as safety net (fallocate -l 32G /swapfile-ai && mkswap /swapfile-ai && swapon /swapfile-ai)
Use the built-in “LTX-2 Text to Video (Distilled)” template
Do not change CFG, sampler, or sigma schedule from defaults
Write video-style prompts with temporal descriptions and audio cues

Local AI Video Generation with LTX-2 on an RTX 3090

Victorian Hallway

Spider Web Dew

Jazz Pianist

Cat on Windowsill

Tokyo Alley

Northern Lights

Cowboy Sunset

Ink Drop

Sea Turtle

Grizzly Bear

Viking Fleet

Astronaut

Table of contents

Hardware Requirements

How Weight Streaming Works

The Two-Pass Pipeline

FP8 vs BF16: The Checkpoint That Changes Everything

The Distilled Model Trap

Prompting: Video Style, Not Photography Style

Prompting Rules

What Works and What Doesn’t: 12-Prompt Batch Test

Image-to-Video Pipeline

API-Driven Generation with ComfyUI

Batch Generation

Performance Summary

What I Would Do Differently

Setup Checklist