Stable Diffusion remains the most powerful and flexible open-source AI image generation system available today. Unlike closed-platform tools, it gives you complete control over every parameter — which means the prompts you craft matter enormously. When you extract prompts from video using VideoToPrompt.org, you unlock the ability to replicate real-world visual styles with unprecedented precision. This guide walks you through the complete workflow from video analysis to finished Stable Diffusion generation.
Prerequisites: This guide assumes you have Stable Diffusion installed via AUTOMATIC1111 or ComfyUI. If you're new to Stable Diffusion, start with our Beginner's Guide first, then return here for the advanced SD-specific techniques.
SDXL vs SD 1.5: Understanding Prompt Differences
The single most important decision you'll make before pasting a video-extracted prompt into Stable Diffusion is choosing the right model version. SDXL (Stable Diffusion XL) and SD 1.5 respond very differently to the same prompt text, and understanding why will save you hours of frustration.
SD 1.5 Prompt Characteristics
SD 1.5 uses an older CLIP text encoder with a 77-token limit. This means concise, keyword-dense prompts often outperform long, descriptive sentences. When VideoToPrompt.org generates a prompt from your video, you may need to trim it for SD 1.5 compatibility.
- Token limit: 77 tokens (roughly 55-60 words) are actively processed; anything beyond is truncated
- Keyword weighting: Place your most important descriptors at the beginning of the prompt
- Parenthetical emphasis: Use
(keyword:1.3)syntax to boost specific elements - Comma-separated style: SD 1.5 responds well to comma-separated keyword lists
SDXL Prompt Characteristics
SDXL uses a dual-encoder architecture (CLIP-L and CLIP-G) with OpenCLIP, allowing for much longer, more natural language descriptions. Video-extracted prompts often work better on SDXL out-of-the-box because they're written in flowing, descriptive language.
- Natural language friendly: Full sentences and paragraphs work well
- Better style understanding: References to artistic movements, directors, and photographers are more reliably interpreted
- Dual prompt fields: AUTOMATIC1111 and ComfyUI expose both a main prompt and a "refiner" prompt for SDXL
- Higher base resolution: Designed for 1024x1024; downscaling can reduce quality
| Feature | SD 1.5 | SDXL |
|---|---|---|
| Optimal resolution | 512x512 or 768x768 | 1024x1024 |
| Token limit | 77 (active), 154 (with clip skip) | ~300+ (dual encoder) |
| Prompt style | Keywords, comma-separated | Natural language sentences |
| Style accuracy | Good with trained styles | Excellent general style transfer |
| VRAM requirement | 4GB minimum | 8GB minimum (12GB recommended) |
| Generation speed | Faster on modest hardware | Slower, but higher quality |
Mastering Negative Prompts
Negative prompts are one of Stable Diffusion's most powerful features and one that video-extracted prompts don't automatically provide. After you get your positive prompt from VideoToPrompt.org, pairing it with a well-crafted negative prompt dramatically improves output quality.
Universal Negative Prompt Baseline
Start with this foundation and customize based on your subject matter:
Universal Negative Prompt: ugly, tiling, poorly drawn hands, poorly drawn feet, poorly drawn face, out of frame, extra limbs, disfigured, deformed, body out of frame, bad anatomy, watermark, signature, cut off, low contrast, underexposed, overexposed, bad art, beginner, amateur, distorted face, blurry, draft, grainy
Subject-Specific Negative Prompts
- Portraits: Add cross-eyed, asymmetrical eyes, extra fingers, fused fingers, cloned face
- Landscapes: Add oversaturated, plastic, fake, cartoon, illustration if you want photorealism
- Architecture: Add distorted perspective, impossible geometry, floating elements
- Night scenes: Add flat lighting, washed out, overexposed highlights
CFG Scale and Sampling Methods
The CFG (Classifier-Free Guidance) scale controls how strictly the model follows your prompt. Video-extracted prompts are usually rich in detail, which has implications for optimal CFG settings.
CFG Scale Recommendations
- CFG 4-6: Looser interpretation, more artistic freedom — good for abstract or painterly styles extracted from artistic videos
- CFG 7-8: Balanced default — works well for most video-extracted prompts
- CFG 9-12: Strict prompt adherence — ideal when your extracted prompt contains very specific details you need preserved
- CFG 13+: Often causes oversaturation and artifacts; avoid unless testing specific effects
DPM++ 2M Karras and Other Samplers
The sampler controls how the denoising process works during image generation. For video-extracted prompts that target photorealistic or cinematic outputs, certain samplers consistently outperform others.
| Sampler | Best For | Steps | Speed |
|---|---|---|---|
| DPM++ 2M Karras | Photorealism, general purpose | 20-30 | Fast |
| DPM++ SDE Karras | Fine details, portraits | 20-25 | Moderate |
| DDIM | Consistency, img2img | 30-50 | Moderate |
| Euler a | Artistic, varied results | 20-30 | Very fast |
| UniPC | Architecture, landscapes | 15-20 | Very fast |
Recommended starting point: DPM++ 2M Karras at 25 steps with CFG 7 covers roughly 80% of video-extracted prompt use cases effectively.
ControlNet Integration with Video Frames
ControlNet transforms your workflow from "text-only" to "text + visual structure," making it the most powerful tool for faithfully recreating scenes from video. Instead of relying on the prompt alone, ControlNet uses extracted video frames as structural guides.
ControlNet Video Frame Workflow
- Extract the key frame: Use VideoToPrompt.org to identify and download the specific frame you want to recreate
- Choose your ControlNet model: OpenPose for figures, Canny for edges, Depth for spatial relationships, Lineart for artistic styles
- Set ControlNet weight: Start at 0.7-0.8; higher values preserve more structure, lower values allow more creative interpretation
- Combine with your extracted prompt: The text prompt guides style and content; ControlNet guides composition and structure
- Adjust guidance end: Setting guidance end to 0.8 allows the model to freely resolve final details without constraint
Which ControlNet to Use
- Canny: Detects edges — excellent for architecture, products, and scenes with strong geometric elements
- Depth: Estimates spatial depth — ideal for landscapes and scenes with foreground/background separation
- OpenPose: Detects human body positions — essential for portrait and action scenes
- Lineart: Preserves line drawings — great for stylized or illustrated video sources
- IP-Adapter: Style transfer from a reference image — use a video frame as a style source while applying your extracted prompt
LoRA Models for Style Enhancement
LoRA (Low-Rank Adaptation) models are small fine-tuning files that teach Stable Diffusion specific styles, characters, or concepts. When your video-extracted prompt identifies a particular artistic style, finding a matching LoRA can dramatically enhance the output.
LoRA Prompt Syntax
Add LoRAs to your prompt using the angle bracket syntax: <lora:filename:weight>
Example with LoRA: cinematic still, golden hour, woman standing in wheat field, warm volumetric light, shallow depth of field, film grain <lora:cinematicStyle:0.8> <lora:filmGrain:0.4>
Stacking Multiple LoRAs
You can stack multiple LoRAs, but weights should add up to no more than 1.5 total to avoid artifacts. For video-extracted prompts, a common effective combination is:
- Style LoRA at 0.6-0.8 (handles the overall look)
- Detail enhancer LoRA at 0.3-0.4 (improves texture and sharpness)
- Specific element LoRA at 0.3-0.5 (if needed for lighting or specific objects)
AUTOMATIC1111 vs ComfyUI Workflow Differences
Both interfaces support the same underlying Stable Diffusion models, but they organize the workflow very differently. Your choice affects how you apply video-extracted prompts.
AUTOMATIC1111 Workflow
AUTOMATIC1111 (A1111) uses a form-based interface. Paste your video-extracted prompt directly into the positive prompt box. Key settings to configure:
- Hires fix: Enable for output at 2x resolution — essential for SDXL-quality results on SD 1.5 models
- Restore faces: Enable for portrait prompts extracted from video footage of people
- Tiling: Disable unless your video source was a repeating pattern/texture
- Script: X/Y/Z plot: Use this to test your extracted prompt across different CFG values and samplers simultaneously
ComfyUI Workflow
ComfyUI uses a node-based graph editor that offers more flexibility but a steeper learning curve. For video-extracted prompts, ComfyUI excels because you can:
- Build dedicated "video to image" pipelines that automatically extract frames and apply prompts
- Use the SDXL base + refiner workflow natively for maximum quality
- Chain multiple generations (e.g., extract prompt → generate → upscale → face restore) in one automated graph
- Use the VideoHelperSuite extension to process entire video files frame-by-frame
Example Prompts with SD-Specific Syntax
Here are complete, ready-to-use prompts adapted from common video sources, formatted for Stable Diffusion:
Cinematic Portrait (SD 1.5 Style)
Positive: (masterpiece:1.2), (best quality:1.1), cinematic portrait, young woman, dramatic side lighting, dark background, film noir aesthetic, shallow depth of field, 35mm film, subtle grain, (detailed skin:1.1), high contrast, moody atmosphere, professional photography
Negative: ugly, blurry, low quality, cartoon, anime, overexposed, flat lighting, extra limbs, bad anatomy
Settings: CFG 7.5, DPM++ 2M Karras, 28 steps, 768x1024
Golden Hour Landscape (SDXL Style)
Positive: Breathtaking landscape photograph at golden hour, rolling hills covered in tall grass, warm amber and orange light casting long shadows, distant mountains with atmospheric haze, dramatic cloud formations, shot with a wide-angle lens, photorealistic, hyperdetailed, National Geographic quality, volumetric lighting, 8K resolution
Negative: CGI, artificial, oversaturated, cartoon, painting, low resolution, watermark
Settings: CFG 6, DPM++ SDE Karras, 30 steps, 1024x768
Common Issues and Fixes
Even with perfect video-extracted prompts, Stable Diffusion can produce unexpected results. Here are the most common problems and their solutions:
| Problem | Likely Cause | Fix |
|---|---|---|
| Faces look distorted | Model struggles with faces at small resolution | Enable ADetailer or Restore Faces; increase resolution |
| Colors look washed out | CFG too low or wrong sampler | Increase CFG to 7-8; try DPM++ 2M Karras |
| Prompt ignored partially | Token limit exceeded (SD 1.5) | Trim prompt to 60 words; move key terms to the front |
| Artifacts and noise | CFG too high | Reduce CFG to 6-7.5 |
| Wrong style despite correct prompt | Wrong checkpoint model | Choose a model fine-tuned for the target style |
| ControlNet ignoring structure | Weight too low | Increase ControlNet weight to 0.8-1.0 |
The video-to-prompt workflow for Stable Diffusion rewards patience and experimentation. Start with the extracted prompt as-is, make small adjustments based on your output, and build a personal library of settings that work for your preferred visual styles. With ControlNet and LoRA stacking, you can achieve results that closely mirror the original video source while expressing your unique creative vision.