Getting Started
This guide will help you get up and running with LuisaCompute, from installation to your first GPU-accelerated program.
Overview
LuisaCompute is a high-performance cross-platform computing framework designed for modern graphics and general-purpose GPU computing. It balances three key goals:
Unification: Write code once, run on CUDA, DirectX, Metal, or CPU
Programmability: Modern C++ embedded DSL with intuitive syntax
Performance: Optimized backends with automatic command scheduling
The framework consists of three major components:
Embedded DSL: Write GPU kernels directly in C++ using
Var<T>,Kernel, andCallableUnified Runtime: Resource management and command scheduling with
Context,Device, andStreamMultiple Backends: CUDA, DirectX, Metal, and CPU implementations
Prerequisites
Hardware Requirements
Backend |
Requirements |
|---|---|
CUDA |
NVIDIA GPU with RTX support (RTX 20 series or newer recommended) |
DirectX |
DirectX 12.1 & Shader Model 6.5 compatible GPU |
Metal |
macOS device with Metal support |
CPU |
Any modern x86_64 or ARM64 processor |
Software Requirements
Operating System: Windows 10/11, Linux (Ubuntu 20.04+), or macOS 12+
Compiler: C++20 compatible compiler (MSVC 2019+, GCC 11+, Clang 14+)
Build System: CMake 3.20+ or XMake 2.7+
Python: 3.10+ (for bootstrap script and Python frontend)
Backend-Specific Requirements
CUDA: CUDA Toolkit 11.7+, NVIDIA driver R535+
DirectX: Windows SDK 10.0.19041.0+
Metal: Xcode 14+ (macOS)
Building from Source
Clone the Repository
git clone -b next https://github.com/LuisaGroup/LuisaCompute.git --recursive
cd LuisaCompute
Important: The
--recursiveflag is required to fetch all submodules.
Build Using the Bootstrap Script (Recommended)
The bootstrap script automates dependency installation and building:
# Build with CUDA backend using CMake
python bootstrap.py cmake -f cuda -b
# Build with all available backends
python bootstrap.py cmake -f all -b
# Build with specific configuration
python bootstrap.py cmake -f cuda -b -- -DCMAKE_BUILD_TYPE=Release
# Generate IDE project without building
python bootstrap.py cmake -f cuda -c -o cmake-build-release
Install missing dependencies:
python bootstrap.py -i rust # Install Rust (for CPU backend)
python bootstrap.py -i cmake # Install/upgrade CMake
python bootstrap.py -i xmake # Install XMake
Build Using CMake Directly
mkdir build && cd build
cmake .. -DLUISA_BACKEND_CUDA=ON -DCMAKE_BUILD_TYPE=Release
cmake --build . --parallel
Build Using XMake
xmake config --cuda=true
xmake build
CMake Integration in Your Project
# Add LuisaCompute as a subdirectory
add_subdirectory(LuisaCompute)
# Link to your target
target_link_libraries(your_target PRIVATE luisa::compute)
Or use the starter template for quick setup.
Your First Program
Let’s create a simple program that fills an image with a gradient color on the GPU.
Complete Example
#include <luisa/luisa-compute.h>
#include <luisa/dsl/sugar.h> // For $if, $while, etc.
#include <stb_image_write.h> // For saving the result
using namespace luisa;
using namespace luisa::compute;
int main(int argc, char *argv[]) {
// Step 1: Initialize the runtime context
Context context{argv[0]};
// Step 2: Create a device (GPU) to run computations on
Device device = context.create_device("cuda");
// Step 3: Create a command stream for submitting work
Stream stream = device.create_stream();
// Step 4: Create a 1024x1024 image with 4-channel 8-bit storage
// The template argument <float> means pixel values are automatically
// converted between byte4 and float4 when reading/writing
Image<float> image = device.create_image<float>(
PixelStorage::BYTE4, // Internal storage format
1024, 1024 // Width and height
);
// Step 5: Define a callable (reusable function)
Callable linear_to_srgb = [](Float4 linear) noexcept {
Float3 rgb = linear.xyz();
// Apply gamma correction
$if (rgb <= 0.00031308f) {
rgb = 12.92f * rgb;
} $else {
rgb = 1.055f * pow(rgb, 1.0f / 2.4f) - 0.055f;
};
return make_float4(rgb, linear.w);
};
// Step 6: Define a 2D kernel (entry point for GPU execution)
Kernel2D fill_kernel = [&](ImageFloat image) noexcept {
// Get the current thread's position in the grid
Var coord = dispatch_id().xy();
Var size = dispatch_size().xy();
// Calculate normalized UV coordinates (0 to 1)
Float2 uv = make_float2(coord) / make_float2(size);
// Create a gradient color
Float4 color = make_float4(uv, 0.5f, 1.0f);
// Apply sRGB conversion and write to image
image->write(coord, linear_to_srgb(color));
};
// Step 7: Compile the kernel into a shader
auto shader = device.compile(fill_kernel);
// Step 8: Prepare host memory for downloading the result
std::vector<std::byte> pixels(1024 * 1024 * 4);
// Step 9: Execute commands on the GPU
stream << shader(image.view(0)).dispatch(1024, 1024) // Launch kernel
<< image.copy_to(pixels.data()) // Download to CPU
<< synchronize(); // Wait for completion
// Step 10: Save the result
stbi_write_png("output.png", 1024, 1024, 4, pixels.data(), 0);
return 0;
}
Building Your Program
# Compile with your build system, linking against LuisaCompute
g++ -std=c++20 myprogram.cpp -I/path/to/luisa/include \
-L/path/to/luisa/lib -lluisa-compute-runtime \
-o myprogram
Core Concepts
1. Context and Device
The Context is the entry point for the runtime. It manages backend plugins and global configuration:
// Create context with executable path
Context context{argv[0]};
// Create a specific backend device
Device cuda_device = context.create_device("cuda");
Device dx_device = context.create_device("dx"); // Windows only
Device metal_device = context.create_device("metal"); // macOS only
Device cpu_device = context.create_device("cpu");
// Or use the default available backend
Device device = context.create_default_device();
2. Streams and Commands
Streams are command queues for asynchronous execution:
// Create a compute stream
Stream stream = device.create_stream(StreamTag::COMPUTE);
// Commands are submitted using operator<<
stream << command1 << command2 << command3;
// Commands are automatically batched and executed
// Explicit synchronization:
stream << synchronize();
// Or:
stream.synchronize();
3. Resources
Buffers
Linear memory for structured data:
// Create a buffer of 1000 float4 elements
Buffer<float4> buffer = device.create_buffer<float4>(1000);
// Upload data from host
std::vector<float4> host_data(1000);
stream << buffer.copy_from(host_data.data());
// Download data to host
stream << buffer.copy_to(host_data.data());
Images
2D textures with hardware-accelerated sampling:
// Create a 1920x1080 image with RGBA8 format
Image<float> image = device.create_image<float>(
PixelStorage::BYTE4, 1920, 1080);
// Create with mipmaps
Image<float> image_with_mips = device.create_image<float>(
PixelStorage::BYTE4, 1920, 1080, 10); // 10 mipmap levels
4. The DSL
Variables
Use Var<T> to represent device-side variables:
// Scalar types
Var<float> x = 1.0f; // Or: Float x = 1.0f;
Var<int> count = 100; // Or: Int count = 100;
// Vector types
Float2 pos = make_float2(1.0f, 2.0f);
Float3 color = make_float3(1.0f, 0.5f, 0.0f);
Float4 rgba = make_float4(color, 1.0f);
// Swizzling
Float2 xy = pos.xy(); // Extract x and y
Float3 xyz = rgba.xyz(); // Extract first three components
Float4 repeated = pos.xxxx(); // (x, x, x, x)
Control Flow
Use $-prefixed macros for device-side control flow:
$if (condition) {
// GPU-side branch
} $elif (other_condition) {
// Another branch
} $else {
// Default branch
};
$while (condition) {
// Loop on GPU
};
$for (i, 0, 100) {
// Loop from 0 to 99
// i is Var<int>
};
$for (i, 0, 100, 2) {
// Loop from 0 to 98 with step 2
};
Note: Use native C++
iffor host-side (compile-time) decisions that affect what gets recorded into the kernel.
Kernels and Callables
// A Callable is a reusable device function
Callable lerp = [](Float a, Float b, Float t) noexcept {
return a * (1.0f - t) + b * t;
};
// A Kernel is the entry point for GPU execution
Kernel2D render_kernel = [&](ImageFloat image, BufferFloat4 colors) noexcept {
Var coord = dispatch_id().xy();
Var index = coord.y * dispatch_size().x + coord.x;
Float4 color = colors.read(index);
image->write(coord, color);
};
// Compile and dispatch
auto shader = device.compile(render_kernel);
stream << shader(image, colors).dispatch(1024, 1024);
Next Steps
Learn more about the DSL for advanced kernel programming
Explore Resources and Runtime for detailed resource management
Understand the Architecture for implementation details
Check out the API Reference for complete class documentation
Browse example programs in
src/testsfor real-world usage patterns
Troubleshooting
Backend Not Found
Error: Backend 'cuda' not found
Ensure the backend plugin exists:
luisa-backend-cuda.dll(Windows) orluisa-backend-cuda.so(Linux)Check that the plugin is in the same directory as your executable
Verify CUDA toolkit is installed and drivers are up to date
Compilation Errors
Ensure C++20 is enabled (
-std=c++20for GCC/Clang,/std:c++20for MSVC)Check that all submodules are cloned (
git submodule update --init --recursive)
Runtime Validation
Enable validation layer for debugging:
Device device = context.create_device("cuda", nullptr, true);
// Or via environment variable:
// LUISA_ENABLE_VALIDATION=1 ./myprogram