Project Architecture

This document describes the internal architecture of LuisaCompute, including the compilation pipeline, runtime system, and backend implementations.

Overview

LuisaCompute is structured in three main layers:

┌─────────────────────────────────────────────────────────────┐
│                    Application Layer                        │
│         (User code, kernels, resource management)           │
├─────────────────────────────────────────────────────────────┤
│                      Frontend (DSL)                         │
│    (C++ template metaprogramming, AST construction)         │
├─────────────────────────────────────────────────────────────┤
│                    Middle-end (Runtime)                     │
│   (Resource wrappers, command encoding, device interface)   │
├─────────────────────────────────────────────────────────────┤
│              Backend (Platform-specific)                    │
│   (CUDA, DirectX, Metal, CPU code generation & execution)   │
└─────────────────────────────────────────────────────────────┘

Frontend: The Embedded DSL

The Domain Specific Language (DSL) layer allows users to write GPU kernels in C++.

AST Construction

When you write a kernel:

Kernel2D my_kernel = [&](ImageFloat img) noexcept {
    Var coord = dispatch_id().xy();
    img->write(coord, make_float4(1.0f));
};

The following happens at C++ runtime:

FunctionBuilder Activation: A FunctionBuilder singleton is pushed to a thread-local stack
Argument Creation: DSL variables (Var<T>) are created as AST expression nodes
Operator Overloading: Each operation (+, *, function calls) records AST nodes
AST Finalization: The builder is popped and the function is finalized

Key Components

FunctionBuilder

The FunctionBuilder class in include/luisa/ast/function_builder.h is the core of AST construction:

// Simplified concept
class FunctionBuilder {
    // Records expressions
    const Expression* literal(const Type* type, const void* data);
    const Expression* binary(BinaryOp op, const Expression* lhs, const Expression* rhs);
    const Expression* call(const Function* func, luisa::span<const Expression* const> args);
    
    // Records statements
    void assign(const Expression* lhs, const Expression* rhs);
    void if_(const Expression* cond, const Statement* true_branch, const Statement* false_branch);
    void for_(const Statement* init, const Expression* cond, const Expression* step, const Statement* body);
};

Var and Expression

Var<T> wraps an AST expression pointer:

template<typename T>
struct Var : public Ref<T> {
    Var() : Ref<T>{FunctionBuilder::current()->local(Type::of<T>())} {}
    // Operations delegate to FunctionBuilder
};

When you write Float a = b + c, it roughly translates to:

// a = b + c becomes:
auto expr = FunctionBuilder::current()->binary(
    BinaryOp::ADD, 
    b.expression(), 
    c.expression()
);
FunctionBuilder::current()->assign(a.expression(), expr);

Type System

The type system supports:

Scalar types: bool, int, uint, float, short, ushort, slong, ulong, half
Vector types: Vector<T, N> for N = 2, 3, 4
Matrix types: Matrix<N> for N = 2, 3, 4
Arrays: std::array<T, N>
Structures: User-defined with LUISA_STRUCT macro

Type reflection is implemented in include/luisa/ast/type.h:

// Type registry ensures unique type instances
const Type* type = Type::of<float3>();  // Returns singleton

Middle-end: Runtime and Resources

The runtime layer provides a unified interface over different GPU APIs.

DeviceInterface

DeviceInterface (in include/luisa/runtime/rhi/device_interface.h) is the abstraction over backends:

class DeviceInterface {
public:
    // Resource creation
    virtual ResourceCreationInfo create_buffer(size_t size) = 0;
    virtual ResourceCreationInfo create_image(...) = 0;
    
    // Shader compilation
    virtual uint64_t create_shader(...) = 0;
    
    // Command execution
    virtual void dispatch(Stream* stream, const CommandList& commands) = 0;
    
    // Synchronization
    virtual void synchronize_event(uint64_t event) = 0;
};

Resource Management

Resources follow a handle-based design:

// Resource is a lightweight wrapper
template<typename T>
class Resource {
    DeviceInterface* _device;
    uint64_t _handle;
public:
    // RAII: destructor calls _device->destroy_resource(_handle)
    ~Resource() { _device->destroy_resource(_handle); }
};

The actual GPU memory is managed by the backend implementation.

Command System

Commands are encoded as lightweight description objects:

// Command hierarchy
class Command {
    virtual void accept(CommandVisitor& visitor) = 0;
};

class ShaderDispatchCommand : public Command {
    uint64_t shader_handle;
    luisa::vector<Argument> arguments;
    uint3 dispatch_size;
};

class BufferUploadCommand : public Command {
    uint64_t buffer_handle;
    void* host_data;
    size_t offset;
    size_t size;
};

Command Scheduling

A key feature is automatic dependency tracking:

// Commands are analyzed for resource usage
void CommandReorderVisitor::visit(const ShaderDispatchCommand& cmd) {
    for (auto& arg : cmd.arguments) {
        if (arg.type == Argument::BUFFER) {
            auto usage = cmd.get_resource_usage(arg.buffer);
            // Track read-after-write, write-after-read dependencies
            _dependencies.emplace_back(arg.buffer, usage, cmd);
        }
    }
}

The scheduler builds a DAG of commands and can:

Reorder independent commands for better throughput
Insert necessary memory barriers
Batch compatible commands

Backend: Code Generation and Execution

Each backend translates the AST to platform-specific code.

CUDA Backend

File: src/backends/cuda/

The CUDA backend:

Generates PTX or CUDA C++ from the AST
Uses NVCC or NVRTC for compilation
Manages CUmodule, CUfunction, and CUdeviceptr

// Simplified code generation
void CUDACodegenAST::visit(const BinaryExpr* expr) {
    emit("(");
    expr->lhs()->accept(*this);
    emit(" %s ", binary_op_name(expr->op()));
    expr->rhs()->accept(*this);
    emit(")");
}

DirectX Backend

File: src/backends/dx/

The DirectX backend:

Generates HLSL from the AST
Uses DXC for compilation to DXIL
Manages ID3D12PipelineState, ID3D12Resource

Key features:

Root signature generation for descriptor binding
Resource barrier tracking for state transitions

Metal Backend

File: src/backends/metal/

The Metal backend:

Generates Metal Shading Language from the AST
Uses Metal compiler framework
Manages MTLLibrary, MTLFunction, MTLBuffer

CPU Backend

File: src/rust/

The CPU backend is implemented in Rust using:

LLVM for code generation
Embree for ray tracing
Custom threading for parallel execution

IR (Intermediate Representation)

LuisaCompute v2 introduces a new IR for more advanced optimizations.

AST to IR Conversion

AST (FunctionBuilder) -> IR (luisa::ir) -> Backend Code

The IR provides:

SSA (Static Single Assignment) form for easier analysis
Explicit control flow graph
Type-preserving transformations

IR Passes

Located in include/luisa/xir/passes/:

DCE (Dead Code Elimination): Removes unused computations
Mem2Reg: Promotes stack variables to registers
Reg2Mem: Converts registers back to memory for complex control flow
Outline: Extracts code into separate functions
Autodiff: Automatic differentiation transformation

// Example: Mem2Reg pass
void mem2reg_pass(Function* func) {
    // Analyze alloca sites
    // Promote to registers where possible
    // Insert phi nodes at merge points
}

Shader Compilation Pipeline

The full compilation flow:

1. User defines Kernel/Callable in C++
   ↓
2. Lambda executes, records AST via FunctionBuilder
   ↓
3. AST is finalized into a Function object
   ↓
4. device.compile(kernel) is called
   ↓
5. Backend converts AST to platform source (PTX, HLSL, MSL, LLVM IR)
   ↓
6. Platform compiler generates machine code
   ↓
7. Backend creates Shader object with compiled binary
   ↓
8. Shader is ready for dispatch

Shader Caching

Compiled shaders are cached to disk:

<build-dir>/bin/.cache/
├── cuda/
│   └── <hash>.ptx
├── dx/
│   └── <hash>.dxil
└── metal/
    └── <hash>.metallib

Cache key includes:

AST hash
Backend version
Compilation options (fast math, debug info, etc.)

Memory Management

GPU Memory

Backends use different allocation strategies:

Backend	Allocator	Strategy
CUDA	cuMemAlloc/cuMemPool	Memory pools for efficiency
DirectX	D3D12MA	TLSF-based custom allocator
Metal	MTLHeap	Heap-based sub-allocation
CPU	mimalloc	Per-thread heap allocation

Host Memory

Pinned memory for efficient transfers:

// Pinned memory (page-locked) for fast GPU upload
void* pinned = allocate_pinned(size);
cudaMemcpyAsync(gpu_ptr, pinned, size, cudaMemcpyHostToDevice, stream);

Threading Model

Host-Side

Stream execution: Commands are queued and executed asynchronously
Callback system: Host functions can be scheduled to run after GPU work
Fiber support: Integration with marl for coroutine-style programming

Device-Side

Kernels execute in a 3D grid:

Grid (dispatch_size)
├── Block (block_size)
│   ├── Warp (SIMD width, e.g., 32 threads)
│   │   └── Threads execute in lockstep
│   └── Multiple warps per block
└── Multiple blocks per grid

Extension System

Backends can expose platform-specific features:

// Extension interface
struct DenoiserExt : DeviceExtension {
    static constexpr string_view name = "DenoiserExt";
    virtual void denoise(...) = 0;
};

// Backend implementation
class CUDADenoiser : public DenoiserExt {
    void denoise(...) override;
};

// Usage
if (auto* denoiser = device.extension<DenoiserExt>()) {
    denoiser->denoise(...);
}

Build System Integration

CMake Integration

LuisaCompute uses modern CMake (3.23+) with the following configuration options:

Main Build Options

Option	Default	Description
`LUISA_COMPUTE_ENABLE_CUDA`	ON	Enable CUDA backend
`LUISA_COMPUTE_ENABLE_DX`	ON	Enable DirectX backend
`LUISA_COMPUTE_ENABLE_METAL`	ON	Enable Metal backend
`LUISA_COMPUTE_ENABLE_CPU`	ON	Enable CPU backend
`LUISA_COMPUTE_ENABLE_VULKAN`	ON	Enable Vulkan backend
`LUISA_COMPUTE_ENABLE_HIP`	OFF	Enable HIP backend (WIP)
`LUISA_COMPUTE_ENABLE_DSL`	ON	Enable C++ DSL
`LUISA_COMPUTE_ENABLE_GUI`	ON	Enable GUI support
`LUISA_COMPUTE_ENABLE_TENSOR`	OFF	Enable tensor extension
`LUISA_COMPUTE_BUILD_TESTS`	ON	Build test suite

Build Commands

# Basic build
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build

# With specific backends
cmake -S . -B build \
    -DLUISA_COMPUTE_ENABLE_CUDA=ON \
    -DLUISA_COMPUTE_ENABLE_DX=OFF \
    -DLUISA_COMPUTE_ENABLE_METAL=OFF \
    -DCMAKE_BUILD_TYPE=Release

# Using Ninja (recommended)
cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build build

Using LuisaCompute in Your Project

Method 1: add_subdirectory

# Clone with submodules
git clone --recursive https://github.com/LuisaGroup/LuisaCompute.git

# In your CMakeLists.txt, add:
# add_subdirectory(LuisaCompute)
# target_link_libraries(your_target PRIVATE luisa::compute)

Method 2: FetchContent

include(FetchContent)
FetchContent_Declare(
    luisacompute
    GIT_REPOSITORY https://github.com/LuisaGroup/LuisaCompute.git
    GIT_TAG next
    GIT_SUBMODULES_RECURSE TRUE
)
FetchContent_MakeAvailable(luisacompute)

target_link_libraries(your_target PRIVATE luisa::compute)

Method 3: find_package (after installation)

# Install LuisaCompute first
cmake --build build --target install

# In your CMakeLists.txt
find_package(LuisaCompute REQUIRED)
target_link_libraries(your_target PRIVATE LuisaCompute::compute)

XMake Integration

LuisaCompute uses XMake (3.0.6+) as an alternative build system with a more streamlined workflow.

Main Build Options

Option	Default	Description
Backend Options
`lc_cuda_backend`	true	Enable NVIDIA CUDA backend
`lc_dx_backend`	true	Enable DirectX 12 backend
`lc_vk_backend`	true	Enable Vulkan backend
`lc_metal_backend`	true	Enable Metal backend
`lc_fallback_backend`	false	Enable fallback backend
`lc_toy_c_backend`	false	Enable toy C backend (experimental)
Backend Extensions
`lc_cuda_ext_lcub`	false	Enable NVIDIA CUB extension (long compile time)
`lc_dx_cuda_interop`	false	Enable DirectX-CUDA interop
`lc_vk_cuda_interop`	false	Enable Vulkan-CUDA interop
Module Options
`lc_enable_dsl`	true	Enable C++ DSL module
`lc_enable_gui`	true	Enable GUI module
`lc_enable_imgui`	true	Enable ImGui support
`lc_enable_osl`	true	Enable OSL (Open Shading Language) support
`lc_enable_py`	true	Enable Python bindings
`lc_enable_clangcxx`	false	Enable Clang C++ module
`lc_enable_xir`	false	Enable XIR (experimental IR)
Build Configuration
`lc_enable_mimalloc`	true	Use mimalloc as default allocator
`lc_enable_custom_malloc`	false	Enable custom malloc
`lc_enable_unity_build`	true	Enable unity (jumbo) build for faster compilation
`lc_enable_simd`	true	Enable SSE and SSE2 SIMD
`lc_use_lto`	false	Enable Link Time Optimization
`lc_rtti`	false	Enable C++ RTTI
`lc_cxx_standard`	cxx20	C++ standard (cxx20, cxx23, etc.)
`lc_c_standard`	clatest	C standard
`lc_enable_tests`	true	Enable test suite
`lc_external_marl`	false	Use external marl library
`lc_use_system_stl`	false	Use system STL instead of EASTL
Python Configuration
`lc_py_include`	false	Python include path
`lc_py_linkdir`	false	Python library directory
`lc_py_libs`	false	Python libraries to link
Path Configuration
`lc_bin_dir`	bin	Custom binary output directory
`lc_sdk_dir`	false	Custom SDK directory
`lc_llvm_path`	false	LLVM installation path (for CPU backend)
`lc_embree_path`	false	Embree path (for CPU ray tracing)
`lc_toolchain`	false	Custom toolchain
`lc_win_runtime`	false	Windows runtime library
`lc_optimize`	false	Additional optimization flags
Third-Party Source
`lc_spdlog_use_xrepo`	false	Use xrepo for spdlog
`lc_reproc_use_xrepo`	false	Use xrepo for reproc
`lc_lmdb_use_xrepo`	false	Use xrepo for lmdb
`lc_imgui_use_xrepo`	false	Use xrepo for imgui
`lc_glfw_use_xrepo`	false	Use xrepo for glfw
`lc_yyjson_use_xrepo`	false	Use xrepo for yyjson

Build Commands

# Basic(Release) build
xmake f -m release -c
xmake

# With specific backends
xmake f -m release --lc_cuda_backend=true --lc_dx_backend=false --lc_metal_backend=false -c
xmake

# Debug build
xmake f -m debug -c
xmake

# Using ClangCL toolchain (recommended on Windows)
xmake f -m release --toolchain=clang-cl -c
xmake

Local Configuration with options.lua

You can create scripts/options.lua to save default configuration for your local environment:

# Generate default options.lua
xmake lua scripts/write_options.lua

Example scripts/options.lua:

lc_options = {
    toolchain = "clang-cl",           -- Use LLVM clang-cl compiler
    lc_enable_tests = true,           -- Enable test-case compilation
    lc_enable_gui = false             -- Disable GUI targets
}

Options in options.lua can be overridden by command-line arguments:

xmake f --lc_enable_tests=false -c

Using LuisaCompute in Your Project

Method 1: Git Submodule + includes

# Clone with submodules
git submodule add https://github.com/LuisaGroup/LuisaCompute.git third_party/LuisaCompute
git submodule update --init --recursive

-- Include LuisaCompute's build scripts
includes("third_party/LuisaCompute")

target("your_app")
    set_kind("binary")
    add_deps("lc-dsl")

Method 2: External Project with xmake.repo

-- xmake.lua for your project
set_languages("c++20")

-- Require LuisaCompute as a package
add_requires("luisa-compute")

target("your_app")
    set_kind("binary")
    add_files("src/*.cpp")
    add_packages("luisa-compute")

Integration Best Practices

Based on LuisaRender and LuisaComputeGaussianSplatting:

Always use submodules to ensure consistent versions:

git submodule add https://github.com/LuisaGroup/LuisaCompute.git third_party/LuisaCompute
git submodule update --init --recursive

Set RPATH for portable binaries:

set(CMAKE_BUILD_RPATH_USE_ORIGIN ON)
set(CMAKE_INSTALL_RPATH "$ORIGIN;$ORIGIN/../lib")

Handle backends gracefully:

if(LUISA_COMPUTE_ENABLE_CUDA AND CUDA_FOUND)
    target_compile_definitions(your_target PRIVATE ENABLE_CUDA)
endif()

Use unity build for faster compilation:

xmake f --lc_enable_unity_build=true -c

Use -c flag for clean configuration when switching options:

xmake f --lc_cuda_backend=false --lc_vk_backend=true -c

Performance Considerations

Kernel Optimization

Memory coalescing: Ensure threads access consecutive memory
Occupancy: Balance register usage and block size
Branch divergence: Minimize divergent execution within warps
Texture caching: Use images for 2D spatial locality

Runtime Optimization

Command batching: Submit multiple commands at once
Resource reuse: Avoid repeated allocations
Async transfers: Overlap compute and data transfer
Stream parallelism: Use multiple streams for independent work

Debugging and Profiling

Validation Layer

Enabled via LUISA_ENABLE_VALIDATION=1:

Resource lifetime tracking
Memory access validation
Command buffer consistency checks

Profiling

Backend-specific profiling:

// CUDA: Nsight Systems/Compute integration
// DirectX: PIX markers
// Metal: Xcode GPU debugger

Future Directions

Planned architectural improvements:

Graph-based execution: Explicit compute graphs for better optimization
Multi-device support: Seamless multi-GPU scaling
Task graph API: Higher-level task description
JIT specialization: Runtime kernel specialization based on parameters