Project Architecture
This document describes the internal architecture of LuisaCompute, including the compilation pipeline, runtime system, and backend implementations.
Overview
LuisaCompute is structured in three main layers:
┌─────────────────────────────────────────────────────────────┐
│ Application Layer │
│ (User code, kernels, resource management) │
├─────────────────────────────────────────────────────────────┤
│ Frontend (DSL) │
│ (C++ template metaprogramming, AST construction) │
├─────────────────────────────────────────────────────────────┤
│ Middle-end (Runtime) │
│ (Resource wrappers, command encoding, device interface) │
├─────────────────────────────────────────────────────────────┤
│ Backend (Platform-specific) │
│ (CUDA, DirectX, Metal, CPU code generation & execution) │
└─────────────────────────────────────────────────────────────┘
Frontend: The Embedded DSL
The Domain Specific Language (DSL) layer allows users to write GPU kernels in C++.
AST Construction
When you write a kernel:
Kernel2D my_kernel = [&](ImageFloat img) noexcept {
Var coord = dispatch_id().xy();
img->write(coord, make_float4(1.0f));
};
The following happens at C++ runtime:
FunctionBuilder Activation: A
FunctionBuildersingleton is pushed to a thread-local stackArgument Creation: DSL variables (
Var<T>) are created as AST expression nodesOperator Overloading: Each operation (
+,*, function calls) records AST nodesAST Finalization: The builder is popped and the function is finalized
Key Components
FunctionBuilder
The FunctionBuilder class in include/luisa/ast/function_builder.h is the core of AST construction:
// Simplified concept
class FunctionBuilder {
// Records expressions
const Expression* literal(const Type* type, const void* data);
const Expression* binary(BinaryOp op, const Expression* lhs, const Expression* rhs);
const Expression* call(const Function* func, luisa::span<const Expression* const> args);
// Records statements
void assign(const Expression* lhs, const Expression* rhs);
void if_(const Expression* cond, const Statement* true_branch, const Statement* false_branch);
void for_(const Statement* init, const Expression* cond, const Expression* step, const Statement* body);
};
Var and Expression
Var<T> wraps an AST expression pointer:
template<typename T>
struct Var : public Ref<T> {
Var() : Ref<T>{FunctionBuilder::current()->local(Type::of<T>())} {}
// Operations delegate to FunctionBuilder
};
When you write Float a = b + c, it roughly translates to:
// a = b + c becomes:
auto expr = FunctionBuilder::current()->binary(
BinaryOp::ADD,
b.expression(),
c.expression()
);
FunctionBuilder::current()->assign(a.expression(), expr);
Type System
The type system supports:
Scalar types:
bool,int,uint,float,short,ushort,slong,ulong,halfVector types:
Vector<T, N>for N = 2, 3, 4Matrix types:
Matrix<N>for N = 2, 3, 4Arrays:
std::array<T, N>Structures: User-defined with
LUISA_STRUCTmacro
Type reflection is implemented in include/luisa/ast/type.h:
// Type registry ensures unique type instances
const Type* type = Type::of<float3>(); // Returns singleton
Middle-end: Runtime and Resources
The runtime layer provides a unified interface over different GPU APIs.
DeviceInterface
DeviceInterface (in include/luisa/runtime/rhi/device_interface.h) is the abstraction over backends:
class DeviceInterface {
public:
// Resource creation
virtual ResourceCreationInfo create_buffer(size_t size) = 0;
virtual ResourceCreationInfo create_image(...) = 0;
// Shader compilation
virtual uint64_t create_shader(...) = 0;
// Command execution
virtual void dispatch(Stream* stream, const CommandList& commands) = 0;
// Synchronization
virtual void synchronize_event(uint64_t event) = 0;
};
Resource Management
Resources follow a handle-based design:
// Resource is a lightweight wrapper
template<typename T>
class Resource {
DeviceInterface* _device;
uint64_t _handle;
public:
// RAII: destructor calls _device->destroy_resource(_handle)
~Resource() { _device->destroy_resource(_handle); }
};
The actual GPU memory is managed by the backend implementation.
Command System
Commands are encoded as lightweight description objects:
// Command hierarchy
class Command {
virtual void accept(CommandVisitor& visitor) = 0;
};
class ShaderDispatchCommand : public Command {
uint64_t shader_handle;
luisa::vector<Argument> arguments;
uint3 dispatch_size;
};
class BufferUploadCommand : public Command {
uint64_t buffer_handle;
void* host_data;
size_t offset;
size_t size;
};
Command Scheduling
A key feature is automatic dependency tracking:
// Commands are analyzed for resource usage
void CommandReorderVisitor::visit(const ShaderDispatchCommand& cmd) {
for (auto& arg : cmd.arguments) {
if (arg.type == Argument::BUFFER) {
auto usage = cmd.get_resource_usage(arg.buffer);
// Track read-after-write, write-after-read dependencies
_dependencies.emplace_back(arg.buffer, usage, cmd);
}
}
}
The scheduler builds a DAG of commands and can:
Reorder independent commands for better throughput
Insert necessary memory barriers
Batch compatible commands
Backend: Code Generation and Execution
Each backend translates the AST to platform-specific code.
CUDA Backend
File: src/backends/cuda/
The CUDA backend:
Generates PTX or CUDA C++ from the AST
Uses NVCC or NVRTC for compilation
Manages
CUmodule,CUfunction, andCUdeviceptr
// Simplified code generation
void CUDACodegenAST::visit(const BinaryExpr* expr) {
emit("(");
expr->lhs()->accept(*this);
emit(" %s ", binary_op_name(expr->op()));
expr->rhs()->accept(*this);
emit(")");
}
DirectX Backend
File: src/backends/dx/
The DirectX backend:
Generates HLSL from the AST
Uses DXC for compilation to DXIL
Manages
ID3D12PipelineState,ID3D12Resource
Key features:
Root signature generation for descriptor binding
Resource barrier tracking for state transitions
Metal Backend
File: src/backends/metal/
The Metal backend:
Generates Metal Shading Language from the AST
Uses Metal compiler framework
Manages
MTLLibrary,MTLFunction,MTLBuffer
CPU Backend
File: src/rust/
The CPU backend is implemented in Rust using:
LLVM for code generation
Embree for ray tracing
Custom threading for parallel execution
IR (Intermediate Representation)
LuisaCompute v2 introduces a new IR for more advanced optimizations.
AST to IR Conversion
AST (FunctionBuilder) -> IR (luisa::ir) -> Backend Code
The IR provides:
SSA (Static Single Assignment) form for easier analysis
Explicit control flow graph
Type-preserving transformations
IR Passes
Located in include/luisa/xir/passes/:
DCE (Dead Code Elimination): Removes unused computations
Mem2Reg: Promotes stack variables to registers
Reg2Mem: Converts registers back to memory for complex control flow
Outline: Extracts code into separate functions
Autodiff: Automatic differentiation transformation
// Example: Mem2Reg pass
void mem2reg_pass(Function* func) {
// Analyze alloca sites
// Promote to registers where possible
// Insert phi nodes at merge points
}
Shader Compilation Pipeline
The full compilation flow:
1. User defines Kernel/Callable in C++
↓
2. Lambda executes, records AST via FunctionBuilder
↓
3. AST is finalized into a Function object
↓
4. device.compile(kernel) is called
↓
5. Backend converts AST to platform source (PTX, HLSL, MSL, LLVM IR)
↓
6. Platform compiler generates machine code
↓
7. Backend creates Shader object with compiled binary
↓
8. Shader is ready for dispatch
Shader Caching
Compiled shaders are cached to disk:
<build-dir>/bin/.cache/
├── cuda/
│ └── <hash>.ptx
├── dx/
│ └── <hash>.dxil
└── metal/
└── <hash>.metallib
Cache key includes:
AST hash
Backend version
Compilation options (fast math, debug info, etc.)
Memory Management
GPU Memory
Backends use different allocation strategies:
Backend |
Allocator |
Strategy |
|---|---|---|
CUDA |
cuMemAlloc/cuMemPool |
Memory pools for efficiency |
DirectX |
D3D12MA |
TLSF-based custom allocator |
Metal |
MTLHeap |
Heap-based sub-allocation |
CPU |
mimalloc |
Per-thread heap allocation |
Host Memory
Pinned memory for efficient transfers:
// Pinned memory (page-locked) for fast GPU upload
void* pinned = allocate_pinned(size);
cudaMemcpyAsync(gpu_ptr, pinned, size, cudaMemcpyHostToDevice, stream);
Threading Model
Host-Side
Stream execution: Commands are queued and executed asynchronously
Callback system: Host functions can be scheduled to run after GPU work
Fiber support: Integration with marl for coroutine-style programming
Device-Side
Kernels execute in a 3D grid:
Grid (dispatch_size)
├── Block (block_size)
│ ├── Warp (SIMD width, e.g., 32 threads)
│ │ └── Threads execute in lockstep
│ └── Multiple warps per block
└── Multiple blocks per grid
Extension System
Backends can expose platform-specific features:
// Extension interface
struct DenoiserExt : DeviceExtension {
static constexpr string_view name = "DenoiserExt";
virtual void denoise(...) = 0;
};
// Backend implementation
class CUDADenoiser : public DenoiserExt {
void denoise(...) override;
};
// Usage
if (auto* denoiser = device.extension<DenoiserExt>()) {
denoiser->denoise(...);
}
Build System Integration
CMake Integration
LuisaCompute uses modern CMake (3.23+) with the following configuration options:
Main Build Options
Option |
Default |
Description |
|---|---|---|
|
ON |
Enable CUDA backend |
|
ON |
Enable DirectX backend |
|
ON |
Enable Metal backend |
|
ON |
Enable CPU backend |
|
ON |
Enable Vulkan backend |
|
OFF |
Enable HIP backend (WIP) |
|
ON |
Enable C++ DSL |
|
ON |
Enable GUI support |
|
OFF |
Enable tensor extension |
|
ON |
Build test suite |
Build Commands
# Basic build
cmake -S . -B build -DCMAKE_BUILD_TYPE=Release
cmake --build build
# With specific backends
cmake -S . -B build \
-DLUISA_COMPUTE_ENABLE_CUDA=ON \
-DLUISA_COMPUTE_ENABLE_DX=OFF \
-DLUISA_COMPUTE_ENABLE_METAL=OFF \
-DCMAKE_BUILD_TYPE=Release
# Using Ninja (recommended)
cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=Release
cmake --build build
Using LuisaCompute in Your Project
Method 1: add_subdirectory
# Clone with submodules
git clone --recursive https://github.com/LuisaGroup/LuisaCompute.git
# In your CMakeLists.txt, add:
# add_subdirectory(LuisaCompute)
# target_link_libraries(your_target PRIVATE luisa::compute)
Method 2: FetchContent
include(FetchContent)
FetchContent_Declare(
luisacompute
GIT_REPOSITORY https://github.com/LuisaGroup/LuisaCompute.git
GIT_TAG next
GIT_SUBMODULES_RECURSE TRUE
)
FetchContent_MakeAvailable(luisacompute)
target_link_libraries(your_target PRIVATE luisa::compute)
Method 3: find_package (after installation)
# Install LuisaCompute first
cmake --build build --target install
# In your CMakeLists.txt
find_package(LuisaCompute REQUIRED)
target_link_libraries(your_target PRIVATE LuisaCompute::compute)
XMake Integration
LuisaCompute uses XMake (3.0.6+) as an alternative build system with a more streamlined workflow.
Main Build Options
Option |
Default |
Description |
|---|---|---|
Backend Options |
||
|
true |
Enable NVIDIA CUDA backend |
|
true |
Enable DirectX 12 backend |
|
true |
Enable Vulkan backend |
|
true |
Enable Metal backend |
|
false |
Enable fallback backend |
|
false |
Enable toy C backend (experimental) |
Backend Extensions |
||
|
false |
Enable NVIDIA CUB extension (long compile time) |
|
false |
Enable DirectX-CUDA interop |
|
false |
Enable Vulkan-CUDA interop |
Module Options |
||
|
true |
Enable C++ DSL module |
|
true |
Enable GUI module |
|
true |
Enable ImGui support |
|
true |
Enable OSL (Open Shading Language) support |
|
true |
Enable Python bindings |
|
false |
Enable Clang C++ module |
|
false |
Enable XIR (experimental IR) |
Build Configuration |
||
|
true |
Use mimalloc as default allocator |
|
false |
Enable custom malloc |
|
true |
Enable unity (jumbo) build for faster compilation |
|
true |
Enable SSE and SSE2 SIMD |
|
false |
Enable Link Time Optimization |
|
false |
Enable C++ RTTI |
|
cxx20 |
C++ standard (cxx20, cxx23, etc.) |
|
clatest |
C standard |
|
true |
Enable test suite |
|
false |
Use external marl library |
|
false |
Use system STL instead of EASTL |
Python Configuration |
||
|
false |
Python include path |
|
false |
Python library directory |
|
false |
Python libraries to link |
Path Configuration |
||
|
bin |
Custom binary output directory |
|
false |
Custom SDK directory |
|
false |
LLVM installation path (for CPU backend) |
|
false |
Embree path (for CPU ray tracing) |
|
false |
Custom toolchain |
|
false |
Windows runtime library |
|
false |
Additional optimization flags |
Third-Party Source |
||
|
false |
Use xrepo for spdlog |
|
false |
Use xrepo for reproc |
|
false |
Use xrepo for lmdb |
|
false |
Use xrepo for imgui |
|
false |
Use xrepo for glfw |
|
false |
Use xrepo for yyjson |
Build Commands
# Basic(Release) build
xmake f -m release -c
xmake
# With specific backends
xmake f -m release --lc_cuda_backend=true --lc_dx_backend=false --lc_metal_backend=false -c
xmake
# Debug build
xmake f -m debug -c
xmake
# Using ClangCL toolchain (recommended on Windows)
xmake f -m release --toolchain=clang-cl -c
xmake
Local Configuration with options.lua
You can create scripts/options.lua to save default configuration for your local environment:
# Generate default options.lua
xmake lua scripts/write_options.lua
Example scripts/options.lua:
lc_options = {
toolchain = "clang-cl", -- Use LLVM clang-cl compiler
lc_enable_tests = true, -- Enable test-case compilation
lc_enable_gui = false -- Disable GUI targets
}
Options in options.lua can be overridden by command-line arguments:
xmake f --lc_enable_tests=false -c
Using LuisaCompute in Your Project
Method 1: Git Submodule + includes
# Clone with submodules
git submodule add https://github.com/LuisaGroup/LuisaCompute.git third_party/LuisaCompute
git submodule update --init --recursive
-- Include LuisaCompute's build scripts
includes("third_party/LuisaCompute")
target("your_app")
set_kind("binary")
add_deps("lc-dsl")
Method 2: External Project with xmake.repo
-- xmake.lua for your project
set_languages("c++20")
-- Require LuisaCompute as a package
add_requires("luisa-compute")
target("your_app")
set_kind("binary")
add_files("src/*.cpp")
add_packages("luisa-compute")
Integration Best Practices
Based on LuisaRender and LuisaComputeGaussianSplatting:
Always use submodules to ensure consistent versions:
git submodule add https://github.com/LuisaGroup/LuisaCompute.git third_party/LuisaCompute git submodule update --init --recursive
Set RPATH for portable binaries:
set(CMAKE_BUILD_RPATH_USE_ORIGIN ON) set(CMAKE_INSTALL_RPATH "$ORIGIN;$ORIGIN/../lib")
Handle backends gracefully:
if(LUISA_COMPUTE_ENABLE_CUDA AND CUDA_FOUND) target_compile_definitions(your_target PRIVATE ENABLE_CUDA) endif()
Use unity build for faster compilation:
xmake f --lc_enable_unity_build=true -c
Use
-cflag for clean configuration when switching options:xmake f --lc_cuda_backend=false --lc_vk_backend=true -c
Performance Considerations
Kernel Optimization
Memory coalescing: Ensure threads access consecutive memory
Occupancy: Balance register usage and block size
Branch divergence: Minimize divergent execution within warps
Texture caching: Use images for 2D spatial locality
Runtime Optimization
Command batching: Submit multiple commands at once
Resource reuse: Avoid repeated allocations
Async transfers: Overlap compute and data transfer
Stream parallelism: Use multiple streams for independent work
Debugging and Profiling
Validation Layer
Enabled via LUISA_ENABLE_VALIDATION=1:
Resource lifetime tracking
Memory access validation
Command buffer consistency checks
Profiling
Backend-specific profiling:
// CUDA: Nsight Systems/Compute integration
// DirectX: PIX markers
// Metal: Xcode GPU debugger
Future Directions
Planned architectural improvements:
Graph-based execution: Explicit compute graphs for better optimization
Multi-device support: Seamless multi-GPU scaling
Task graph API: Higher-level task description
JIT specialization: Runtime kernel specialization based on parameters