# Project Architecture This document describes the internal architecture of LuisaCompute, including the compilation pipeline, runtime system, and backend implementations. ## Overview LuisaCompute is structured in three main layers: ``` ┌─────────────────────────────────────────────────────────────┐ │ Application Layer │ │ (User code, kernels, resource management) │ ├─────────────────────────────────────────────────────────────┤ │ Frontend (DSL) │ │ (C++ template metaprogramming, AST construction) │ ├─────────────────────────────────────────────────────────────┤ │ Middle-end (Runtime) │ │ (Resource wrappers, command encoding, device interface) │ ├─────────────────────────────────────────────────────────────┤ │ Backend (Platform-specific) │ │ (CUDA, DirectX, Metal, CPU code generation & execution) │ └─────────────────────────────────────────────────────────────┘ ``` ## Frontend: The Embedded DSL The Domain Specific Language (DSL) layer allows users to write GPU kernels in C++. ### AST Construction When you write a kernel: ```cpp Kernel2D my_kernel = [&](ImageFloat img) noexcept { Var coord = dispatch_id().xy(); img->write(coord, make_float4(1.0f)); }; ``` The following happens at C++ runtime: 1. **FunctionBuilder Activation**: A `FunctionBuilder` singleton is pushed to a thread-local stack 2. **Argument Creation**: DSL variables (`Var`) are created as AST expression nodes 3. **Operator Overloading**: Each operation (`+`, `*`, function calls) records AST nodes 4. **AST Finalization**: The builder is popped and the function is finalized ### Key Components #### FunctionBuilder The `FunctionBuilder` class in `include/luisa/ast/function_builder.h` is the core of AST construction: ```cpp // Simplified concept class FunctionBuilder { // Records expressions const Expression* literal(const Type* type, const void* data); const Expression* binary(BinaryOp op, const Expression* lhs, const Expression* rhs); const Expression* call(const Function* func, luisa::span args); // Records statements void assign(const Expression* lhs, const Expression* rhs); void if_(const Expression* cond, const Statement* true_branch, const Statement* false_branch); void for_(const Statement* init, const Expression* cond, const Expression* step, const Statement* body); }; ``` #### Var and Expression `Var` wraps an AST expression pointer: ```cpp template struct Var : public Ref { Var() : Ref{FunctionBuilder::current()->local(Type::of())} {} // Operations delegate to FunctionBuilder }; ``` When you write `Float a = b + c`, it roughly translates to: ```cpp // a = b + c becomes: auto expr = FunctionBuilder::current()->binary( BinaryOp::ADD, b.expression(), c.expression() ); FunctionBuilder::current()->assign(a.expression(), expr); ``` ### Type System The type system supports: - **Scalar types**: `bool`, `int`, `uint`, `float`, `short`, `ushort`, `slong`, `ulong`, `half` - **Vector types**: `Vector` for N = 2, 3, 4 - **Matrix types**: `Matrix` for N = 2, 3, 4 - **Arrays**: `std::array` - **Structures**: User-defined with `LUISA_STRUCT` macro Type reflection is implemented in `include/luisa/ast/type.h`: ```cpp // Type registry ensures unique type instances const Type* type = Type::of(); // Returns singleton ``` ## Middle-end: Runtime and Resources The runtime layer provides a unified interface over different GPU APIs. ### DeviceInterface `DeviceInterface` (in `include/luisa/runtime/rhi/device_interface.h`) is the abstraction over backends: ```cpp class DeviceInterface { public: // Resource creation virtual ResourceCreationInfo create_buffer(size_t size) = 0; virtual ResourceCreationInfo create_image(...) = 0; // Shader compilation virtual uint64_t create_shader(...) = 0; // Command execution virtual void dispatch(Stream* stream, const CommandList& commands) = 0; // Synchronization virtual void synchronize_event(uint64_t event) = 0; }; ``` ### Resource Management Resources follow a handle-based design: ```cpp // Resource is a lightweight wrapper template class Resource { DeviceInterface* _device; uint64_t _handle; public: // RAII: destructor calls _device->destroy_resource(_handle) ~Resource() { _device->destroy_resource(_handle); } }; ``` The actual GPU memory is managed by the backend implementation. ### Command System Commands are encoded as lightweight description objects: ```cpp // Command hierarchy class Command { virtual void accept(CommandVisitor& visitor) = 0; }; class ShaderDispatchCommand : public Command { uint64_t shader_handle; luisa::vector arguments; uint3 dispatch_size; }; class BufferUploadCommand : public Command { uint64_t buffer_handle; void* host_data; size_t offset; size_t size; }; ``` ### Command Scheduling A key feature is automatic dependency tracking: ```cpp // Commands are analyzed for resource usage void CommandReorderVisitor::visit(const ShaderDispatchCommand& cmd) { for (auto& arg : cmd.arguments) { if (arg.type == Argument::BUFFER) { auto usage = cmd.get_resource_usage(arg.buffer); // Track read-after-write, write-after-read dependencies _dependencies.emplace_back(arg.buffer, usage, cmd); } } } ``` The scheduler builds a DAG of commands and can: - Reorder independent commands for better throughput - Insert necessary memory barriers - Batch compatible commands ## Backend: Code Generation and Execution Each backend translates the AST to platform-specific code. ### CUDA Backend **File**: `src/backends/cuda/` The CUDA backend: 1. Generates PTX or CUDA C++ from the AST 2. Uses NVCC or NVRTC for compilation 3. Manages `CUmodule`, `CUfunction`, and `CUdeviceptr` ```cpp // Simplified code generation void CUDACodegenAST::visit(const BinaryExpr* expr) { emit("("); expr->lhs()->accept(*this); emit(" %s ", binary_op_name(expr->op())); expr->rhs()->accept(*this); emit(")"); } ``` ### DirectX Backend **File**: `src/backends/dx/` The DirectX backend: 1. Generates HLSL from the AST 2. Uses DXC for compilation to DXIL 3. Manages `ID3D12PipelineState`, `ID3D12Resource` Key features: - Root signature generation for descriptor binding - Resource barrier tracking for state transitions ### Metal Backend **File**: `src/backends/metal/` The Metal backend: 1. Generates Metal Shading Language from the AST 2. Uses Metal compiler framework 3. Manages `MTLLibrary`, `MTLFunction`, `MTLBuffer` ### CPU Backend **File**: `src/rust/` The CPU backend is implemented in Rust using: - LLVM for code generation - Embree for ray tracing - Custom threading for parallel execution ## IR (Intermediate Representation) LuisaCompute v2 introduces a new IR for more advanced optimizations. ### AST to IR Conversion ``` AST (FunctionBuilder) -> IR (luisa::ir) -> Backend Code ``` The IR provides: - SSA (Static Single Assignment) form for easier analysis - Explicit control flow graph - Type-preserving transformations ### IR Passes Located in `include/luisa/xir/passes/`: - **DCE (Dead Code Elimination)**: Removes unused computations - **Mem2Reg**: Promotes stack variables to registers - **Reg2Mem**: Converts registers back to memory for complex control flow - **Outline**: Extracts code into separate functions - **Autodiff**: Automatic differentiation transformation ```cpp // Example: Mem2Reg pass void mem2reg_pass(Function* func) { // Analyze alloca sites // Promote to registers where possible // Insert phi nodes at merge points } ``` ## Shader Compilation Pipeline The full compilation flow: ``` 1. User defines Kernel/Callable in C++ ↓ 2. Lambda executes, records AST via FunctionBuilder ↓ 3. AST is finalized into a Function object ↓ 4. device.compile(kernel) is called ↓ 5. Backend converts AST to platform source (PTX, HLSL, MSL, LLVM IR) ↓ 6. Platform compiler generates machine code ↓ 7. Backend creates Shader object with compiled binary ↓ 8. Shader is ready for dispatch ``` ### Shader Caching Compiled shaders are cached to disk: ``` /bin/.cache/ ├── cuda/ │ └── .ptx ├── dx/ │ └── .dxil └── metal/ └── .metallib ``` Cache key includes: - AST hash - Backend version - Compilation options (fast math, debug info, etc.) ## Memory Management ### GPU Memory Backends use different allocation strategies: | Backend | Allocator | Strategy | |---------|-----------|----------| | CUDA | cuMemAlloc/cuMemPool | Memory pools for efficiency | | DirectX | D3D12MA | TLSF-based custom allocator | | Metal | MTLHeap | Heap-based sub-allocation | | CPU | mimalloc | Per-thread heap allocation | ### Host Memory Pinned memory for efficient transfers: ```cpp // Pinned memory (page-locked) for fast GPU upload void* pinned = allocate_pinned(size); cudaMemcpyAsync(gpu_ptr, pinned, size, cudaMemcpyHostToDevice, stream); ``` ## Threading Model ### Host-Side - **Stream execution**: Commands are queued and executed asynchronously - **Callback system**: Host functions can be scheduled to run after GPU work - **Fiber support**: Integration with marl for coroutine-style programming ### Device-Side Kernels execute in a 3D grid: ``` Grid (dispatch_size) ├── Block (block_size) │ ├── Warp (SIMD width, e.g., 32 threads) │ │ └── Threads execute in lockstep │ └── Multiple warps per block └── Multiple blocks per grid ``` ## Extension System Backends can expose platform-specific features: ```cpp // Extension interface struct DenoiserExt : DeviceExtension { static constexpr string_view name = "DenoiserExt"; virtual void denoise(...) = 0; }; // Backend implementation class CUDADenoiser : public DenoiserExt { void denoise(...) override; }; // Usage if (auto* denoiser = device.extension()) { denoiser->denoise(...); } ``` ## Build System Integration ### CMake Integration LuisaCompute uses modern CMake (3.23+) with the following configuration options: #### Main Build Options | Option | Default | Description | |--------|---------|-------------| | `LUISA_COMPUTE_ENABLE_CUDA` | ON | Enable CUDA backend | | `LUISA_COMPUTE_ENABLE_DX` | ON | Enable DirectX backend | | `LUISA_COMPUTE_ENABLE_METAL` | ON | Enable Metal backend | | `LUISA_COMPUTE_ENABLE_CPU` | ON | Enable CPU backend | | `LUISA_COMPUTE_ENABLE_VULKAN` | ON | Enable Vulkan backend | | `LUISA_COMPUTE_ENABLE_HIP` | OFF | Enable HIP backend (WIP) | | `LUISA_COMPUTE_ENABLE_DSL` | ON | Enable C++ DSL | | `LUISA_COMPUTE_ENABLE_GUI` | ON | Enable GUI support | | `LUISA_COMPUTE_ENABLE_TENSOR` | OFF | Enable tensor extension | | `LUISA_COMPUTE_BUILD_TESTS` | ON | Build test suite | #### Build Commands ```bash # Basic build cmake -S . -B build -DCMAKE_BUILD_TYPE=Release cmake --build build # With specific backends cmake -S . -B build \ -DLUISA_COMPUTE_ENABLE_CUDA=ON \ -DLUISA_COMPUTE_ENABLE_DX=OFF \ -DLUISA_COMPUTE_ENABLE_METAL=OFF \ -DCMAKE_BUILD_TYPE=Release # Using Ninja (recommended) cmake -S . -B build -G Ninja -DCMAKE_BUILD_TYPE=Release cmake --build build ``` #### Using LuisaCompute in Your Project **Method 1: add_subdirectory** ```bash # Clone with submodules git clone --recursive https://github.com/LuisaGroup/LuisaCompute.git # In your CMakeLists.txt, add: # add_subdirectory(LuisaCompute) # target_link_libraries(your_target PRIVATE luisa::compute) ``` **Method 2: FetchContent** ```cmake include(FetchContent) FetchContent_Declare( luisacompute GIT_REPOSITORY https://github.com/LuisaGroup/LuisaCompute.git GIT_TAG next GIT_SUBMODULES_RECURSE TRUE ) FetchContent_MakeAvailable(luisacompute) target_link_libraries(your_target PRIVATE luisa::compute) ``` **Method 3: find_package (after installation)** ```bash # Install LuisaCompute first cmake --build build --target install # In your CMakeLists.txt find_package(LuisaCompute REQUIRED) target_link_libraries(your_target PRIVATE LuisaCompute::compute) ``` ### XMake Integration LuisaCompute uses XMake (3.0.6+) as an alternative build system with a more streamlined workflow. #### Main Build Options | Option | Default | Description | |--------|---------|-------------| | **Backend Options** ||| | `lc_cuda_backend` | true | Enable NVIDIA CUDA backend | | `lc_dx_backend` | true | Enable DirectX 12 backend | | `lc_vk_backend` | true | Enable Vulkan backend | | `lc_metal_backend` | true | Enable Metal backend | | `lc_fallback_backend` | false | Enable fallback backend | | `lc_toy_c_backend` | false | Enable toy C backend (experimental) | | **Backend Extensions** ||| | `lc_cuda_ext_lcub` | false | Enable NVIDIA CUB extension (long compile time) | | `lc_dx_cuda_interop` | false | Enable DirectX-CUDA interop | | `lc_vk_cuda_interop` | false | Enable Vulkan-CUDA interop | | **Module Options** ||| | `lc_enable_dsl` | true | Enable C++ DSL module | | `lc_enable_gui` | true | Enable GUI module | | `lc_enable_imgui` | true | Enable ImGui support | | `lc_enable_osl` | true | Enable OSL (Open Shading Language) support | | `lc_enable_py` | true | Enable Python bindings | | `lc_enable_clangcxx` | false | Enable Clang C++ module | | `lc_enable_xir` | false | Enable XIR (experimental IR) | | **Build Configuration** ||| | `lc_enable_mimalloc` | true | Use mimalloc as default allocator | | `lc_enable_custom_malloc` | false | Enable custom malloc | | `lc_enable_unity_build` | true | Enable unity (jumbo) build for faster compilation | | `lc_enable_simd` | true | Enable SSE and SSE2 SIMD | | `lc_use_lto` | false | Enable Link Time Optimization | | `lc_rtti` | false | Enable C++ RTTI | | `lc_cxx_standard` | cxx20 | C++ standard (cxx20, cxx23, etc.) | | `lc_c_standard` | clatest | C standard | | `lc_enable_tests` | true | Enable test suite | | `lc_external_marl` | false | Use external marl library | | `lc_use_system_stl` | false | Use system STL instead of EASTL | | **Python Configuration** ||| | `lc_py_include` | false | Python include path | | `lc_py_linkdir` | false | Python library directory | | `lc_py_libs` | false | Python libraries to link | | **Path Configuration** ||| | `lc_bin_dir` | bin | Custom binary output directory | | `lc_sdk_dir` | false | Custom SDK directory | | `lc_llvm_path` | false | LLVM installation path (for CPU backend) | | `lc_embree_path` | false | Embree path (for CPU ray tracing) | | `lc_toolchain` | false | Custom toolchain | | `lc_win_runtime` | false | Windows runtime library | | `lc_optimize` | false | Additional optimization flags | | **Third-Party Source** ||| | `lc_spdlog_use_xrepo` | false | Use xrepo for spdlog | | `lc_reproc_use_xrepo` | false | Use xrepo for reproc | | `lc_lmdb_use_xrepo` | false | Use xrepo for lmdb | | `lc_imgui_use_xrepo` | false | Use xrepo for imgui | | `lc_glfw_use_xrepo` | false | Use xrepo for glfw | | `lc_yyjson_use_xrepo` | false | Use xrepo for yyjson | #### Build Commands ```bash # Basic(Release) build xmake f -m release -c xmake # With specific backends xmake f -m release --lc_cuda_backend=true --lc_dx_backend=false --lc_metal_backend=false -c xmake # Debug build xmake f -m debug -c xmake # Using ClangCL toolchain (recommended on Windows) xmake f -m release --toolchain=clang-cl -c xmake ``` #### Local Configuration with options.lua You can create `scripts/options.lua` to save default configuration for your local environment: ```bash # Generate default options.lua xmake lua scripts/write_options.lua ``` Example `scripts/options.lua`: ```lua lc_options = { toolchain = "clang-cl", -- Use LLVM clang-cl compiler lc_enable_tests = true, -- Enable test-case compilation lc_enable_gui = false -- Disable GUI targets } ``` Options in `options.lua` can be overridden by command-line arguments: ```bash xmake f --lc_enable_tests=false -c ``` #### Using LuisaCompute in Your Project **Method 1: Git Submodule + includes** ```bash # Clone with submodules git submodule add https://github.com/LuisaGroup/LuisaCompute.git third_party/LuisaCompute git submodule update --init --recursive ``` ```lua -- Include LuisaCompute's build scripts includes("third_party/LuisaCompute") target("your_app") set_kind("binary") add_deps("lc-dsl") ``` **Method 2: External Project with xmake.repo** ```lua -- xmake.lua for your project set_languages("c++20") -- Require LuisaCompute as a package add_requires("luisa-compute") target("your_app") set_kind("binary") add_files("src/*.cpp") add_packages("luisa-compute") ``` ### Integration Best Practices Based on [LuisaRender](https://github.com/LuisaGroup/LuisaRender) and [LuisaComputeGaussianSplatting](https://github.com/LuisaGroup/LuisaComputeGaussianSplatting): 1. **Always use submodules** to ensure consistent versions: ```bash git submodule add https://github.com/LuisaGroup/LuisaCompute.git third_party/LuisaCompute git submodule update --init --recursive ``` 2. **Set RPATH for portable binaries**: ```cmake set(CMAKE_BUILD_RPATH_USE_ORIGIN ON) set(CMAKE_INSTALL_RPATH "$ORIGIN;$ORIGIN/../lib") ``` 3. **Handle backends gracefully**: ```cmake if(LUISA_COMPUTE_ENABLE_CUDA AND CUDA_FOUND) target_compile_definitions(your_target PRIVATE ENABLE_CUDA) endif() ``` 4. **Use unity build for faster compilation:** ```bash xmake f --lc_enable_unity_build=true -c ``` 5. **Use `-c` flag for clean configuration when switching options:** ```bash xmake f --lc_cuda_backend=false --lc_vk_backend=true -c ``` ## Performance Considerations ### Kernel Optimization 1. **Memory coalescing**: Ensure threads access consecutive memory 2. **Occupancy**: Balance register usage and block size 3. **Branch divergence**: Minimize divergent execution within warps 4. **Texture caching**: Use images for 2D spatial locality ### Runtime Optimization 1. **Command batching**: Submit multiple commands at once 2. **Resource reuse**: Avoid repeated allocations 3. **Async transfers**: Overlap compute and data transfer 4. **Stream parallelism**: Use multiple streams for independent work ## Debugging and Profiling ### Validation Layer Enabled via `LUISA_ENABLE_VALIDATION=1`: - Resource lifetime tracking - Memory access validation - Command buffer consistency checks ### Profiling Backend-specific profiling: ```cpp // CUDA: Nsight Systems/Compute integration // DirectX: PIX markers // Metal: Xcode GPU debugger ``` ## Future Directions Planned architectural improvements: 1. **Graph-based execution**: Explicit compute graphs for better optimization 2. **Multi-device support**: Seamless multi-GPU scaling 3. **Task graph API**: Higher-level task description 4. **JIT specialization**: Runtime kernel specialization based on parameters