GPUStack 0.2: Out-of-the-box Distributed Inference, CPU Reasoning, and Scheduling Strategies

GPUStack is an open source GPU cluster manager designed to run Large Language Models (LLMs), and is designed to support the construction of unified arithmetic clusters based on any brand of heterogeneous GPUs, whether they are running on Apple Macs, Windows PCs, or Linux servers, and incorporate them into a unified arithmetic cluster. GPUStack is a unified cluster of computing power. Administrators can easily deploy any LLM from popular large language model repositories such as Hugging Face, and developers can access deployed private LLMs through OpenAI-compatible APIs just as easily as they can access APIs for public LLM services from vendors such as OpenAI or Microsoft Azure.

Since the release of GPUStack at the end of July, the response from both domestic and overseas communities has been very enthusiastic, and the R&D team has received a lot of suggestions and feedback. After comprehensively evaluating the community's needs and GPUStack's Roadmap program, we quickly released GPUStack version 0.2. In this version, we have added core features such as single-machine multi-card distributed inference, cross-host distributed inference, CPU-only inference, Binpack and Spread placement strategies, specified worker scheduling, manually specified GPU scheduling, etc. We have also further extended the support for Nvidia GPUs, and made enhancements and fixes to address community feedback to better meet the needs of a variety of usage scenarios. The support for Nvidia GPUs has been further expanded, and enhancements and fixes have been made to address community feedback to better meet the needs of various usage scenarios.

have sth to do withGPUStack For more information, you can visit:

GitHub repository address./gpustack/gpustack

GPUStack User Documentation.

New Features

distributed inference

The key feature of GPUStack version 0.2 is the out-of-the-box support for single-computer multi-card distributed inference and cross-node distributed inference, which allows administrators to run large models on multiple GPUs on a single computer or across multiple nodes without complex configurations to meet the need to run large parametric models that cannot be supported by a single card.

Single-Computer Multi-Card Distributed Reasoning
In version 0.1, when there are no GPUs in GPUStack that can meet the resource requirements of the model, GPUStack adopts a semi-offloading scheme to run the model by mixing CPU and GPU inference. However, due to the partial reliance on the CPU for inference, the overall performance will be affected, and this approach cannot meet the demand for high-performance inference.

To address this problem, GPUStack version 0.2 introduces a single-machine multi-card distributed inference feature. This feature allows different layers of a model to be offloaded to multiple GPUs on a single machine, while utilizing multiple GPUs for inference. In this way, administrators can not only run a larger number of models with a larger number of parameters, but also ensure that the models achieve higher levels of performance and efficiency.

Cross-node distributed reasoning
In order to support Llama 3.1 405B, Llama 3.1 70B, Qwen2 72B and other models with large number of parameters, GPUStack version 0.2 introduces a cross-node distributed inference feature. When a single worker cannot meet the model resource requirements, GPUStack can offload the model to multiple workers to realize cross-host distributed inference.

At this point, the inference performance of the model is limited by the cross-host network bandwidth and may be significantly degraded. Therefore, in order to ensure better inference performance, it is recommended to combine high-performance network solutions such as NVLink / NVSwitch, RDMA, and so on. In consumer-grade scenarios, Thunderbolt interconnect solutions can also be considered.

In addition, in order to support Hugging Face model repository where model files of various models with large number of parameters are too large and need to be sliced, support for merge download and run of sliced models has been added in version 0.2.

CPU reasoning

GPUStack version 0.2 now supports CPU inference. In cases where GPU resources are unavailable or insufficient, GPUStack can use the CPU as a fallback option to fully load models into memory and perform inference via the CPU. This allows administrators to run large models with a small number of parameters even in environments without GPUs, further improving GPUStack's applicability in edge and resource-constrained environments.

scheduling strategy

Binpack Placement Strategies for Reducing Arithmetic Resource Fragmentation
The Binpack placement policy is a compact scheduling policy. In version 0.1 of GPUStack, the Binpack policy is used by default to deploy models. When a single GPU is able to meet the resource requirements of a model, the policy will try to centrally schedule multiple copies of the model to that GPU to maximize its utilization, until the remaining resources of that GPU are not enough to support a new model, then other GPUs will be selected.

The Binpack policy serves to reduce resource fragmentation and optimize overall GPU resource utilization. Fragmentation refers to a small amount of unused resources on individual GPUs that are insufficient to support new models, resulting in wasted arithmetic resources. With the Binpack policy, models can be centralized on some GPUs as much as possible, allowing other GPUs to retain full arithmetic resources for processing larger models.

Spread Placement Strategies for Improved Arithmetic Load Balancing
While the Binpack policy can reduce arithmetic fragmentation and maximize the utilization of individual GPUs, in some cases it can lead to over-concentration of the load on a few GPUs, leaving other GPUs idle. To address this issue, GPUStack 0.2 adds support for the Spread placement policy.

The Spread strategy, in contrast to Binpack's compact scheduling, aims to evenly distribute the model across multiple GPUs, avoiding over-concentration of resources on a single GPU and ensuring a more balanced load on each GPU. This reduces the performance bottleneck caused by resource contention, thus improving the overall performance and stability of the model.

Under the Spread policy, tasks are prioritized to GPUs with lower loads so that all GPUs can participate in inference tasks. This policy is especially suitable for high concurrency or high performance scenarios, which can improve cluster resilience and avoid overloading individual GPUs when there are sufficient resources.GPUStack 0.2 adopts the Spread policy by default, and administrators can choose the appropriate policy according to the actual needs.

Specifying Worker Scheduling

In GPUStack version 0.2, administrators can set labels for different Workers and specify that model instances be dispatched to Workers with specific labels via the Worker selector at model deployment time. This enables administrators to control model deployment more precisely, optimize resource allocation, and meet specific requirements or policies.

This capability is particularly suitable for scenarios that require fine-grained management of compute resources, such as scheduling models to a particular GPU vendor or a particular GPU model in a heterogeneous environment. Through the label selection mechanism, GPUStack allows for more efficient resource management in complex computing environments, increasing the flexibility and relevance of model deployment.

Manually specify GPU scheduling

One of the core capabilities of GPUStack is that it provides the ability to automatically calculate model resource requirements and auto-schedule based on resource requirements. This means that administrators don't have to worry about how to allocate resources or manually schedule models.GPUStack version 0.2 also supports a variety of scheduling policies such as Binpack and Spread placement policies, single-machine multi-card distributed reasoning, cross-host distributed reasoning, and specified worker scheduling, which gives administrators control over the scheduling behavior of models.

In order to cover more usage scenarios, GPUStack's scheduling features are continuously being enriched and improved. To meet some specific scheduling needs, GPUStack also provides a manual scheduling option in version 0.2. Administrators can manually schedule a model to run on a specific GPU to more precisely control the model's scheduling behavior.

Controls whether CPU offload is allowed

In version 0.1, for scenarios where the model cannot be completely offloaded to the GPU due to insufficient GPU memory, GPUStack automatically offloads part of the layers to the GPU for acceleration, and loads the other part into memory for inference via the CPU, depending on the number of model layers that can be loaded on the GPU. This is called CPU offloading, semi-offloading, or partial offloading, i.e., mixed CPU and GPU reasoning.

This fulfills the need to run larger parametric models with limited video memory resources. However, since some of the reasoning is CPU-dependent, overall performance is affected. In scenarios with high model performance requirements, administrators cannot directly determine if the model is fully loaded into the GPU, making it difficult to decide if GPU resources need to be scaled to improve performance.

In version 0.2, administrators can choose whether to allow semi-uninstallation. This option is turned off by default and only supports pure GPU reasoning. If there are no GPUs available that meet the resource requirements, the model will not be deployed and will enter a Pending state until a suitable GPU becomes available. Administrators can also enable this option to allow mixed CPU and GPU reasoning if they are comfortable with the performance loss associated with partial offloading.

Other key features

Added support for Nvidia GPU models for Compute Capability 6.0, 6.1, 7.0, and 7.5.

In version 0.2, GPUStack further expands support for Nvidia GPUs by adding support for Nvidia GPU models with Compute Capability 6.0, 6.1, 7.0, and 7.5, including the NVIDIA T4, V100, Tesla P100, P40, and P4, as well as the GeForce GTX 10 series and RTX 20 series models. This allows GPUStack to cover more data center and consumer scenarios.

Currently, GPUStack supports all Nvidia GPU models with Compute Capability 6.0 ~ 8.9. You can refer to Nvidia's official GPU Compute Capability description for specific supported models:/cuda-gpus

See the full changelog for other enhancements and fixes:/gpustack/gpustack/releases/tag/0.2.0

Join the Community

To know more about GPUStack, you can visit:。

If you encounter any problems while using GPUStack, or have any suggestions for GPUStack, please feel free to join ourDiscord Community [/VXYJzuaqwD], and alsoAdd GPUStack WeChat Assistant(micro-signal: GPUStack) Join theGPUStack WeChat Exchange GroupGet technical support from the GPUStack team or talk to the community of enthusiasts.

We're rapidly iterating on the GPUStack project, and we're excited to start experiencing theGPUStack Before you do, you're more than welcome to add your own GitHub repositories to ourgpustack/gpustack Follow us on ⭐️ to receive instant notifications of future GPUStack releases. You are more than welcome to contribute to this open source project.