Location>code7788 >text

Playing with Phi-3 SLM with C# and ONNX

Popularity:1 ℃/2024-08-04 17:33:21

After LLM swept the world and refreshed the knowledge of AI, SLM was gradually emphasized because the hardware requirement of LLM was too high and it was difficult to run on ordinary devices. Phi-3 SLM is a model developed by Microsoft, which can be run on your computer, cell phone, and other devices, and the combination of SLM and ONNX changed the game of AI interoperability. The combination of SLM and ONNX is a game changer for AI interoperability. Let's show you how to leverage the power of the Phi-3 model in a .NET application using C# and ONNX, Microsoft has a githubMicrosoft Phi-3 Cookbook

Introduction to Phi-3 SLM
The Phi family of models is the Small Language Model (SLM) from Microsoft, and Phi-3 SLM is the latest version, emphasizing language comprehension, reasoning, mathematics, and programming capabilities, with a balance of performance and power, giving us the opportunity to put language models into the hands of the The Phi-3 SLM emphasizes language understanding, reasoning, mathematics, and program writing capabilities.

The Phi-3 models (including Phi-3-mini, Phi-3-small, and Phi-3-medium) are also optimized for the ONNX Runtime, not only for Windows, but also cross-platform, and are even optimized for NVIDIA GPUs, making the model more flexible and portable.

Introduction to ONNX Runtime

ONNX or Open Neural Network Exchangeis an open standard for manipulating machine learning models and interoperating between different frameworks, allowing AI models to be portable and interoperable across different frameworks and hardware. It enables developers to use the same model with a wide range of tools, runtimes, and compilers, making it a cornerstone of AI development.ONNX supports a wide range of operators and provides scalability, which is critical for evolving AI requirements.

ONNX Runtime is a cross-platform machine learning model gas pedal with a flexible interface to integrate hardware-specific link libraries. ONNX Runtime works with models from PyTorch, Tensorflow/Keras, TFLite, scikit-learn, and other architectures, for more information seeONNX Runtime Files

Local AI development benefits greatly from ONNX because it simplifies model deployment and enhances performance. ONNX provides a common format for machine learning models, facilitates communication between different frameworks, and is optimized for a variety of hardware environments.

For C# developers, this is especially useful because we have a set of libraries created specifically to handle ONNX models. Example:The Generative library for the ONNX Runtime will have the following four packages. As for the Generative library of the ONNX Runtime, there will be the following four packages, each with its own purpose:

  1. :
    • This is a generic suite of ONNX Runtime that contains the core functionality needed to run ONNX models
    • Supports CPU operation and can be extended to support other hardware acceleration (e.g., GPUs).
  2. :
    • This is the fully managed version for pure .
    • Does not rely on native libraries to ensure cross-platform consistency, suitable for use in scenarios that do not require specific hardware acceleration
  3. :
    • This version is specifically designed for hardware acceleration using NVIDIA CUDA GPUs.
    • Ideal for deep learning models that require high-performance computing, with significant performance gains on NVIDIA GPUs
  4. :
    • This version utilizes Microsoft's DirectML API and is designed for the Windows platform.
    • Supports a wide range of hardware acceleration devices, including NVIDIA and AMD GPUs, for high-performance computing needs in Windows environments.

The main difference between these packages is that they are optimized for different hardware acceleration needs and environments, and which package you choose depends on your application scenario and hardware setup. In general, use the Managed version for pure .NET environments, or choose the CUDA or DirectML version if you have a GPU and need to use GPU acceleration.

Download LLM models from HuggingFace

Currently Phi-3 is available for download as follows:

  • Phi-3 Mini
    • Phi-3-mini-4k-instruct-onnx (cpu, cuda, directml)
    • Phi-3-mini-4k-instruct-onnx-web(web)
    • Phi-3-mini-128k-instruct-onnx (cpu, cuda, directml)
  • Phi-3 Small
    • Phi-3-small-8k-instruct-onnx (cuda)
    • Phi-3-small-128k-instruct-onnx (cuda)
  • Phi-3 Medium
    • Phi-3-medium-4k-instruct-onnx (cpu, cuda, directml)
    • Phi-3-medium-128k-instruct-onnx (cpu, cuda, directml)

The model names above are labeled 4k, 8k, and 128k, which indicate the length of the token that can be used to form a context, meaning that it takes fewer resources to run a 4k model, while 128k can support a larger context length. We can think of HuggingFace as a GitHub of sorts, where we can download models from HuggingFace with a git command. Before downloading, make sure you have git-lfs installed, you can install it with the command git lfs install.

Suppose the model we want to download ismicrosoft/Phi-3-mini-4k-instruct-onnx, the download command will try the following:

git clone /microsoft/Phi-3-mini-4k-instruct-onnx


Sample Console Application Using ONNX Models

The main steps for using the model with ONNX in a C# application are:

  • stored inmodelPathThe Phi-3 model in the Phi-3 object is loaded into the Model object.
  • The model is then used to create a model that will be responsible for converting our textual input into a model that can understand theTokenizerFormat.

For example, this is from/src/LabsPhi301/ of chatbot implementations.

  • The chatbot runs in a continuous loop, waiting for user input.
  • When the user types in a question, that question is combined with the system prompt to form a complete prompt.
  • The complete prompt is then tokenized and passed to the Generator object.
  • The generator is configured with specific parameters to generate responses one token at a time.
  • Each token is decoded back into text and printed to the console to form the chatbot's response.
  • The loop will continue until the user decides to exit by entering an empty string.

The code is divided into three main sections:

  • Load Model: Load model by class Model and convert text to Token by class Tokenizer
  • Set Prompt: Set System Prompt and User Prompt and combine them into a complete Prompt.
  • One Question, One Answer: Generate a response through a category and decode the generated Token into text for output to Generator

When setting the parameters, it will be set by category, here we only set and two parameters, is the maximum length of the generated response, is to control the diversity of the generated response, and the ONNX Runtime provides quite a lot of setting parameters, the full list of parameters please refer to theofficial document


C# ONNX and Phi-3and Phi-3 Vision

Phi-3 Cookbook RepositoryShows how these powerful models can be used to perform tasks such as quizzing and image analysis in a .NET environment. It includes a demonstration of how theUsing Phi-3 mini and Phi-3-Vision Models in .of labs and sample projects.

sports event descriptive
Laboratory Phi301 This is a sample project that uses a local phi3 model to ask questions. This project uses a library to load a native ONNX Phi-3 model.
Laboratory Phi302 This is a sample project that uses Semantic Kernel to implement a console chat.
Laboratory Phi303 This is a sample project to analyze an image using a native phi3 vision model. This project uses a library to load a native ONNX Phi-3 Vision model.
Laboratory Phi304 This is a sample project to analyze an image using a native phi3 vision model. The project uses a library to load a local ONNX Phi-3 Vision model. The project also provides a menu with different options for interacting with the user.


Overall, we can answer the questions correctly in English, but when we switch to Chinese, there are a lot of wonderful things that can happen. However, this is to be expected, after all, this SLM model basically uses English as the training material, so the processing ability of Chinese is relatively weak. Maybe we can use Fine-tuning to improve the Chinese processing ability, we can study it again.

To learn more about .NET 8 and AI Getting Started with New Quick Start TutorialsOther AI examples in the