LLM Large Model Deployment Practical Guide: Ollama Simplifies the Process, OpenLLM Deploys Flexibly, LocalAI Local Optimization, Dify Empowers Application Development

1. Local model for Ollama deployment (🔺)

Ollama is an open source framework designed for easy deployment and running of Large Language Models (LLMs) on local machines. Here is the official website address for Ollama:/

The following is an overview of its main features and functions:
1. Simplified Deployment: Ollama aims to simplify the process of deploying large language models in Docker containers, making it easy for non-expert users to manage and run these complex models.
2. Lightweight and Scalable: As a lightweight framework, Ollama maintains a small resource footprint while being scalable, allowing users to adjust configurations as needed to accommodate projects of different sizes and hardware conditions.
3. API Support: Provides a clean API that makes it easy for developers to create, run and manage large language model instances, lowering the technical barrier to interacting with the model.
4. Pre-built model library: contains a series of pre-trained large-scale language models, which can be used directly by users to apply to their own applications without the need to train from scratch or find their own model sources.

1.1 One-Click Installation

curl: (77) error setting certificate verify locations:CAfile: /data/usr/local/anaconda/ssl/: none
Reason: The address path CAfile is incorrect, i.e., the file cannot be found under this path.

Solution:

Find the location of your file /path/to/. If you don't have that certificate, you can start by adding the/ca/ Download it and save it in a directory somewhere.
Setting environment variables

export CURL_CA_BUNDLE=/path/to/
#Replace "/path/to/" with the actual path to your certificate file.
export CURL_CA_BUNDLE=/www/anaconda3/anaconda3/ssl/

Execute downloads

curl -fsSL / | sh

1.2 Manual installation

ollama Chinese:/getting-started/linux/

Download the ollama binary: Ollama is distributed as a self contained binary. Download it to a directory in your PATH:

sudo curl -L /download/ollama-linux-amd64 -o /usr/bin/ollama

sudo chmod +x /usr/bin/ollama

Add Ollama as a startup service (recommended): Create a user for Ollama:

sudo useradd -r -s /bin/false -m -d /usr/share/ollama ollama

3. Create a service file in /etc/systemd/system/:

#vim  

[Unit]

Description=Ollama Service
After=

[Service]
ExecStart=/usr/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3

[Install]
WantedBy=

Then start the service:

sudo systemctl daemon-reload
sudo systemctl enable ollama

Launch Ollama¶
Use systemd to start Ollama:

sudo systemctl start ollama

Update, view log

# Run it again
sudo curl -L /download/ollama-linux-amd64 -o /usr/bin/ollama
sudo chmod +x /usr/bin/ollama

# To view the journal of Ollama running as a startup service, run:
journalctl -u ollama

Step 7: Shut down the Ollama service

# Shut down the ollama service
service ollama stop

1.3 Linux intranet offline installation of Ollama

Check the server CPU model

##Command to view CPU model of Linux system, my server cpu model is x86_64
lscpu

Step 2: Download the Ollama installation package according to the CPU model and save it to the directory

Download Address:/ollama/ollama/releases/

#x86_64 CPU choose to download ollama-linux-amd64
#aarch64|arm64 CPU choose to download ollama-linux-arm64

#The same applies to the download from an internet-enabled machine
wget /download/ollama-linux-amd64

Download to an offline server:/usr/bin/ollama ollama is the ollama-linux-amd64 you downloaded renamed (mv), other steps are consistent

1.4 Modifying storage paths

Ollama models are stored by default:

macOS: ~/.ollama/models
Linux: /usr/share/ollama/.ollama/models
Windows: C:\Users<username>.ollama\models

If Ollama is running as a systemd service, the following command should be used to set the environment variable systemctl:

Edit the systemd service systemctl edit by calling . This will open an editor.
EnvironmentFor each environment variable, add a line [Service] under the section:

Added 2 lines directly to "/etc/systemd/system/".

[Service]
Environment="OLLAMA_HOST=0.0.0.0:7861"
Environment="OLLAMA_MODELS=/www/algorithm/LLM_model/models"

Save and exit.
Reload systemd and restart Ollama:

systemctl daemon-reload 
systemctl restart ollama

Reference Links:/ollama/ollama/blob/main/docs/

Use systemd to start Ollama:

sudo systemctl start ollama

terminate (law)

Terminate (the big model loaded by ollama will stop occupying the video memory, at this time ollama belongs to the out-of-connection state, the deployment and running operation is invalid, and will report an error:

Error: could not connect to ollama app, is it running? needs to be started before it can deploy and run operations

systemctl stop

Startup after termination (after startup, you can then use ollama to deploy and run the big model)

systemctl start

1.5 Initiation of LLM

Download model

ollama pull llama3.1
ollama pull qwen2

Running the big model

ollama run llama3.1
ollama run qwen2

To see if large models are recognized: theollama list, if successful, you will see the large model

ollama list
NAME            ID              SIZE    MODIFIED    
qwen2:latest    e0d4e1163c58    4.4 GB  3 hours ago

using thisollama pscommand to view the models currently loaded into memory.

NAME            ID              SIZE    PROCESSOR       UNTIL              
qwen2:latest    e0d4e1163c58    5.7 GB  100% GPU        3 minutes from now

nvidia-smi view

+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10              Driver Version: 535.86.10    CUDA Version: 12.2     |
|-----------------------------------------+----------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id         | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |         Memory-Usage | GPU-Util  Compute M. |
|                                         |                      |               MIG M. |
|=========================================+======================+======================|
|   0  Tesla V100-SXM2-32GB           On  | 00000000:00:08.0 Off |                    0 |
| N/A   35C    P0              56W / 300W |   5404MiB / 32768MiB |      0%      Default |
|                                         |                      |                  N/A |
+-----------------------------------------+----------------------+----------------------+
                                                                                         
+---------------------------------------------------------------------------------------+
| Processes:                                                                            |
|  GPU   GI   CI        PID   Type   Process name                            GPU Memory |
|        ID   ID                                                             Usage      |
|=======================================================================================|
|    0   N/A  N/A   3062036      C   ...unners/cuda_v11/ollama_llama_server     5402MiB |
+---------------------------------------------------------------------------------------+

Once started, we can verify that it is available:

curl http://10.80.2.195:7861/api/chat -d '{
  "model": "llama3.1",
  "messages": [
    { "role": "user", "content": "why is the sky blue?" }
  ]
}'

1.6 More other configurations

Environment variables that can be set by Ollama：

OLLAMA_HOST: This variable defines which network interfaces Ollama listens on. By setting OLLAMA_HOST=0.0.0.0, we can allow Ollama to listen to all available network interfaces, thus allowing external network access.
OLLAMA_MODELS: This variable specifies the storage path for the model image. By setting OLLAMA_MODELS=F:\OllamaCache, we can store the model image in the E drive to avoid the problem of insufficient space in the C drive.
OLLAMA_KEEP_ALIVE: This variable controls how long the model stays alive in memory. Setting OLLAMA_KEEP_ALIVE=24h keeps the model in memory for 24 hours, increasing access speed.
OLLAMA_PORT: This variable allows us to change the default port for Ollama. For example, setting OLLAMA_PORT=8080 changes the service port from the default 11434 to 8080.
OLLAMA_NUM_PARALLEL: This variable determines the number of user requests that Ollama can handle simultaneously. Setting OLLAMA_NUM_PARALLEL=4 allows Ollama to handle two concurrent requests at the same time.
OLLAMA_MAX_LOADED_MODELS: This variable limits the number of models that Ollama can load at the same time. Setting OLLAMA_MAX_LOADED_MODELS=4 ensures that system resources are allocated appropriately.

Environment="OLLAMA_PORT=9380" didn't work.

Specify it this way:Environment="OLLAMA_HOST=0.0.0.0:7861"
Specify GPU
There are multiple GPUs locally, how to run Ollama with the specified GPU? Create the following configuration file on Linux and configure the environment variable CUDA_VISIBLE_DEVICES to specify the GPU to run Ollama, and then restart the Ollama service. [The test serial number starts from 0 or 1, it should start from 0].

vim /etc/systemd/system/
[Service]
Environment="CUDA_VISIBLE_DEVICES=0,1"

1.7 Ollama Common Commands

Reboot ollama

systemctl daemon-reload
systemctl restart ollama

Restart the ollama service

ubuntu/debian

sudo apt update
sudo apt install lsof
stop ollama
lsof -i :11434
kill <PID>
ollama serve

Ubuntu

sudo apt update
sudo apt install lsof
stop ollama
lsof -i :11434
kill <PID>
ollama serve

Confirm the service port status:

netstat -tulpn | grep 11434

Configuration Services

The HOST needs to be configured in order for the service to be accessible to the outside environment.

Open the configuration file:

vim /etc/systemd/system/

Modify the variable Environment as appropriate:

server environment:

Environment="OLLAMA_HOST=0.0.0.0:11434"

virtual machine environment:

Environment="OLLAMA_HOST=Server Intranet IP Address:11434"

1.8 Uninstalling Ollama

If you decide that you no longer want to use Ollama, you can remove it completely from your system by following these steps:

(1) Stop and disable the service:

sudo systemctl stop ollama
sudo systemctl disable ollama

(2) Delete the service file and the Ollama binary:

sudo rm /etc/systemd/system/ 
sudo rm $(which ollama)

(3) Clean up Ollama users and groups:

sudo rm -r /usr/share/ollama
sudo userdel ollama
sudo groupdel ollama

By following these steps, you will not only be able to successfully install and configure Ollama on the Linux platform, but also have the flexibility to update and uninstall it.

deployments

OpenLLM, open sourced in June 2023, is a framework for deploying large language models. Currently, the project has received 9.6K stars on GitHub. Its original slogan was to provide convenience to individual users by switching between different large language models with a single line of code or with relative ease.OpenLLM is an open platform for manipulating large language models (LLMs) in production environments, making it easy to fine-tune, service, deploy, and monitor any LLM.

mounting

pip install openllm  # or pip3 install openllm
openllm hello

Support Models
- Llama-3.1
- Llama-3
- Phi-3
- Mistral
- Gemma-2
- Qwen-2
- Gemma
- Llama-2
- Mixtral

Fill in Settings > Model Provider > OpenLLM:
- Model Name:
- Server URL: http://<Machine_IP>:3333 Replace with the IP address of your machine "Save" to use the model in your application.

OpenLLM provides a built-in Python client that allows you to interact with the model. In a different terminal window or Jupyter notebook, create a client to start interacting with the model:

import openllm
client = ('http://localhost:3000')
('Explain to me the difference between "further" and "farther"')

You can query a model from a terminal using the openllm query command:

export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'Explain to me the difference between "further" and "farther"'

Use the openllm models command to view a list of models and their variants supported by OpenLLM.

deployments

LocalAI is a local inference framework that provides a RESTFul API that is compatible with the OpenAI API specification. It allows you to run LLMs (and other models) locally on consumer-grade hardware or on your own servers, and supports multiple model families compatible with the ggml format. No GPU is required. Dify supports locally deployed access to large-scale language model inference and embedding capabilities deployed by LocalAI.

giuhub：/mudler/LocalAI/tree/master
Official Handbook:/docs/getting-started/models/
Official Rapid Deployment Manual Case:/docs/getting-started/models/

First pull the LocalAI code repository and go to the specified directory

git clone /go-skynet/LocalAI
cd LocalAI/examples/langchain-chroma

Download demo LLM and Embedding models (FYI)

wget /skeskinen/ggml/resolve/main/all-MiniLM-L6-v2/ggml-model-q4_0.bin -O models/bert
wget /models/ -O models/ggml-gpt4all-j

Reference Article:Say goodbye to the Hugging Face model download problem

Configuring .env files

mv . .env

NOTE: Make sure that the value of the THREADS variable in .env does not exceed the number of CPU cores on your machine.

Activate LocalAI

#start with docker-compose
$docker-compose up -d --build

#tail the logs & wait until the build completes
docker logs -f langchain-chroma-api-1
7:16AM INF Starting LocalAI using 4 threads, with models path: /models
7:16AM INF LocalAI version: v1.24.1 (9cc8d9086580bd2a96f5c96a6b873242879c70bc)

 ┌───────────────────────────────────────────────────┐ 
 │                   Fiber v2.48.0                   │ 
 │               http://127.0.0.1:8080               │ 
 │       (bound on host 0.0.0.0 and port 8080)       │ 
 │                                                   │ 
 │ Handlers ............ 55  Processes ........... 1 │ 
 │ Prefork ....... Disabled  PID ................ 14 │ 
 └───────────────────────────────────────────────────┘

Opened up the localhttp://127.0.0.1:8080 Serves as an endpoint for LocalAI to request APIs.

And two models are provided, respectively:

LLM model: ggml-gpt4all-j

External access name: gpt-3.5-turbo (this name can be customized and configured in models/gpt-3.
Embedding model: all-MiniLM-L6-v2

External access name: text-embedding-ada-002 (This name can be customized and configured in models/.

Use Dify Docker deployment method need to pay attention to the network configuration, ensure that Dify containers can access the localAI endpoints, Dify containers can not access the localhost, you need to use the host IP address.

LocalAI API service deployed, using access model in Dify

Fill in Settings > Model Provider > LocalAI:

Model 1: ggml-gpt4all-j
- Model type: text generation
- Model name: gpt-3.5-turbo
- Server URL:http://127.0.0.1:8080
- If Dify is a docker deployment, enter the host domain name:http://your-LocalAI-endpoint-domain:8080The IP address of the local area network (LAN) can be filled in, for example:http://192.168.1.100:8080
Model 2: all-MiniLM-L6-v2
- Model type: Embeddings
- Model name: text-embedding-ada-002
- Server URL:http://127.0.0.1:8080
- If Dify is a docker deployment, enter the host domain name:http://your-LocalAI-endpoint-domain:8080The IP address of the local area network (LAN) can be filled in, for example:http://192.168.1.100:8080

For more information on LocalAI, see:/go-skynet/LocalAI

4. Configure LLM+Dify (ollama 🔺)

Confirm the service port status:

netstat -tulnp | grep ollama
#netstat -tulpn | grep 11434

report an error： "Error: could not connect to ollama app, is it running?"

Reference Links:/questions/78437376/run-ollama-run-llama3-in-colab-raise-err-error-could-not-connect-to-ollama

The /etc/systemd/system/ file is:

[Service]
ExecStart=/usr/local/bin/ollama serve
Environment="OLLAMA_HOST=0.0.0.0:7861"
Environment="OLLAMA_KEEP_ALIVE=-1"

runtime instruction

export OLLAMA_HOST=0.0.0.0:7861
ollama list
ollama run llama3.1

#You can also add it directly to the environment variable
vim ~/.bashrc
source ~/.bashrc

Fill in Settings > Model Provider > Ollama:

Model Name: llama3.1
Base URL:http://<your-ollama-endpoint-domain>:11434
- The address of the Ollama service that can be accessed is required here.
- If Dify is deployed as a docker, it is recommended to fill in the LAN IP address, for example:http://10.80.2.195:11434 or the docker host IP address, for example:http://172.17.0.1:11434。
- If deployed as a local source, you can fill in thehttp://localhost:11434。
Model type: dialog
Model context length: 4096
- The maximum context length of the model, if not clear you can fill in the default value 4096.
Maximum token limit: 4096
- The maximum number of tokens to be returned by the model for content, which may be consistent with the model context length if not otherwise specified by the model.
Does Vision support: Yes
- Check this box when the model supports picture understanding (multimodal), as in llava.
Click "Save" to verify that the model is correct and can be used in your application.
The Embedding model is accessed in a similar way to LLM, by changing the model type to Text Embedding.

If you deploy Dify and Ollama with Docker, you may encounter the following error.

httpconnectionpool(host=127.0.0.1, port=11434): max retries exceeded with url:/cpi/chat (Caused by NewConnectionError('< object at 0x7f8562812c20>: fail to establish a new connection:[Errno 111] Connection refused'))

httpconnectionpool(host=localhost, port=11434): max retries exceeded with url:/cpi/chat (Caused by NewConnectionError('< object at 0x7f8562812c20>: fail to establish a new connection:[Errno 111] Connection refused'))

This error is due to the Docker container being unable to access the Ollama service. localhost usually refers to the container itself, not the host or another container. To resolve this issue, you need to expose the Ollama service to the network.

4.1. Multi-model comparison

Just refer to the individual model deployments for another configuration add-on

It should be noted that after adding a new model configuration, you need to refresh the dedify web page, directly web-side refresh is good, the newly added model will be loaded in!

You can see the model resource consumption after the call

More LLM platform references:
- RAG+AI Workflow+Agent: How to Choose an LLM Framework, a Comprehensive Comparison of MaxKB, Dify, FastGPT, RagFlow, Anything-LLM, and More!
- Wisdom wins the future: domestic large model + Agent application case selection, as well as the mainstream Agent framework open source project recommendation
Official Website:/zh
github address:/langgenius/dify/tree/main
ollama Chinese website:/
Installation tutorial for ollama:/getting-started/linux/
Ollama Linux Deployment and Applications LLama 3

More quality content please pay attention to the public number: Ting, artificial intelligence; will provide some related resources and quality articles, free access to read.

More quality content please follow CSDN: Ting, Artificial Intelligence; will provide some related resources and quality articles, free access to read.