LLM Large Model Deployment Practical Guide: Ollama Simplifies the Process, OpenLLM Deploys Flexibly, LocalAI Local Optimization, Dify Empowers Application Development
1. Local model for Ollama deployment (πΊ)
Ollama is an open source framework designed for easy deployment and running of Large Language Models (LLMs) on local machines. Here is the official website address for Ollama:/
-
The following is an overview of its main features and functions:
- Simplified Deployment: Ollama aims to simplify the process of deploying large language models in Docker containers, making it easy for non-expert users to manage and run these complex models.
- Lightweight and Scalable: As a lightweight framework, Ollama maintains a small resource footprint while being scalable, allowing users to adjust configurations as needed to accommodate projects of different sizes and hardware conditions.
- API Support: Provides a clean API that makes it easy for developers to create, run and manage large language model instances, lowering the technical barrier to interacting with the model.
- Pre-built model library: contains a series of pre-trained large-scale language models, which can be used directly by users to apply to their own applications without the need to train from scratch or find their own model sources.
1.1 One-Click Installation
curl: (77) error setting certificate verify locations:CAfile: /data/usr/local/anaconda/ssl/: none
Reason: The address path CAfile is incorrect, i.e., the file cannot be found under this path.
- Solution:
- Find the location of your file /path/to/. If you don't have that certificate, you can start by adding the/ca/ Download it and save it in a directory somewhere.
- Setting environment variables
export CURL_CA_BUNDLE=/path/to/
#Replace "/path/to/" with the actual path to your certificate file.
export CURL_CA_BUNDLE=/www/anaconda3/anaconda3/ssl/
- Execute downloads
curl -fsSL / | sh
1.2 Manual installation
ollama Chinese:/getting-started/linux/
- Download the ollama binary: Ollama is distributed as a self contained binary. Download it to a directory in your PATH:
sudo curl -L /download/ollama-linux-amd64 -o /usr/bin/ollama
sudo chmod +x /usr/bin/ollama
- Add Ollama as a startup service (recommended): Create a user for Ollama:
sudo useradd -r -s /bin/false -m -d /usr/share/ollama ollama
3. Create a service file in /etc/systemd/system/:
#vim
[Unit]
Description=Ollama Service
After=
[Service]
ExecStart=/usr/bin/ollama serve
User=ollama
Group=ollama
Restart=always
RestartSec=3
[Install]
WantedBy=
- Then start the service:
sudo systemctl daemon-reload
sudo systemctl enable ollama
- Launch OllamaΒΆ
Use systemd to start Ollama:
sudo systemctl start ollama
- Update, view log
# Run it again
sudo curl -L /download/ollama-linux-amd64 -o /usr/bin/ollama
sudo chmod +x /usr/bin/ollama
# To view the journal of Ollama running as a startup service, run:
journalctl -u ollama
- Step 7: Shut down the Ollama service
# Shut down the ollama service
service ollama stop
1.3 Linux intranet offline installation of Ollama
- Check the server CPU model
##Command to view CPU model of Linux system, my server cpu model is x86_64
lscpu
- Step 2: Download the Ollama installation package according to the CPU model and save it to the directory
Download Address:/ollama/ollama/releases/
#x86_64 CPU choose to download ollama-linux-amd64
#aarch64|arm64 CPU choose to download ollama-linux-arm64
#The same applies to the download from an internet-enabled machine
wget /download/ollama-linux-amd64
Download to an offline server:/usr/bin/ollama ollama is the ollama-linux-amd64 you downloaded renamed (mv), other steps are consistent
1.4 Modifying storage paths
Ollama models are stored by default:
- macOS: ~/.ollama/models
- Linux: /usr/share/ollama/.ollama/models
- Windows: C:\Users<username>.ollama\models
If Ollama is running as a systemd service, the following command should be used to set the environment variable systemctl:
-
Edit the systemd service systemctl edit by calling . This will open an editor.
-
EnvironmentFor each environment variable, add a line [Service] under the section:
Added 2 lines directly to "/etc/systemd/system/".
[Service]
Environment="OLLAMA_HOST=0.0.0.0:7861"
Environment="OLLAMA_MODELS=/www/algorithm/LLM_model/models"
-
Save and exit.
-
Reload systemd and restart Ollama:
systemctl daemon-reload
systemctl restart ollama
Reference Links:/ollama/ollama/blob/main/docs/
- Use systemd to start Ollama:
sudo systemctl start ollama
- terminate (law)
Terminate (the big model loaded by ollama will stop occupying the video memory, at this time ollama belongs to the out-of-connection state, the deployment and running operation is invalid, and will report an error:
Error: could not connect to ollama app, is it running? needs to be started before it can deploy and run operations
systemctl stop
- Startup after termination (after startup, you can then use ollama to deploy and run the big model)
systemctl start
1.5 Initiation of LLM
- Download model
ollama pull llama3.1
ollama pull qwen2
- Running the big model
ollama run llama3.1
ollama run qwen2
- To see if large models are recognized: the
ollama list
, if successful, you will see the large model
ollama list
NAME ID SIZE MODIFIED
qwen2:latest e0d4e1163c58 4.4 GB 3 hours ago
- using this
ollama ps
command to view the models currently loaded into memory.
NAME ID SIZE PROCESSOR UNTIL
qwen2:latest e0d4e1163c58 5.7 GB 100% GPU 3 minutes from now
- nvidia-smi view
+---------------------------------------------------------------------------------------+
| NVIDIA-SMI 535.86.10 Driver Version: 535.86.10 CUDA Version: 12.2 |
|-----------------------------------------+----------------------+----------------------+
| GPU Name Persistence-M | Bus-Id | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap | Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|=========================================+======================+======================|
| 0 Tesla V100-SXM2-32GB On | 00000000:00:08.0 Off | 0 |
| N/A 35C P0 56W / 300W | 5404MiB / 32768MiB | 0% Default |
| | | N/A |
+-----------------------------------------+----------------------+----------------------+
+---------------------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=======================================================================================|
| 0 N/A N/A 3062036 C ...unners/cuda_v11/ollama_llama_server 5402MiB |
+---------------------------------------------------------------------------------------+
- Once started, we can verify that it is available:
curl http://10.80.2.195:7861/api/chat -d '{
"model": "llama3.1",
"messages": [
{ "role": "user", "content": "why is the sky blue?" }
]
}'
1.6 More other configurations
Environment variables that can be set by OllamaοΌ
-
OLLAMA_HOST
: This variable defines which network interfaces Ollama listens on. By setting OLLAMA_HOST=0.0.0.0, we can allow Ollama to listen to all available network interfaces, thus allowing external network access. -
OLLAMA_MODELS
: This variable specifies the storage path for the model image. By setting OLLAMA_MODELS=F:\OllamaCache, we can store the model image in the E drive to avoid the problem of insufficient space in the C drive. -
OLLAMA_KEEP_ALIVE
: This variable controls how long the model stays alive in memory. Setting OLLAMA_KEEP_ALIVE=24h keeps the model in memory for 24 hours, increasing access speed. -
OLLAMA_PORT
: This variable allows us to change the default port for Ollama. For example, setting OLLAMA_PORT=8080 changes the service port from the default 11434 to 8080. -
OLLAMA_NUM_PARALLEL
: This variable determines the number of user requests that Ollama can handle simultaneously. Setting OLLAMA_NUM_PARALLEL=4 allows Ollama to handle two concurrent requests at the same time. -
OLLAMA_MAX_LOADED_MODELS
: This variable limits the number of models that Ollama can load at the same time. Setting OLLAMA_MAX_LOADED_MODELS=4 ensures that system resources are allocated appropriately.
Environment="OLLAMA_PORT=9380" didn't work.
-
Specify it this way:
Environment="OLLAMA_HOST=0.0.0.0:7861"
-
Specify GPU
There are multiple GPUs locally, how to run Ollama with the specified GPU? Create the following configuration file on Linux and configure the environment variable CUDA_VISIBLE_DEVICES to specify the GPU to run Ollama, and then restart the Ollama service. [The test serial number starts from 0 or 1, it should start from 0].
vim /etc/systemd/system/
[Service]
Environment="CUDA_VISIBLE_DEVICES=0,1"
1.7 Ollama Common Commands
- Reboot ollama
systemctl daemon-reload
systemctl restart ollama
- Restart the ollama service
ubuntu/debian
sudo apt update
sudo apt install lsof
stop ollama
lsof -i :11434
kill <PID>
ollama serve
- Ubuntu
sudo apt update
sudo apt install lsof
stop ollama
lsof -i :11434
kill <PID>
ollama serve
- Confirm the service port status:
netstat -tulpn | grep 11434
- Configuration Services
The HOST needs to be configured in order for the service to be accessible to the outside environment.
Open the configuration file:
vim /etc/systemd/system/
Modify the variable Environment as appropriate:
server environment:
Environment="OLLAMA_HOST=0.0.0.0:11434"
virtual machine environment:
Environment="OLLAMA_HOST=Server Intranet IP Address:11434"
1.8 Uninstalling Ollama
If you decide that you no longer want to use Ollama, you can remove it completely from your system by following these steps:
(1) Stop and disable the service:
sudo systemctl stop ollama
sudo systemctl disable ollama
(2) Delete the service file and the Ollama binary:
sudo rm /etc/systemd/system/
sudo rm $(which ollama)
(3) Clean up Ollama users and groups:
sudo rm -r /usr/share/ollama
sudo userdel ollama
sudo groupdel ollama
By following these steps, you will not only be able to successfully install and configure Ollama on the Linux platform, but also have the flexibility to update and uninstall it.
deployments
OpenLLM, open sourced in June 2023, is a framework for deploying large language models. Currently, the project has received 9.6K stars on GitHub. Its original slogan was to provide convenience to individual users by switching between different large language models with a single line of code or with relative ease.OpenLLM is an open platform for manipulating large language models (LLMs) in production environments, making it easy to fine-tune, service, deploy, and monitor any LLM.
- mounting
pip install openllm # or pip3 install openllm
openllm hello
- Support Models
- Llama-3.1
- Llama-3
- Phi-3
- Mistral
- Gemma-2
- Qwen-2
- Gemma
- Llama-2
- Mixtral
-
Fill in Settings > Model Provider > OpenLLM:
-
Model Name:
-
Server URL: http://<Machine_IP>:3333 Replace with the IP address of your machine "Save" to use the model in your application.
-
OpenLLM provides a built-in Python client that allows you to interact with the model. In a different terminal window or Jupyter notebook, create a client to start interacting with the model:
import openllm
client = ('http://localhost:3000')
('Explain to me the difference between "further" and "farther"')
- You can query a model from a terminal using the openllm query command:
export OPENLLM_ENDPOINT=http://localhost:3000
openllm query 'Explain to me the difference between "further" and "farther"'
Use the openllm models command to view a list of models and their variants supported by OpenLLM.
deployments
LocalAI is a local inference framework that provides a RESTFul API that is compatible with the OpenAI API specification. It allows you to run LLMs (and other models) locally on consumer-grade hardware or on your own servers, and supports multiple model families compatible with the ggml format. No GPU is required. Dify supports locally deployed access to large-scale language model inference and embedding capabilities deployed by LocalAI.
- giuhubοΌ/mudler/LocalAI/tree/master
- Official Handbook:/docs/getting-started/models/
- Official Rapid Deployment Manual Case:/docs/getting-started/models/
- First pull the LocalAI code repository and go to the specified directory
git clone /go-skynet/LocalAI
cd LocalAI/examples/langchain-chroma
- Download demo LLM and Embedding models (FYI)
wget /skeskinen/ggml/resolve/main/all-MiniLM-L6-v2/ggml-model-q4_0.bin -O models/bert
wget /models/ -O models/ggml-gpt4all-j
- Reference Article:Say goodbye to the Hugging Face model download problem
- Configuring .env files
mv . .env
NOTE: Make sure that the value of the THREADS variable in .env does not exceed the number of CPU cores on your machine.
- Activate LocalAI
#start with docker-compose
$docker-compose up -d --build
#tail the logs & wait until the build completes
docker logs -f langchain-chroma-api-1
7:16AM INF Starting LocalAI using 4 threads, with models path: /models
7:16AM INF LocalAI version: v1.24.1 (9cc8d9086580bd2a96f5c96a6b873242879c70bc)
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Fiber v2.48.0 β
β http://127.0.0.1:8080 β
β (bound on host 0.0.0.0 and port 8080) β
β β
β Handlers ............ 55 Processes ........... 1 β
β Prefork ....... Disabled PID ................ 14 β
βββββββββββββββββββββββββββββββββββββββββββββββββββββ
Opened up the localhttp://127.0.0.1:8080 Serves as an endpoint for LocalAI to request APIs.
And two models are provided, respectively:
-
LLM model: ggml-gpt4all-j
External access name: gpt-3.5-turbo (this name can be customized and configured in models/gpt-3.
-
Embedding model: all-MiniLM-L6-v2
External access name: text-embedding-ada-002 (This name can be customized and configured in models/.
Use Dify Docker deployment method need to pay attention to the network configuration, ensure that Dify containers can access the localAI endpoints, Dify containers can not access the localhost, you need to use the host IP address.
- LocalAI API service deployed, using access model in Dify
Fill in Settings > Model Provider > LocalAI:
-
Model 1: ggml-gpt4all-j
- Model type: text generation
- Model name: gpt-3.5-turbo
- Server URL:http://127.0.0.1:8080
- If Dify is a docker deployment, enter the host domain name:http://your-LocalAI-endpoint-domain:8080The IP address of the local area network (LAN) can be filled in, for example:http://192.168.1.100:8080
-
Model 2: all-MiniLM-L6-v2
- Model type: Embeddings
- Model name: text-embedding-ada-002
- Server URL:http://127.0.0.1:8080
- If Dify is a docker deployment, enter the host domain name:http://your-LocalAI-endpoint-domain:8080The IP address of the local area network (LAN) can be filled in, for example:http://192.168.1.100:8080
For more information on LocalAI, see:/go-skynet/LocalAI
4. Configure LLM+Dify (ollama πΊ)
- Confirm the service port status:
netstat -tulnp | grep ollama
#netstat -tulpn | grep 11434
- report an errorοΌ "Error: could not connect to ollama app, is it running?"
Reference Links:/questions/78437376/run-ollama-run-llama3-in-colab-raise-err-error-could-not-connect-to-ollama
The /etc/systemd/system/ file is:
[Service]
ExecStart=/usr/local/bin/ollama serve
Environment="OLLAMA_HOST=0.0.0.0:7861"
Environment="OLLAMA_KEEP_ALIVE=-1"
- runtime instruction
export OLLAMA_HOST=0.0.0.0:7861
ollama list
ollama run llama3.1
#You can also add it directly to the environment variable
vim ~/.bashrc
source ~/.bashrc
Fill in Settings > Model Provider > Ollama:
-
Model Name: llama3.1
-
Base URL:
http://<your-ollama-endpoint-domain>:11434
- The address of the Ollama service that can be accessed is required here.
- If Dify is deployed as a docker, it is recommended to fill in the LAN IP address, for example:
http://10.80.2.195:11434
or the docker host IP address, for example:http://172.17.0.1:11434
γ - If deployed as a local source, you can fill in the
http://localhost:11434
γ
-
Model type: dialog
-
Model context length: 4096
- The maximum context length of the model, if not clear you can fill in the default value 4096.
-
Maximum token limit: 4096
- The maximum number of tokens to be returned by the model for content, which may be consistent with the model context length if not otherwise specified by the model.
-
Does Vision support: Yes
- Check this box when the model supports picture understanding (multimodal), as in llava.
-
Click "Save" to verify that the model is correct and can be used in your application.
-
The Embedding model is accessed in a similar way to LLM, by changing the model type to Text Embedding.
- If you deploy Dify and Ollama with Docker, you may encounter the following error.
httpconnectionpool(host=127.0.0.1, port=11434): max retries exceeded with url:/cpi/chat (Caused by NewConnectionError('< object at 0x7f8562812c20>: fail to establish a new connection:[Errno 111] Connection refused'))
httpconnectionpool(host=localhost, port=11434): max retries exceeded with url:/cpi/chat (Caused by NewConnectionError('< object at 0x7f8562812c20>: fail to establish a new connection:[Errno 111] Connection refused'))
This error is due to the Docker container being unable to access the Ollama service. localhost usually refers to the container itself, not the host or another container. To resolve this issue, you need to expose the Ollama service to the network.
4.1. Multi-model comparison
Just refer to the individual model deployments for another configuration add-on
- It should be noted that after adding a new model configuration, you need to refresh the dedify web page, directly web-side refresh is good, the newly added model will be loaded in!
- You can see the model resource consumption after the call
-
More LLM platform references:
-
RAG+AI Workflow+Agent: How to Choose an LLM Framework, a Comprehensive Comparison of MaxKB, Dify, FastGPT, RagFlow, Anything-LLM, and More!
-
Wisdom wins the future: domestic large model + Agent application case selection, as well as the mainstream Agent framework open source project recommendation
-
-
Official Website:/zh
-
github address:/langgenius/dify/tree/main
-
ollama Chinese website:/
-
Installation tutorial for ollama:/getting-started/linux/
-
Ollama Linux Deployment and Applications LLama 3
More quality content please pay attention to the public number: Ting, artificial intelligence; will provide some related resources and quality articles, free access to read.
More quality content please follow CSDN: Ting, Artificial Intelligence; will provide some related resources and quality articles, free access to read.