Cloud hosting-based ModelArts model training practices to simplify development environments

This article was shared from Huawei Cloud Community[Developer Space Practice] Installing Docker on Cloud Hosting and Making Custom Images for Model Training on ModelArts Platform., Author: Developer Space Bee.

1.1 Introduction to cases

In the process of AI business development as well as operation, there are generally complex environment dependencies that need to be tuned and solidified. Facing the vulnerability of the development environment and the problem of multi-track switching in development, in ModelArts' AI development best practices, the running environment is cured by means of container images, in which way not only the dependency management can be carried out, but also the switching of working environments can be conveniently accomplished. With the use of cloud-based container resources provided by ModelArts, AI development and model experiment iteration can be carried out more quickly and efficiently.

This case will guide developers on how to install Docker on the cloud host to make a custom model image and use this image for model training in ModelArts platform.

1.2 Get Free Cloud Hosting

If you don't have cloud hosting yet, you can clicklink (on a website)If you want to get the exclusive cloud hosting, you have to do it.

If you have picked up cloud hosting, you can start experimenting directly.

1.3 Experimental Procedures

Description:

Install docker on the cloud host;
Make a model training image and upload it to SWR;
Create a training script in the cloud host, open the OBS service using a browser and upload the training script;
Create training jobs in the ModelArts platform to complete model training.

1.4 Experimental resources

This experiment cost a total of about1 yuanResources are billed on an as-needed basis.click on a linkYou can follow the steps to join Huawei Cloud Vodafone Cloud Creation Program and apply for cloud resource vouchers for free experience. After the experience, please release the resources in a timely manner to avoid incurring excess costs and causing you inconvenience.

1.5 Installing Docker on Cloud Hosts

After entering the cloud host, click the "Terminal" button on the left menu to open the command line tool.

Install Docker by typing the following command on the command line:

curl -fsSL  -o 
	sh

Enter the command "sudo docker info" to confirm that Docker is installed.

Next, you need to configure image acceleration, enter the command "vim /etc/docker/" to edit the docker configuration file, click the "i" key to enter the insertion mode, and then enter the following content.

{
"registry-mirrors": [ "" ]
}

After you finish typing, press "Esc" to exit insert mode, then type ":wq" to save the file, then type "sudo systemctl restart docker Then type "sudo systemctl restart docker" to restart docker, and then type "sudo docker info" to check it, if the following information is displayed, it means the image acceleration configuration is completed.

1.6 Preparing the necessary files for making the image

After confirming that Docker is installed, use the command "mkdir -p context "Create folder

Use the command "cd context/" to enter the context folder.

Use the "vim" command to edit thepip source file, click the "i" key to enter the insert mode, enter the following content, after confirming that there is no error, click ESC to return to the command mode, use the ":wq" command to save and exit.

[global]
index-url = /repository/pypi/simple
trusted-host = 
timeout = 120

Download using the wget command"torch*.whl" file, three files in total

wget /whl/cu111/torch-1.8.1%2Bcu111-cp37-cp37m-linux_x86_64.whl
wget /whl/torchaudio-0.8.1-cp37-cp37m-linux_x86_64.whl
wget /whl/cu111/torchvision-0.9.1%2Bcu111-cp37-cp37m-linux_x86_64.whl

Use "wget /miniconda/Miniconda3-py37_4.12.0-Linux-x86_64.sh"." command to download the Miniconda3 py37 4.12.0 installation file

Place the above pip source file, torch*.whl file, and Miniconda3 installation file in the context folder, and the content of the context folder after completing the above operations is as follows.

context
├── Miniconda3-py37_4.12.0-Linux-x86_64.sh
├── 
├── torch-1.8.1+cu111-cp37-cp37m-linux_x86_64.whl
├── torchaudio-0.8.1-cp37-cp37m-linux_x86_64.whl
└── torchvision-0.9.1+cu111-cp37-cp37m-linux_x86_64.whl

Use "vim Dockerfile" to create and edit the Dockerfile file, fill in the following contents

# Container image build hosts need to be connected to the public network
# Base container image, /NVIDIA/nvidia-docker/wiki/CUDA
# /develop/develop-images/multiple-images
# /develop/develop-images/multistage-build/#use-multi-stage-builds
# require Docker Engine >= 17.05
# require Docker Engine >= 17.05
# require Docker Engine >= 17.05
FROM nvidia/cuda:11.1.1-runtime-ubuntu18.04 AS builder

# The default user of the base container image is already root
# USER root

# Use the pypi configuration provided by Huawei's open source mirror site
RUN mkdir -p /root/.pip/
COPY /root/.pip/

# Copy the files to be installed to the /tmp directory in the base container image.
COPY Miniconda3-py37_4.12.0-Linux-x86_64.sh /tmp
COPY torch-1.8.1+cu111-cp37-cp37m-linux_x86_64.whl /tmp
COPY torchvision-0.9.1+cu111-cp37-cp37m-linux_x86_64.whl /tmp
COPY torchaudio-0.8.1-cp37-cp37m-linux_x86_64.whl /tmp

# /projects/conda/en/latest/user-guide/install/#installing-on-linux
# Install Miniconda3 into the /home/ma-user/miniconda3 directory of the base container image
RUN bash /tmp/Miniconda3-py37_4.12.0-Linux-x86_64.sh -b -p /home/ma-user/miniconda3

# Install torch*.whl using Miniconda3's default python environment (i.e. /home/ma-user/miniconda3/bin/pip)
RUN cd /tmp && \
    /home/ma-user/miniconda3/bin/pip install --no-cache-dir \
    /tmp/torch-1.8.1+cu111-cp37-cp37m-linux_x86_64.whl \
    /tmp/torchvision-0.9.1+cu111-cp37-cp37m-linux_x86_64.whl \
    /tmp/torchaudio-0.8.1-cp37-cp37m-linux_x86_64.whl

# Build the final container image
FROM nvidia/cuda:11.1.1-runtime-ubuntu18.04

# Install vim and curl tools (still using Huawei open source mirror)
RUN cp -a /etc/apt/ /etc/apt/ && \
    sed -i "s@http://. *@@g" /etc/apt/ && \
    sed -i "s@http://. *@@g" /etc/apt/ && \
    apt-get update && \
    apt-get install -y vim curl && \
    apt-get clean && \
    mv /etc/apt/ /etc/apt/

# Add ma-user user (uid = 1000, gid = 100)
# Note that the base container image already has a group with gid = 100, so the ma-user user can be used directly
RUN useradd -m -d /home/ma-user -s /bin/bash -g 100 -u 1000 ma-user

# Copy the /home/ma-user/miniconda3 directory from the above builder stage to the directory of the same name in the current container image
COPY --chown=ma-user:100 --from=builder /home/ma-user/miniconda3 /home/ma-user/miniconda3

# Set the container image preconfigured environment variables
# Be sure to set PYTHONUNBUFFERED=1 to prevent log loss
ENV PATH=$PATH:/home/ma-user/miniconda3/bin \
    PYTHONUNBUFFERED=1

# Set the default user and working directory of the container image
USER ma-user
WORKDIR /home/ma-user

After finishing editing, press "Esc" key to exit the editing mode, and input ":wq" command to save the file.After completing the operation, use the "ll" command to view the contents of the context folder as follows

context

├── Dockerfile

├── Miniconda3-py37_4.12.0-Linux-x86_64.sh

├── 

├── torch-1.8.1+cu111-cp37-cp37m-linux_x86_64.whl

├── torchaudio-0.8.1-cp37-cp37m-linux_x86_64.whl

└── torchvision-0.9.1+cu111-cp37-cp37m-linux_x86_64.whl

1.7 Make a mirror and upload SWRs

In the context folder, enter the command "sudo docker build . -t pytorch:1.8.1-cuda11.1" to build the image.

This process takes 5-8 minutes and can be preceded by the following steps.

Open Firefox, enter the Huawei Cloud home page, select "Products" > "Containers" > "Container Image Service SWR" in turn, enter the service page, and then Click "Console" to enter the SWR console page.

Click on the "Create Organization" button in the upper right corner, enter an organization name and click "OK". This step can be skipped if there is already an organization available.

Then go back to the overview page, click the "Login Command" button at the top, copy the docker command that pops up, add "sudo" in front of this command and then enter it in the terminal command line of the cloud host, and the display will show " Login Succeeded" means the login is successful.

Use the following commands to label the newly made image

sudo docker tag pytorch:1.8.1-cuda11.1 swr.{region parameter}/{organization name}/pytorch:1.8.1-cuda11.1

The region parameter is obtained from the login command, as shown in the red box in the figure below, and the organization name should be replaced with the organization name created in the step above. Take Huawei Cloud South China-Guangzhou region as an example.

sudo docker tag pytorch:1.8.1-cuda11.1 /ai-test/pytorch:1.8.1-cuda11.1

Use the following command to upload the image to SWR

sudo docker push swr.{region parameter}/{organization name}/pytorch:1.8.1-cuda11.1

sudo docker push /ai-test/pytorch:1.8.1-cuda11.1

After you finish uploading images, you can view the uploaded custom images on the My Images page of the Container Image Service console.

1.8 Create OBS buckets and folders and upload training scripts

On the Huawei Cloud home page, select "Products" > "Storage" > "Object Storage Service OBS" in turn, enter the service page, and click "Buy" to enter the resource pack purchase page.

Purchase according to the following specifications:

Once the purchase is complete, click on "Bucket List" in the menu on the left side of the console page.

Then click on the "Create Bucket" button in the upper right corner to enter the creation page.

Created to the following specifications:

Once you have created a bucket, you can see it on the bucket list page, and click on the bucket name link to go to the bucket details page.

Click the "New Folder" button to create a folder named "pytorch" folder, and under this folder, create two subfolders named "log" and "The "demo-code" folder holds the training scripts and the log folder holds the training logs.

Back in the command line window, in the context folder, use vim Command new edit training script, click the "i" key to enter the insert mode, enter the following content, confirm that there is no error, click ESC to return to the command mode, use the ":wq" command to save and exit.

import torch
import  as nn

x = (5, 3)
print(x)

available_dev = ("cuda") if .is_available() else ("cpu")
y = (5, 3).to(available_dev)
print(y)

After creating a new script, go back to the OBS console page, go to the demo-code folder, and upload the script to the demo-code folder.

Click "Add File" and select the file you just created in the /developer/context directory.Click "Open" to complete the upload.

1.9 Creating Training Jobs on ModelArts

On the Huawei Cloud home page, select "Products" > "Artificial Intelligence" > "AI Development Platform ModelArts" in turn, enter the service page, and click the "Console" button to enter the console page. "Console" button to enter the console page.

The first time you use the ModelArts platform, you will be prompted with insufficient permissions and need to authorize the OBS service and SWR service.

Click on the "here" hyperlink in the red box above to enter the authorization page, select "New Delegation", choose "Ordinary User" for permission configuration, and then click the "Create" button. When you return to the ModelArts console page, the red box indicating insufficient privileges disappears and you can use it normally.

Select "Model Training > Training Assignments" in the left navigation bar to enter the "Training Assignments" list by default.

Fill in the parameter information according to the following table:

After the training job is created, the background will automatically complete actions such as container image download, code directory download, and execution of startup commands.

Training operations typically take a while to run.

After the training is completed, find the training logs in the OBS service, under the bucket and log folder we created and download them, search for the keyword: tensor in the downloaded log file, and see the following message indicating that the training was successful.

At this point, the experiment is complete.