Some new features in MindSpore-2.4 version

Technical background

In a previousblog (loanword)In this section we describe the installation of MindSpore-2.4-gpu and some of the problems that may arise. Here we can try to use some of the features of the new version of MindSpore after the installation is complete. Then after the installation, if you are using VSCode as IDE, you can use the ctrl+shift+P shortcut and search for thepython:seleSwitch the Python interpreter to the latest MindSpore environment we need.

Equipment management and resource monitoring

Added in mindspore-2.4 versioninterfaces that can be used to manage devices, monitor devices, and perform stream processing, among other things. For example, it is common to get the number of devices:

import mindspore as ms
ms.set_context(device_target="GPU")
device_target = .get_context("device_target")
print(.device_count(device_target))
# 2

This output indicates that we have two GPU cards in our environment. You can also print the names of the two cards:

import mindspore as ms
ms.set_context(device_target="GPU")
device_target = .get_context("device_target")
print(.get_device_name(0, device_target))
print(.get_device_name(1, device_target))
# Quadro RTX 4000
# Quadro RTX 4000

and the availability status of the device:

import mindspore as ms
ms.set_context(device_target="GPU")
device_target = .get_context("device_target")
print(.is_available(device_target))
# True

Queries whether the device is initialized:

import mindspore as ms
ms.set_context(device_target="GPU")
device_target = .get_context("device_target")
print(.is_initialized(device_target))
A = ([0.], ms.float32)
A2 = (A+A).asnumpy()
print(.is_initialized(device_target))
# False
# True

This also indicates that MindSpore only transfers data from the Tensor to the computation backend during the computation process. In addition to device management, the new version of MindSpore also supports some memory monitoring features, which are very useful for performance management:

import mindspore as ms
import numpy as np
ms.set_context(device_target="GPU")
A = ((1000), ms.float32)
A2 = (A+A).asnumpy()
print(.max_memory_allocated())
# 8192

The output here is the size of the Tensor that occupies the maximum video memory. It should be noted that the calculation is not directly based on the floating-point space occupied. It should be noted that MindSpore generates some additional data structures during the construction of the graph, which also occupy a certain amount of video memory, but the trend of video memory growth is accurate. In addition to individual printing, you can also output a summary of the whole:

import mindspore as ms
import numpy as np
ms.set_context(device_target="GPU")
A = ((1000), ms.float32)
A2 = (A+A).asnumpy()
print(.memory_summary())

The output is:

|=============================================|
|               Memory summary                |
|=============================================|
| Metric               | Data                 |
|---------------------------------------------|
| Reserved memory      |   1024 MB            |
|---------------------------------------------|
| Allocated memory     |   4096 B             |
|---------------------------------------------|
| Idle memory          |   1023 MB            |
|---------------------------------------------|
| Eager free memory    |      0 B             |
|---------------------------------------------|
| Max reserved memory  |   1024 MB            |
|---------------------------------------------|
| Max allocated memory |   8192 B             |
|=============================================|

ForiLoop

In fact, it is simply a built-in for loop operation, similar to the fori_loop in Jax:

import mindspore as ms
import numpy as np
from mindspore import ops
ms.set_context(device_target="GPU")

@
def f(_, x):
    return x + x

A = ((10), ms.float32)
N = 3
AN = ()(0, N, f, A).asnumpy()
print (AN)
# [8. 8. 8. 8. 8. 8. 8. 8. 8. 8.]

With this new for loop body, we can do end-to-end automatic differentiation of the entire loop body:

import mindspore as ms
import numpy as np
from mindspore import ops, grad
ms.set_context(device_target="GPU", mode=ms.GRAPH_MODE)

@
def f(_, x):
    return x + x

@
def s(x, N):
    return ()(0, N, f, x)

A = ((10), ms.float32)
N = 3
AN = grad(s, grad_position=(0, ))(A, N).asnumpy()
print (AN)
# [8. 8. 8. 8. 8. 8. 8. 8. 8. 8.]

flow calculation

CUDA Stream is an inevitable feature in CUDA high-performance programming, and its performance optimization point comes from the separation of data transfer and floating-point operation, which can be done in different Streams, so as to achieve the effect of transferring data while calculating. Compared to a single StreamTransmit-Compute-Wait-Transmit-ComputeSuch a model is definitely faster, and some deep learning frameworks have actually supported Stream scheduling for a long time, and MindSpore is currently keeping pace. Regarding some scenarios where Stream computing is applicable, let's first look at an example like this:

import mindspore as ms
import numpy as np
(0)
from mindspore import numpy as msnp
ms.set_context(device_target="GPU", mode=ms.GRAPH_MODE)

@
def U(x, mu=1.0, k=1.0):
    return (0.5 * k * (x-mu) ** 2)

x = ((1000000000), ms.float32)
energy = U(x)
print (energy)

Executing it in the local environment will report an error:

Traceback (most recent call last):
  File "/home/dechin/projects/gitee/dechin/tests/test_ms.py", line 13, in <module>
    energy = U(x)
  File "/home/dechin/anaconda3/envs/mindspore-master/lib/python3.9/site-packages/mindspore/common/", line 960, in staging_specialize
    out = _MindsporeFunctionExecutor(func, hash_obj, dyn_args, process_obj, jit_config)(*args, **kwargs)
  File "/home/dechin/anaconda3/envs/mindspore-master/lib/python3.9/site-packages/mindspore/common/", line 188, in wrapper
    results = fn(*arg, **kwargs)
  File "/home/dechin/anaconda3/envs/mindspore-master/lib/python3.9/site-packages/mindspore/common/", line 588, in __call__
    output = self._graph_executor(tuple(new_inputs), phase)
RuntimeError: 
----------------------------------------------------
- Memory not enough:
----------------------------------------------------
Device(id:0) memory isn't enough and alloc failed, kernel name: 0_Default/Sub-op0, alloc size: 4000000000B.

----------------------------------------------------
- C++ Call Stack: (For framework developers)
----------------------------------------------------
mindspore/ccsrc/runtime/graph_scheduler/graph_scheduler.cc:1066 Run

Indicates an out-of-memory situation. Often, it may be necessary to do a split manually and then traverse using a loop body:

import time
import mindspore as ms
import numpy as np
from mindspore import numpy as msnp
ms.set_context(device_target="GPU", mode=ms.GRAPH_MODE)

@
def U(x, mu=1.0, k=1.0):
    return (0.5 * k * (x-mu) ** 2)

def f(x, N=1000, size=1000000):
    ene = 0.
    start_time = ()
    for i in range(N):
        x_tensor = (x[i*size:(i+1)*size], ms.float32)
        ene += U(x_tensor)
    end_time = ()
    print ("The calculation time cost is: {:.3f} s".format(end_time - start_time))
    return ()

x = (1000000000)
energy = f(x)
print (energy)
# The calculation time cost is: 11.732 s
# 0.0

Here at least there are no memory errors reported anymore, because we only copy the corresponding part to the video memory each time we do a calculation. The next step is to use the streaming calculation, which is a function that calculates as it copies:

def f_stream(x, N=1000, size=1000000):
    ene = 0.
    s1 = ()
    s2 = ()
    start_time = ()
    for i in range(N):
        if i % 2 == 0:
            with (s1):
                x_tensor = (x[i*size:(i+1)*size], ms.float32)
                ene += U(x_tensor)
        else:
            with (s2):
                x_tensor = (x[i*size:(i+1)*size], ms.float32)
                ene += U(x_tensor)
    ()
    end_time = ()
    print ("The calculation with stream time cost is: {:.3f} s".format(end_time - start_time))
    return ()

The comparison between using Stream and not using Stream needs to be performed separately because of the performance impact of compiling the program. After several tests, the runtime without Stream is approximately:

The calculation time cost is: 10.925 s
41666410.0

And the runtime using Stream is approx:

The calculation with stream time cost is: 9.929 s
41666410.0

Intuitively, Stream computation in MindSpore has the potential to bring some acceleration effect, but in fact this acceleration effect compared to directly write CUDA Stream to bring the effect of gain is actually a bit weaker, may have something to do with the logic of compilation. But at least now there is a tool such as Stream can be called directly in MindSpore, you can synchronize with many frameworks of the same type of competition.

Summary outline

Continuing from the previous article on the installation of MindSpore-2.4-gpu version, this article mainly introduces some new features in MindSpore-2.4 version, such as the use of hal to manage devices and streams, which in turn supports Stream streaming computation. In addition there is similar to the fori_loop method in Jax, the latest version of MindSpore also supports ForiLoop loop body, making the execution of the loop more efficient, but also one of the powerful tools for end-to-end automatic differentiation.

This article was first linked to:/dechinphy/p/

Author ID: DechinPhy