OpenGL RHI Optimization

preamble

With the popularity of Vulkan, OpenGL has been slowly being phased out, lighter API calls can save a lot of performance, especially on mobile platforms, reducing CPU overhead and thus power consumption. It seems perfect, but the problem is that there are a lot of compatibility issues with the Vulkan driver for mobile platforms at the moment, and everyone's mainstream practice is to whitelist their way to Vulkan, so for the time being, we're going to continue to focus on OpenGL. The purpose of this article is the author in the optimization of OpenGL when the accumulation of some experience, because the use of the engine is UE4, so here is the optimization of UE4 unfolding, of course, most of the optimization is common.

make superior

Among the many APIs, the more time-consuming ones are as follows

Setting the texture
Setting the buffer
Set uniform, uniform buffer
Setting the program
Update texture
Update buffer
Compile shader

Other APIs also have overhead, but it is not particularly obvious or try to avoid it (such as setting the render target), you can do some targeted optimization, general state caching can be a better solution.

Because mobile platforms are now mainstream machines are TBDR architecture, different platforms have their own strategies to reduce overdraw, such as Qualcomm's LRZ, ARM's FPK and PowerVR's HSR technology. So our sorting can be based on the rendering state to sort, of course, because of the implementation of the old machine is not good, or according to the distance sorting can reduce more overdraw. next, we are targeted at the above mentioned high overhead API optimization.

Setting the texture

Try to Pack texture channels, e.g. Normal uses two channels.
Merging Maps with Atlas
Merging textures with Texture2DArray
Fix generic textures to specific slots, such as shadow maps, reflection textures, cluster shading related buffers, etc.

SHADER_PARAMETER_TEXTURE_EX(Texture2D, DirectionalLightShadowTexture, 3)

UE will set the unused texture to None after each DC is set, this is to address certain driver issues that can be optimized, it's too conservative.

Set Buffer

The more relevant buffers are put together as much as possible, such as normal and tangent.
Use a large buffer + offset to manage the buffer, this will be explained in detail later in the update buffer

Setting the uniform, unform buffer

Prior to 4.21, ES31 used a uniform buffer underneath, but since 4.21 you can use an emulated uniform buffer, which is an interface that allows you to set up updates to a uniform buffer, but actually uses a uniform buffer underneath, which officially saves a lot of memory and improves performance. officially, it saves a lot of memory and improves performance.

But in fact, we tested it and the overhead is still very high, because the number of uniforms set will become a lot, so is there a better way to optimize it? Of course there is, since we want to save memory and performance, then we can use a hybrid approach, so that uniform and uniform buffer coexistence use. What is suitable for uniform buffer, like View, DirectionalLight, Shadow, such as per frame or multi frame is suitable, because the number of small, but like Primitive, such as the number of particularly large is not suitable.

In addition, the UE itself implements emulated uniform buffer because it does not completely pack the data when it is used, this place can also pack them together at compile time and record the runtime copy to the corresponding offset.

pre-optimization

post-optimization

#define View_IndirectLightingCacheShowFlag (pc0_h[11].x)

#define View_ReflectionEnvironmentRoughnessMixingScaleBiasAndLargestWeight (pc0_h[10].xyz)

#define View_HighResolutionReflectionCubemapMaxMip (pc0_h[9].x)

#define View_ReflectionCubemapMaxMip (pc0_h[8].x)

#define View_SkyLightColor (pc0_h[7].xyzw)

#define View_NormalCurvatureToRoughnessScaleBias (pc0_h[6].xyz)

#define View_IndirectLightingColorScale (pc0_h[5].xyz)

#define View_CullingSign (pc0_h[4].x)

#define View_PreExposure (pc0_h[3].x)

#define View_ViewSizeAndInvSize (pc0_h[2].xyzw)

#define View_ViewRectMin (pc0_h[1].xyzw)

#define View_PreViewTranslation (pc0_h[0].xyz)

uniform highp vec4 pc0_h[12];

layout(std140) uniform pb0

{

vec4 Padding0[76];

　highp vec3 View_PreViewTranslation;

float PaddingF1228_0;

vec4 Padding1228[63];

vec4 View_ViewRectMin;

highp vec4 View_ViewSizeAndInvSize;

vec4 Padding2272[4];

float PaddingB2272_0;

highp float View_PreExposure;

float PaddingF2344_0;

float PaddingF2344_1;

vec4 Padding2344[6];

float PaddingB2344_0;

float PaddingB2344_1;

float PaddingB2344_2;

highp float View_CullingSign;

vec4 Padding2464[13];

highp vec3 View_IndirectLightingColorScale;

float PaddingF2684_0;

vec4 Padding2684[54];

highp float View_IndirectLightingCacheShowFlag;

} View;

#define Primitive_LightingChannelMask (pc2_u[0].x)

#define Primitive_UseSingleSampleShadowFromStationaryLights (pc2_h[1].x)

#define Primitive_InvNonUniformScaleAndDeterminantSign (pc2_h[0].xyzw)

uniform uvec4 pc2_u[1];

uniform highp vec4 pc2_h[3];

#define Primitive_PrimaryPrecomputedShadowMaskValue (pc2_h[1].z)

#define Primitive_LightingChannelMask (floatBitsToUint(pc2_h[1].y))

#define Primitive_UseSingleSampleShadowFromStationaryLights (pc2_h[1].x)

#define Primitive_InvNonUniformScaleAndDeterminantSign (pc2_h[0].xyzw)

uniform highp vec4 pc2_h[2];

You can see that View uses a uniform buffer, while Primitve still uses uniform, but the number of variables has been reduced from four vec4s to two vec4s.

Setting the Program

Minimize the number of programs, for example some simple macros can be avoided by ? operators and so on to avoid, another is to replace macros by uniform way, of course this needs to be evaluated, because it may cause register spilling and reduce efficiency.

Updating Textures

If texture streaming is enabled and there are too many textures, the texture update will consume a lot of textures, you can try the following optimizations:

The UE itself uses PBO for texture updates, this is unnecessary on mobile platforms and adds an extra overhead of uploading PBO.
There is also an extra copy of texture data from Render to RHI when RHI is enabled, which can be optimized away.
OpenGL itself supports multi context, so you can start a separate thread to do texture uploading.

Update Buffer

If you have a large number of buffers in addition to the frequency of updates, this time in some slightly older machines (888 and below) is very easy to meet the high time-consuming to update the buffer and lag, we in the previouswritingsIt's written in there.

Only at that time the article is longer, and later there is a new implementation, now it is all buffers except UAV can use the big buffer + offset way to access memory, this gives RHI reduce 10% ~ 20% of the overhead.

The start index is available in glDrawRangeElements and glDrawElements.
texture buffer glTexBufferRangeEXT supports offset, which is mainly used for instance data in ISM and HISM.

Shader Compilation

Shader compilation is a very time-consuming operation, the current common practice is to collect PSO in advance and warm up, but it is difficult to cover the full, if compiled directly in the RHI thread will lead to lag, this time you can also reuse GL's multi-context mechanism for asynchronous compilation. But this will introduce flicker, need to do the balance.

summarize

The above lists some OpengGL overhead function and targeted optimization, other APIs can also be optimized through the cache machine, etc. If you follow the above ideas are optimized to complete, I believe that your GL performance will have a good improvement and lower power consumption.