__m128i Index = _mm_setzero_si128(); Index = _mm_blendv_epi8(Index, _mm_add_epi8(Index, _mm_set1_epi8(1)), P0, _mm_set1_epi8(255)); Index = _mm_blendv_epi8(Index, _mm_add_epi8(Index, _mm_set1_epi8(2)), P1, _mm_set1_epi8(255)); Index = _mm_blendv_epi8(Index, _mm_add_epi8(Index, _mm_set1_epi8(4)), P2, _mm_set1_epi8(255)); Index = _mm_blendv_epi8(Index, _mm_add_epi8(Index, _mm_set1_epi8(8)), P3, _mm_set1_epi8(255)); _mm_storeu_si128((__m128i*)(LinePD + X), _mm_shuffle_epi8(Lut128, Index));
Also, observing this C language peculiarity again, we can modify the above C code as:
int Index = 0; Index += (1 & P0); Index += (2 & P1); Index += (4 & P2); Index += (8 & P3); LinePD[X] = Lut[Index];
That is to say, we can directly use bitwise operation to replace that judgment, so that the corresponding SSE instruction can also be modified as the following code:
__m128i Index = _mm_setzero_si128(); Index = _mm_add_epi8(Index, _mm_and_si128(_mm_set1_epi8(1), P0)); Index = _mm_add_epi8(Index, _mm_and_si128(_mm_set1_epi8(2), P1)); Index = _mm_add_epi8(Index, _mm_and_si128(_mm_set1_epi8(4), P2)); Index = _mm_add_epi8(Index, _mm_and_si128(_mm_set1_epi8(8), P3)); // You can implement a lookup table directly with shuffle _mm_storeu_si128((__m128i*)(LinePD + X), _mm_shuffle_epi8(Lut128, Index));
This is also about a 30% performance improvement over the original SSE code.
For a table of 512 elements, the situation will be different, because the index value is greater than 255, so it is no longer possible to use the byte type to save the index obtained after accumulation, at least the ushort type, but then, the first 8 positions of the index will not be added up beyond the byte type, so the first 8 positions can still be processed at one time 16 pixels, to the last position, the separate conversion to 16 bits, and then use 2 times 16 data processing instructions, you can get 16 bytes corresponding to the lookup table index at once.
But at this time, because the table size is 512 hours, so it is no longer possible to use SSE to optimize this table lookup function, only to extract the index value individually, and then use ordinary C language to execute the code. If the CPU can support AVX2, then AVX2 can be used to look up the table directly, but there is a lot of modification work to be done.
We test that for a table of 512 elements, the optimized SSE instruction takes less than 2ms to process a 3000*2000 bipartite graph, which is still quite fast.
Search matlab code, in addition to bweuler and bwarea within the use of bwlookup, in addition to bwmorph is also a large number of bwlookup, and are using the 512 elements of the table, that is to say, the use of 3 * 3 field, we look at the bwmorph help file, there are a lot of related content:
These corresponding lookup tables can be found in files beginning with lut in the MATLAB\R2023b\toolbox\images\images\+images\+internal directory, e.g., the table for this clean has the following contents:
The contents of these tables are designed ahead of time and can then be called many times over to get some sort of effect.
In general, the effect of this 3*3 domain operator does not change after a certain number of iterations.
We can also see some common operators in morph, such as this remove to get the boundary of the binary image, this majority can smooth out the noise (and the effect of that majority specifically in my blog is still different).
Original image removemajority
fatten skeleton thin
Personally feel that this method is still very helpful in dealing with some of the more complex domain information, especially not convenient to write one by one with the judgment conditions, if you can get the table out in advance, then the efficiency can be greatly improved.
This algorithm is also integrated in the personal SSE Demo, whose content is in Binary(Binary Processing)--"Processing(Post-Processing)-->> Morph(Morphology), for those who are interested.