SIMD with C Intrinsics and Inline Assembler – Lab5 SPO600

Welcome to my blog. This post will be about using SIMD with C. SIMD is an acronym for Single Instruction Multiple Data. SIMD is a form of vectorization where the computer performs the same operation on multiple data points simultaneously.

What I did for this lab

In this lab, I will be looking at three different approaches to SIMD. I will be using the same code as lab 4. The code is a simple example of changing the volume of a sound file. I will also be using an AARCH 64 machine, so the inline assembler and intrinsics are for the AARCH 64 architecture.

Part 1:

For part 1, I will be letting the GCC compiler do the vectorization for me using the compiler flag -O3. To confirm that the vectorization was successful. I enabled the compiler flag that will give me information about the vectorization of the code ‘-fopt-info-vec-all.’

I was tasked with vectorizing the last loop which performed the following line of code.

ttl = (ttl + data[x]) % 1000;

This piece of code sums up all the data, it is used to confirm that the algorithm for scaling the samples is correct and is not giving a different result than the original.

The way that the above code calculates the sum will not work for vectorization. This is because it relies on the previous value for the total. So, in order to allow auto-vectorization, I removed that dependance and changed that line of code to the following.

ttl += data[x] % 1000;

Part 2:

For this part of the lab, I first practiced writing inline assembler on another straightforward program. All this program performed was a modulus operator on some variables and printed the results, and my job was to replace the modulus operation with inline assembler.

int main() {
    int a = 3;
    int b = 19;
    int c;
    int d;
    __asm__("udiv %0, %1, %2" : "=r"(c) : "r"(b), "r"(a) );
    __asm__("msub %0, %1, %2, %3" : "=r"(c) : "r"(c), "r"(a), "r"(b)  );
    printf("%d\n", c);
}

Now back to the sound scaling program. My professor gave me a version of the sound scaling program that contained the inline assembler. It also included questions marked with a Q: about the code in comments. My task for this part of the lab is to answer the questions.

The following are the questions and answers to part 2.

Question 1 Code:
register int16_t* cursor asm(“r20”); // input cursor
register int16_t vol_int asm(“r22”); // volume as int16_t

Question 1:
These variables will be used in our assembler code, so we’re going to hand-allocate which register they are placed in what is an alternate approach?

Answer:
Don’t include hand-allocation and let the compiler control what register they are assigned.

Question 2 Code:
vol_int = (int16_t) (0.75 * 32767.0);

Question 2:
Should we use 32767 or 32768 in this line of code? Why?

Answer:
We should use 32767 since it the larges number we can use since it uses only 15 bits to store if 32768 was the largest number, we would not be able to save the sign bit.

Question 3 Code:
asm (“dup v1.8h,%w0”::”r”(vol_int));

Question 3:
What does it mean to “duplicate” values in the next line?

Answer:
It means we are duplicating the value of vol_int across the vector one register. So, each position in vector one will contain the value of vol_int.

Question 4 Code:
asm (
“ldr q0, [%[cursor]], #0 \n\t”
“sqdmulh v0.8h, v0.8h, v1.8h \n\t”
“str q0, [%[cursor]],#16 \n\t”
: [cursor]”+r”(cursor)
: “r”(cursor)
: “memory”
);

Question 4:
Why is #16 included in the str line but not in the ldr line?

Answer:
We did not want to increment the cursor at ldr since we still need the current cursor position to store the values in the str command.

Question 4 Code:
asm(“…”
: [cursor]”+r”(cursor)
: “r”(cursor)
: “memory”
);

Question 4:
What do these next three lines do?

Answer:
The first line of code after colon one is the output operand. After that, the code after colon two is the input operand. The code after colon three is the clobber. The word memory in the clobber tells the compiler that this inline assembler effects global memory.

Question 5:
Are the results of this program usable? Are they correct?

Answer:
No, if I compare the results to the original program, I am getting 930 from the inline assembler code, and I was getting 94 from the original version of the code. I believe this is due to the fact that we are using a fixed-point representation of 0.75 for calculations.

Performance Analysis of Part 2

With a sample size of 50 Million, I got the following results:

AUTO-VECTORIZATION CODE
real 0m4.754s
user 0m4.585s
sys 0m0.160s

INLINE ASSEMBLER CODE
real 0m4.780s
user 0m4.618s
sys 0m0.150s

As you can see from the results above It would appear that the inline assembler is slightly slower in the real and user time categories but is consistently faster in the sys time category.

Part 3:

In this part of the lab, I start with the completed code for the sound scaling program that used intrinsic, and similar to part 2 of this lab, it contained comments with questions about the code for me to answer.

The following are the questions and answers to part 3.

Question 1 Code:
vst1q_s16(cursor, vqdmulhq_s16(vld1q_s16(cursor), vdupq_n_s16(vol_int)));

Question 1:
What do these intrinsic functions do?

Answer:
The intrinsic function “vst1q_s16” stores a single vector into memory. In this case, we are storing the results of the multiplication.

The intrinsic function “vqdmulhq_s16” stands for “vector saturating doubling multiply high.” In this case, we are multiplying two vector lanes. We pass the two vectors as parameters.

The intrinsic function “vld1q_s16” will load a single vector from memory.

The intrinsic function “vdupq_n_s16” loads all lanes of a vector with the same value.

Question 2 Code:
cursor += 8;

Question 2:
Why is the increment 8 instead of 16 or some other value?

Answer:
Since we are using an int_16t for our data, we have eight vector lanes. So we are incrementing the cursor to the next set of eight values that we will calculate.

Question 3 Code:
cursor += 8;

Question 3:
Why is this line not needed in the inline assembler version of this program?

Answer:
The incrementing of the cursor gets done inside of the inline assembler, so we don’t need to have an extra increment step.

Question 4:
Are the results usable? Are they accurate?

Answer:
Similar to part 2, the results are different than the original. but it is the same as the inline assembler.

Performance Analysis of Part 3

With a sample size of 50 Million, I got the following results:

AUTO-VECTORIZATION CODE
real 0m4.754s
user 0m4.585s
sys 0m0.160s

INLINE ASSEMBLER CODE
real 0m4.780s
user 0m4.618s
sys 0m0.150s

INTRINSICS CODE
real 0m4.768s
user 0m4.589s
sys 0m0.170s

The results are all very similar, I have run the test multiple times. I believe the auto-vectorization code seems to be the quickest most of the time and the intrinsic and inline assembler are about the same.

CODE DOWNLOAD

Download lab files