Analysis of FLAC – Stage Two of Three – SPO600 Project

Hello and welcome to my blog!

In this blog post, I will be writing about my final project for my software portability and optimization class SPO 600. In this post, I will review the work I have done with the FLAC project. If you didn’t read the first post in this series, you could find it here. A quick review of the last post, for my SPO 600 class, I got tasked with finding and optimizing an open-source project. Knowing I wanted to work with audio, I found the open-source project called FLAC (Free Lossless Audio Codec). On further investigation, I made a plan on how to optimize the FLAC library. In this blog post, I will go over the implementation and results of my optimization.

Execution of my plan:

In the previous blog post, I laid out a strategy for the completion of my optimization. It turned out to be quite helpful. I was able to follow it, and I completed the optimization just as planned. Below is the strategy I made and some notes on how I completed each task.

  1. Research the required pre-processor directives that I will need to run the Aarch64 code inside the FLAC library conditionally.
    Through reading the FLAC code, I was able to determine that the pre-processor directive used in this project really on global variables defined by the configure script. After reading the “configure.ac” file, online AutoTools documentation and the “configure.ac” file from the OPUS project, which is also created by Xiph, the same people who created the FLAC project. I was able to determine how to check if the user has an aarch64 CPU and if the intrinsic library arm-neon is available. After confirming that I am on an aarch64 machine, I define “FLAC__CPU_AARCH64,” and after confirming arm-neon intrinsic are possible, I define “FLAC__HAS_NEONINTRIN.”
  2. Test the pre-processor directives with some code that will cause a fault, so I know it is working.
    Since the FLAC project has some optimizations for other platforms, I needed to follow the same pattern as the previous people. To test the preprocessor directives, I had to add the new architecture to the function selection logic. To add the new architecture, I added code to “src/libFLAC/cpu.c”, “src/libFLAC/include/private/cpu.h”, “src/libFLAC/include/private/lpc.h” and “src/libFLAC/stream_encoder.c.” Once I had the architecture function selection logic done, I was ready to test. I did not end up needing to cause any faults to confirm that the preprocessor directives were working. Instead, I used a printf statement and a copy of the vanilla C code version of the auto-correlation function. In running the program, I was able to see the printf messages, and I also used perf to confirm I was using the new function.
  3. Examine the codebase to know where precisely I need to put the pre-processor directives. And check if I need to mess with the build instructions.
    I ended up doing this in step in steps one and two since the FLAC project didn’t use pre-defined variables for the pre-processor directives. Instead, the FLAC project uses variables defined by the configure script. And I had to examine the codebase and implement the changes to run a test.
  4. Configure the makefile to build the new file that I am adding.
    I added one file to the project called lpc_intrin_neon.c. So for the compiler to build it, I put it into the list of source files inside “src/libFLAC/Makefile.am.”
  5. I am going to focus on the “FLAC__lpc_compute_autocorrelation” function, and I am going to translate it into aarch64 intrinsic’s. I will use the existing c and x86 intrinsic code to help me with the translation.
    Success! It took some time, but I was able to translate the x86 code into aarch64. The way I did this was by using the Intel and arm-neon online documentation. I also got help by googling a specific x86 intrinsic and asking what arm-neon instruction does the same thing or similar. For a few intrinsics, there was no direct replacement. Specifically, there is no shuffle in arm-neon, so I had to read up on how shuffle worked on x86 and execute that using multiple arm-neon instruction. I ended up creating inline functions for the shuffles to make writing the code more manageable and cleaner.
  6. Testing my optimization, I will re-run the test that I performed on the original code with my optimized version and see if I have improved the performance on the aarch64 platform.
    I tested with two aarch64 machines. The first machine has a faster single-thread performance with 8 threads. The second machine has 24 threads but slower single-thread performance. On the first machine, initially, the autocorrelation function took 26.11 percent of the runtime. After the optimizations, the autocorrelation function took 12.64 percent of the time. On the second machine, initially, the autocorrelation function took 52.41 percent of the runtime. After the optimizations, the autocorrelation function took 14.78 percent of the time. I also tested the optimizations on an x86 machine to confirm that the changes did not affect that architecture.
  7. As a stretch goal, depending on how hard it is to write the Aarch64 intrinsics, I would like to translate the full “ipc.c” file with aarch64 intrinsics.
    I didn’t end up translating the full “ipc.c” file, but I did translate all versions of the autocorrelation function. There are four versions of the autocorrelation function. Depending on how much lag it will call the correct version, either lag 4, lag 8, lag 12 or lag 16.

Full Results:

The following results are not averaged, but I did run theses test multiple times with similar results. The numbers below are from a few of the many tests I performed.

Aaarch64 Machine 1:

TOTAL RUNTIME OF THE TEST BEFORE OPTIMIZATION:

real    0m51.784s
user 0m49.356s
sys 0m2.349s

TOTAL RUNTIME OF THE TEST AFTER OPTIMIZATION:

real    0m43.503s
user 0m40.950s
sys 0m2.470s

PERF REPORT BEFORE OPTIMIZATION (First 20 Lines):

To display the perf.data header info, please use --header/--header-only options.
 #
 #
 Total Lost Samples: 0
 #
 Samples: 208K of event 'cycles:u'
 Event count (approx.): 98509947650
 #
 Overhead  Command   Shared Object           Symbol
 ……..  ……..  ………………….  ………………………………………………………………………………..
 #
     26.11%  lt-flac   libFLAC.so.8.3.0        [.] FLAC__lpc_compute_autocorrelation
     25.54%  lt-flac   libFLAC.so.8.3.0        [.] FLAC__fixed_compute_best_predictor_wide
     11.35%  lt-flac   libFLAC.so.8.3.0        [.] FLAC__bitwriter_write_rice_signed_block
      9.45%  lt-flac   libFLAC.so.8.3.0        [.] FLAC__MD5Transform
      5.95%  lt-flac   lt-flac                 [.] format_input
      5.60%  lt-flac   libFLAC.so.8.3.0        [.] FLAC__lpc_compute_residual_from_qlp_coefficients_wide
      3.42%  lt-flac   libFLAC.so.8.3.0        [.] precompute_partition_info_sums_
      2.34%  lt-flac   libFLAC.so.8.3.0        [.] FLAC__MD5Accumulate
      2.21%  lt-flac   libFLAC.so.8.3.0        [.] FLAC__crc16

PERF REPORT AFTER OPTIMIZATION (First 20 Lines):

To display the perf.data header info, please use --header/--header-only options.
 #
 #
 Total Lost Samples: 0
 #
 Samples: 175K of event 'cycles:u'
 Event count (approx.): 81871492155
 #
 Overhead  Command  Shared Object       Symbol
 ……..  …….  ………………  ………………………………………………………………………………..
 #
     30.58%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__fixed_compute_best_predictor_wide
     13.36%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__bitwriter_write_rice_signed_block
     12.64%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__lpc_compute_autocorrelation_intrin_neon_lag_12
     11.71%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__MD5Transform
      7.16%  lt-flac  lt-flac             [.] format_input
      5.18%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__lpc_compute_residual_from_qlp_coefficients_wide
      4.16%  lt-flac  libFLAC.so.8.3.0    [.] precompute_partition_info_sums_
      3.00%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__MD5Accumulate
      2.62%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__crc16

Aaarch64 Machine 2:

TOTAL RUNTIME OF THE TEST BEFORE OPTIMIZATION:

real    3m43.841s
user    3m33.558s
sys     0m8.791s

TOTAL RUNTIME OF THE TEST AFTER OPTIMIZATION:

real    2m3.675s
user    1m54.260s
sys     0m8.588s

PERF REPORT BEFORE OPTIMIZATION (First 20 Lines):

To display the perf.data header info, please use --header/--header-only options.
 #
 #
 Total Lost Samples: 0
 #
 Samples: 901K of event 'cycles:uppp'
 Event count (approx.): 213328075836
 #
 Overhead  Command  Shared Object     Symbol
 ……..  …….  …………….  ………………………………………………………………………………..
 #
     52.41%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__lpc_compute_autocorrelation
     11.36%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__fixed_compute_best_predictor_wide
      6.62%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__bitwriter_write_rice_signed_block
      5.80%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__MD5Transform
      4.35%  lt-flac  lt-flac           [.] format_input
      4.05%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__lpc_compute_residual_from_qlp_coefficients_wide
      2.69%  lt-flac  libFLAC.so.8.3.0  [.] precompute_partition_info_sums_
      2.52%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__MD5Accumulate
      2.10%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__fixed_compute_residual

PERF REPORT AFTER OPTIMIZATION (First 20 Lines):

To display the perf.data header info, please use --header/--header-only options.
 #
 #
 Total Lost Samples: 0
 #
 Samples: 620K of event 'cycles:uppp'
 Event count (approx.): 144725968757
 #
 Overhead  Command  Shared Object     Symbol
 ……..  …….  …………….  ………………………………………………………………………………..
 #
     15.03%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__fixed_compute_best_predictor_wide
     14.78%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__lpc_compute_autocorrelation_intrin_neon_lag_12
     10.06%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__bitwriter_write_rice_signed_block
     10.03%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__lpc_compute_residual_from_qlp_coefficients_wide
      9.14%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__lpc_window_data
      9.09%  lt-flac  libFLAC.so.8.3.0  [.] precompute_partition_info_sums_
      8.07%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__MD5Transform
      6.14%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__fixed_compute_residual
      5.68%  lt-flac  lt-flac           [.] format_input

Code Changes:

On GitHub, I created a pull request inside my fork of FLAC HERE. With this pull request, you will be able to see exactly all the code changes I made to the FLAC project.

Analysis of FLAC – Stage One of Three – SPO600 Project

Hello and welcome to my blog!

In this blog post, I will be writing about my final project for SPO600. The goal of this project is to optimize an open-source library. To complete the project, I had to choose one of the following tasks: alter build options, code changes to permit better optimization by the compiler, algorithm improvements or In-line assembler.

What I did to find an Open-source project:

For a long time, I have been working with audio as a hobby. I actively produced music, mix live bands, calibrate PA speakers and play drums. So, after completing the lab about changing the volume of sound, I knew I wanted to work with audio. So, I started at an open-source project that I knew about that works with audio called Audacity. Audacity is an open-source DAW (Digital Audio Workstation.) I looked around the Audacity project source for a bit then decided to dig into its dependencies. I got the idea of looking at the dependencies from my professor. He suggested to everyone that a library would be a great place to find an opportunity to optimize some code. In looking at the Audacity dependencies, I found the library called FLAC by Xiph. FLAC is an acronym that means Free Lossless Audio Codec. Similar to what I did with Audacity, I started looking at the source code. The way I navigated the source code was by using the search on GitHub and the Find command on bash. I was looking for architecture-specific code, so my search’s were like “x86” or “aarch64”. Between searching for keywords and browsing the folders, I found a file that was called “cpu.c” in “src/libFLAC,” and inside, I was able to determine that the FLAC project has not yet been optimized for Aarch64. The way I discovered that the FLAC project has yet to be optimized for Aarch64 was by looking at the compiler preprocessing directives. Inside that file, I could see that this project has optimizations for the x86, IA32 and PPC architectures, but not for Aarch64. After learning that the FLAC project has not been optimized for Aarch64, I submitted an Issue on GitHub. Here is a link to the issue I created: Issue #156. Inside the issue, I inquired to the maintainers of the FLAC project if they were open to me, adding some optimizations for Aarch64. One of the maintainers of the FLAC project responded and is open to me, adding Aarch64 support to the FLAC project. Now that I got approval to work on this repo, I could now start on my benchmarking.

Benchmarking the FLAC project:

Step one to benchmarking the FLAC project was building the projects on both x86 and Aarch64. I downloaded the source code from the FLAC GitHub, so the first step to building the library was to run the “autogen.sh” script. Which created my “./configure” script. I then ran the configure script with the “-pg” command to allow gprof, so my configuration command was “./configure CFLAGS=”-g -pg -O2″ CXXFLAGS=”-g -pg -O2″.” I was then able to use “make -j” to build the code. Unfortunately, I could not get gporf working with the main FLAC binary. It would only send garbage data to the “gprof.out” file. Though I did get gprof working with some of the tests that were included with the source code, so I know the “-pg” worked. I ended up switching to perf for my profiling. I ran Make clean and then re-ran Configure so I could use perf. “./configure CFLAGS=”-g -O2″ CXXFLAGS=”-g -O2″ .” After doing that, I grabbed my test data and ran perf record. The command I used to test with is “src/flac/flac input.wav” this command runs FLAC and passes a wave file to it. For my sizable test data, I took one of the live multitrack recordings that I had on my computer and exported the mix to a stereo wave file. The test wave file is 1 hour and 31 minutes long and is 1.57 gigabytes.

Using Perf:

Perf did a fantastic job of helping me find what I want to optimize for this project. The part of FLAC that I tested was the encoding, specifically “.wav” encoding to “.flac.” I ran the same test file on the two different architectures, and the performance difference was noticeable instantly. On the x86 machine, it will run the encoding in 16.439s. And the Aarch64 machine took 4minutes and 39.585s. (These times are from one of the tests I did. I ran the test multiple times with similar results.) I then took a look at the perf report.

Perf Report:

In analyzing the perf reports, I was able to narrow down where I should target my optimizations. You can see from the following snippets of the perf report that on the Aarch64 architecture, the function called “FLAC__lpc_compute_autocorrelation” is taking about 49.64% of the run time. Versus, on the x86 report it uses an intrinsic version of that function called “FLAC__lpc_compute_autocorrelation_intrin_sse_lag_12_new” which significantly improves the performance. On the x86 machine, this function only took 7.25% of the run time. The vanilla c code in which the Aarch64 machine is running is located inside the file “ipc.c.” The Intrinsic code the x86 machine is running is located inside the file “ipc_intin_sse.c.” These files are located in the “src/libFLAC” folder.

aarch64
# Samples: 1M of event 'cycles:uppp'
# Event count (approx.): 255896905943
#
# Overhead  Command  Shared Object       Symbol
# ........  .......  ..................  ............................................................................................
    49.64%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__lpc_compute_autocorrelation
     8.54%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__fixed_compute_best_predictor_wide
     7.08%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__lpc_compute_residual_from_qlp_coefficients_wide
     5.65%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__bitwriter_write_rice_signed_block
     5.19%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__lpc_window_data
     5.14%  lt-flac  libFLAC.so.8.3.0    [.] precompute_partition_info_sums_
     4.54%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__MD5Transform


x86
# Samples: 61K of event 'cycles:u'
# Event count (approx.): 52656457012
#
# Overhead  Command  Shared Object       Symbol                                                                                      
# ........  .......  ..................  ............................................................................................
#
    20.02%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__fixed_compute_best_predictor_wide_intrin_ssse3
    16.78%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__bitwriter_write_rice_signed_block
    16.44%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__MD5Transform
     7.43%  lt-flac  lt-flac             [.] format_input
     7.25%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__lpc_compute_autocorrelation_intrin_sse_lag_12_new
     7.17%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__lpc_compute_residual_from_qlp_coefficients_wide_intrin_avx2
     5.44%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__fixed_compute_residual
     4.17%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__MD5Accumulate

Strategy:

In reviewing the benchmarks and the code, I have narrowed down the strategy I will be doing to complete the project. Here is the list of steps I came up with to optimize the encoding of FLAC for aarch64.

  1. Research the required pre-processor directives that I will need to run the Aarch64 code inside the FLAC library conditionally.
  2. Test the pre-processor directives with some code that will cause a fault, so I know it is working.
  3. Examine the codebase to know where precisely I need to put the pre-processor directives. And check if I need to mess with the build instructions.
  4. Configure the makefile to build the new file that I am adding.
  5. I am going to focus on the “FLAC__lpc_compute_autocorrelation” function, and I am going to translate it into aarch64 intrinsic’s. I will use the existing c and x86 intrinsic code to help me with the translation.
  6. Testing my optimization, I will re-run the test that I performed on the original code with my optimized version and see if I have improved the performance on the aarch64 platform.
  7. As a stretch goal, depending on how hard it is to write the Aarch64 intrinsics, I would like to translate the full “ipc.c” file with aarch64 intrinsics.

Results To This Point:

At this point, I have found a project to work on, started learning the project structure, did research on how to use the package, I successfully built the package on both x86 and Aarch64 and performed benchmarks on the project using gprof, perf and the bash time command on both platforms. For my optimization, I have narrowed down exactly what I am going to work on for the remainder of this project and created a plan for how I will accomplish those changes.

SIMD with C Intrinsics and Inline Assembler – Lab5 SPO600

Welcome to my blog. This post will be about using SIMD with C. SIMD is an acronym for Single Instruction Multiple Data. SIMD is a form of vectorization where the computer performs the same operation on multiple data points simultaneously.

What I did for this lab

In this lab, I will be looking at three different approaches to SIMD. I will be using the same code as lab 4. The code is a simple example of changing the volume of a sound file. I will also be using an AARCH 64 machine, so the inline assembler and intrinsics are for the AARCH 64 architecture.

Part 1:

For part 1, I will be letting the GCC compiler do the vectorization for me using the compiler flag -O3. To confirm that the vectorization was successful. I enabled the compiler flag that will give me information about the vectorization of the code ‘-fopt-info-vec-all.’

I was tasked with vectorizing the last loop which performed the following line of code.

ttl = (ttl + data[x]) % 1000;

This piece of code sums up all the data, it is used to confirm that the algorithm for scaling the samples is correct and is not giving a different result than the original.

The way that the above code calculates the sum will not work for vectorization. This is because it relies on the previous value for the total. So, in order to allow auto-vectorization, I removed that dependance and changed that line of code to the following.

ttl += data[x] % 1000;

Part 2:

For this part of the lab, I first practiced writing inline assembler on another straightforward program. All this program performed was a modulus operator on some variables and printed the results, and my job was to replace the modulus operation with inline assembler.

int main() {
    int a = 3;
    int b = 19;
    int c;
    int d;
    __asm__("udiv %0, %1, %2" : "=r"(c) : "r"(b), "r"(a) );
    __asm__("msub %0, %1, %2, %3" : "=r"(c) : "r"(c), "r"(a), "r"(b)  );
    printf("%d\n", c);
} 

Now back to the sound scaling program. My professor gave me a version of the sound scaling program that contained the inline assembler. It also included questions marked with a Q: about the code in comments. My task for this part of the lab is to answer the questions.

The following are the questions and answers to part 2.

Question 1 Code:
register int16_t* cursor asm(“r20”); // input cursor
register int16_t vol_int asm(“r22”); // volume as int16_t

Question 1:
These variables will be used in our assembler code, so we’re going to hand-allocate which register they are placed in what is an alternate approach?

Answer:
Don’t include hand-allocation and let the compiler control what register they are assigned.

Question 2 Code:
vol_int = (int16_t) (0.75 * 32767.0);

Question 2:
Should we use 32767 or 32768 in this line of code? Why?

Answer:
We should use 32767 since it the larges number we can use since it uses only 15 bits to store if 32768 was the largest number, we would not be able to save the sign bit.

Question 3 Code:
asm (“dup v1.8h,%w0”::”r”(vol_int));

Question 3:
What does it mean to “duplicate” values in the next line?

Answer:
It means we are duplicating the value of vol_int across the vector one register. So, each position in vector one will contain the value of vol_int.

Question 4 Code:
asm (
“ldr q0, [%[cursor]], #0 \n\t”
“sqdmulh v0.8h, v0.8h, v1.8h \n\t”
“str q0, [%[cursor]],#16 \n\t”
: [cursor]”+r”(cursor)
: “r”(cursor)
: “memory”
);

Question 4:
Why is #16 included in the str line but not in the ldr line?

Answer:
We did not want to increment the cursor at ldr since we still need the current cursor position to store the values in the str command.

Question 4 Code:
asm(“…”
: [cursor]”+r”(cursor)
: “r”(cursor)
: “memory”
);

Question 4:
What do these next three lines do?

Answer:
The first line of code after colon one is the output operand. After that, the code after colon two is the input operand. The code after colon three is the clobber. The word memory in the clobber tells the compiler that this inline assembler effects global memory.

Question 5:
Are the results of this program usable? Are they correct?

Answer:
No, if I compare the results to the original program, I am getting 930 from the inline assembler code, and I was getting 94 from the original version of the code. I believe this is due to the fact that we are using a fixed-point representation of 0.75 for calculations.

Performance Analysis of Part 2

With a sample size of 50 Million, I got the following results:

AUTO-VECTORIZATION CODE
real 0m4.754s
user 0m4.585s
sys 0m0.160s

INLINE ASSEMBLER CODE
real 0m4.780s
user 0m4.618s
sys 0m0.150s

As you can see from the results above It would appear that the inline assembler is slightly slower in the real and user time categories but is consistently faster in the sys time category.

Part 3:

In this part of the lab, I start with the completed code for the sound scaling program that used intrinsic, and similar to part 2 of this lab, it contained comments with questions about the code for me to answer.

The following are the questions and answers to part 3.

Question 1 Code:
vst1q_s16(cursor, vqdmulhq_s16(vld1q_s16(cursor), vdupq_n_s16(vol_int)));

Question 1:
What do these intrinsic functions do?

Answer:
The intrinsic function “vst1q_s16” stores a single vector into memory. In this case, we are storing the results of the multiplication.

The intrinsic function “vqdmulhq_s16” stands for “vector saturating doubling multiply high.” In this case, we are multiplying two vector lanes. We pass the two vectors as parameters.

The intrinsic function “vld1q_s16” will load a single vector from memory.

The intrinsic function “vdupq_n_s16” loads all lanes of a vector with the same value.

Question 2 Code:
cursor += 8;

Question 2:
Why is the increment 8 instead of 16 or some other value?

Answer:
Since we are using an int_16t for our data, we have eight vector lanes. So we are incrementing the cursor to the next set of eight values that we will calculate.

Question 3 Code:
cursor += 8;

Question 3:
Why is this line not needed in the inline assembler version of this program?

Answer:
The incrementing of the cursor gets done inside of the inline assembler, so we don’t need to have an extra increment step.

Question 4:
Are the results usable? Are they accurate?

Answer:
Similar to part 2, the results are different than the original. but it is the same as the inline assembler.

Performance Analysis of Part 3

With a sample size of 50 Million, I got the following results:

AUTO-VECTORIZATION CODE
real 0m4.754s
user 0m4.585s
sys 0m0.160s

INLINE ASSEMBLER CODE
real 0m4.780s
user 0m4.618s
sys 0m0.150s

INTRINSICS CODE
real 0m4.768s
user 0m4.589s
sys 0m0.170s

The results are all very similar, I have run the test multiple times. I believe the auto-vectorization code seems to be the quickest most of the time and the intrinsic and inline assembler are about the same.

CODE DOWNLOAD

Download lab files