Analysis of FLAC – Stage Three of Three – SPO600 Project

Hello and welcome to my blog!

For my final post this series, about my SPO 600 project, I will discuss the last part of the project, the pull request to the upstream project. And, I will be reviewing what I have learned while working on this project.

PULL REQUEST:

Here is my the link to the pull request I created: https://github.com/xiph/flac/pull/183

The FLAC project has an extensive testing system that gets run by the continuous integration system of Travis CI. It took 2 hours, 28 minutes and 15 seconds for Travis to finish the tests on my pull request. Travis builds and runs the FLAC program in 16 different ways for the test. There are four different build settings for that are tested, and each build setting gets compiled twice once by the GCC compiler and once by the CLANG compiler. Travis will run these eight tests twice once on a Linux machine and then once on a Mac OSX machine. To speed up the tests, Travis will run multiple test scenarios at a time. The full run-time of all the test scenarios is 6 hours and 57 seconds.

My changes passed all the tests Travis ran. That means the changes I made didn’t affect the regular operation of the program.

It has been a little over a week since I made the PR. It is currently not merged, and I have not heard from the project owner. The day I made the pull request, I got asked if the CPU feature detection that I did was at run-time or compile time? Initially, I thought the question was referring to the detection of the CPU architecture, which gets performed at compile time. After some discussion, I understood that the original question was referring to choosing the correct version of a function based on variables at run-time. Which I did implement, the selection of which autocorrelation function to run gets decided by the variable “encoder->protected_->max_lpc_order.” As far as I can tell, this question was by another member of the community and not a project maintainer, but I could be wrong.

Since that question, there has not been any more activity on the pull request.

PROJECT REVIEW:

This course and project have been a great learning experience for me. I have learned a lot about low-level programming through the lectures and the project. Especially about compilation, for this project, I learned a lot about the GNU auto-tools build system. I now know how to read and edit a configure script. With the practice I got from working with this project, I am now comfortable with the whole build process using a configure script and make.

Through this course and project, I also learned about using assembler and intrinsics in the C language. I was able to gain practical experience working with x86 and aarch64 intrinsics during this project. Specifically, I learned how to use the intrinsic documentation to find the correct intrinsic I need and to understand how to use each intrinsics.

This project was the first major C/C++ project I worked on. I learned a lot about the build process of a C/C++ project. For example, how to detect the hardware details and the available software on a computer.

I also gained a lot of practice with profilers. I used profilers in the past for previous course projects, but this project gave me a practical way I could practice using profilers. For example, by finding hot spots in a program and by helping measure the performance changes in the optimization, I do to the program.

Conclusion

I have enjoyed doing this project, and I am glad I could mix my interests in audio and computer programming. This project has been a great learning experience and I would be interested in doing this type of work in the future.

Analysis of FLAC – Stage Two of Three – SPO600 Project

Hello and welcome to my blog!

In this blog post, I will be writing about my final project for my software portability and optimization class SPO 600. In this post, I will review the work I have done with the FLAC project. If you didn’t read the first post in this series, you could find it here. A quick review of the last post, for my SPO 600 class, I got tasked with finding and optimizing an open-source project. Knowing I wanted to work with audio, I found the open-source project called FLAC (Free Lossless Audio Codec). On further investigation, I made a plan on how to optimize the FLAC library. In this blog post, I will go over the implementation and results of my optimization.

Execution of my plan:

In the previous blog post, I laid out a strategy for the completion of my optimization. It turned out to be quite helpful. I was able to follow it, and I completed the optimization just as planned. Below is the strategy I made and some notes on how I completed each task.

  1. Research the required pre-processor directives that I will need to run the Aarch64 code inside the FLAC library conditionally.
    Through reading the FLAC code, I was able to determine that the pre-processor directive used in this project really on global variables defined by the configure script. After reading the “configure.ac” file, online AutoTools documentation and the “configure.ac” file from the OPUS project, which is also created by Xiph, the same people who created the FLAC project. I was able to determine how to check if the user has an aarch64 CPU and if the intrinsic library arm-neon is available. After confirming that I am on an aarch64 machine, I define “FLAC__CPU_AARCH64,” and after confirming arm-neon intrinsic are possible, I define “FLAC__HAS_NEONINTRIN.”
  2. Test the pre-processor directives with some code that will cause a fault, so I know it is working.
    Since the FLAC project has some optimizations for other platforms, I needed to follow the same pattern as the previous people. To test the preprocessor directives, I had to add the new architecture to the function selection logic. To add the new architecture, I added code to “src/libFLAC/cpu.c”, “src/libFLAC/include/private/cpu.h”, “src/libFLAC/include/private/lpc.h” and “src/libFLAC/stream_encoder.c.” Once I had the architecture function selection logic done, I was ready to test. I did not end up needing to cause any faults to confirm that the preprocessor directives were working. Instead, I used a printf statement and a copy of the vanilla C code version of the auto-correlation function. In running the program, I was able to see the printf messages, and I also used perf to confirm I was using the new function.
  3. Examine the codebase to know where precisely I need to put the pre-processor directives. And check if I need to mess with the build instructions.
    I ended up doing this in step in steps one and two since the FLAC project didn’t use pre-defined variables for the pre-processor directives. Instead, the FLAC project uses variables defined by the configure script. And I had to examine the codebase and implement the changes to run a test.
  4. Configure the makefile to build the new file that I am adding.
    I added one file to the project called lpc_intrin_neon.c. So for the compiler to build it, I put it into the list of source files inside “src/libFLAC/Makefile.am.”
  5. I am going to focus on the “FLAC__lpc_compute_autocorrelation” function, and I am going to translate it into aarch64 intrinsic’s. I will use the existing c and x86 intrinsic code to help me with the translation.
    Success! It took some time, but I was able to translate the x86 code into aarch64. The way I did this was by using the Intel and arm-neon online documentation. I also got help by googling a specific x86 intrinsic and asking what arm-neon instruction does the same thing or similar. For a few intrinsics, there was no direct replacement. Specifically, there is no shuffle in arm-neon, so I had to read up on how shuffle worked on x86 and execute that using multiple arm-neon instruction. I ended up creating inline functions for the shuffles to make writing the code more manageable and cleaner.
  6. Testing my optimization, I will re-run the test that I performed on the original code with my optimized version and see if I have improved the performance on the aarch64 platform.
    I tested with two aarch64 machines. The first machine has a faster single-thread performance with 8 threads. The second machine has 24 threads but slower single-thread performance. On the first machine, initially, the autocorrelation function took 26.11 percent of the runtime. After the optimizations, the autocorrelation function took 12.64 percent of the time. On the second machine, initially, the autocorrelation function took 52.41 percent of the runtime. After the optimizations, the autocorrelation function took 14.78 percent of the time. I also tested the optimizations on an x86 machine to confirm that the changes did not affect that architecture.
  7. As a stretch goal, depending on how hard it is to write the Aarch64 intrinsics, I would like to translate the full “ipc.c” file with aarch64 intrinsics.
    I didn’t end up translating the full “ipc.c” file, but I did translate all versions of the autocorrelation function. There are four versions of the autocorrelation function. Depending on how much lag it will call the correct version, either lag 4, lag 8, lag 12 or lag 16.

Full Results:

The following results are not averaged, but I did run theses test multiple times with similar results. The numbers below are from a few of the many tests I performed.

Aaarch64 Machine 1:

TOTAL RUNTIME OF THE TEST BEFORE OPTIMIZATION:

real    0m51.784s
user 0m49.356s
sys 0m2.349s

TOTAL RUNTIME OF THE TEST AFTER OPTIMIZATION:

real    0m43.503s
user 0m40.950s
sys 0m2.470s

PERF REPORT BEFORE OPTIMIZATION (First 20 Lines):

To display the perf.data header info, please use --header/--header-only options.
 #
 #
 Total Lost Samples: 0
 #
 Samples: 208K of event 'cycles:u'
 Event count (approx.): 98509947650
 #
 Overhead  Command   Shared Object           Symbol
 ……..  ……..  ………………….  ………………………………………………………………………………..
 #
     26.11%  lt-flac   libFLAC.so.8.3.0        [.] FLAC__lpc_compute_autocorrelation
     25.54%  lt-flac   libFLAC.so.8.3.0        [.] FLAC__fixed_compute_best_predictor_wide
     11.35%  lt-flac   libFLAC.so.8.3.0        [.] FLAC__bitwriter_write_rice_signed_block
      9.45%  lt-flac   libFLAC.so.8.3.0        [.] FLAC__MD5Transform
      5.95%  lt-flac   lt-flac                 [.] format_input
      5.60%  lt-flac   libFLAC.so.8.3.0        [.] FLAC__lpc_compute_residual_from_qlp_coefficients_wide
      3.42%  lt-flac   libFLAC.so.8.3.0        [.] precompute_partition_info_sums_
      2.34%  lt-flac   libFLAC.so.8.3.0        [.] FLAC__MD5Accumulate
      2.21%  lt-flac   libFLAC.so.8.3.0        [.] FLAC__crc16

PERF REPORT AFTER OPTIMIZATION (First 20 Lines):

To display the perf.data header info, please use --header/--header-only options.
 #
 #
 Total Lost Samples: 0
 #
 Samples: 175K of event 'cycles:u'
 Event count (approx.): 81871492155
 #
 Overhead  Command  Shared Object       Symbol
 ……..  …….  ………………  ………………………………………………………………………………..
 #
     30.58%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__fixed_compute_best_predictor_wide
     13.36%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__bitwriter_write_rice_signed_block
     12.64%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__lpc_compute_autocorrelation_intrin_neon_lag_12
     11.71%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__MD5Transform
      7.16%  lt-flac  lt-flac             [.] format_input
      5.18%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__lpc_compute_residual_from_qlp_coefficients_wide
      4.16%  lt-flac  libFLAC.so.8.3.0    [.] precompute_partition_info_sums_
      3.00%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__MD5Accumulate
      2.62%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__crc16

Aaarch64 Machine 2:

TOTAL RUNTIME OF THE TEST BEFORE OPTIMIZATION:

real    3m43.841s
user    3m33.558s
sys     0m8.791s

TOTAL RUNTIME OF THE TEST AFTER OPTIMIZATION:

real    2m3.675s
user    1m54.260s
sys     0m8.588s

PERF REPORT BEFORE OPTIMIZATION (First 20 Lines):

To display the perf.data header info, please use --header/--header-only options.
 #
 #
 Total Lost Samples: 0
 #
 Samples: 901K of event 'cycles:uppp'
 Event count (approx.): 213328075836
 #
 Overhead  Command  Shared Object     Symbol
 ……..  …….  …………….  ………………………………………………………………………………..
 #
     52.41%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__lpc_compute_autocorrelation
     11.36%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__fixed_compute_best_predictor_wide
      6.62%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__bitwriter_write_rice_signed_block
      5.80%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__MD5Transform
      4.35%  lt-flac  lt-flac           [.] format_input
      4.05%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__lpc_compute_residual_from_qlp_coefficients_wide
      2.69%  lt-flac  libFLAC.so.8.3.0  [.] precompute_partition_info_sums_
      2.52%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__MD5Accumulate
      2.10%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__fixed_compute_residual

PERF REPORT AFTER OPTIMIZATION (First 20 Lines):

To display the perf.data header info, please use --header/--header-only options.
 #
 #
 Total Lost Samples: 0
 #
 Samples: 620K of event 'cycles:uppp'
 Event count (approx.): 144725968757
 #
 Overhead  Command  Shared Object     Symbol
 ……..  …….  …………….  ………………………………………………………………………………..
 #
     15.03%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__fixed_compute_best_predictor_wide
     14.78%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__lpc_compute_autocorrelation_intrin_neon_lag_12
     10.06%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__bitwriter_write_rice_signed_block
     10.03%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__lpc_compute_residual_from_qlp_coefficients_wide
      9.14%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__lpc_window_data
      9.09%  lt-flac  libFLAC.so.8.3.0  [.] precompute_partition_info_sums_
      8.07%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__MD5Transform
      6.14%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__fixed_compute_residual
      5.68%  lt-flac  lt-flac           [.] format_input

Code Changes:

On GitHub, I created a pull request inside my fork of FLAC HERE. With this pull request, you will be able to see exactly all the code changes I made to the FLAC project.

Analysis of FLAC – Stage One of Three – SPO600 Project

Hello and welcome to my blog!

In this blog post, I will be writing about my final project for SPO600. The goal of this project is to optimize an open-source library. To complete the project, I had to choose one of the following tasks: alter build options, code changes to permit better optimization by the compiler, algorithm improvements or In-line assembler.

What I did to find an Open-source project:

For a long time, I have been working with audio as a hobby. I actively produced music, mix live bands, calibrate PA speakers and play drums. So, after completing the lab about changing the volume of sound, I knew I wanted to work with audio. So, I started at an open-source project that I knew about that works with audio called Audacity. Audacity is an open-source DAW (Digital Audio Workstation.) I looked around the Audacity project source for a bit then decided to dig into its dependencies. I got the idea of looking at the dependencies from my professor. He suggested to everyone that a library would be a great place to find an opportunity to optimize some code. In looking at the Audacity dependencies, I found the library called FLAC by Xiph. FLAC is an acronym that means Free Lossless Audio Codec. Similar to what I did with Audacity, I started looking at the source code. The way I navigated the source code was by using the search on GitHub and the Find command on bash. I was looking for architecture-specific code, so my search’s were like “x86” or “aarch64”. Between searching for keywords and browsing the folders, I found a file that was called “cpu.c” in “src/libFLAC,” and inside, I was able to determine that the FLAC project has not yet been optimized for Aarch64. The way I discovered that the FLAC project has yet to be optimized for Aarch64 was by looking at the compiler preprocessing directives. Inside that file, I could see that this project has optimizations for the x86, IA32 and PPC architectures, but not for Aarch64. After learning that the FLAC project has not been optimized for Aarch64, I submitted an Issue on GitHub. Here is a link to the issue I created: Issue #156. Inside the issue, I inquired to the maintainers of the FLAC project if they were open to me, adding some optimizations for Aarch64. One of the maintainers of the FLAC project responded and is open to me, adding Aarch64 support to the FLAC project. Now that I got approval to work on this repo, I could now start on my benchmarking.

Benchmarking the FLAC project:

Step one to benchmarking the FLAC project was building the projects on both x86 and Aarch64. I downloaded the source code from the FLAC GitHub, so the first step to building the library was to run the “autogen.sh” script. Which created my “./configure” script. I then ran the configure script with the “-pg” command to allow gprof, so my configuration command was “./configure CFLAGS=”-g -pg -O2″ CXXFLAGS=”-g -pg -O2″.” I was then able to use “make -j” to build the code. Unfortunately, I could not get gporf working with the main FLAC binary. It would only send garbage data to the “gprof.out” file. Though I did get gprof working with some of the tests that were included with the source code, so I know the “-pg” worked. I ended up switching to perf for my profiling. I ran Make clean and then re-ran Configure so I could use perf. “./configure CFLAGS=”-g -O2″ CXXFLAGS=”-g -O2″ .” After doing that, I grabbed my test data and ran perf record. The command I used to test with is “src/flac/flac input.wav” this command runs FLAC and passes a wave file to it. For my sizable test data, I took one of the live multitrack recordings that I had on my computer and exported the mix to a stereo wave file. The test wave file is 1 hour and 31 minutes long and is 1.57 gigabytes.

Using Perf:

Perf did a fantastic job of helping me find what I want to optimize for this project. The part of FLAC that I tested was the encoding, specifically “.wav” encoding to “.flac.” I ran the same test file on the two different architectures, and the performance difference was noticeable instantly. On the x86 machine, it will run the encoding in 16.439s. And the Aarch64 machine took 4minutes and 39.585s. (These times are from one of the tests I did. I ran the test multiple times with similar results.) I then took a look at the perf report.

Perf Report:

In analyzing the perf reports, I was able to narrow down where I should target my optimizations. You can see from the following snippets of the perf report that on the Aarch64 architecture, the function called “FLAC__lpc_compute_autocorrelation” is taking about 49.64% of the run time. Versus, on the x86 report it uses an intrinsic version of that function called “FLAC__lpc_compute_autocorrelation_intrin_sse_lag_12_new” which significantly improves the performance. On the x86 machine, this function only took 7.25% of the run time. The vanilla c code in which the Aarch64 machine is running is located inside the file “ipc.c.” The Intrinsic code the x86 machine is running is located inside the file “ipc_intin_sse.c.” These files are located in the “src/libFLAC” folder.

aarch64
# Samples: 1M of event 'cycles:uppp'
# Event count (approx.): 255896905943
#
# Overhead  Command  Shared Object       Symbol
# ........  .......  ..................  ............................................................................................
    49.64%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__lpc_compute_autocorrelation
     8.54%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__fixed_compute_best_predictor_wide
     7.08%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__lpc_compute_residual_from_qlp_coefficients_wide
     5.65%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__bitwriter_write_rice_signed_block
     5.19%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__lpc_window_data
     5.14%  lt-flac  libFLAC.so.8.3.0    [.] precompute_partition_info_sums_
     4.54%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__MD5Transform


x86
# Samples: 61K of event 'cycles:u'
# Event count (approx.): 52656457012
#
# Overhead  Command  Shared Object       Symbol                                                                                      
# ........  .......  ..................  ............................................................................................
#
    20.02%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__fixed_compute_best_predictor_wide_intrin_ssse3
    16.78%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__bitwriter_write_rice_signed_block
    16.44%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__MD5Transform
     7.43%  lt-flac  lt-flac             [.] format_input
     7.25%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__lpc_compute_autocorrelation_intrin_sse_lag_12_new
     7.17%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__lpc_compute_residual_from_qlp_coefficients_wide_intrin_avx2
     5.44%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__fixed_compute_residual
     4.17%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__MD5Accumulate

Strategy:

In reviewing the benchmarks and the code, I have narrowed down the strategy I will be doing to complete the project. Here is the list of steps I came up with to optimize the encoding of FLAC for aarch64.

  1. Research the required pre-processor directives that I will need to run the Aarch64 code inside the FLAC library conditionally.
  2. Test the pre-processor directives with some code that will cause a fault, so I know it is working.
  3. Examine the codebase to know where precisely I need to put the pre-processor directives. And check if I need to mess with the build instructions.
  4. Configure the makefile to build the new file that I am adding.
  5. I am going to focus on the “FLAC__lpc_compute_autocorrelation” function, and I am going to translate it into aarch64 intrinsic’s. I will use the existing c and x86 intrinsic code to help me with the translation.
  6. Testing my optimization, I will re-run the test that I performed on the original code with my optimized version and see if I have improved the performance on the aarch64 platform.
  7. As a stretch goal, depending on how hard it is to write the Aarch64 intrinsics, I would like to translate the full “ipc.c” file with aarch64 intrinsics.

Results To This Point:

At this point, I have found a project to work on, started learning the project structure, did research on how to use the package, I successfully built the package on both x86 and Aarch64 and performed benchmarks on the project using gprof, perf and the bash time command on both platforms. For my optimization, I have narrowed down exactly what I am going to work on for the remainder of this project and created a plan for how I will accomplish those changes.

SIMD with C Intrinsics and Inline Assembler – Lab5 SPO600

Welcome to my blog. This post will be about using SIMD with C. SIMD is an acronym for Single Instruction Multiple Data. SIMD is a form of vectorization where the computer performs the same operation on multiple data points simultaneously.

What I did for this lab

In this lab, I will be looking at three different approaches to SIMD. I will be using the same code as lab 4. The code is a simple example of changing the volume of a sound file. I will also be using an AARCH 64 machine, so the inline assembler and intrinsics are for the AARCH 64 architecture.

Part 1:

For part 1, I will be letting the GCC compiler do the vectorization for me using the compiler flag -O3. To confirm that the vectorization was successful. I enabled the compiler flag that will give me information about the vectorization of the code ‘-fopt-info-vec-all.’

I was tasked with vectorizing the last loop which performed the following line of code.

ttl = (ttl + data[x]) % 1000;

This piece of code sums up all the data, it is used to confirm that the algorithm for scaling the samples is correct and is not giving a different result than the original.

The way that the above code calculates the sum will not work for vectorization. This is because it relies on the previous value for the total. So, in order to allow auto-vectorization, I removed that dependance and changed that line of code to the following.

ttl += data[x] % 1000;

Part 2:

For this part of the lab, I first practiced writing inline assembler on another straightforward program. All this program performed was a modulus operator on some variables and printed the results, and my job was to replace the modulus operation with inline assembler.

int main() {
    int a = 3;
    int b = 19;
    int c;
    int d;
    __asm__("udiv %0, %1, %2" : "=r"(c) : "r"(b), "r"(a) );
    __asm__("msub %0, %1, %2, %3" : "=r"(c) : "r"(c), "r"(a), "r"(b)  );
    printf("%d\n", c);
} 

Now back to the sound scaling program. My professor gave me a version of the sound scaling program that contained the inline assembler. It also included questions marked with a Q: about the code in comments. My task for this part of the lab is to answer the questions.

The following are the questions and answers to part 2.

Question 1 Code:
register int16_t* cursor asm(“r20”); // input cursor
register int16_t vol_int asm(“r22”); // volume as int16_t

Question 1:
These variables will be used in our assembler code, so we’re going to hand-allocate which register they are placed in what is an alternate approach?

Answer:
Don’t include hand-allocation and let the compiler control what register they are assigned.

Question 2 Code:
vol_int = (int16_t) (0.75 * 32767.0);

Question 2:
Should we use 32767 or 32768 in this line of code? Why?

Answer:
We should use 32767 since it the larges number we can use since it uses only 15 bits to store if 32768 was the largest number, we would not be able to save the sign bit.

Question 3 Code:
asm (“dup v1.8h,%w0”::”r”(vol_int));

Question 3:
What does it mean to “duplicate” values in the next line?

Answer:
It means we are duplicating the value of vol_int across the vector one register. So, each position in vector one will contain the value of vol_int.

Question 4 Code:
asm (
“ldr q0, [%[cursor]], #0 \n\t”
“sqdmulh v0.8h, v0.8h, v1.8h \n\t”
“str q0, [%[cursor]],#16 \n\t”
: [cursor]”+r”(cursor)
: “r”(cursor)
: “memory”
);

Question 4:
Why is #16 included in the str line but not in the ldr line?

Answer:
We did not want to increment the cursor at ldr since we still need the current cursor position to store the values in the str command.

Question 4 Code:
asm(“…”
: [cursor]”+r”(cursor)
: “r”(cursor)
: “memory”
);

Question 4:
What do these next three lines do?

Answer:
The first line of code after colon one is the output operand. After that, the code after colon two is the input operand. The code after colon three is the clobber. The word memory in the clobber tells the compiler that this inline assembler effects global memory.

Question 5:
Are the results of this program usable? Are they correct?

Answer:
No, if I compare the results to the original program, I am getting 930 from the inline assembler code, and I was getting 94 from the original version of the code. I believe this is due to the fact that we are using a fixed-point representation of 0.75 for calculations.

Performance Analysis of Part 2

With a sample size of 50 Million, I got the following results:

AUTO-VECTORIZATION CODE
real 0m4.754s
user 0m4.585s
sys 0m0.160s

INLINE ASSEMBLER CODE
real 0m4.780s
user 0m4.618s
sys 0m0.150s

As you can see from the results above It would appear that the inline assembler is slightly slower in the real and user time categories but is consistently faster in the sys time category.

Part 3:

In this part of the lab, I start with the completed code for the sound scaling program that used intrinsic, and similar to part 2 of this lab, it contained comments with questions about the code for me to answer.

The following are the questions and answers to part 3.

Question 1 Code:
vst1q_s16(cursor, vqdmulhq_s16(vld1q_s16(cursor), vdupq_n_s16(vol_int)));

Question 1:
What do these intrinsic functions do?

Answer:
The intrinsic function “vst1q_s16” stores a single vector into memory. In this case, we are storing the results of the multiplication.

The intrinsic function “vqdmulhq_s16” stands for “vector saturating doubling multiply high.” In this case, we are multiplying two vector lanes. We pass the two vectors as parameters.

The intrinsic function “vld1q_s16” will load a single vector from memory.

The intrinsic function “vdupq_n_s16” loads all lanes of a vector with the same value.

Question 2 Code:
cursor += 8;

Question 2:
Why is the increment 8 instead of 16 or some other value?

Answer:
Since we are using an int_16t for our data, we have eight vector lanes. So we are incrementing the cursor to the next set of eight values that we will calculate.

Question 3 Code:
cursor += 8;

Question 3:
Why is this line not needed in the inline assembler version of this program?

Answer:
The incrementing of the cursor gets done inside of the inline assembler, so we don’t need to have an extra increment step.

Question 4:
Are the results usable? Are they accurate?

Answer:
Similar to part 2, the results are different than the original. but it is the same as the inline assembler.

Performance Analysis of Part 3

With a sample size of 50 Million, I got the following results:

AUTO-VECTORIZATION CODE
real 0m4.754s
user 0m4.585s
sys 0m0.160s

INLINE ASSEMBLER CODE
real 0m4.780s
user 0m4.618s
sys 0m0.150s

INTRINSICS CODE
real 0m4.768s
user 0m4.589s
sys 0m0.170s

The results are all very similar, I have run the test multiple times. I believe the auto-vectorization code seems to be the quickest most of the time and the intrinsic and inline assembler are about the same.

CODE DOWNLOAD

Download lab files

Manipulating large data – Lab4 SPO 600

Welcome, in class, we learned about a few different ways you can represent data in a computer. We learned about integers, fixed-point, floating-point, graphics, and sound. This lab focuses on sound data, which is one of the larger forms of data.

How we record sound?

The way we record sound is by sampling it. The typical sample rate is 44.1 or 48 thousand samples per second per channel and each sample size is typically 16 or 24 bit.

What I did for this lab

Task 1

My first task was to benchmark a program that manipulated a random dataset that was simulating a 16 bit sound file. The way we made the data was by generating five million signed integers between positive 32767 and negative 32768.

Here is the code for task 1. – Task1

Task 1 – Results

Result: 94
real 0m0.058s
user 0m0.053s
sys 0m0.005s

Task 1 – Analysis

Using the profiler Perf I can see that the total time running the scaling code is around 18.62% of the program runtime. About 81.38% of the time was generating random data.

Task 2

My second task was to change the formula for scaling the samples to a pre-calculated lookup table. The lookup table would contain all possible sample values multiplied by the volume factor. To get the results, you use the value as the index to look up each sample in that table to get the scaled values.

Here is the code for task 2. – Task 2

Task 2 – Results

Result: 94
real 0m0.646s
user 0m0.633s
sys 0m0.011s

Task 2 – Analysis

Using the profiler Perf I can see that the total time running the scaling code is around 34.27% of the program runtime. About 65.73% of the time was generating random data.

Task 3

My third task was to change the formula for scaling the samples to use fixed point integer math. The reason for doing this is that on most machines they can calculate integer math faster than floating-point. The way I accomplished this was by bit shifting.

The formula I used was: (((246 * 0.75) * SAMPLE) >> 8)

Here is the code for task 3. – Task 3

Task 3 – Results

Result: 873
real 0m0.522s
user 0m0.501s
sys 0m0.020s

Task 3 -Analysis

Using the profiler Perf I can see that the total time running the scaling code is around 18.24% of the program runtime. About 81.76% of the time was generating random data.

Final Analysis

I tested each test multiple times. The results were all similar, with a small variant to speed and the same output every time. The computer doing other tasks is the likely cause for the slight variant in test speeds.

I also tried with a larger sample size than 5 million and the result was the same.

The final results are fixed point integers give us the fastest result, floating-point math came in second, and the lookup table was the slowest.

Now there was a downfall to the fixed point integers. It did not give the same result as the floating-point and lookup table. The result was different since it does not calculate the decimals. The fixed point integers delete the decimal without rounding. By not rounding the numbers, the result was slightly different.

Writing In Assembler x86 and aarch64 – Lab3 SP0600

Hello,

In this post, I am talking about how I am furthering my understanding of computers, so I can understand how I can properly optimize software. I am writing about learning how to write assembler code on the x86 and aarch64 platform for my software optimization class lab.

To complete this lab I performed the following tasks :

  1. Build and run the three C versions of the program for x86_64.
    Take a look at the differences in the code.
  2. Use the objdump -d command to dump (print) the object code (machine code) and disassemble it into assembler for each of the binaries. Find the section and take a look at the code. Notice the total amount of code.
  3. Review, build and run the x86_64 assembly language programs. Take a look at the code using objdump -d objectfile and compare it to the source code. Notice the absence of other code (compared to the C binary, which had a lot of extra code).
  4. Build and run the three C versions of the program for aarch64. Verify that you can disassemble the object code in the ELF binary using objdump -d objectfile and take a look at the code
  5. Review, build and run the aarch64 assembly language programs. Take a look at the code using objdump -d objectfile and compare it to the source code.
  6. Make a loop from 0 to 9, on x86 and aarch64
  7. Extend the code to loop from 00-30, printing each value as a 2-digit decimal number, on x86 and aarch64

How I used a Makefile

Since this lab required testing, reviewing, creating and running many files I decided to load everything into a Makefile.

In doing this I learned that I can call Makefiles in other folders.
The way I did that was by adding a target to the main Makefile and typing in “cd /route/to/makefile && make all”

In the attached folders you can see the Makefile I created.

Task 1

The three c programs all perform the same task of printing “Hello World!”, but they do it in 3 different ways.

Program 1: Uses printf()
Program 2: Uses write()
Program 3: Uses syscall()

Task 2

After Reviewing the output of the objdump I can see that program 1 uses the least amount of code at 8 lines but it is using printf which has the most overhead of the three functions. Program 2 using write which should have less overhead uses 12 lines of code. And finally program 3 also uses 12 lines of code and since we are using a syscall we have very little overhead.

Task 3

Yes, Since we are now compiling straight from assembler we don’t have the overhead of the c language. This cut the program down in size drastically now the how objdump file is only 11 lines of code.

Task 4

Here is the total line count the three c programs took to run on aarch64. Pretty similar results.

Program 1: 10 lines
Program 2: 12 lines
Program 3: 12 lines

Task 5

Surprisingly, the results are identical to the x86 in term of line count. The aarch64 Hello world program used 11 lines of code the same as x86.

Something interesting I noticed about the compiled code is that it transformed all the numbers to hexadecimal.

Task 6

Here is my loops 0-9 on x86 and aarch64.

/* x86 */
.text
.globl    _start

start = 0                       /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 10                        /* loop exits when the index hits this number (loop condition is i<max) */

_start:
    mov     $start,%r15         /* loop index */
   
loop:
    /* ... body of the loop ... do something useful here ... */
    mov     $len,%rdx
    
    mov     $48,%r14
    add     %r15,%r14
    
    movb    %r14b,msg+6
    mov     $msg,%rsi

    mov     $1,%rdi
    mov     $1,%rax
    syscall 

    inc     %r15                /* increment index */
    cmp     $max,%r15           /* see if we're done */
    jne     loop                /* loop if we're not */

    mov     $0,%rdi             /* exit status */
    mov     $60,%rax            /* syscall sys_exit */
    syscall
.data 
msg: .ascii "Loop:  \n"
    len = . - msg
/* aarch64 */
.text
.globl    _start

start = 0                       /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 10                        /* loop exits when the index hits this number (loop condition is i<max) */

_start:
    mov     x30,start           /* loop index */
    
loop:
    
    mov     x19,48
    mov     x26,max
    mov     x27,1
    adr     x28,msg

    add     x19,x30,x19

    strb     w19,[x28,6]
    ldr      x1,=msg
        
    mov     x0,1
    mov     x2,len
    mov     x8, 64
    svc     0

    add     x30,x27,x30             /* increment index */
    cmp     x26,x30                 /* see if we're done */
    b.ne    loop                   /* loop if we're not */

    mov     x8,93                   /* syscall sys_exit */
    svc     0

    .data

    msg: .ascii "Loop:      \n"
    len = . - msg

Task 7

Here is my loops 0-30 with the leading zero’s removed on x86 and aarch64.

/* x86 */
.text
.globl    _start

start = 0                       /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 31                        /* loop exits when the index hits this number (loop condition is i<max) */

_start:
    mov     $start,%r15         /* loop index */
    
loop:
    /* ... body of the loop ... do something useful here ... */
    
   
    mov     $48,%r13
    mov     $48,%r14
    mov     $0,%rdx

    mov     %r15,%rax
    mov     $10,%r12
    div     %r12

    
    add     %rax,%r13
    add     %rdx,%r14
    
    cmp     $48,%r13                   /*Compare*/
    
    je continue     
    
    movb    %r13b,msg+6

continue:

    movb    %r14b,msg+7
    mov     $msg,%rsi /*send message to reg rsi*/
        
    mov     $1,%rdi
    mov     $1,%rax
    mov     $len,%rdx

    syscall

    inc     %r15                /* increment index */
    cmp     $max,%r15           /* see if we're done */
    jne     loop                /* loop if we're not */

    mov     $0,%rdi             /* exit status */
    mov     $60,%rax            /* syscall sys_exit */
    syscall

    .data

    msg: .ascii "Loop:   \n"
    len = . - msg
/* aarch64 */
.text
.globl    _start

start = 0                       /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 30                        /* loop exits when the index hits this number (loop condition is i<max) */

_start:
    mov     x30,start           /* loop index */
    
loop:
    
    mov     x19,48
    mov     x20,48
    mov     x24,10  
    mov     x25,48
    mov     x26,max
    mov     x27,1
    adr     x28,msg

    
    udiv    x21,x30,x24
    msub    x22,x21,x24,x30

    add     x19,x21,x19
    add     x20,x22,x20

    cmp     x25,x19
    b.eq    continue     
    
    strb     w19,[x28,6]

continue:

    strb     w20,[x28,7]
    ldr      x1,=msg
        
    mov     x0,1
    mov     x2,len
    mov     x8, 64
    svc     0

    add     x30,x27,x30             /* increment index */
    cmp     x26,x30                 /* see if we're done */
    b.ne    loop                   /* loop if we're not */

    mov     x8,93                   /* syscall sys_exit */
    svc     0

    .data

    msg: .ascii "Loop:      \n"
    len = . - msg

Download my files

Code Review – HelloWorld.c Lab02

Hello,

In the following post, I will be looking at seven different ways that you can compile a simple hello world program in C.

I will be using the following four flags for the GCC compiler.


-g               # enable debugging information
-O0              # do not optimize
-fno-builtin     # do not use builtin function optimizations
-static          # includes the header files in the executable

After I compile using gcc, I will look at things like the filesize and the decompiled assembler to gather my results.

Okay, so test one is to compile the following program with -g -O0 -fno-builtin. This test will be the control for this experiment. We will be basing or conclusions for the other tests of this one.


$ gcc -g -O0 -fno-builtin -o test test.c

#include 
int main(){
    printf("Hello World!\n");
}


$ objdump -fsd --source test

The command “objdump” will allow me to view the compiled code. You will probably want to pipe this command into less or send to an output file.

The results of this command will be extensive, so you will want to use a search to find “<main>.”


// Total File Size 24656 bytes
#include <stdio.h>
int main(){
  401126:	55                   	push   %rbp
  401127:	48 89 e5             	mov    %rsp,%rbp
    printf("Hello World!\n");
  40112a:	bf 10 20 40 00       	mov    $0x402010,%edi
  40112f:	b8 00 00 00 00       	mov    $0x0,%eax
  401134:	e8 f7 fe ff ff       	callq  401030 <printf@plt>
  401139:	b8 00 00 00 00       	mov    $0x0,%eax
  40113e:	5d                   	pop    %rbp
  40113f:	c3                   	retq   

Alright, now we have our control results we can now start some tests.
Here is a link to an assembler quick start guide if you need.

TEST 1

Add the compiler option -static.


$ gcc -static -g -O0 -fno-builtin -o test1 test.c

// Code
#include <stdio.h>
int main(){
    printf("Hello World!\n");
}

$ objdump -fsd --source test1 >> test1.text

// Total File Size 1720896 bytes
#include 
int main(){
  401bb5:	55                   	push   %rbp
  401bb6:	48 89 e5             	mov    %rsp,%rbp
    printf("Hello World!\n");
  401bb9:	bf 10 00 48 00       	mov    $0x480010,%edi
  401bbe:	b8 00 00 00 00       	mov    $0x0,%eax
  401bc3:	e8 f8 72 00 00       	callq  408ec0 <_IO_printf>
  401bc8:	b8 00 00 00 00       	mov    $0x0,%eax
  401bcd:	5d                   	pop    %rbp
  401bce:	c3                   	retq   
  401bcf:	90                   	nop 

If we review the results the assembler, the “printf” gets called from inside the file versus linking to another file. And, if we look at the file size, it is now a lot larger. Due to the -static, it has made it, so it included the header with the assembled code.

TEST 2

Remove the compiler option -fno-builtin.


$ gcc -g -O0 -o test2 test.c

// Code
#include <stdio.h>
int main(){
    printf("Hello World!\n");
}

$ objdump -fsd --source test2 >> test2.text

// Total File Size 24648 bytes
#include <stdio.h>
int main(){
  401126:	55                   	push   %rbp
  401127:	48 89 e5             	mov    %rsp,%rbp
    printf("Hello World!\n");
  40112a:	bf 10 20 40 00       	mov    $0x402010,%edi
  40112f:	e8 fc fe ff ff       	callq  401030 <puts@plt>
  401134:	b8 00 00 00 00       	mov    $0x0,%eax
  401139:	5d                   	pop    %rbp
  40113a:	c3                   	retq   
  40113b:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)

If we review the results of the assembler, we can see that we are no longer using “printf” the compiler has replaced it with the “puts” function.

TEST 3

Remove the compiler option -g.


$ gcc -O0 -fno-builtin -o ../excabutables/test3 test.c

// Code
#include <stdio.h>
int main(){
    printf("Hello World!\n");
}

$ objdump -fsd --source test3 >> test3.text

// Total File Size 22272 bytes
  401126:	55                   	push   %rbp
  401127:	48 89 e5             	mov    %rsp,%rbp
  40112a:	bf 10 20 40 00       	mov    $0x402010,%edi
  40112f:	b8 00 00 00 00       	mov    $0x0,%eax
  401134:	e8 f7 fe ff ff       	callq  401030 <printf@plt>
  401139:	b8 00 00 00 00       	mov    $0x0,%eax
  40113e:	5d                   	pop    %rbp
  40113f:	c3                   	retq

If we review the results of the assembler, we can see that we are no longer able to see the debugger information like the header include, or code.

TEST 4

Add additional arguments to the printf() function in your program.


$ gcc -g -O0 -fno-builtin -o test4 testMultiArgs.c

// Code
#include <stdio.h>
int main(){
    printf("Hello World!!!, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d \n", 1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
}

$ objdump -fsd --source test4 >> test4.text


// Total File Size 24688 bytes
#include <stdio.h>
int main(){
  401126:	55                   	push   %rbp
  401127:	48 89 e5             	mov    %rsp,%rbp
    printf("Hello World!!!, %d, %d, %d, %d, %d, %d, %d, %d, %d, %d \n", 1, 2, 3, 4, 5, 6, 7, 8, 9, 10);
  40112a:	48 83 ec 08          	sub    $0x8,%rsp
  40112e:	6a 0a                	pushq  $0xa
  401130:	6a 09                	pushq  $0x9
  401132:	6a 08                	pushq  $0x8
  401134:	6a 07                	pushq  $0x7
  401136:	6a 06                	pushq  $0x6
  401138:	41 b9 05 00 00 00    	mov    $0x5,%r9d
  40113e:	41 b8 04 00 00 00    	mov    $0x4,%r8d
  401144:	b9 03 00 00 00       	mov    $0x3,%ecx
  401149:	ba 02 00 00 00       	mov    $0x2,%edx
  40114e:	be 01 00 00 00       	mov    $0x1,%esi
  401153:	bf 10 20 40 00       	mov    $0x402010,%edi
  401158:	b8 00 00 00 00       	mov    $0x0,%eax
  40115d:	e8 ce fe ff ff       	callq  401030 <printf@plt>
  401162:	48 83 c4 30          	add    $0x30,%rsp
  401166:	b8 00 00 00 00       	mov    $0x0,%eax
  40116b:	c9                   	leaveq 
  40116c:	c3                   	retq   
  40116d:	0f 1f 00             	nopl   (%rax)

If we review the results, we can see all the steps needed to perform a “printf” with ten arguments.

TEST 5

Move the printf() call to a separate function named output()


$ gcc -g -O0 -fno-builtin -o test5 testFunctionCall.c

// Code
#include <stdio.h>
void output(){
    printf("hello World");
}
int main(){
    output();
}

$ objdump -fsd --source test5 >> test5.text


// Total File Size 24792 bytes
int main(){
  40113c:	55                   	push   %rbp
  40113d:	48 89 e5             	mov    %rsp,%rbp
    output();
  401140:	b8 00 00 00 00       	mov    $0x0,%eax
  401145:	e8 dc ff ff ff       	callq  401126 
  40114a:	b8 00 00 00 00       	mov    $0x0,%eax
  40114f:	5d                   	pop    %rbp
  401150:	c3                   	retq   
  401151:	66 2e 0f 1f 84 00 00 	nopw   %cs:0x0(%rax,%rax,1)
  401158:	00 00 00 
  40115b:	0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)

#include <stdio.h>
void output(){
  401126:	55                   	push   %rbp
  401127:	48 89 e5             	mov    %rsp,%rbp
    printf("hello World");
  40112a:	bf 10 20 40 00       	mov    $0x402010,%edi
  40112f:	b8 00 00 00 00       	mov    $0x0,%eax
  401134:	e8 f7 fe ff ff       	callq  401030 
}
  401139:	90                   	nop
  40113a:	5d                   	pop    %rbp
  40113b:	c3                   	retq

If we review the results, we can see that we are now using the output function for the “printf.”

TEST 6

Remove -O0 and add -O3 to the gcc options.


$ gcc -g -fno-builtin -O3 -o test6 test.c

// Code
#include <stdio.h>
int main(){
    printf("Hello World!\n");
}

$ objdump -fsd --source test6 >> test6.text

// Total File Size 24896 bytes
#include <stdio.h>
int main(){
  401040:	48 83 ec 08          	sub    $0x8,%rsp
    printf("Hello World!\n");
  401044:	bf 10 20 40 00       	mov    $0x402010,%edi
  401049:	31 c0                	xor    %eax,%eax
  40104b:	e8 e0 ff ff ff       	callq  401030 <printf@plt>
  401050:	31 c0                	xor    %eax,%eax
  401052:	48 83 c4 08          	add    $0x8,%rsp
  401056:	c3                   	retq 

If we review the results, we can see that the -O3 gcc option has swapped a bunch of operations with more efficient versions.

Download my files

SPO 600 – Lab 1

Hello,

This post is the start to a new blogging series I will be doing, and it will revolve around the work I am doing for my SPO 600 class at school. SPO 600 is the course code, and the full name is Software Portability and Optimization. The course has a public wiki page if you want to read more available here.

Lab 1 – Code Review

For this lab, I needed to find two opensource community to review how they operate. I found VSCODE and OPENCV.

VSCODE
Visual Studio Code is an open-source project run by Microsoft under the MIT license.

Contributing to VScode
VScode has the following wiki page that explains how you can help here. On the wiki is where you can learn how to file bugs or request features. It seems to be simple you list a few items that they would like you to include, but the formatting of the issue is left up to you. From what I found, the pull request has a link to the issue in them and not any more writing.

Following an Issue request
This seems to be a very active project pull request (or PR) seem to get merged or closed quickly. One thing I notice is that there is not much community PR’s what appears to happen is someone submits an issue then it gets worked on by an employee at Microsoft.

I followed Issue#80352 to see an example of the process, from the moment KamasamaK posted the issue it took about a day for sbatten who I think is a reviewer to reply. Sbatten then added someone who would be able to fix the issue. In the issue, you can see jrieken start working on a solution and then another person mjbvz joined and helped solve the problem and post the PR fixing it.

The total time to fix the issue was about five days.

OPENCV
Opencv is an open-source computer vision library for c++. I took the computer vision course at my school and learned how to use this library. It uses the BSD licence.

Contributing to OPENCV
Similar to VScode, Opencv uses a wiki page to explain how you can contribute to the project. Here is a link to the wiki. On that page, you will find how the developers would like you to send in your help. This project seems to run more on pull request then issues. From what I read on the wiki the developers seem okay with people directly sending pull request versus sending an issue then a pull request when you finish the code, but it is likely a good idea to submit an issue before you start working on it too let the developers know.

Following an Issue request
I followed Issue#15439 to see what the process is to contribute to OpenCV. The first step was for the creation of the issue about the problem, and then you can see a reviewer for OpenCV came and suggested some ideas. In this case, the person who created the issue persuaded fixing the problem and created a pull request with the required changes PR#15440. In this pull request, you can see the conversation between the person creating the pull request and the developer approving it. It took a few days of back and forth fixing the code before the pull request got merged.

The total time to fix the issue was about three days.