Analysis of FLAC – Stage One of Three – SPO600 Project

Hello and welcome to my blog!

In this blog post, I will be writing about my final project for SPO600. The goal of this project is to optimize an open-source library. To complete the project, I had to choose one of the following tasks: alter build options, code changes to permit better optimization by the compiler, algorithm improvements or In-line assembler.

What I did to find an Open-source project:

For a long time, I have been working with audio as a hobby. I actively produced music, mix live bands, calibrate PA speakers and play drums. So, after completing the lab about changing the volume of sound, I knew I wanted to work with audio. So, I started at an open-source project that I knew about that works with audio called Audacity. Audacity is an open-source DAW (Digital Audio Workstation.) I looked around the Audacity project source for a bit then decided to dig into its dependencies. I got the idea of looking at the dependencies from my professor. He suggested to everyone that a library would be a great place to find an opportunity to optimize some code. In looking at the Audacity dependencies, I found the library called FLAC by Xiph. FLAC is an acronym that means Free Lossless Audio Codec. Similar to what I did with Audacity, I started looking at the source code. The way I navigated the source code was by using the search on GitHub and the Find command on bash. I was looking for architecture-specific code, so my search’s were like “x86” or “aarch64”. Between searching for keywords and browsing the folders, I found a file that was called “cpu.c” in “src/libFLAC,” and inside, I was able to determine that the FLAC project has not yet been optimized for Aarch64. The way I discovered that the FLAC project has yet to be optimized for Aarch64 was by looking at the compiler preprocessing directives. Inside that file, I could see that this project has optimizations for the x86, IA32 and PPC architectures, but not for Aarch64. After learning that the FLAC project has not been optimized for Aarch64, I submitted an Issue on GitHub. Here is a link to the issue I created: Issue #156. Inside the issue, I inquired to the maintainers of the FLAC project if they were open to me, adding some optimizations for Aarch64. One of the maintainers of the FLAC project responded and is open to me, adding Aarch64 support to the FLAC project. Now that I got approval to work on this repo, I could now start on my benchmarking.

Benchmarking the FLAC project:

Step one to benchmarking the FLAC project was building the projects on both x86 and Aarch64. I downloaded the source code from the FLAC GitHub, so the first step to building the library was to run the “autogen.sh” script. Which created my “./configure” script. I then ran the configure script with the “-pg” command to allow gprof, so my configuration command was “./configure CFLAGS=”-g -pg -O2″ CXXFLAGS=”-g -pg -O2″.” I was then able to use “make -j” to build the code. Unfortunately, I could not get gporf working with the main FLAC binary. It would only send garbage data to the “gprof.out” file. Though I did get gprof working with some of the tests that were included with the source code, so I know the “-pg” worked. I ended up switching to perf for my profiling. I ran Make clean and then re-ran Configure so I could use perf. “./configure CFLAGS=”-g -O2″ CXXFLAGS=”-g -O2″ .” After doing that, I grabbed my test data and ran perf record. The command I used to test with is “src/flac/flac input.wav” this command runs FLAC and passes a wave file to it. For my sizable test data, I took one of the live multitrack recordings that I had on my computer and exported the mix to a stereo wave file. The test wave file is 1 hour and 31 minutes long and is 1.57 gigabytes.

Using Perf:

Perf did a fantastic job of helping me find what I want to optimize for this project. The part of FLAC that I tested was the encoding, specifically “.wav” encoding to “.flac.” I ran the same test file on the two different architectures, and the performance difference was noticeable instantly. On the x86 machine, it will run the encoding in 16.439s. And the Aarch64 machine took 4minutes and 39.585s. (These times are from one of the tests I did. I ran the test multiple times with similar results.) I then took a look at the perf report.

Perf Report:

In analyzing the perf reports, I was able to narrow down where I should target my optimizations. You can see from the following snippets of the perf report that on the Aarch64 architecture, the function called “FLAC__lpc_compute_autocorrelation” is taking about 49.64% of the run time. Versus, on the x86 report it uses an intrinsic version of that function called “FLAC__lpc_compute_autocorrelation_intrin_sse_lag_12_new” which significantly improves the performance. On the x86 machine, this function only took 7.25% of the run time. The vanilla c code in which the Aarch64 machine is running is located inside the file “ipc.c.” The Intrinsic code the x86 machine is running is located inside the file “ipc_intin_sse.c.” These files are located in the “src/libFLAC” folder.

aarch64
# Samples: 1M of event 'cycles:uppp'
# Event count (approx.): 255896905943
#
# Overhead  Command  Shared Object       Symbol
# ........  .......  ..................  ............................................................................................
    49.64%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__lpc_compute_autocorrelation
     8.54%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__fixed_compute_best_predictor_wide
     7.08%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__lpc_compute_residual_from_qlp_coefficients_wide
     5.65%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__bitwriter_write_rice_signed_block
     5.19%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__lpc_window_data
     5.14%  lt-flac  libFLAC.so.8.3.0    [.] precompute_partition_info_sums_
     4.54%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__MD5Transform


x86
# Samples: 61K of event 'cycles:u'
# Event count (approx.): 52656457012
#
# Overhead  Command  Shared Object       Symbol                                                                                      
# ........  .......  ..................  ............................................................................................
#
    20.02%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__fixed_compute_best_predictor_wide_intrin_ssse3
    16.78%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__bitwriter_write_rice_signed_block
    16.44%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__MD5Transform
     7.43%  lt-flac  lt-flac             [.] format_input
     7.25%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__lpc_compute_autocorrelation_intrin_sse_lag_12_new
     7.17%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__lpc_compute_residual_from_qlp_coefficients_wide_intrin_avx2
     5.44%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__fixed_compute_residual
     4.17%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__MD5Accumulate

Strategy:

In reviewing the benchmarks and the code, I have narrowed down the strategy I will be doing to complete the project. Here is the list of steps I came up with to optimize the encoding of FLAC for aarch64.

  1. Research the required pre-processor directives that I will need to run the Aarch64 code inside the FLAC library conditionally.
  2. Test the pre-processor directives with some code that will cause a fault, so I know it is working.
  3. Examine the codebase to know where precisely I need to put the pre-processor directives. And check if I need to mess with the build instructions.
  4. Configure the makefile to build the new file that I am adding.
  5. I am going to focus on the “FLAC__lpc_compute_autocorrelation” function, and I am going to translate it into aarch64 intrinsic’s. I will use the existing c and x86 intrinsic code to help me with the translation.
  6. Testing my optimization, I will re-run the test that I performed on the original code with my optimized version and see if I have improved the performance on the aarch64 platform.
  7. As a stretch goal, depending on how hard it is to write the Aarch64 intrinsics, I would like to translate the full “ipc.c” file with aarch64 intrinsics.

Results To This Point:

At this point, I have found a project to work on, started learning the project structure, did research on how to use the package, I successfully built the package on both x86 and Aarch64 and performed benchmarks on the project using gprof, perf and the bash time command on both platforms. For my optimization, I have narrowed down exactly what I am going to work on for the remainder of this project and created a plan for how I will accomplish those changes.

Manipulating large data – Lab4 SPO 600

Welcome, in class, we learned about a few different ways you can represent data in a computer. We learned about integers, fixed-point, floating-point, graphics, and sound. This lab focuses on sound data, which is one of the larger forms of data.

How we record sound?

The way we record sound is by sampling it. The typical sample rate is 44.1 or 48 thousand samples per second per channel and each sample size is typically 16 or 24 bit.

What I did for this lab

Task 1

My first task was to benchmark a program that manipulated a random dataset that was simulating a 16 bit sound file. The way we made the data was by generating five million signed integers between positive 32767 and negative 32768.

Here is the code for task 1. – Task1

Task 1 – Results

Result: 94
real 0m0.058s
user 0m0.053s
sys 0m0.005s

Task 1 – Analysis

Using the profiler Perf I can see that the total time running the scaling code is around 18.62% of the program runtime. About 81.38% of the time was generating random data.

Task 2

My second task was to change the formula for scaling the samples to a pre-calculated lookup table. The lookup table would contain all possible sample values multiplied by the volume factor. To get the results, you use the value as the index to look up each sample in that table to get the scaled values.

Here is the code for task 2. – Task 2

Task 2 – Results

Result: 94
real 0m0.646s
user 0m0.633s
sys 0m0.011s

Task 2 – Analysis

Using the profiler Perf I can see that the total time running the scaling code is around 34.27% of the program runtime. About 65.73% of the time was generating random data.

Task 3

My third task was to change the formula for scaling the samples to use fixed point integer math. The reason for doing this is that on most machines they can calculate integer math faster than floating-point. The way I accomplished this was by bit shifting.

The formula I used was: (((246 * 0.75) * SAMPLE) >> 8)

Here is the code for task 3. – Task 3

Task 3 – Results

Result: 873
real 0m0.522s
user 0m0.501s
sys 0m0.020s

Task 3 -Analysis

Using the profiler Perf I can see that the total time running the scaling code is around 18.24% of the program runtime. About 81.76% of the time was generating random data.

Final Analysis

I tested each test multiple times. The results were all similar, with a small variant to speed and the same output every time. The computer doing other tasks is the likely cause for the slight variant in test speeds.

I also tried with a larger sample size than 5 million and the result was the same.

The final results are fixed point integers give us the fastest result, floating-point math came in second, and the lookup table was the slowest.

Now there was a downfall to the fixed point integers. It did not give the same result as the floating-point and lookup table. The result was different since it does not calculate the decimals. The fixed point integers delete the decimal without rounding. By not rounding the numbers, the result was slightly different.

Writing In Assembler x86 and aarch64 – Lab3 SP0600

Hello,

In this post, I am talking about how I am furthering my understanding of computers, so I can understand how I can properly optimize software. I am writing about learning how to write assembler code on the x86 and aarch64 platform for my software optimization class lab.

To complete this lab I performed the following tasks :

  1. Build and run the three C versions of the program for x86_64.
    Take a look at the differences in the code.
  2. Use the objdump -d command to dump (print) the object code (machine code) and disassemble it into assembler for each of the binaries. Find the section and take a look at the code. Notice the total amount of code.
  3. Review, build and run the x86_64 assembly language programs. Take a look at the code using objdump -d objectfile and compare it to the source code. Notice the absence of other code (compared to the C binary, which had a lot of extra code).
  4. Build and run the three C versions of the program for aarch64. Verify that you can disassemble the object code in the ELF binary using objdump -d objectfile and take a look at the code
  5. Review, build and run the aarch64 assembly language programs. Take a look at the code using objdump -d objectfile and compare it to the source code.
  6. Make a loop from 0 to 9, on x86 and aarch64
  7. Extend the code to loop from 00-30, printing each value as a 2-digit decimal number, on x86 and aarch64

How I used a Makefile

Since this lab required testing, reviewing, creating and running many files I decided to load everything into a Makefile.

In doing this I learned that I can call Makefiles in other folders.
The way I did that was by adding a target to the main Makefile and typing in “cd /route/to/makefile && make all”

In the attached folders you can see the Makefile I created.

Task 1

The three c programs all perform the same task of printing “Hello World!”, but they do it in 3 different ways.

Program 1: Uses printf()
Program 2: Uses write()
Program 3: Uses syscall()

Task 2

After Reviewing the output of the objdump I can see that program 1 uses the least amount of code at 8 lines but it is using printf which has the most overhead of the three functions. Program 2 using write which should have less overhead uses 12 lines of code. And finally program 3 also uses 12 lines of code and since we are using a syscall we have very little overhead.

Task 3

Yes, Since we are now compiling straight from assembler we don’t have the overhead of the c language. This cut the program down in size drastically now the how objdump file is only 11 lines of code.

Task 4

Here is the total line count the three c programs took to run on aarch64. Pretty similar results.

Program 1: 10 lines
Program 2: 12 lines
Program 3: 12 lines

Task 5

Surprisingly, the results are identical to the x86 in term of line count. The aarch64 Hello world program used 11 lines of code the same as x86.

Something interesting I noticed about the compiled code is that it transformed all the numbers to hexadecimal.

Task 6

Here is my loops 0-9 on x86 and aarch64.

/* x86 */
.text
.globl    _start

start = 0                       /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 10                        /* loop exits when the index hits this number (loop condition is i<max) */

_start:
    mov     $start,%r15         /* loop index */
   
loop:
    /* ... body of the loop ... do something useful here ... */
    mov     $len,%rdx
    
    mov     $48,%r14
    add     %r15,%r14
    
    movb    %r14b,msg+6
    mov     $msg,%rsi

    mov     $1,%rdi
    mov     $1,%rax
    syscall 

    inc     %r15                /* increment index */
    cmp     $max,%r15           /* see if we're done */
    jne     loop                /* loop if we're not */

    mov     $0,%rdi             /* exit status */
    mov     $60,%rax            /* syscall sys_exit */
    syscall
.data 
msg: .ascii "Loop:  \n"
    len = . - msg
/* aarch64 */
.text
.globl    _start

start = 0                       /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 10                        /* loop exits when the index hits this number (loop condition is i<max) */

_start:
    mov     x30,start           /* loop index */
    
loop:
    
    mov     x19,48
    mov     x26,max
    mov     x27,1
    adr     x28,msg

    add     x19,x30,x19

    strb     w19,[x28,6]
    ldr      x1,=msg
        
    mov     x0,1
    mov     x2,len
    mov     x8, 64
    svc     0

    add     x30,x27,x30             /* increment index */
    cmp     x26,x30                 /* see if we're done */
    b.ne    loop                   /* loop if we're not */

    mov     x8,93                   /* syscall sys_exit */
    svc     0

    .data

    msg: .ascii "Loop:      \n"
    len = . - msg

Task 7

Here is my loops 0-30 with the leading zero’s removed on x86 and aarch64.

/* x86 */
.text
.globl    _start

start = 0                       /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 31                        /* loop exits when the index hits this number (loop condition is i<max) */

_start:
    mov     $start,%r15         /* loop index */
    
loop:
    /* ... body of the loop ... do something useful here ... */
    
   
    mov     $48,%r13
    mov     $48,%r14
    mov     $0,%rdx

    mov     %r15,%rax
    mov     $10,%r12
    div     %r12

    
    add     %rax,%r13
    add     %rdx,%r14
    
    cmp     $48,%r13                   /*Compare*/
    
    je continue     
    
    movb    %r13b,msg+6

continue:

    movb    %r14b,msg+7
    mov     $msg,%rsi /*send message to reg rsi*/
        
    mov     $1,%rdi
    mov     $1,%rax
    mov     $len,%rdx

    syscall

    inc     %r15                /* increment index */
    cmp     $max,%r15           /* see if we're done */
    jne     loop                /* loop if we're not */

    mov     $0,%rdi             /* exit status */
    mov     $60,%rax            /* syscall sys_exit */
    syscall

    .data

    msg: .ascii "Loop:   \n"
    len = . - msg
/* aarch64 */
.text
.globl    _start

start = 0                       /* starting value for the loop index; note that this is a symbol (constant), not a variable */
max = 30                        /* loop exits when the index hits this number (loop condition is i<max) */

_start:
    mov     x30,start           /* loop index */
    
loop:
    
    mov     x19,48
    mov     x20,48
    mov     x24,10  
    mov     x25,48
    mov     x26,max
    mov     x27,1
    adr     x28,msg

    
    udiv    x21,x30,x24
    msub    x22,x21,x24,x30

    add     x19,x21,x19
    add     x20,x22,x20

    cmp     x25,x19
    b.eq    continue     
    
    strb     w19,[x28,6]

continue:

    strb     w20,[x28,7]
    ldr      x1,=msg
        
    mov     x0,1
    mov     x2,len
    mov     x8, 64
    svc     0

    add     x30,x27,x30             /* increment index */
    cmp     x26,x30                 /* see if we're done */
    b.ne    loop                   /* loop if we're not */

    mov     x8,93                   /* syscall sys_exit */
    svc     0

    .data

    msg: .ascii "Loop:      \n"
    len = . - msg

Download my files