React Practice With GraphQL

website

Hello, and Welcome to my blog!

Recently I have been working through some tutorials from Thinkster.io. In this blog post, I will be reviewing my experience going through the React and Redux tutorial.

I completed the tutorial using the GraphQL API I made. The tutorial uses a REST API, so I had to make lots of changes. The changes made completing the tutorial more challenging, but it also made it more interesting. I am glad I did it.

I used Apollo React Client as the GraphQL client to connect to my API. Overall, I liked Apollo, but since I was also using Redux, it made it a bit difficult at times. Apollo and Redux both due state management, I think for future GraphQL applications, I would probably not use them both at the same time. But for the practice, I am glad I used redux again.

It was fun using React again. My last React project ended in December. Completing this practice project was a great learning experience for me. I became more comfortable with making functional components. I now know when to use functional components over the traditional class component. In my last React project, I didn’t re-use components much. In this project, I got practice re-using a bunch of components.

I also changed up the router, the tutorial used react-router. I changed to Reach-router. The main reason for switching was due to some incompatibility with Apollo if I remember correctly. Overall, I found it works better than the traditional React router. I liked it a lot.

Another cool thing I learned well completing this tutorial was how to make custom Redux middleware. I created a custom middleware to handle setting and clearing the JWT from the browser. It worked amazingly, and it was easy to do. I will link the Git-Hub below if your interested; the code is in the file src/middleware.js.

To finish the project, I deployed the GraphQL server on Heroku and used an Amazon S3 bucket to host the web app. Here is the link to the site: http://react.practice.site.s3-website.ca-central-1.amazonaws.com/ 

To conclude, I had a great time doing this project, and I learned a lot. To run the project, I am using a MongoDB server hosted on MongoDB Atlas, GraphQL server hosted on Heroku, and React application hosted on Amazon S3 bucket.

Here is the link to the two Git-Hub repositories for the project.
https://github.com/coreyjjames/REACT-PRACTICE-APP-GRAPHQL
https://github.com/coreyjjames/GRAPHQL-REALWORLD-API-TYPESCRIPT

GraphQL RealWorld API – TypeScript, JWT, MongoDB, Express and Node.js

Hello, Welcome to my blog!

Following up on my most recent post “RealWorld API – TypeScript, JWT, MongoDB, Express and Node.js”. I have modified the REST API I made into a Graph QL API.

What is GraphQL?

GraphQL is a query language for an API. GraphQL allows front-end applications to have control to specify the data received per request. All the data is available to you when you create a request, which means you can mix and match data per request. Drastically cutting down on the number of calls the front-end application has to make to the API.

My experience using GraphQL

I enjoyed using GraphQL, and I think it allows more flexibly than a REST API. There are many different flavours of GraphQL. I used Apollo Server Express. It was a bit of a steep learning curve, but once I built a few queries, I saw how easy to use it was.

Here is a list of some things that I learned about GraphQL. Examples can be found by viewing my code on git-hub.

  • GraphQL is not a database, It replaces your API router. It can be used with any database.
  • GraphQL uses a single POST route. On that POST route, you will make your Querys using the query language inside the HTTP body.
  • Unless specified, GraphQL will look for queries under the Query type.
  • The keyword mutation is used to tell GraphQL to change from looking at the default Query type to the Mutation Type.
  • It is best to practise to put all data fetching queries under the Query type and all data writing queries under the Mutation Type.
  • Postman works great to test your queries.
  • You need to define your schema using the query language GQL.
  • With Apollo authentication and error handling can be done when you create the Apollo server.

Conclusion

I definitely will use GraphQL for my future projects, and I like how flexible it is and that it is available for many different programming languages.

The code for this project can be found on my Git-Hub.
https://github.com/coreyjjames/GRAPHQL-REALWORLD-API-TYPESCRIPT

RealWorld API – TypeScript, JWT, MongoDB, Express and Node.js

Hello, Welcome to my blog!

Recently, I subscribed to a tutorial website called Thinkster.io. Thikster has tutorials dedicated to making production-ready applications. They have tutorials for many different frameworks. I decide to go through their node.js tutorial.

I also wanted to get more comfortable with typescript, so I changed the tutorial up some by implementing the tutorial using TypeScript. The tutorial is written initially with regular javascript. I had the extra challenge of translating the tutorial to use TypeScript as I went through it. Going through this tutorial was a great experience. I learned some great techniques for building an API with node.js and TypeScript.

Programming with TypeScript was enjoyable. It cleaned up the code allot! Which is always a great thing. Since I have programmed in the past, with C, C++ and C#, I also found working with TypeScript very familiar.

I posted my code for this project on my Github. https://github.com/coreyjjames/REALWORLD-API-TYPESCRIPT

The tutorial also covered Postman. Postman is an application that allows you to test your API. I liked it allot, I have lightly used Postman in the past, but this time I used it extensively. I love how you can write test cases in Postman. It was handy at the end of the project to check for any bugs that got missed during the central development time.

In the next couple of weeks, I would like to improve this API some more by using a framework called GraphQl.

I’ll post another blog post when I finish!

If anyone has some improvements I can make to the code, let me know. 🙂

Rule Of Five – C++

Hello and Welcome to my blog!

To keep my skills sharp while I am looking for work, I have been reviewing what I have learned at college.

Today I reviewed the Rule of Five in C++. The rule of five is the programming pattern that states when a class implements any of the following functions, it must implement all of them. A rule of three exist as well, the rule of five is an expanded version.

The Five functions in the rule of five are…

1 – Copy Assignment Operator
2 – Copy Constructor
3 – Deconstructor
4 – Move Assignment Operator
5 – Move Constructor

I decided to Implement the rule of five two ways, first as a Templated class and second as a Non-Templated class.

This code was good practice, and I am planning on releasing more blogs like this one soon. All the code is below, thanks for reading my blog.

Templated Rule Of Five

template < class T >
  class RuleOfFiveTemplated {
    size_t size = 0;
    T * data = nullptr;
    public:
      // Default Constructor
      RuleOfFiveTemplated() {
        std::cout << (void * ) this << ": RuleOfFiveTemplated constructor()\n";
      }

    // Constructor Overload
    RuleOfFiveTemplated(size_t size): size(size), data(new T[size]) {
      cout << (void * ) this << ": RuleOfFiveTemplated (" << size << ") constructor\n";
    }

    // One - Assignment Operator
    RuleOfFiveTemplated & operator = (const RuleOfFiveTemplated & rhs) {
      cout << (void * ) this << ": RuleOfFiveTemplated assignment operator, size = " << size << ", rhs.size = " << rhs.size << endl;
      if (this != & rhs) {
        delete[] data;
        data = nullptr;
        size = 0;

        if (rhs.data) {
          size = rhs.size;
          try {
            data = new T[size];
            memcpy(data, rhs.data, size * sizeof(T));
          } catch (const std::bad_alloc & err) {
            cout << (void * ) this << ": RuleOfFiveTemplated Failed to copy value inside data: " << err.what() << endl;
          }
        }
      } else {
        cout << (void * ) this << ": RuleOfFiveTemplated copy assignment operator called on itself" << endl;
      }
      return *this;
    }

    // Two - Copy Constructor
    RuleOfFiveTemplated(const RuleOfFiveTemplated & rhs) {
      cout << (void * ) this << ": RuleOfFiveTemplated copy constructor, size = " << size << ", rhs.size = " << rhs.size << endl;
      data = nullptr;
      * this = rhs;
    }

    // three - Move Assignment Operator
    RuleOfFiveTemplated && operator = (RuleOfFiveTemplated && rhs) noexcept {
      cout << (void * ) this << ": RuleOfFiveTemplated move assignment operator, size = " << size << ", rhs.size = " << rhs.size << endl;
      if (this != & rhs) {
        delete[] data;

        size = rhs.size;
        data = rhs.data;

        rhs.size = 0;
        rhs.data = nullptr;
      } else {
        cout << (void * ) this << ": RuleOfFiveTemplated move assignment operator called on itself\n";
      }
      return std::move( * this);
    }

    // Four - Move Constructor
    RuleOfFiveTemplated(RuleOfFiveTemplated && rhs) noexcept {
        cout << (void * ) this << ": RuleOfFiveTemplated move constructor, size = " << size << ", rhs.size = " << rhs.size << endl;
        data = nullptr;
        * this = std::move(rhs);
      }

      // Five - De-Constructor
      ~RuleOfFiveTemplated() {
        cout << (void * ) this << ": RuleOfFiveTemplated destructor, size=" << size << "\n";
        delete[] data;
      }

    // Print
    void print() {
      cout << (void * ) this << ": size=" << size << " (" << size * sizeof(T) << " BYTES)\n";
    }
  };

Non-Templated Rule Of Five

class RuleOfFive {
	size_t size = 0;
	double* data = nullptr;
public:
	RuleOfFive() {
		cout << (void*)this << ": RuleOfFive default constructor" << endl;
	}

	RuleOfFive(double size) : size(size), data(new double[size]) {
		cout << (void*)this << ": RuleOfFive constructor overload." << endl;
	}

	// One - Copy assignment operator
	RuleOfFive& operator=(RuleOfFive& rhs) {
		cout << (void*)this << ": RuleOfFive copy assignment operator, size = " << size << ", rhs.size = " << rhs.size << endl;
		if (this != &rhs) {
			delete[] data;
			data = nullptr;
			size = 0;

			if (rhs.data) {
				size = rhs.size;
				try
				{
					data = new double[size];
					memcpy(data, rhs.data, size * sizeof(double));
				}
				catch (const std::bad_alloc&)
				{
					cout << (void*)this << ": RuleOfFive Failed to copy value inside data" << endl;
				}
			}
		}
		else {
			cout << (void*)this << ": RuleOfFive copy assignment operator called on itself" << endl;
		}

		return *this;
	}

	// Two - Copy Constructor
	RuleOfFive(RuleOfFive& rhs) {
		cout << (void*)this << ": RuleOfFive copy constructor, size = " << size << ", rhs.size = " << rhs.size << endl;
		data = nullptr;
		*this = rhs;
	}

	// Three - Move Assignment operator
	RuleOfFive&& operator=(RuleOfFive&& rhs) noexcept {
		cout << (void*)this << ": RuleOfFive move assignment operator, size = " << size << ", rhs.size = " << rhs.size << endl;
		if (this != &rhs) {
			delete[] data;

			size = rhs.size;
			data = rhs.data;

			rhs.size = 0;
			rhs.data = nullptr;
		}
		else {
			cout << (void*)this << ": RuleOfFive move assignment operator called on itself" << endl;
		}

		return std::move(*this);
	}

	// Four - Move constructor
	RuleOfFive(RuleOfFive&& rhs) noexcept {
		cout << (void*)this << ": RuleOfFive move constructor, size = " << size << ", rhs.size = " << rhs.size << endl;
		data = nullptr;
		*this = std::move(rhs);
	}

	// Five - Deconstuctor
	~RuleOfFive() {
		cout << (void*)this << ": RuleOfFive deconstuctor" << endl;
	}

	// Print
	void print()
	{
		cout << (void*)this << ": size=" << size << " (" << size * sizeof(double) << " BYTES)" << endl;
	}
};

Main Function

#include <iostream>
#include <cstring> 
using namespace std;

int main(int argc, char** argv)
{
	cout << "Rule Of Five" << endl << endl;

	// Test Default Constructors
	cout << endl << "Testing Default Constructor" << endl << "=========================================" << endl;

	RuleOfFive x; 
	cout << "x created : ";
	x.print();
	
	RuleOfFiveTemplated<double> tx;
	cout << "tx created : ";
	tx.print();

	// Test Constructor Overloads
	cout << endl << "Testing Constructor Overloads" << endl << "=========================================" << endl;

	RuleOfFive y(1000);
	cout << "y created : ";
	y.print();

	RuleOfFiveTemplated<double> ty(1000);
	cout << "ty created : ";
	ty.print();

	// Test Copy Assignment operator
	cout << endl << "Test Rule One - Copy Assignment operator" << endl << "=========================================" << endl;

	x = y;
	cout << "x : lhs : ";
	x.print();
	cout << "y : rhs : ";
	y.print();

	tx = ty;
	cout << "tx : lhs : ";
	tx.print();
	cout << "ty : rhs : ";
	ty.print();

	// Test Copy Constructor
	cout << endl << "Test Rule Two - Copy Constructor" << endl << "=========================================" << endl;

	RuleOfFive z(x);
	cout << "z created : lhs : ";
	z.print();
	cout << "x : rhs : ";
	x.print();

	RuleOfFiveTemplated<double> tz(tx);
	cout << "tz created : lhs : ";
	tz.print();
	cout << "tx : rhs : ";
	tx.print();

	// Test Move Assignment operator
	cout << endl << "Test Rule Three - Move Assignment operator" << endl << "=========================================" << endl;

	y = std::move(x);
	cout << "y : lhs : ";
	y.print();
	cout << "x : rhs : ";
	x.print();

	ty = std::move(tx);
	cout << "ty : lhs : ";
	ty.print();
	cout << "tx : rhs : ";
	tx.print();

	// Test Move Constructor
	cout << endl << "Test Rule Four - Move Constructor" << endl << "=========================================" << endl;

	RuleOfFive a(std::move(y));
	cout << "a created : lhs : ";
	a.print();
	cout << "y : rhs : ";
	y.print();

	RuleOfFiveTemplated<double> ta(std::move(ty));
	cout << "ta created : lhs : ";
	ta.print();
	cout << "ty : rhs : ";
	ty.print();

	// Test Deconstructors
	cout << endl << "Test Rule Five - Deconstructors" << endl << "=========================================" << endl;
}

	

Reversing Words Order in C++ and Using CMake

Saw this challenge in a job posting, thought I would give it a try. I also set up CMake for the first time to build this project. It worked great. I was able to create the program on my windows machine using the command line as I wanted. I think I like Linux Make a bit better though. I think its a bit simpler. I don’t have a Linux machine set up at the moment, so I am glad I got CMake working.

Challenge: Reversing Words Order

This is a pretty simple challenge. The challenge is to write a program that can take words like this “Keyboard and Mouse” and change it to “Mouse and Keyboard” as the output.

My Code:

#include <iostream>
using namespace std;

char *reverseWordOrder(char *, int);

int main()
{
    const int size = 19;
    char wordsToReverse[size] = "Keyboard and Mouse";
    cout << reverseWordOrder(wordsToReverse, size) << endl;
    return 0;
}

char *reverseWordOrder(char *words, int size)
{
    char *reversedWords = new char[size];
    int cursor = 0;

    for (int i = size; i >= 0; i--)
    {
        if (words[i] == ' ' || i == 0)
        {
            int j = (i == 0) ? i : i + 1;

            while (words[j] != ' ' && words[j] != '\0')
            {
                reversedWords[cursor] = words[j];
                cursor++;
                j++;
            }

            if (cursor < size - 1)
            {
                reversedWords[cursor] = ' ';
                cursor++;
            }
            reversedWords[cursor] = '\0';
        }
    }

    return reversedWords;
} 

Conclusion:

I looked up a solution online after I finished mine. Performance-wise I believe mine would be faster. I choose to navigate backwards through the phrase and grab each word then add it to a new string.

Example: (Note: This example is not exactly what the code is doing)
Input: Keyboard and Mouse
Step1: Get the last word.
Output: Mouse
Step2: Get the next word.
Output: Mouse and
Step3: Get the next word.
Output: Mouse and Keyboard

In the solution I found online here, they reversed each word individually and then reversed the whole phrase. The act of reversing the words and phrase is a smart way of completing the task, but it is more expensive. In this algorithm, it will visit each letter around four times. In my algorithm, it visits each letter two times.

Example: (Note: This example is not exactly what the code is doing)
Input: Keyboard and Mouse
Step1: Reverse each word.
Output: draobyeK dna esuoM
Step2: Reversed the whole phrase.
Output: Mouse and Keyboard

Analysis of FLAC – Stage Three of Three – SPO600 Project

Hello and welcome to my blog!

For my final post this series, about my SPO 600 project, I will discuss the last part of the project, the pull request to the upstream project. And, I will be reviewing what I have learned while working on this project.

PULL REQUEST:

Here is my the link to the pull request I created: https://github.com/xiph/flac/pull/183

The FLAC project has an extensive testing system that gets run by the continuous integration system of Travis CI. It took 2 hours, 28 minutes and 15 seconds for Travis to finish the tests on my pull request. Travis builds and runs the FLAC program in 16 different ways for the test. There are four different build settings for that are tested, and each build setting gets compiled twice once by the GCC compiler and once by the CLANG compiler. Travis will run these eight tests twice once on a Linux machine and then once on a Mac OSX machine. To speed up the tests, Travis will run multiple test scenarios at a time. The full run-time of all the test scenarios is 6 hours and 57 seconds.

My changes passed all the tests Travis ran. That means the changes I made didn’t affect the regular operation of the program.

It has been a little over a week since I made the PR. It is currently not merged, and I have not heard from the project owner. The day I made the pull request, I got asked if the CPU feature detection that I did was at run-time or compile time? Initially, I thought the question was referring to the detection of the CPU architecture, which gets performed at compile time. After some discussion, I understood that the original question was referring to choosing the correct version of a function based on variables at run-time. Which I did implement, the selection of which autocorrelation function to run gets decided by the variable “encoder->protected_->max_lpc_order.” As far as I can tell, this question was by another member of the community and not a project maintainer, but I could be wrong.

Since that question, there has not been any more activity on the pull request.

PROJECT REVIEW:

This course and project have been a great learning experience for me. I have learned a lot about low-level programming through the lectures and the project. Especially about compilation, for this project, I learned a lot about the GNU auto-tools build system. I now know how to read and edit a configure script. With the practice I got from working with this project, I am now comfortable with the whole build process using a configure script and make.

Through this course and project, I also learned about using assembler and intrinsics in the C language. I was able to gain practical experience working with x86 and aarch64 intrinsics during this project. Specifically, I learned how to use the intrinsic documentation to find the correct intrinsic I need and to understand how to use each intrinsics.

This project was the first major C/C++ project I worked on. I learned a lot about the build process of a C/C++ project. For example, how to detect the hardware details and the available software on a computer.

I also gained a lot of practice with profilers. I used profilers in the past for previous course projects, but this project gave me a practical way I could practice using profilers. For example, by finding hot spots in a program and by helping measure the performance changes in the optimization, I do to the program.

Conclusion

I have enjoyed doing this project, and I am glad I could mix my interests in audio and computer programming. This project has been a great learning experience and I would be interested in doing this type of work in the future.

Analysis of FLAC – Stage Two of Three – SPO600 Project

Hello and welcome to my blog!

In this blog post, I will be writing about my final project for my software portability and optimization class SPO 600. In this post, I will review the work I have done with the FLAC project. If you didn’t read the first post in this series, you could find it here. A quick review of the last post, for my SPO 600 class, I got tasked with finding and optimizing an open-source project. Knowing I wanted to work with audio, I found the open-source project called FLAC (Free Lossless Audio Codec). On further investigation, I made a plan on how to optimize the FLAC library. In this blog post, I will go over the implementation and results of my optimization.

Execution of my plan:

In the previous blog post, I laid out a strategy for the completion of my optimization. It turned out to be quite helpful. I was able to follow it, and I completed the optimization just as planned. Below is the strategy I made and some notes on how I completed each task.

  1. Research the required pre-processor directives that I will need to run the Aarch64 code inside the FLAC library conditionally.
    Through reading the FLAC code, I was able to determine that the pre-processor directive used in this project really on global variables defined by the configure script. After reading the “configure.ac” file, online AutoTools documentation and the “configure.ac” file from the OPUS project, which is also created by Xiph, the same people who created the FLAC project. I was able to determine how to check if the user has an aarch64 CPU and if the intrinsic library arm-neon is available. After confirming that I am on an aarch64 machine, I define “FLAC__CPU_AARCH64,” and after confirming arm-neon intrinsic are possible, I define “FLAC__HAS_NEONINTRIN.”
  2. Test the pre-processor directives with some code that will cause a fault, so I know it is working.
    Since the FLAC project has some optimizations for other platforms, I needed to follow the same pattern as the previous people. To test the preprocessor directives, I had to add the new architecture to the function selection logic. To add the new architecture, I added code to “src/libFLAC/cpu.c”, “src/libFLAC/include/private/cpu.h”, “src/libFLAC/include/private/lpc.h” and “src/libFLAC/stream_encoder.c.” Once I had the architecture function selection logic done, I was ready to test. I did not end up needing to cause any faults to confirm that the preprocessor directives were working. Instead, I used a printf statement and a copy of the vanilla C code version of the auto-correlation function. In running the program, I was able to see the printf messages, and I also used perf to confirm I was using the new function.
  3. Examine the codebase to know where precisely I need to put the pre-processor directives. And check if I need to mess with the build instructions.
    I ended up doing this in step in steps one and two since the FLAC project didn’t use pre-defined variables for the pre-processor directives. Instead, the FLAC project uses variables defined by the configure script. And I had to examine the codebase and implement the changes to run a test.
  4. Configure the makefile to build the new file that I am adding.
    I added one file to the project called lpc_intrin_neon.c. So for the compiler to build it, I put it into the list of source files inside “src/libFLAC/Makefile.am.”
  5. I am going to focus on the “FLAC__lpc_compute_autocorrelation” function, and I am going to translate it into aarch64 intrinsic’s. I will use the existing c and x86 intrinsic code to help me with the translation.
    Success! It took some time, but I was able to translate the x86 code into aarch64. The way I did this was by using the Intel and arm-neon online documentation. I also got help by googling a specific x86 intrinsic and asking what arm-neon instruction does the same thing or similar. For a few intrinsics, there was no direct replacement. Specifically, there is no shuffle in arm-neon, so I had to read up on how shuffle worked on x86 and execute that using multiple arm-neon instruction. I ended up creating inline functions for the shuffles to make writing the code more manageable and cleaner.
  6. Testing my optimization, I will re-run the test that I performed on the original code with my optimized version and see if I have improved the performance on the aarch64 platform.
    I tested with two aarch64 machines. The first machine has a faster single-thread performance with 8 threads. The second machine has 24 threads but slower single-thread performance. On the first machine, initially, the autocorrelation function took 26.11 percent of the runtime. After the optimizations, the autocorrelation function took 12.64 percent of the time. On the second machine, initially, the autocorrelation function took 52.41 percent of the runtime. After the optimizations, the autocorrelation function took 14.78 percent of the time. I also tested the optimizations on an x86 machine to confirm that the changes did not affect that architecture.
  7. As a stretch goal, depending on how hard it is to write the Aarch64 intrinsics, I would like to translate the full “ipc.c” file with aarch64 intrinsics.
    I didn’t end up translating the full “ipc.c” file, but I did translate all versions of the autocorrelation function. There are four versions of the autocorrelation function. Depending on how much lag it will call the correct version, either lag 4, lag 8, lag 12 or lag 16.

Full Results:

The following results are not averaged, but I did run theses test multiple times with similar results. The numbers below are from a few of the many tests I performed.

Aaarch64 Machine 1:

TOTAL RUNTIME OF THE TEST BEFORE OPTIMIZATION:

real    0m51.784s
user 0m49.356s
sys 0m2.349s

TOTAL RUNTIME OF THE TEST AFTER OPTIMIZATION:

real    0m43.503s
user 0m40.950s
sys 0m2.470s

PERF REPORT BEFORE OPTIMIZATION (First 20 Lines):

To display the perf.data header info, please use --header/--header-only options.
 #
 #
 Total Lost Samples: 0
 #
 Samples: 208K of event 'cycles:u'
 Event count (approx.): 98509947650
 #
 Overhead  Command   Shared Object           Symbol
 ……..  ……..  ………………….  ………………………………………………………………………………..
 #
     26.11%  lt-flac   libFLAC.so.8.3.0        [.] FLAC__lpc_compute_autocorrelation
     25.54%  lt-flac   libFLAC.so.8.3.0        [.] FLAC__fixed_compute_best_predictor_wide
     11.35%  lt-flac   libFLAC.so.8.3.0        [.] FLAC__bitwriter_write_rice_signed_block
      9.45%  lt-flac   libFLAC.so.8.3.0        [.] FLAC__MD5Transform
      5.95%  lt-flac   lt-flac                 [.] format_input
      5.60%  lt-flac   libFLAC.so.8.3.0        [.] FLAC__lpc_compute_residual_from_qlp_coefficients_wide
      3.42%  lt-flac   libFLAC.so.8.3.0        [.] precompute_partition_info_sums_
      2.34%  lt-flac   libFLAC.so.8.3.0        [.] FLAC__MD5Accumulate
      2.21%  lt-flac   libFLAC.so.8.3.0        [.] FLAC__crc16

PERF REPORT AFTER OPTIMIZATION (First 20 Lines):

To display the perf.data header info, please use --header/--header-only options.
 #
 #
 Total Lost Samples: 0
 #
 Samples: 175K of event 'cycles:u'
 Event count (approx.): 81871492155
 #
 Overhead  Command  Shared Object       Symbol
 ……..  …….  ………………  ………………………………………………………………………………..
 #
     30.58%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__fixed_compute_best_predictor_wide
     13.36%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__bitwriter_write_rice_signed_block
     12.64%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__lpc_compute_autocorrelation_intrin_neon_lag_12
     11.71%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__MD5Transform
      7.16%  lt-flac  lt-flac             [.] format_input
      5.18%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__lpc_compute_residual_from_qlp_coefficients_wide
      4.16%  lt-flac  libFLAC.so.8.3.0    [.] precompute_partition_info_sums_
      3.00%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__MD5Accumulate
      2.62%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__crc16

Aaarch64 Machine 2:

TOTAL RUNTIME OF THE TEST BEFORE OPTIMIZATION:

real    3m43.841s
user    3m33.558s
sys     0m8.791s

TOTAL RUNTIME OF THE TEST AFTER OPTIMIZATION:

real    2m3.675s
user    1m54.260s
sys     0m8.588s

PERF REPORT BEFORE OPTIMIZATION (First 20 Lines):

To display the perf.data header info, please use --header/--header-only options.
 #
 #
 Total Lost Samples: 0
 #
 Samples: 901K of event 'cycles:uppp'
 Event count (approx.): 213328075836
 #
 Overhead  Command  Shared Object     Symbol
 ……..  …….  …………….  ………………………………………………………………………………..
 #
     52.41%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__lpc_compute_autocorrelation
     11.36%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__fixed_compute_best_predictor_wide
      6.62%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__bitwriter_write_rice_signed_block
      5.80%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__MD5Transform
      4.35%  lt-flac  lt-flac           [.] format_input
      4.05%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__lpc_compute_residual_from_qlp_coefficients_wide
      2.69%  lt-flac  libFLAC.so.8.3.0  [.] precompute_partition_info_sums_
      2.52%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__MD5Accumulate
      2.10%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__fixed_compute_residual

PERF REPORT AFTER OPTIMIZATION (First 20 Lines):

To display the perf.data header info, please use --header/--header-only options.
 #
 #
 Total Lost Samples: 0
 #
 Samples: 620K of event 'cycles:uppp'
 Event count (approx.): 144725968757
 #
 Overhead  Command  Shared Object     Symbol
 ……..  …….  …………….  ………………………………………………………………………………..
 #
     15.03%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__fixed_compute_best_predictor_wide
     14.78%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__lpc_compute_autocorrelation_intrin_neon_lag_12
     10.06%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__bitwriter_write_rice_signed_block
     10.03%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__lpc_compute_residual_from_qlp_coefficients_wide
      9.14%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__lpc_window_data
      9.09%  lt-flac  libFLAC.so.8.3.0  [.] precompute_partition_info_sums_
      8.07%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__MD5Transform
      6.14%  lt-flac  libFLAC.so.8.3.0  [.] FLAC__fixed_compute_residual
      5.68%  lt-flac  lt-flac           [.] format_input

Code Changes:

On GitHub, I created a pull request inside my fork of FLAC HERE. With this pull request, you will be able to see exactly all the code changes I made to the FLAC project.

Analysis of FLAC – Stage One of Three – SPO600 Project

Hello and welcome to my blog!

In this blog post, I will be writing about my final project for SPO600. The goal of this project is to optimize an open-source library. To complete the project, I had to choose one of the following tasks: alter build options, code changes to permit better optimization by the compiler, algorithm improvements or In-line assembler.

What I did to find an Open-source project:

For a long time, I have been working with audio as a hobby. I actively produced music, mix live bands, calibrate PA speakers and play drums. So, after completing the lab about changing the volume of sound, I knew I wanted to work with audio. So, I started at an open-source project that I knew about that works with audio called Audacity. Audacity is an open-source DAW (Digital Audio Workstation.) I looked around the Audacity project source for a bit then decided to dig into its dependencies. I got the idea of looking at the dependencies from my professor. He suggested to everyone that a library would be a great place to find an opportunity to optimize some code. In looking at the Audacity dependencies, I found the library called FLAC by Xiph. FLAC is an acronym that means Free Lossless Audio Codec. Similar to what I did with Audacity, I started looking at the source code. The way I navigated the source code was by using the search on GitHub and the Find command on bash. I was looking for architecture-specific code, so my search’s were like “x86” or “aarch64”. Between searching for keywords and browsing the folders, I found a file that was called “cpu.c” in “src/libFLAC,” and inside, I was able to determine that the FLAC project has not yet been optimized for Aarch64. The way I discovered that the FLAC project has yet to be optimized for Aarch64 was by looking at the compiler preprocessing directives. Inside that file, I could see that this project has optimizations for the x86, IA32 and PPC architectures, but not for Aarch64. After learning that the FLAC project has not been optimized for Aarch64, I submitted an Issue on GitHub. Here is a link to the issue I created: Issue #156. Inside the issue, I inquired to the maintainers of the FLAC project if they were open to me, adding some optimizations for Aarch64. One of the maintainers of the FLAC project responded and is open to me, adding Aarch64 support to the FLAC project. Now that I got approval to work on this repo, I could now start on my benchmarking.

Benchmarking the FLAC project:

Step one to benchmarking the FLAC project was building the projects on both x86 and Aarch64. I downloaded the source code from the FLAC GitHub, so the first step to building the library was to run the “autogen.sh” script. Which created my “./configure” script. I then ran the configure script with the “-pg” command to allow gprof, so my configuration command was “./configure CFLAGS=”-g -pg -O2″ CXXFLAGS=”-g -pg -O2″.” I was then able to use “make -j” to build the code. Unfortunately, I could not get gporf working with the main FLAC binary. It would only send garbage data to the “gprof.out” file. Though I did get gprof working with some of the tests that were included with the source code, so I know the “-pg” worked. I ended up switching to perf for my profiling. I ran Make clean and then re-ran Configure so I could use perf. “./configure CFLAGS=”-g -O2″ CXXFLAGS=”-g -O2″ .” After doing that, I grabbed my test data and ran perf record. The command I used to test with is “src/flac/flac input.wav” this command runs FLAC and passes a wave file to it. For my sizable test data, I took one of the live multitrack recordings that I had on my computer and exported the mix to a stereo wave file. The test wave file is 1 hour and 31 minutes long and is 1.57 gigabytes.

Using Perf:

Perf did a fantastic job of helping me find what I want to optimize for this project. The part of FLAC that I tested was the encoding, specifically “.wav” encoding to “.flac.” I ran the same test file on the two different architectures, and the performance difference was noticeable instantly. On the x86 machine, it will run the encoding in 16.439s. And the Aarch64 machine took 4minutes and 39.585s. (These times are from one of the tests I did. I ran the test multiple times with similar results.) I then took a look at the perf report.

Perf Report:

In analyzing the perf reports, I was able to narrow down where I should target my optimizations. You can see from the following snippets of the perf report that on the Aarch64 architecture, the function called “FLAC__lpc_compute_autocorrelation” is taking about 49.64% of the run time. Versus, on the x86 report it uses an intrinsic version of that function called “FLAC__lpc_compute_autocorrelation_intrin_sse_lag_12_new” which significantly improves the performance. On the x86 machine, this function only took 7.25% of the run time. The vanilla c code in which the Aarch64 machine is running is located inside the file “ipc.c.” The Intrinsic code the x86 machine is running is located inside the file “ipc_intin_sse.c.” These files are located in the “src/libFLAC” folder.

aarch64
# Samples: 1M of event 'cycles:uppp'
# Event count (approx.): 255896905943
#
# Overhead  Command  Shared Object       Symbol
# ........  .......  ..................  ............................................................................................
    49.64%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__lpc_compute_autocorrelation
     8.54%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__fixed_compute_best_predictor_wide
     7.08%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__lpc_compute_residual_from_qlp_coefficients_wide
     5.65%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__bitwriter_write_rice_signed_block
     5.19%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__lpc_window_data
     5.14%  lt-flac  libFLAC.so.8.3.0    [.] precompute_partition_info_sums_
     4.54%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__MD5Transform


x86
# Samples: 61K of event 'cycles:u'
# Event count (approx.): 52656457012
#
# Overhead  Command  Shared Object       Symbol                                                                                      
# ........  .......  ..................  ............................................................................................
#
    20.02%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__fixed_compute_best_predictor_wide_intrin_ssse3
    16.78%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__bitwriter_write_rice_signed_block
    16.44%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__MD5Transform
     7.43%  lt-flac  lt-flac             [.] format_input
     7.25%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__lpc_compute_autocorrelation_intrin_sse_lag_12_new
     7.17%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__lpc_compute_residual_from_qlp_coefficients_wide_intrin_avx2
     5.44%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__fixed_compute_residual
     4.17%  lt-flac  libFLAC.so.8.3.0    [.] FLAC__MD5Accumulate

Strategy:

In reviewing the benchmarks and the code, I have narrowed down the strategy I will be doing to complete the project. Here is the list of steps I came up with to optimize the encoding of FLAC for aarch64.

  1. Research the required pre-processor directives that I will need to run the Aarch64 code inside the FLAC library conditionally.
  2. Test the pre-processor directives with some code that will cause a fault, so I know it is working.
  3. Examine the codebase to know where precisely I need to put the pre-processor directives. And check if I need to mess with the build instructions.
  4. Configure the makefile to build the new file that I am adding.
  5. I am going to focus on the “FLAC__lpc_compute_autocorrelation” function, and I am going to translate it into aarch64 intrinsic’s. I will use the existing c and x86 intrinsic code to help me with the translation.
  6. Testing my optimization, I will re-run the test that I performed on the original code with my optimized version and see if I have improved the performance on the aarch64 platform.
  7. As a stretch goal, depending on how hard it is to write the Aarch64 intrinsics, I would like to translate the full “ipc.c” file with aarch64 intrinsics.

Results To This Point:

At this point, I have found a project to work on, started learning the project structure, did research on how to use the package, I successfully built the package on both x86 and Aarch64 and performed benchmarks on the project using gprof, perf and the bash time command on both platforms. For my optimization, I have narrowed down exactly what I am going to work on for the remainder of this project and created a plan for how I will accomplish those changes.

SIMD with C Intrinsics and Inline Assembler – Lab5 SPO600

Welcome to my blog. This post will be about using SIMD with C. SIMD is an acronym for Single Instruction Multiple Data. SIMD is a form of vectorization where the computer performs the same operation on multiple data points simultaneously.

What I did for this lab

In this lab, I will be looking at three different approaches to SIMD. I will be using the same code as lab 4. The code is a simple example of changing the volume of a sound file. I will also be using an AARCH 64 machine, so the inline assembler and intrinsics are for the AARCH 64 architecture.

Part 1:

For part 1, I will be letting the GCC compiler do the vectorization for me using the compiler flag -O3. To confirm that the vectorization was successful. I enabled the compiler flag that will give me information about the vectorization of the code ‘-fopt-info-vec-all.’

I was tasked with vectorizing the last loop which performed the following line of code.

ttl = (ttl + data[x]) % 1000;

This piece of code sums up all the data, it is used to confirm that the algorithm for scaling the samples is correct and is not giving a different result than the original.

The way that the above code calculates the sum will not work for vectorization. This is because it relies on the previous value for the total. So, in order to allow auto-vectorization, I removed that dependance and changed that line of code to the following.

ttl += data[x] % 1000;

Part 2:

For this part of the lab, I first practiced writing inline assembler on another straightforward program. All this program performed was a modulus operator on some variables and printed the results, and my job was to replace the modulus operation with inline assembler.

int main() {
    int a = 3;
    int b = 19;
    int c;
    int d;
    __asm__("udiv %0, %1, %2" : "=r"(c) : "r"(b), "r"(a) );
    __asm__("msub %0, %1, %2, %3" : "=r"(c) : "r"(c), "r"(a), "r"(b)  );
    printf("%d\n", c);
} 

Now back to the sound scaling program. My professor gave me a version of the sound scaling program that contained the inline assembler. It also included questions marked with a Q: about the code in comments. My task for this part of the lab is to answer the questions.

The following are the questions and answers to part 2.

Question 1 Code:
register int16_t* cursor asm(“r20”); // input cursor
register int16_t vol_int asm(“r22”); // volume as int16_t

Question 1:
These variables will be used in our assembler code, so we’re going to hand-allocate which register they are placed in what is an alternate approach?

Answer:
Don’t include hand-allocation and let the compiler control what register they are assigned.

Question 2 Code:
vol_int = (int16_t) (0.75 * 32767.0);

Question 2:
Should we use 32767 or 32768 in this line of code? Why?

Answer:
We should use 32767 since it the larges number we can use since it uses only 15 bits to store if 32768 was the largest number, we would not be able to save the sign bit.

Question 3 Code:
asm (“dup v1.8h,%w0”::”r”(vol_int));

Question 3:
What does it mean to “duplicate” values in the next line?

Answer:
It means we are duplicating the value of vol_int across the vector one register. So, each position in vector one will contain the value of vol_int.

Question 4 Code:
asm (
“ldr q0, [%[cursor]], #0 \n\t”
“sqdmulh v0.8h, v0.8h, v1.8h \n\t”
“str q0, [%[cursor]],#16 \n\t”
: [cursor]”+r”(cursor)
: “r”(cursor)
: “memory”
);

Question 4:
Why is #16 included in the str line but not in the ldr line?

Answer:
We did not want to increment the cursor at ldr since we still need the current cursor position to store the values in the str command.

Question 4 Code:
asm(“…”
: [cursor]”+r”(cursor)
: “r”(cursor)
: “memory”
);

Question 4:
What do these next three lines do?

Answer:
The first line of code after colon one is the output operand. After that, the code after colon two is the input operand. The code after colon three is the clobber. The word memory in the clobber tells the compiler that this inline assembler effects global memory.

Question 5:
Are the results of this program usable? Are they correct?

Answer:
No, if I compare the results to the original program, I am getting 930 from the inline assembler code, and I was getting 94 from the original version of the code. I believe this is due to the fact that we are using a fixed-point representation of 0.75 for calculations.

Performance Analysis of Part 2

With a sample size of 50 Million, I got the following results:

AUTO-VECTORIZATION CODE
real 0m4.754s
user 0m4.585s
sys 0m0.160s

INLINE ASSEMBLER CODE
real 0m4.780s
user 0m4.618s
sys 0m0.150s

As you can see from the results above It would appear that the inline assembler is slightly slower in the real and user time categories but is consistently faster in the sys time category.

Part 3:

In this part of the lab, I start with the completed code for the sound scaling program that used intrinsic, and similar to part 2 of this lab, it contained comments with questions about the code for me to answer.

The following are the questions and answers to part 3.

Question 1 Code:
vst1q_s16(cursor, vqdmulhq_s16(vld1q_s16(cursor), vdupq_n_s16(vol_int)));

Question 1:
What do these intrinsic functions do?

Answer:
The intrinsic function “vst1q_s16” stores a single vector into memory. In this case, we are storing the results of the multiplication.

The intrinsic function “vqdmulhq_s16” stands for “vector saturating doubling multiply high.” In this case, we are multiplying two vector lanes. We pass the two vectors as parameters.

The intrinsic function “vld1q_s16” will load a single vector from memory.

The intrinsic function “vdupq_n_s16” loads all lanes of a vector with the same value.

Question 2 Code:
cursor += 8;

Question 2:
Why is the increment 8 instead of 16 or some other value?

Answer:
Since we are using an int_16t for our data, we have eight vector lanes. So we are incrementing the cursor to the next set of eight values that we will calculate.

Question 3 Code:
cursor += 8;

Question 3:
Why is this line not needed in the inline assembler version of this program?

Answer:
The incrementing of the cursor gets done inside of the inline assembler, so we don’t need to have an extra increment step.

Question 4:
Are the results usable? Are they accurate?

Answer:
Similar to part 2, the results are different than the original. but it is the same as the inline assembler.

Performance Analysis of Part 3

With a sample size of 50 Million, I got the following results:

AUTO-VECTORIZATION CODE
real 0m4.754s
user 0m4.585s
sys 0m0.160s

INLINE ASSEMBLER CODE
real 0m4.780s
user 0m4.618s
sys 0m0.150s

INTRINSICS CODE
real 0m4.768s
user 0m4.589s
sys 0m0.170s

The results are all very similar, I have run the test multiple times. I believe the auto-vectorization code seems to be the quickest most of the time and the intrinsic and inline assembler are about the same.

CODE DOWNLOAD

Download lab files

Manipulating large data – Lab4 SPO 600

Welcome, in class, we learned about a few different ways you can represent data in a computer. We learned about integers, fixed-point, floating-point, graphics, and sound. This lab focuses on sound data, which is one of the larger forms of data.

How we record sound?

The way we record sound is by sampling it. The typical sample rate is 44.1 or 48 thousand samples per second per channel and each sample size is typically 16 or 24 bit.

What I did for this lab

Task 1

My first task was to benchmark a program that manipulated a random dataset that was simulating a 16 bit sound file. The way we made the data was by generating five million signed integers between positive 32767 and negative 32768.

Here is the code for task 1. – Task1

Task 1 – Results

Result: 94
real 0m0.058s
user 0m0.053s
sys 0m0.005s

Task 1 – Analysis

Using the profiler Perf I can see that the total time running the scaling code is around 18.62% of the program runtime. About 81.38% of the time was generating random data.

Task 2

My second task was to change the formula for scaling the samples to a pre-calculated lookup table. The lookup table would contain all possible sample values multiplied by the volume factor. To get the results, you use the value as the index to look up each sample in that table to get the scaled values.

Here is the code for task 2. – Task 2

Task 2 – Results

Result: 94
real 0m0.646s
user 0m0.633s
sys 0m0.011s

Task 2 – Analysis

Using the profiler Perf I can see that the total time running the scaling code is around 34.27% of the program runtime. About 65.73% of the time was generating random data.

Task 3

My third task was to change the formula for scaling the samples to use fixed point integer math. The reason for doing this is that on most machines they can calculate integer math faster than floating-point. The way I accomplished this was by bit shifting.

The formula I used was: (((246 * 0.75) * SAMPLE) >> 8)

Here is the code for task 3. – Task 3

Task 3 – Results

Result: 873
real 0m0.522s
user 0m0.501s
sys 0m0.020s

Task 3 -Analysis

Using the profiler Perf I can see that the total time running the scaling code is around 18.24% of the program runtime. About 81.76% of the time was generating random data.

Final Analysis

I tested each test multiple times. The results were all similar, with a small variant to speed and the same output every time. The computer doing other tasks is the likely cause for the slight variant in test speeds.

I also tried with a larger sample size than 5 million and the result was the same.

The final results are fixed point integers give us the fastest result, floating-point math came in second, and the lookup table was the slowest.

Now there was a downfall to the fixed point integers. It did not give the same result as the floating-point and lookup table. The result was different since it does not calculate the decimals. The fixed point integers delete the decimal without rounding. By not rounding the numbers, the result was slightly different.