During another one of my assembly-related escapades, I came across a following problem : using the finite difference formula, create a function that would calculate the values of a given function’s derivative, given the beginning and end point to be inspected, as well as the precision, e.g. how many samples (individual values) to take in that area. The function being inspected was a simple quadratic function, given by the general formula . Of course, it has a very trivial derivative of , but the computer doesn’t necessarily know that.

The finite difference formula can be summarized as follows : if we need to calculate values of a function’s derivative on the area of , then the value of the derivative can be approximated by the following formula :

where , e.g. the size of a single „step” according to the selected precision – the bigger the precision, the „smaller” the step is and thus the function is calculated with a greater granularity.

Writing some sequential code for this formula is obviously trivial. Parallelization of the computations can be a more difficult task, though. In my approach, I chose to use „masks” (couldn’t come up with a better name for these, unfortunately : they kind of resemble bitmasks, that’s why the name came to me, I guess) which represent the increments (or decrements) of their respective arguments. There are four such „masks” :

- the „dx mask”, which contains the value of multiplied by 8, since there are 8 values calculated in one iteration of the loop
- the „2dx mask”, simply containing multiplied by 2 – this is the value by which the calculated values are divided by
- the „plus mask”, which contains the values of multiplied by 1 to 8. This mask is used to calculate the value of $latex x + \Delta x$ in the loop, which is later used as an argument for the function.
- the „minus mask”, which contains the values of multiplied by -1 to 6. Analogously to the „plus mask”, it is used to calculate the value of $latex x – \Delta x$.

Each of those masks takes up a whole YMM register and is added to the current quadratic function arguments, producing new arguments for which the value of the derivative is calculated.

I do realize that this all may be quite confusing, so let me illustrate this on a diagram. It attempts to show (though poorly) what is done in one iteration of the loop.

YMM5 represents the „current position” within the function, e.g. the current argument. After adding the „plus” and „minus masks” to these values, every respective element of the resulting YMM register contains the argument to the function which corresponds to the argument required by the formula of the theorem. Therefore, after calculating the value of that function, subtracting the respective values from each other and dividing these results by , we obtain the „dval”, which is the final value of the derivative of the function at the given point. That value is written to memory. The „current position” is incremented by and the loop is executed again (unless the required number of iterations has been reached).

If there are any leftover values, they are calculated by normal sequential code, operating on one element in one iteration. The sequential loop will make at most 7 iterations, so it’s not a bottleneck by any means.

I also came across a very interesting issue related to memory performance and its correlation with the alignment of the address. All the SSE instructions required the destination (or source, in the case of using instructions operating on packed data) memory operand’s address to be 16-byte aligned. If it wasn’t aligned, a general protection fault was reported while executing that instruction. The one of the few SSE instructions that didn’t require the address to be 16-byte aligned was **movups**, but using non-16-byte aligned addresses resulted in a dramatic performance decrease. I found this not to be true anymore in the case of AVX. The Intel Basic Architecture manual says that

*Most VEX-encoded SIMD numeric and data processing instruction semantics withmemory operand have relaxed memory alignment requirements than instructions encoded using SIMD prefixes.*

(section 13.1.3)

and

*With the exception of explicitly aligned 16 or 32 byte SIMD load/store instructions, most VEX-encoded, arithmetic and data processing instructions operate in a flexible environment regarding memory address alignment, i.e. VEX-encoded instruction with 32-byte or 16-byte load semantics will support unaligned load operation by default. Memory arguments for most instructions with VEX prefix operate normally without causing #GP(0) on any byte-granularity alignment (unlike Legacy SSE instructions).*

(section 13.3)

I was curious to find out what kind of penalty an unaligned memory access actually introduces. In order to do that, I compared the performance of the function when the destination array was allocated via standard malloc (which is guaranteed to return an 8-byte aligned address when using glibc, I don’t know about any other libc implementations) and mmap (which returns addresses located on a page boundary, which is 4KB in case of x86). The results showed that there is a 5% increase in performance when using the address returned by mmap. I thought that this might be improved even more by changing the instruction from **movups** to **movaps**, self-modifying the instruction after checking the address’ alignment. The results were not satisfactory, though : differences in performance were negligible, around 0.041%, which most certainly can’t be considered an improvement considering all the extra code that must be added in order to allow the function to self-modify itself. So, the good news are : **you can always use movups for memory transfers in AVX**, since it doesn’t introduce any kind of performance penalty when the pointer is aligned, and doesn’t fail if it’s unaligned. Just keep the addresses aligned yourself and you’ll be fine.

Here’s the source of the implementation. The assembly is written using the AMD64 ABI, was compiled under gcc 4.6.3 and nasm 2.09.10. The program actually using the function is written like a benchmark : it performs 10000 executions of the function with precision 5000007, randomizing the parameters beforehand, and sums the number of ticks returned by rdtsc, writing the average number of ticks from all the iterations to stdout on exit.

Regarding my last joystick-related project : had to put it on hold until further notice. This is the kind of thing that bites you on the ass once you leave it for some time and then want to come back to it, since coming back to it takes a lot of time. I’d like to continue it, obviously, but it doesn’t seem likely nowadays.