Performance issues on Intel CPUs

Alexander Peshkoff wrote:

One of our users has found that firebird 2.x (64-bit mode) runs on some very simple tests 2 times slower compared with firebird 1.5. I've checked it on my box and could not reproduce a problem. Afterwards it became known that my box is AMD, while theirs is Intel.

Therefore I've decided to compare different versions on different CPUs. Performance tests showed, that firebird 1.5 runs on them with more or less same speed (therefore it's very convenient for comparison). But firebird 2.5 really runs almost 2 times slower on Intel than 1.5 (and therefore 2 times slower than on AMD). What else required explanation - why do we not have this performance issue under windows.

Runtime patterns for 1.5 and 2.5 differ very much, therefore it makes less and less sense to compare 1.5 and newer versions directly. They just run differently. But comparing 2.5 on amd and intel - quite possible, and it was done. Must say that even with this very big difference, it was not easy to find a source of problems - delays were distributed across a lot of functions. And standard profilers are not precise enough to find exact line of code which causes the problems.

Finally I've found, that line:

memcpy(tempData,
       beforeInsertNode.data + newPrefix - beforeInsertNode.prefix,
       newLength);

in btr.cpp:insert_node(), function takes on intel 6 times more compared with amd. Therefore I've decided to compare memcpy()'s behavior using simple test case with next results:

                Intel           Amd             Another intel / win64
tst_direct      0.524s          0.996s          1.000
tst_jpq         1.544s          1.656s          - miss -
tst_memcpy      3.208s          1.727s          1.703

This table confirmed my initial assumption very well - linux has memcpy optimized for AMD cpu's, but awful for intel. At least when working with not-aligned data - pay attention, I've specially forced bad alignment in testcase, because it's typical when we perform insert_node() operation and a lot of other operations with indices).

To make sure problem is really in glibc, I've decided to look at generated code. Stop: there is NO call to memcpy! Due to used constant length in memcpy, it was implemented inline by compiler.

movq    (%r9,%rax), %rax
movq    %rax, (%r10,%rcx)

Moving 8 bytes on 64-bit machine - what may be easier and faster, yes? No, not on intel!

With this two commands present binary runs 3.292s. Without them (I've just removed this 2 lines from .s file) - 0.828s. You may be sure - whole loop body contains a few thousands of commands.

Asm optimized version of memcpy() in glibc is also using movq commands. Moreover - this comment from memcpy.S makes me think that nobody cared about intel performance:

/*
Optimized memcpy for x86-64.

Copyright (C) 2007 Free Software Foundation, Inc.
Contributed by Evandro Menezes , 2007.

Really - why should amd guys care about intel performance?

Now I plan to try newest gcc/glibc versions, and if nothing changed - contact authors.

I suppose that maybe someone else would like to try it. There are both src, binaries and scripts to build and run.

In follow-up Alexander Peshkoff wrote:

Well, it looks like no other Intel CPU has so problematic movq as family 15. And it appears that my Phenom (AMD Phenom(tm) 9650 Quad-Core Processor) gives the best results for non-aligned memory access. So we're lucky that most different CPUs were chosen for initial testing (it was real luck for me, or it could take even more time to find a reason of FB slowness).

The question is - should we try to take mesures to better support FB on Intel family 15?

Roman Rokytskyy answers:

If somebody comes and pays appropriate amount of money (development, testing and keeping the code in the tree (unless the solution is universal for all platforms)), why not? :)

Otherwise, it's up to you, Alex :) If you have time to code it, do it, if not - we can simply add text to release notes or to the site.

In follow-up Alexander Peshkoff wrote:

Seems that Intel CPU family 6 is no better than family 15.

It looks like Intel should better declare their CPUs RISC-like (with strong alignment requirements) than to have so bad performance if misaligned operand is used.

Performance issues on Intel CPUs

Alexander Peshkoff wrote:

In follow-up Alexander Peshkoff wrote:

Roman Rokytskyy answers:

In follow-up Alexander Peshkoff wrote:

Related Articles

Author

Published

Category

Tags