Saturday, September 3, 2011

Trying Delphi XE2, some performance tests....

 Given the following program I've measure the speed with combinations of WIN32/64 and Debug/Release:

TEST 1)


var
  i: integer;
  x, y: int64;

begin
  x := 0;
  y := 0;
  for i := 0 to MaxLongInt -1 do
  begin
    Inc(x, y);
    Inc(y);
  end;
end;
And the results expressed in milliseconds:
Debug Release
Win32 6.703 5.395
Win64 4.005 1.380
Clearly the Win64 + Release is almost four times faster than the optimized 32 bits  version, there is not big difference between non optimized versions of Win32 and Win64.
TEST 2)
Another test, this time implies multiplications:
var
  i: integer;
  x, y: int64;

begin
  x := 0;
  y := 0;
  for i := 0 to MaxLongInt -1 do
  begin
    x := x * y;
    Inc(y, i);
  end;
end;
And again the results:
Debug Release
Win32 14368 14461
Win64 5718 2398
The optimized Win64 version is seven times faster than optimized Win32, there is  no difference between optimized and non optimized versions of Win32, the conclussion is that Delphi does not know how to optimize x64 assembler when in 32 bits mode.
But it seems that they are doing a good job with the optimizations of the 64 bits, let's see if I'm able to compare with other compilers.

2 comments:

  1. In Win32, all CPU register are 32 bits long (EVEN on 64 bit processor, bacause cpu is switched to legacy 32 bit mode!!!), so Delphi has to use EAX:EDX to simulate 64 bit integer operations. You use Int64 multiplications, which can't be done directly on 32 bit CPU hardware, so Delphi call _llmul function to multiply two 64 bit values. This is VERY VERY VERY time consuming. Plus, storing them into variable X (memory operation!) require storing eax and edx separately. In Win32 portion of code for line "x := x * y;" looks like

    PUSH x.low (lower 32 bits of x) NEXT FOUR CPU COMMANDS STORE x and y ON STACK AS _llmul INPUT PARAMETERs
    PUSH x.high (highest 32 bit of x) ALL ARE THE MEMORY WRITE OPERATIONS, THEY POLUTE CPU CASHE, SO QUITE SLOW!
    PUSH y.low
    PUSH y.high
    CALL _llmul CALL FUNCTION TO MULT TWO 64 bit INTEGERS

    MOV x.low,EAX STORE THE RESULT INTO x, TWO COMMANDS
    MOV x.high,EDX



    In win64 all 64 bit integer operations are done on ONE 64 bit long register (like RAX), so storing the result into memory takes only one command.
    The same portion of code would look like:

    for x := x * y:
    MOV RAX,x NOTICE: NO MEMORY WRITE OPERATIONS TO MULT TWO 64 BIT INTEGERS!!!!
    IMUL RAX,y MULT RAX with y, USE CPU BUILT IN HARDWARE FOR MULT TWO 64 bit INTEGERS
    MOV x, RAX STORE THE RESULT INTO x


    As you can see, on Win32 multiplication of two 64 bit integers require lots of EXTRA work! It can't be done other way!!!! Like using car engine to lift a plane!..:)

    Delphi optimize code very well on both platforms. The difference is because 64 bit processor can handle larger integers!

    Hope this helps you a bit.

    Two thumbs up for you test!!! Try changing all Int64 variables to Integer type. To prevent overflow, loop variable i should be max 2^16-1 (65.535). Run the test. I belive the diffrence should be quite less then 7 times. You might mail me the findouts.

    Igor

    musketir@hotmail.com

    ReplyDelete
    Replies
    1. Thanks for you feedback Igor, but nothing new to me... (check other posts)

      The points is that there is NO difference between the optimized (RELEASE) and non optimized (DEBUG) version in 32 bits when using 64 bits variables, nothing else

      Delete