Function call loop is fast

I link some assemblies with some c to test the cost of function calls, using the following assembly and c source code (fasm and gcc respectively)

< p>Part:

format ELF

public no_call as "_no_call"
public normal_call as "_normal_call"

section'.text' executable

iter equ 100000000

no_call:
mov ecx, iter
@@:
push ecx
pop ecx
dec ecx
cmp ecx, 0
jne @b
ret

normal_function:
ret

normal_call:
mov ecx, iter
@@:
push ecx
call normal_function
pop ecx
dec ecx
cmp ecx, 0
jne @b
ret

cSource:

#include 
#include

extern int no_call();
extern int normal_call();

int main()
{< br /> clock_t ct1, ct2;

ct1 = clock();
no_call();
ct2 = clock();
printf("\n\ n%d\n", ct2-ct1);

ct1 = clock(); normal_call();
ct2 = clock();
printf("%d\n", ct2-ct1);

return 0;
}

The results I got are surprising. First of all, the speed depends on the order of my link. If I link as gcc intern.o extern.o, the typical output is

< /p>

162
181

But linking gcc extern.o intern.o in reverse order, my output is more like:

162
130

They are different is very surprising, but not the question I asked. (relevant question here)

The question I asked Yes, in the second run, if the function call loop is faster than the loop without function call, then the cost of calling the function is obviously negative.

Edit:
Just mention it in the comments. Some things:

>In the compiled bytecode, the function call is not optimized.
>All the contents of adjusting the alignment of the function and the loop to 4 to 64 byte boundaries are not Speed ​​up no_call, although some alignments do slow down normal_call
> Provides the CPU/OS with a chance to warm up by calling the function multiple times, instead of just calling it once, it has no obvious effect on the length of the measurement, and it has not changed Calling sequence or running separately
>Longer running time will not affect the ratio, for example, running 1000 times, my running time is 162.168 and 131.578 seconds

In addition, after modifying the assembly code to align the bytes , I tested adding an extra offset to the function set and came to some stranger conclusions. This is the updated code:

format ELF

public no_call as "_no_call"
public normal_call as "_normal_call"

section'.text' executable

iter equ 100000000

offset equ 23; this is the number I am changing
times offset nop

times 16 nop
no_call:
mov ecx, iter
no_call.loop_start:
push ecx
pop ecx
dec ecx< br /> cmp ecx, 0
jne no_call.loop_start
ret

times 55 nop
normal_function:
ret


times 58 nop
normal_call:
mov ecx, iter
normal_call.loop_start:
push ecx
call normal_function
pop ecx
dec ecx
cmp ecx, 0
jne normal_call.loop_start
ret

I had to manually (and non-portable) force 64-byte alignment because FASM does not support The executable part is more than 4 bytes aligned, at least on my machine. By offsetting the byte offset program, this is what I found.

if (20 <= offset mod 128 <= 31) then we get an output of (approximately):

162
131

else

162 ( +/- 10)
162 (+/- 10)

Not sure what to do, but this is what I have found so far

Edit 2:< /p>

The other thing I noticed is that if you remove push ecx and pop ecx from the two functions, the output will become

30< br />125

This shows that this is the most expensive part of it. The stack alignment is the same both times, so this is not the reason for the difference. My best guess is that somehow the hardware is optimized, It can be called after pushing or something similar, but I don’t know anything similar.

< /div>

Update: Skylake store/reload latency is as low as 3c, but only if the timing is correct. Involved in the store-and-forward dependency chain Continuous loads with natural intervals of 3 or more cycles will experience faster delays (e.g. 4 imul eax, eax in a loop, mov [rdi], eax / mov eax, [rdi] only need loop count per iteration From 12 to 15 cycles.) But when the load is allowed to be more dense than this, you will encounter some type of contention, and you will get about 4.5 cycles per iteration. The non-integer average throughput is also a big clue, There are some unusual things.

I see the same effect for 32B vectors (6.0c in the best case, 6.2 to 6.9c back-to-back), but the 128b vector is always around 5.0c .See details on Agner Fog’s forum.

Update2: Adding a redundant assignment speeds up code when compiled without optimization and 2013 blog post indicate that all Sandybridge series CPUs have this effect.

Skylake The back-to-back (worst case) store-and-forward latency is 1 cycle better than the previous search, but the variability when the load cannot be executed immediately is similar.

By correct (wrong) alignment, in the loop The extra call to can actually help Skylake observe lower store-and-forward latency from push to pop. I was able to reproduce this using perf counter (Linux perf stat -r4) using YASM. (I heard that using perf counter on Windows does not It’s too convenient, and I don’t have a Windows development machine yet. Fortunately, the operating system has nothing to do with the answer; anyone should be able to reproduce my timer results using VTune or something on Windows.)

After aligning 128 at the position specified in the question, I saw faster time at offset = 0..10, 37, 63-74, 101 and 127. L1I cache behavior is 64B, uop-cache focuses on 32B boundary. It seems relatively Alignment on the 64B boundary is the most important.

The call-free loop is always a stable 5 loops, but each iteration of the call loop can drop from the usual almost complete 5 loops to 4c. I see Performance at offset = 38 is slower than usual (5.68 – 8.3% cycle per iteration). According to perf stat -r4 (4 runs and average), there are minor faults at other points, such as 5.17c-3.3%.

It seems that there is no queue between the front-end waiting for a lot of uops interactions, resulting in the back-end It has low latency and is used to push to popular store-and-forward.

IDK will slow down if the same address is used repeatedly for store-and-forward (multiple storage addresses uops are already in the corresponding storage data Execute before uop), or what.

Test code: bash shell loop to build& use each different offset to analyze asm:

( set -x; for off in {0..127};do 
asm-link -m32 -d call-tight-loop.asm -DFUNC=normal_call -DOFFSET=$off &&
ocperf.py stat -etask-clock,context-switches,cpu-migrations,page-faults:u,cycles,instructions,uops_issued.any,uops_executed.thread,idq.mite_uops,dsb2mite_switches.penalty_cycles -r4 ./call-tight-loop;< br />done) |& tee -a call-tight-loop.call.offset-log

(set -x) in the subshell is a way to record commands and A convenient way to output it.

asm-link is a script to run yasm -felf32 -Worphan-labels -gdwarf2 call-tight-loop.asm “$@”&& ld -melf_i386 -o call-tight-loop call-tight-loop.o, and then run objdumps -drwC -Mintel on the result.

NASM / YASM Linux test program (assembled into a complete static binary file , Run the loop and then exit, so you can analyze the entire program.) The direct port of OP’s FASM source, there is no optimization for asm.

CPU p6; YASM direc tive. For NASM, %use smartalign.
section .text
iter equ 100000000

%ifndef OFFSET
%define OFFSET 0
%endif

align 128
;;offset equ 23; this is the number I am changing
times OFFSET nop

times 16 nop
no_call:< br /> mov ecx, iter
.loop:
push ecx
pop ecx
dec ecx
cmp ecx, 0
jne .loop
ret

times 55 nop
normal_function:
ret

times 58 nop
normal_call:
mov ecx, iter< br />.loop:
push ecx
call normal_function
pop ecx
dec ecx
cmp ecx, 0
jne .loop
ret

%ifndef FUNC
%define FUNC no_call
%endif

align 64
global _start
_start:
call FUNC

mov eax,1; __NR_exit from /usr/include/asm/unistd_32.h
xor ebx,ebx
int 0x80; sys_exit(0), 32- bit ABI

Sample output of fast call running:< /p>

+ asm-link -m32 -d call-tight-loop.asm -DFUNC=normal_call -DOFFSET=3
...

080480d8 :
80480d8: c3 ret
...

08048113 :
8048113: b9 00 e1 f5 05 mov ecx,0x5f5e100< br />08048118 :
8048118: 51 push ecx
8048119: e8 ba ff ff ff call 80480d8
804811e: 59 pop ecx
804811f: 49 dec ecx
8048120: 83 f9 00 cmp ecx,0x0
8048123: 75 f3 jne 8048118
8048125: c3 ret

...

Performance counter stats for'./call-tight-loop' (4 runs):

100.646932 task-clock (msec) # 0.998 CPUs utilized (+- 0.97% )
0 context-s witches # 0.002 K/sec (+-100.00% )
0 cpu-migrations # 0.000 K/sec
1 page-faults:u # 0.010 K/sec
414,143,323 cycles # 4.115 GHz ( +- 0.56% )
700,193,469 instructions # 1.69 insn per cycle (+- 0.00% )
700,293,232 uops_issued_any # 6957.919 M/sec (+- 0.00% )
1,000,299,201 uops_executed_thread # 9938.695 M/sec (+- 0.00% )
83,212,779 idq_mite_uops # 826.779 M/sec (+- 17.02% )
5,792 dsb2mite_switches_penalty_cycles # 0.058 M/sec (+- 33.07% )

0.100805233 seconds time elapsed (+- 0.96% )

Old answer before noticing the variable store-and-forward delay

You push/flick your loop counter Counters, so everything except call and ret instructions (and cmp/jcc) are part of the critical path loop-carrying dependency chain involving loop counters.

You expect pop must wait to pass call/ ret updates the stack pointer, but the stack engine handles those updates with zero latency.(Intel since Pentium-M, AMD since K10, according to Agner Fog’s microarch pdf, so I assume your CPU has one, even though you haven’t said you The CPU microarchitecture for running the test.)

Additional calls/returns still need to be executed, but out-of-order execution allows critical path instructions to run at maximum throughput. Since this includes storage latency -> Push from /Bounce 1 cycle to load forwarding of dec, this is not high throughput on any CPU, and it is surprising that the front end may become the bottleneck of any alignment.

According to Agner Fog, push-gt ; The pop delay is 5 cycles on Skylake, so in your loop, your loop can only run once every 6 cycles at most.
This is sufficient time to run the out-of-order execution of the call and return instructions . Agner lists the maximum throughput that is called every 3 loops, and returns one every 1 loop. Or in AMD Bulldozer, 2 and 2. His table does not list anything about the throughput of the call/alarm pair Information, so IDK can overlap. On AMD Bulldozer, the storage/reload delay of mov is 8 cycles. I think push/pop is roughly the same.

It seems that the different alignment at the top of the loop (i.e. no_call. loop_start:) Cause the front end bottleneck. Each iteration of the called version has 3 branches: call, ret and loop branch. Please note that the branch target of ret is the instruction after the call. Each of these may break the front end. Because you are in To see the actual slowdown in practice, we have to see that each branch is delayed by more than 1 cycle. Or for the no_call version, a single extraction/decoding bubble is worse than about 6 cycles, resulting in actually wasted cycles to release the uop to the core Unordered part. That’s really weird.

Guessing the actual microarchitecture details of each possible uarch is too complicated, so let us know the CPU you tested.

I It will be mentioned that although the push/pop in the Skylake loop prevents it from being emitted from the loop flow detector, it must be retrieved from the uop cache every time. Intel’s optimization manual says that for Sandybridge, the unmatched push/pop in the loop Bomb prevents it from using LSD. This means it can use LSD for loops with balanced push/bomb. In my tests, this was not the case with Skylake (using the lsd.uops performance counter), but I have not seen Is there any change, or is it the same for SnB.

In addition, unconditional branches always end the uop-cache line. It is possible to use normal_function: in the 32B machine code block that is naturally aligned with the call and jne, it is possible The code block is not suitable for uop cache. (Only 3 uop-cache lines can cache decoded uop for a single 32B x86 code block). But this does not explain the possibility of problems with the no_call loop, so you may not be able to use the Intel SnB series Run on microarchitecture.

(Update, yes, the loops sometimes mainly come from legacy decoding (idq.mite_uops), but usually not the only ones. dsb2mite_switches.penalty_cycles is usually ~8k and may only occur at timing The call loop runs faster. The run seems to be associated with a lower idq.mite_uops, but for the case of offset = 37, it is still 34M-63%, where 100M iterations require 401M cycles.)

This is actually one of the “don’t do that” situations: inline tiny functions, instead of calling them from very tight loops.

If you press/pop to remove the loop For registers other than the counter, you may see different results. This will separate push/pop from the loop counter, so there will be 2 independent dependency chains. It should speed up the call and no_call versions, but maybe different. It might It will only make the front-end bottleneck more obvious.

You should see a huge speedup, if you push edx but pop eax, so push/pop instructions will not form a loop-carried dependency chain. Then extra The call/cancel will definitely become a bottleneck.

Side note: dec ecx has set ZF the way you want, so you can use dec ecx / jnz. In addition, cmp ecx,0< /code> is less efficient than test ecx,ecx (larger code size, unable to perform macro fusion on as many CPUs as possible). In any case, the relative performance of your two loops The problem is completely irrelevant. (Your lack of an ALIGN instruction between functions means that changing the first one will change the alignment of the second loop branch, but you have explored different alignments.)

I link some assemblies with some c to test the cost of function calls, using the following Assembly and c source code (using fasm and gcc respectively)

Parts:

format ELF

public no_call as "_no_call"
public normal_call as "_normal_call"

section'.text' executable

iter equ 100000000

no_call:
mov ecx, iter
@@:
push ecx
pop ecx
dec ecx
cmp ecx, 0
jne @b
ret

normal_function:
ret

normal_call:
mov ecx, iter
@@:
push ecx< br /> call normal_function
pop ecx
dec ecx
cmp ecx, 0
jne @b
ret

cSource:

#include 
#include

extern int no_call();
extern int normal_call ();

int main()
{
clock_t ct1, ct2;

ct1 = clock();
no_call() ;
ct2 = clock();
printf("\n\n%d\n", ct2-ct1);

ct1 = clock();
normal_call();
ct2 = clock();
printf(" %d\n", ct2-ct1);

return 0;
}

The results I get are surprising. First of all, the speed depends on what I link Order. If I link as gcc intern.o extern.o, the typical output is

162
181

But in reverse order Link gcc extern.o intern.o, my output is more like:

162
130

They are different and very orderly Surprising, but not the question I asked. (relevant question here)

The question I asked is that in the second run, if the loop of function call is faster than the loop without function call, then call The cost of the function is obviously negative.

Edit:
Just to mention some things tried in the comments:

>In the compiled bytecode, the function call is not Optimized.
>Adjusting the alignment of functions and loops to everything from 4 to 64 byte boundaries did not speed up no_call, although some alignments did slow down normal_call
>By calling the function multiple times to the CPU / The OS provides an opportunity to warm up, instead of just calling it once, it has no obvious effect on the length of the measurement, nor does it change the calling sequence or run it separately. The running time is 162.168 and 131.578 seconds.

In addition, after modifying the assembly code to align the bytes, I tested adding an extra offset to the function set, and got some stranger Conclusion. This is the updated code:

format ELF

public no_call as "_no_call"
public normal_call as "_normal_call"

section'.text' executable

iter equ 100000000

offset equ 23; this is the number I am changing
times offset nop< br />
times 16 nop
no_call:
mov ecx, iter
no_call.loop_ start:
push ecx
pop ecx
dec ecx
cmp ecx, 0
jne no_call.loop_start
ret

times 55 nop
normal_function:
ret


times 58 nop
normal_call:
mov ecx, iter
normal_call.loop_start:
push ecx
call normal_function
pop ecx
dec ecx
cmp ecx, 0
jne normal_call.loop_start
ret

< p>I had to manually (and non-portable) force 64-byte alignment, because FASM does not support the executable part to be aligned more than 4 bytes, at least on my machine. By offsetting the byte offset program, this is what I found Of.

if (20 <= offset mod 128 <= 31) then we get an output of (approximately):

162
131

else

162 (+/- 10)
162 (+/- 10)

Don’t know what to do, But this is what I have found so far

Edit 2:

The other thing I noticed is that if you remove push ecx and pop ecx from the two functions, The output will become

30
125

This shows that this is the most expensive part. The stack alignment is the same twice , So this is not the reason for the difference. My best guess is that somehow the hardware is optimized to be called after a push or something similar, but I don’t know anything similar.

Update: Skylake store/reload latency is as low as 3c, but only if the timing is correct. The store-and-forward dependency chain involved in continuous loads with natural intervals of 3 or more cycles will experience faster Delay (e.g. 4 imul eax, eax in the loop, mov [rdi], eax / mov eax, [rdi] each iteration only requires a loop count from 12 to 15 loops.) But when the load is allowed to be more dense than this, you will encounter some For this type of contention, you will get about 4.5 cycles per iteration. The non-integer average throughput is also a big clue, there are some unusual things.

I see 32B vectors The effect is the same (the best case is 6.0c, back-to-back 6.2 to 6.9c), but the 128b vector is always around 5.0c. See details on Agner Fog's forum.

Update2: Adding a redundant assignment speeds Up code when compiled without optimization and 2013 blog post indicate that all Sandybridge series CPUs have this effect.

Skylake's back-to-back (worst case) store-and-forward latency is 1 cycle better than the previous search, but when The variability when the load cannot be executed immediately is similar.

With proper (wrong) alignment, the extra calls in the loop can actually help Skylake observe lower store-and-forward latency from push to pop. I was able to Use YASM to reproduce this with perf counter (Linux perf stat -r4). (I heard that using perf counter on Windows is not very convenient, and I don’t have a Windows development machine yet. Fortunately, the operating system has nothing to do with the answer; any Anyone should be able to reproduce my timer results using VTune or something on Windows.)

After I align 128 at the position specified in the question, at offset=0..10,37, Faster times are seen at 63-74, 101 and 127. L1I cache behavior is 64B, uop-cache focuses on 32B boundary. It seems that alignment relative to 64B boundary is the most important.

No call loop is always Stable 5 loops, but each iteration of the calling loop can drop from the usual almost exactly 5 loops to 4c. I see that the performance at offset = 38 is slower than usual (5.68 – 8.3% loops per iteration). According to perf stat -r4 (4 runs and average), there are minor faults at other points, such as 5.17c-3.3%.

It seems that there is no queue between the front-ends waiting for a lot of uops interactions, resulting in The terminal has low latency and is used from push to popular store-and-forward.

IDK will slow down if the same address is used repeatedly for store-and-forward (multiple storage addresses u The ops has been executed before the corresponding stored data uop), or what.

Test code: bash shell loop to build& use each different offset to analyze asm:

< /p>

(set -x; for off in {0..127};do 
asm-link -m32 -d call-tight-loop.asm -DFUNC=normal_call -DOFFSET=$off &&
ocperf.py stat -etask-clock,context-switches,cpu-migrations,page-faults:u,cycles,instructions,uops_issued.any,uops_executed.thread,idq.mite_uops,dsb2mite_switches.penalty_cycles -r4 ./ call-tight-loop;
done) |& tee -a call-tight-loop.call.offset-log

The (set -x) in the subshell is a A convenient way to record commands and their output when logging to the log file.

asm-link is a script to run yasm -felf32 -Worphan-labels -gdwarf2 call-tight-loop. asm "$@"&& ld -melf_i386 -o call-tight-loop call-tight-loop.o, and then run objdumps -drwC -Mintel.

NASM / YASM Linux test program (assembly Into a complete static binary file, run the loop and then exit, so you can analyze the entire program.) The direct port of OP's FASM source, there is no optimization for asm.

 CPU p6; YASM directive. For NASM, %use smartalign.
section .text
iter equ 100000000

%ifndef OFFSET
%define OFFSET 0
%endif

align 128
;;offset equ 23; this is the number I am changing
times OFFSET nop

times 16 nop
no_call:
mov ecx, iter
.loop:
push ecx
pop ecx
dec ecx
cmp ecx, 0
jne .loop
ret
< br />times 55 nop
normal_function:
ret

times 58 nop
normal_call:
mov ecx, iter
.loop:< br /> push ecx
call normal_function
pop ecx
dec ecx
cmp ecx, 0
jne .loop
ret

%ifndef FUNC
%define FUNC no_call
%endif

align 64
global _start
_start:
call FUNC

mov eax,1; __NR_exit from /usr/include/asm/unistd_32.h
xor ebx,ebx
int 0x80; sys_exit(0), 32-bit ABI

< p>Sample output of fast call running:

+ asm-link -m32 -d call-tight-loop.asm -DFUNC=normal_call -DOFFSET=3
...

080480d8 :
80480d8: c3 ret
...

08048113 :
8048113: b9 00 e1 f5 05 mov ecx,0x5f5e100
08048118 :
8048118: 51 push ecx
8048119: e8 ba ff ff ff call 80480d8
804811e: 59 pop ecx
804811f: 49 dec ecx
8048120: 83 f9 00 cmp ecx,0x0
8048123: 75 f3 jne 8048118
8048125: c3 ret

...

Performance counter stats for './call-tight-loop' (4 runs):

100.646932 task-clock (msec) # 0.998 CPUs utilized (+- 0.97% )
0 context-switches # 0.002 K /sec (+-100.00% )
0 cpu-migrations # 0.000 K/sec
1 page-faults:u # 0.010 K/sec
414,143,323 cycles # 4.115 GHz (+- 0.56% )
700,193,469 instructions # 1.69 insn per cycle (+- 0.00% )
700,293,232 uops_issued_any # 6957.919 M/sec (+- 0.00% )
1,000,299,201 uops_executed_thread # 9938.695 M/sec (+- 0.00% )
83,212,779 idq_mite_uops # 826.779 M/sec (+- 17.02% )
5,792 dsb2mite_switches_penalty_cycles # 0.058 M/sec (+- 33.07% )

0.100805233 seconds time elapsed (+- 0.96% )

Old answer before noticing variable store-and-forward delay< /p>

You push/flick your loop counter, so everything except call and ret instructions (and cmp/jcc) is part of the loop-carrying dependency chain of the critical path involving the loop counter.

You expect pop must wait to update the stack pointer via call/ret, but the stack engine handles those updates with zero latency. (Intel since Pentium-M, AMD since K10, according to Agner Fog's microarch pdf, so I assume your CPU has one, even though you didn't even mention the CPU microarchitecture on which you ran the test.)

Additional calls/returns still need to be executed, but out-of-order execution can make critical path instructions run at maximum throughput. Since this includes latency of storage -> load forwarding from push/bounce 1 cycle to dec, which is not on any CPU High throughput, and surprisingly the front end may become the bottleneck for any alignment.

According to Agner Fog, push-gt; pop delay is 5 cycles on Skylake, so in your loop , Your loop can only run once in every 6 loops at most.
This is sufficient time to run the out-of-order execution of the call and return instructions. Agner lists the maximum throughput of one call every 3 loops, And it returns one every 1 loop. Or in AMD Bulldozer, 2 and 2. His table does not list any information about the throughput of the call/alarm pair, so whether the IDK can overlap. On AMD Bulldozer, the storage of mov The /reload delay is 8 cycles. I think push/pop is roughly the same.

It seems that the different alignment at the top of the loop (i.e. no_call.loop_start:) causes the front-end bottleneck. The call version has 3 branches per iteration : Call, ret, and loop branch. Please note that the branch target of ret is the instruction after the call. Each of these may break the front end. Since you see actual deceleration in practice, we must see that each branch exceeds 1 cycle delay. Or for the no_call version, a single extraction/decoding bubble is worse than about 6 cycles, resulting in actually wasted cycles publishing the uop to the disordered part of the core. That's really weird.

Guessing the actual microarchitecture details of each possible uarch is too complicated, so let us know the CPU you are testing.

I will mention that although the push/pop in the Skylake loop prevents it from flowing from the loop The detector emits, but must be re-fetched from the uop cache every time. Intel's optimization manual says that for Sandybridge, a mismatched push/bomb in the loop prevents it from using LSD. This means that it can use LSD to have Balance the cycle of push/bounce. In my tests, this was not the case with Skylake (using the lsd.uops performance counter), but I haven't seen any changes, or whether the same is true for SnB.

In addition, the unconditional branch always ends the uop-cache line. It is possible to use normal_function: in the same self as calling and jne However, in the aligned 32B machine code block, the code block may not be suitable for uop cache. (Only 3 uop-cache lines can cache decoded uop for a single 32B x86 code block). But this does not explain the possibility of no_call loop problems. So you may not be able to run on the Intel SnB series microarchitecture.

(Update, yes, loops sometimes mainly come from legacy decoding (idq.mite_uops), but usually not unique. dsb2mite_switches.penalty_cycles It is usually ~8k and may only occur on timer interrupts. The call loop runs faster and the run seems to be associated with a lower idq.mite_uops, but for the case of offset = 37, it is still 34M-63%, where 100M iteration requires 401M cycles.)

This is actually one of the “don’t do that” situations: inlining tiny functions instead of calling them from very tight loops.

< p> If you press/pop a register other than the loop counter, you may see different results. This will separate push/pop from the loop counter, so there will be 2 independent dependency chains. It should speed up the call And no_call version, but maybe different. It may only make the front-end bottleneck more obvious.

You should see a huge speedup, if you push edx but pop eax, so push/pop instructions don’t It will form a dependency chain carried in a loop. Then the additional call/revocation will definitely become a bottleneck.

Side note: dec ecx has set ZF the way you want, so you can use dec ecx / jnz. In addition, cmp ecx,0 is less efficient than test ecx,ecx (larger code size, macro fusion cannot be performed on as many CPUs as possible). , Has nothing to do with the relative performance of your two loops. (Your lack of an ALIGN instruction between functions means that changing the first one will change the alignment of the second loop branch, but you have explored different alignments.)

Leave a Comment

Your email address will not be published.