My main problem is that when I use the FPU stack, There is the fsin function, which can be used on numbers, it is at the top of the stack (st0).
Now I want to calculate the sine of all my four numbers in XMM0, or calculate it elsewhere and enter XMM0. I am using AT&T syntax.
I think the second idea is actually possible, but I don’t know how 🙂
Does anyone know how to do it?
>Use and existing Library to calculate the sin of the SSE vector.
>Use SSE to write your own vector sin function.
>Store the vector in memory, use fsin to calculate the sine of each element, and then load the result. Assuming your stack It is 16-byte aligned and has 16-byte space, as shown below:
movaps %xmm0, (%rsp)
mov $3, %rcx
0: flds (%rsp,%rcx,4)
fsin
fstps (%rsp,%rcx,4)
sub $1, %rcx
jns 0b pre>(1) is almost certainly the best-performing choice and the easiest choice. If you have extensive experience in writing vector code and know a priori that the parameters belong to a certain range, then you can pass (2) Get better performance. Using fsin will work, but if it matters, it will be ugly, slow and not particularly accurate.
I I was doing integration tasks with FPU before, and now I am struggling with SSE.
My main problem is that when I use the FPU stack, there are fsin functions that can be used numerically, which is located at The top of the stack (st0).
Now I want to calculate the sine of all my four numbers in XMM0, or calculate it elsewhere and enter XMM0. I am using AT&T syntax.
I think the second idea is actually possible, but I don't know how :)
Does anyone know how to do it?
Three choices:
>Use and existing library to calculate the sin of the SSE vector.
> Use SSE to write your own vector sin function.
>Store the vector into memory, use fsin to calculate the sine of each element, and then load the result. Assuming your stack is 16-byte aligned and has 16-byte space , As follows:
movaps %xmm0, (%rsp)
mov $3, %rcx
0: flds (%rsp,%rcx, 4)
fsin
fstps (%rsp,%rcx,4)
sub $1, %rcx
jns 0b
(1) is almost certainly The best performance choice is also the easiest choice. If you have rich experience in writing vector code and know a priori that the parameters belong to a certain range, then you can get better performance through (2). Using fsin will It works, but if it matters, it will be ugly, slow and not particularly accurate.