On Thu, 2005-Mar-31 17:17:58 +1000, Bruce Evans wrote:
>>>On the i386 (and probably most other CPUs), you can place the FPU into
>>>am "unavailable" state. This means that any attempt to use it will
>>>trigger a trap. The kernel will then restore FPU state and return.
>>>On a normal system call, if the FPU hasn't been used, the kernel will
>>>see that it's still in an "unavailable" state and can avoid saving the
>>>state. (On an i386, "unavailable" state is achieved by either setting
>>>CR0_TS or CR0_EM). This means you avoid having to always restore FPU
>>>state at the expense of an additional trap if the process actually
>>>uses the FPU.
>I remember that you (Peter) did extensive benchmarks of this.
That was a long time ago and I don't recall them being that extensive.
I suspect the results are in my archives at work - I can't quickly
find them here. From memory the tests were on 2.2 and just counted
the number of context switches, FP saves and restores.
> I still
>think fully lazy switching (c2) is the best general method.
I think it depends on the FP workload. It's a definite win if there
is exactly one FP thread - in this case the FPU state never needs to
be saved (and you could even optimise away the DNA trap by clearing
the TS and EM bits if the switched-to curthread is fputhread).
The worst case is two (or more) FP-intensive threads - in this case,
lazy switching is of no benefit. The DNA trap overheads mean that
the performance is worse than just saving/restoring the FP state
during a context switch.
My guess is that the current generation workstation is closer to the
second case - current generation graphical bloatware uses a lot of
FP for rendering, not to mention that the idle task has a reasonable
chance of being an FP-intensive distributed computing task (setiathome
or similar). It's probably time to do some more measuring (I'm not
offering just now, I have lots of other things on my TODO list).
SMP adds a whole new can of worms. (I originally suspected that lazy
switching had been lost during the SMP transition). Given CPU (FPU)
affinity, you can just add "per CPU" to the above but I'm not sure
that changes my conclusion.
> Maybe FP state should be loaded in advance based on FPU affinity.
Pre-loading the FPU state is an advantage for FP-intensive threads -
if the thread will definitely use the FPU before the next context
switch, you save the cost of a DNA trap by pre-loading the FPU state.
> It might be
>good for CPU affinity to depend on FPU use (prfer not to switch
>threads away from a CPU if they own that CPU via its FPU).
FPU affinity is only an advantage if full lazy switching is implemented.
(And I thought we didn't even have CPU affinity working well). The
first step is probably gathering some data on whether lazy switching
is any benefit.
>BTW, David and I recently found a bug in the context switching in the
>fxsr case, at least on Athlon-XP's and AMD64's.
I gather this is not noticable unless the application is doing its
own FPU save/restore. Is there a solution or work-around?