LinuxLists.cc - [rfc patch] i386: don't save eflags on task switch

[permalink] [raw]

Subject: Re: [rfc patch] i386: don't save eflags on task switch

On Fri, 3 Nov 2006, Chuck Ebbert wrote:
>
> There is no real need to save eflags in switch_to(). Instead,
> we can keep a constant value in the thread_struct and always
> restore that.

I don't really see the point. The "pushfl" isn't the expensive part, and
it gives sane and expected semantics.

The "popfl" is the expensive part, and that's the thing that can't really
even be removed.

Linus

2006-11-04 01:36:28

[permalink] [raw]

Subject: Re: [rfc patch] i386: don't save eflags on task switch

On Saturday 04 November 2006 01:46, Linus Torvalds wrote:
>
> On Fri, 3 Nov 2006, Chuck Ebbert wrote:
> >
> > There is no real need to save eflags in switch_to(). Instead,
> > we can keep a constant value in the and always
> > restore that.
>
> I don't really see the point. The "pushfl" isn't the expensive part, and
> it gives sane and expected semantics.
>
> The "popfl" is the expensive part, and that's the thing that can't really
> even be removed.

Well it could -- i made an attempt at it recently. But it would add
a little code to the SYSENTER entry code (basically a test / conditional
jump after the pushfl there) to clear any rogue flags on entry.
Not sure it is worth it. That is why i didn't put this patch in.

-Andi

2006-11-04 07:03:33

by Chuck Ebbert

[permalink] [raw]

Subject: Re: [rfc patch] i386: don't save eflags on task switch

In-Reply-To: <[email protected]>

On Fri, 3 Nov 2006 16:46:25 -0800, Linus Torvalds wrote:

> On Fri, 3 Nov 2006, Chuck Ebbert wrote:
> >
> > There is no real need to save eflags in switch_to(). Instead,
> > we can keep a constant value in the thread_struct and always
> > restore that.
>
> I don't really see the point. The "pushfl" isn't the expensive part, and
> it gives sane and expected semantics.
>
> The "popfl" is the expensive part, and that's the thing that can't really
> even be removed.

Well that wasn't the impression I got:

Date: Mon, 18 Sep 2006 12:12:51 -0400
From: Benjamin LaHaise <[email protected]>
Subject: Re: Sysenter crash with Nested Task Bit set

...

It's the pushfl that will be slow on any OoO CPU, as it has dependancies on
any previous instructions that modified the flags, which ends up bringing
all of the memory ordering dependancies into play. Doing a popfl to set the
flags to some known value is much less expensive.

And benchmarks seem to support that, even on K8:

lmbench context switch, 50 runs

before after
avg 1.09 1.05
stddev .25 .18

But P4 is the real problem case and I can't test that.

--
Chuck
"Even supernovas have their duller moments."

2006-11-04 19:09:43

[permalink] [raw]

Subject: Re: [rfc patch] i386: don't save eflags on task switch

Chuck Ebbert wrote:
> In-Reply-To: <[email protected]>
>
> On Fri, 3 Nov 2006 16:46:25 -0800, Linus Torvalds wrote:
>
>
>> On Fri, 3 Nov 2006, Chuck Ebbert wrote:
>>
>>> There is no real need to save eflags in switch_to(). Instead,
>>> we can keep a constant value in the thread_struct and always
>>> restore that.
>>>
>> I don't really see the point. The "pushfl" isn't the expensive part, and
>> it gives sane and expected semantics.
>>
>> The "popfl" is the expensive part, and that's the thing that can't really
>> even be removed.
>>
>
> Well that wasn't the impression I got:
>
> Date: Mon, 18 Sep 2006 12:12:51 -0400
> From: Benjamin LaHaise <[email protected]>
> Subject: Re: Sysenter crash with Nested Task Bit set
>
> ...
>
> It's the pushfl that will be slow on any OoO CPU, as it has dependancies on
> any previous instructions that modified the flags, which ends up bringing
> all of the memory ordering dependancies into play. Doing a popfl to set the
> flags to some known value is much less expensive.
>

That doesn't sound correct to me. The popf is far more expensive.
There is no popfl $IMM instruction, so setting flags never can avoid the
memory read and must make some more expensive assumptions about effects
on further instruction stream (TF, DF, all sign flags for conditional
jumps...).

Every processor I've ever measured it on, popf is slower. On P4, for
example, pushf is 6 cycles, and popf is 54. On Opteron, it is 2 / 12.
On Xeon, it is 7 / 91.

Zach

2006-11-04 19:40:34

[permalink] [raw]

Subject: Re: [rfc patch] i386: don't save eflags on task switch

On Sat, 4 Nov 2006, Zachary Amsden wrote:
>
> Every processor I've ever measured it on, popf is slower. On P4, for example,
> pushf is 6 cycles, and popf is 54. On Opteron, it is 2 / 12. On Xeon, it is
> 7 / 91.

This meshes with my memory of doing the same.

A "pushf" at most needs to wait for previous ALU operations to serialize,
it's not fundamnetally a very expensive operation at all (ie you should
generally be able to do it totally pipelined, perhaps with a cycle of two
of latency to the previous thing that could set a flag).

In _practice_, I suspect popfl is always microcoded (because parts of the
flags are "slow" flags - all the stuff that isn't the pure arithmetic
flags - and tend to be maintained in separate registers), which almost
inevitably tends to mean that it isn't well pipelined, and that probably
explains why it tends to be bigger than one or two cycles in practice (ie
I'm surprised that your Opteron numbers are _that_ good - not because it's
a hard operation to do, but simply because I wouldn't have expected the
decode to be as efficient as all that).

In contrast, "popf" is actually a rather complicated instruction. It needs
to write some of the "slow" bits, especially NT, AC and IF, which have
fundamental effects on the instructions around them. So a popf should
really be at _least_ as slow as a "sti" or a "cli", both of which are much
simpler instructions, but both of which are actually complicated exactly
because they need to make sure they serialize any interrupt delivery
(which sounds like an easy thing to do, but isn't necessarily easy at
all).

On a P4, for example, with the trace cache, I would expect that popfl
would at a minimum force a trace to end, exactly because it can have
unexpected consequences for subsequent memory ops (ie they now need to
potentially check for alignment issues etc).

That said, it is quite possible that "popf" has microcode that notices
when none of the expensive flags change, and that it's thus sometimes much
faster (also, popfl is possibly faster in kernel mode than it is in user
mode, because things like AC don't matter in the kernel).

Anyway, I'm not at all surprised by Zach's numbers. popfl is really a lot
nastier than pushfl from a CPU standpoint. Neither are necessarily
"trivial", but writing bits with side effects is still a lot more work
than reading them.

I suspect that any lmbench differences that Chuck sees are either:

- incidental (ie instruction length changes resulted in slightly
different alignment and cacheline usage)

- due to noise that can be "consistent" over a boot, ie things like
memory allocation can cause things like L2 allocation that ends up
having cache way-capacity issues, and depending on how things _happen_
to be allocated, you can get consistently different results even for
otherwise identical kernels.

- some interesting side effect, like the fact that the saved flags value
was saved with interrupts enabled, and maybe a "popfl" with interrupts
enabled in the value is faster than one with them disabled. However, in
this case, I think that would be a bug - we're not supposed to enable
interrupts yet (we're holding the runqueue lock, I think).

There may be other side effects, and it would be interesting to see if
this can be explained. But "pushfl is slow" is not an explanation -
that could at most explain a couple of CPU cycles.

Hmm?

Linus

2006-11-05 03:56:15

by Benjamin LaHaise

[permalink] [raw]

Subject: Re: [rfc patch] i386: don't save eflags on task switch

On Sat, Nov 04, 2006 at 11:09:42AM -0800, Zachary Amsden wrote:
> Every processor I've ever measured it on, popf is slower. On P4, for
> example, pushf is 6 cycles, and popf is 54. On Opteron, it is 2 / 12.
> On Xeon, it is 7 / 91.

pushf has to wait until all flag dependancies can be resolved. On the
P4 with >100 instructions in flight, that can take a long time. Popf
on the other hand has no dependancies on outstanding instructions as it
resets the machine state.

-ben
--
"Time is of no importance, Mr. President, only life is important."
Don't Email: <[email protected]>.

2006-11-05 04:13:47

[permalink] [raw]

Subject: Re: [rfc patch] i386: don't save eflags on task switch

On Sat, 4 Nov 2006, Benjamin LaHaise wrote:

> On Sat, Nov 04, 2006 at 11:09:42AM -0800, Zachary Amsden wrote:
> > Every processor I've ever measured it on, popf is slower. On P4, for
> > example, pushf is 6 cycles, and popf is 54. On Opteron, it is 2 / 12.
> > On Xeon, it is 7 / 91.
>
> pushf has to wait until all flag dependancies can be resolved. On the
> P4 with >100 instructions in flight, that can take a long time.

That's like saying that "addc has to wait until all flag dependencies can
be resolved". Not so. It just needs to execute _after_, but it pipelines
perfectly fine. There's no need for any pipeline flush, nothing like that.
Yes, an "addc" needs the flags from previous instructions, but that
doesn't really make an addc any slower - it just adds a data dependency.

Same goes for pushf. Sure - it has a data dependency on the flags. The
flags are usually there one cycle (ONE CYCLE!) after the actual value has
been computed, so you should expect it to be fast.

> Popf on the other hand has no dependancies on outstanding instructions
> as it resets the machine state.

Right. Popf generally has to flush the pipeline, _exactly_ because it
actually changes machine state that normally isn't carried along, and that
may even be part of the trace-cache state on P4 for all I know.

So you got the logic totally wrong. pushf is fast, popf is slow. pushf can
pipeline, popf would quite commonly _flush_ the pipeline.

Dependencies are _cheap_. What's expensive is state changes.

Anyway, I'm right, you're wrong. Why even bother arguing about it? Why are
people arguing against REAL HARD NUMBERS that were posted by Zach?

Btw, for the P4 people out there who can't admit that they are wrong, just
run this small program.

#define pushfl(value) \
asm volatile("pushfl ; popl %0":"=r" (value));
#define popfl(value) \
asm volatile("push %0 ; popfl": :"r" (value));
#define rdtsc(value) \
asm volatile("rdtsc":"=A" (value))

int main(int argc, char ** argv)
{
unsigned long long a,b;
unsigned long tmp;

rdtsc(a);
rdtsc(b);
printf("%lld\n", b-a);
rdtsc(a);
pushfl(tmp);
rdtsc(b);
printf("%lld\n", b-a);
rdtsc(a);
popfl(tmp);
rdtsc(b);
printf("%lld\n", b-a);
return 0;
}

For me, when compiled with -O2, it results in

84
88
132

which basically says: a "rdtsc->rdtsc" is 84 cycles, putting a "pushfl" in
between is another _4_ cycles, and putting a "popfl" in between is about
another 48 cycles.

Now, those numbers aren't scientific, but they tell you something.

Now, tell me how popfl is faster again.

But dammit, back it up with real numbers and real logic this time.

Linus

2006-11-05 04:17:45

[permalink] [raw]

Subject: Re: [rfc patch] i386: don't save eflags on task switch

Benjamin LaHaise wrote:
> On Sat, Nov 04, 2006 at 11:09:42AM -0800, Zachary Amsden wrote:
>
>> Every processor I've ever measured it on, popf is slower. On P4, for
>> example, pushf is 6 cycles, and popf is 54. On Opteron, it is 2 / 12.
>> On Xeon, it is 7 / 91.
>>
>
> pushf has to wait until all flag dependancies can be resolved. On the
> P4 with >100 instructions in flight, that can take a long time. Popf
> on the other hand has no dependancies on outstanding instructions as it
> resets the machine state.
>

Yes, but as Linus points out popf is most likely microcoded, thus much
slower. Flag dependency is not unique to pushf, many much more common
instructions (adc, jcc, sbc, cmovcc, movs, stos, ...) have flag
dependencies, which can still be pipeline forwarded. I think the raw
cycle counts speak for themselves, despite the fact that I only measured
instruction latency, not throughput. Using a branch to eliminate a
pushf is thus probably not a win in most cases.

Zach

2006-11-05 05:41:36

[permalink] [raw]

Subject: Re: [rfc patch] i386: don't save eflags on task switch

> For me, when compiled with -O2, it results in
>
> 84
> 88
> 132
>
> which basically says: a "rdtsc->rdtsc" is 84 cycles, putting a "pushfl" in
> between is another _4_ cycles, and putting a "popfl" in between is about
> another 48 cycles.

This means we should definitely change restore_flags() to only STI,
never popf

-Andi

2006-11-05 08:01:58

[permalink] [raw]

Subject: Re: [rfc patch] i386: don't save eflags on task switch

Andi Kleen wrote:
>> For me, when compiled with -O2, it results in
>>
>> 84
>> 88
>> 132
>>
>> which basically says: a "rdtsc->rdtsc" is 84 cycles, putting a "pushfl" in
>> between is another _4_ cycles, and putting a "popfl" in between is about
>> another 48 cycles.
>>
>
> This means we should definitely change restore_flags() to only STI,
> never popf
>

sti is expensive as well; iirc just as expensive on most processors as
popf, although I don't have hard numbers to back this up on hand. An
_unlikely_ jcc + popf? is better than a sti, for sure, but a likely
jcc+popf could cost more than a jcc+sti, depending on model.

Zach

2006-11-05 16:12:46

[permalink] [raw]

Subject: Re: [rfc patch] i386: don't save eflags on task switch

On Sun, 5 Nov 2006, Andi Kleen wrote:
>
> This means we should definitely change restore_flags() to only STI,
> never popf

Whaa? That would be wrong. We don't always sti, quite often the flags were
disabled anyway.

And changing restore-flags to a "conditional branch around sti" is likely
not much better - mispredicted branches on a P4 are potentially worse than
the popf cost.

Side note: for the netburst microarchitecture - aka P4 - in general,
something like 48 cycles is a _good_ thing. I measured a internal
micro-fault for marking a page table entry dirty at over 1500 cycles!
There's a reason Intel dropped Netburst in favour of Core 2 - which is
largely just an improved Pentium Pro uarch. Admittedly, the "just" is a
bit unfair, because there's a _lot_ of improvement, but still..

So you should never actually make any real code design decisions based on
a P4 result. The P4 is goign away, and it was odd.

Linus

2006-11-05 16:55:43

[permalink] [raw]

Subject: Re: [rfc patch] i386: don't save eflags on task switch

On Sunday 05 November 2006 17:12, Linus Torvalds wrote:

> And changing restore-flags to a "conditional branch around sti"

Yes of course.

> is likely
> not much better

We'll see.

It used to be a bad idea because everything was inline, but these
days with out of line code one can be much more flexible.

> - mispredicted branches on a P4 are potentially worse than
> the popf cost.

They are far less than 48 cycles. The P4 is not _that_ bad in this
area.

> Side note: for the netburst microarchitecture - aka P4 - in general,
> something like 48 cycles is a _good_ thing. I measured a internal
> micro-fault for marking a page table entry dirty at over 1500 cycles!
> There's a reason Intel dropped Netburst in favour of Core 2 - which is
> largely just an improved Pentium Pro uarch. Admittedly, the "just" is a
> bit unfair, because there's a _lot_ of improvement, but still..
>
> So you should never actually make any real code design decisions based on
> a P4 result. The P4 is goign away, and it was odd.

There are millions and millions of P4s out there running
Linux and I don't think that will change any time soon (in fact Intel will
be still shipping many new ones for a long time) If there are cheap
valuable optimizations for P4 that don't impact others much I'm all for them.

-Andi

2006-11-05 17:01:24

[permalink] [raw]

Subject: Re: [rfc patch] i386: don't save eflags on task switch

> sti is expensive as well; iirc just as expensive on most processors as
> popf,

On K8 STI is 4 cycles.

> although I don't have hard numbers to back this up on hand. An
> _unlikely_ jcc + popf? is better than a sti, for sure, but a likely
> jcc+popf could cost more than a jcc+sti, depending on model.

99.9999% of all restore_flags just need STI.

-Andi

2006-11-05 17:20:36

[permalink] [raw]

Subject: Re: [rfc patch] i386: don't save eflags on task switch

On Sun, 5 Nov 2006, Andi Kleen wrote:
>
> > - mispredicted branches on a P4 are potentially worse than
> > the popf cost.
>
> They are far less than 48 cycles. The P4 is not _that_ bad in this
> area.

You wanna bet? Use the newer P4 cores. A branch mispredict is over 20
cycles, and I bet the "sti" isn't cheap either.

In other words, I suspect the difference between "popfl" and "conditional
jump over sti" is basically zero - exactly because the sti isn't exactly a
no-op.

(Enabling interrupts is actually much more complex than you'd expect.
Interrupt delivery in a HT core is not simple in itself, and "sti" in many
ways is actually more complex than "popf", because it has the additional
"single-cycle interrupt shadow", ie the interrupt isn't actually enabled
after the sti, it's enabled after the _next_ instruction after the sti. So
from a uarch standpoint, "popf" is actually somewhat simpler.)

Anyway, what both you and Chuck seem to be missing is that trying to save
a couple of CPU cycles is NOT WORTH IT, if it makes the code harder and
more fragile. The "save eflags over context switch" that we do now is
_obvious_ code. That's worth a lot in itself. And the costs of context
switching isn't actually a couple of cycles - the real costs are all
elsewhere.

Linus

2006-11-05 17:26:56

[permalink] [raw]

Subject: Re: [rfc patch] i386: don't save eflags on task switch

On Sun, 5 Nov 2006, Andi Kleen wrote:
>
> > sti is expensive as well; iirc just as expensive on most processors as
> > popf,
>
> On K8 STI is 4 cycles.

Your point being? On K8, popf was 12 cycles according to Zach. The
conditional branch is going to be almost totally unpredictable, unless you
inline it everywhere (it will be statically predictable in each individual
location), but if you inline it, you'll be expanding a two-byte
instruction sequence to something like 10 bytes with all the tests.

So you get a test, a unpredictable conditional jump, and the sti - and
you'll end up with the cost being pretty much the same as the popf: only
bigger and more complex.

That's a win, right?

> 99.9999% of all restore_flags just need STI.

Hell no. If you know it statically, you can already just do the
"spin_lock_irq()"->"spin_unlock_irq()", and then you have the
_unconditional_ sti.

The problem with that is that it's now a lot more fragile, ie you must
know what the calling context was. We do do this, because it _is_ faster
when it's unconditional, but you're optimizing all the wrong things.

Andi, one single bug is usually worth _months_ of peoples time and effort.
How many CPU cycles is that?

You need to learn that "stability and maintainability" is more important
than trying to get a single cycle or two. Really. I'll do cycle tweaking
too, but it needs to be something that is really obvious, or really
important.

Linus

2006-11-05 17:35:20

by Arjan van de Ven

[permalink] [raw]

Subject: Re: [rfc patch] i386: don't save eflags on task switch

> Andi, one single bug is usually worth _months_ of peoples time and effort.
> How many CPU cycles is that?

actually lockdep is pretty good at finding this type of bug IMMEDIATELY
even without the actual race triggering ;)

(but I otherwise agree that unless there is a
"sti-conditional-on-zero-flag" or something.... it's probably a void
optimization in the general uninlined case)

--
if you want to mail me at work (you don't), use arjan (at) linux.intel.com
Test the interaction between Linux and your BIOS via http://www.linuxfirmwarekit.org

2006-11-05 17:51:57

[permalink] [raw]

Subject: Re: [rfc patch] i386: don't save eflags on task switch

On Sun, 5 Nov 2006, Arjan van de Ven wrote:
>
> actually lockdep is pretty good at finding this type of bug IMMEDIATELY
> even without the actual race triggering ;)

Ehh. Last time this happened, lockdep didn't find _squat_.

This was when NT and AC leaked across context switches, because the
context switching had removed the "expensive" save/restore.

The thing is, complexity is in the unintended side effects, not in the
code itself. For example, let's say that we changed "restore_flags()" to
do

static inline void restore_flags(unsigned long x)
{
if (x & 0x200)
asm volatile("sti");
}

(I didn't check that IF is 0x200, but it's something like that) and it was
two cycles faster on average than just doing a "popf". The _complexity_
here is that now there might be some other x86-architecture-specific code
sequence that nobody even _realized_ actually depended on saving the other
flags too. Like the context switching thing did.

Is it likely? Maybe not. But that's the thing about complexity - you'd not
know, would you?

Do a few of these kinds of things, and _individually_ they are unlikely to
add new bugs, but once you've done ten or twenty of them, the likelihood
that _one_ of them added a subtle bug that it will take months or years to
find is suddenly not all that small any more.

This is why "robust" is so important. So _much_ more important than a
cycle or two. The fact is, saving and restoring all the eflags over a
context switch is just _more_robust_. If you do a pushfl/popfl, there's
simply not a lot you can screw up.

Linus

2006-11-05 18:53:12

[permalink] [raw]

Subject: Re: [rfc patch] i386: don't save eflags on task switch

>
> So you get a test, a unpredictable conditional jump, and the sti - and
> you'll end up with the cost being pretty much the same as the popf: only
> bigger and more complex.
>
> That's a win, right?

Previously we had a popf which for the CPU unpredictable
restoring of most of its state. I would assume the unpredicted jump
is cheaper than that.

I might be wrong. Benchmarks will tell when I try it.

But I think it might be worth a try at least.

>
> > 99.9999% of all restore_flags just need STI.
>
> Hell no. If you know it statically, you can already just do the
> "spin_lock_irq()"->"spin_unlock_irq()", and then you have the
> _unconditional_ sti.

I meant they don't care about any other flags. Of course you need
a test + conditional jump first to handle the nested lock case.

> Andi, one single bug is usually worth _months_ of peoples time and effort.
> How many CPU cycles is that?

I bet near all cases where restore_flags() are used to restore something
else than IF are actually bugs or performance issues of some sort

This doesn't include the context switch of course, but that one
doesn't use restore_flags().

(I think I have a few legitimate cases of save_flags though. They probably
should be using a different macro for clarity.)

-Andi

2006-11-05 20:11:30

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [rfc patch] i386: don't save eflags on task switch

Zachary Amsden wrote:
> Benjamin LaHaise wrote:
>> On Sat, Nov 04, 2006 at 11:09:42AM -0800, Zachary Amsden wrote:
>>
>>> Every processor I've ever measured it on, popf is slower. On P4, for
>>> example, pushf is 6 cycles, and popf is 54. On Opteron, it is 2 /
>>> 12. On Xeon, it is 7 / 91.
>>
>> pushf has to wait until all flag dependancies can be resolved. On the
>> P4 with >100 instructions in flight, that can take a long time. Popf
>> on the other hand has no dependancies on outstanding instructions as
>> it resets the machine state.
>
> Yes, but as Linus points out popf is most likely microcoded, thus much
> slower. Flag dependency is not unique to pushf, many much more common
> instructions (adc, jcc, sbc, cmovcc, movs, stos, ...) have flag
> dependencies, which can still be pipeline forwarded. I think the raw
> cycle counts speak for themselves, despite the fact that I only measured
> instruction latency, not throughput. Using a branch to eliminate a
> pushf is thus probably not a win in most cases.
>

The "sane" decomposition of popf into uops something like this:

- Memory read
- Mask bits that are immutable in the current mode
- Trap to microcode on changing any bit that alters the pipeline state

The "trap to microcode" can obviously be arbitrarily expensive. So when
timing popf, it's also important to know *which* bits change.

The simplest case, obviously, is when no flags change. I still *fully*
expect that to be more painful than pushf ever is.

-hpa

2006-11-05 22:48:13