2005-11-10 00:35:30

by Zachary Amsden

[permalink] [raw]
Subject: [PATCH 1/10] Cr4 is valid on some 486s

So some 486 processors do have CR4 register. Allow them to present it in
register dumps by using the old fault technique rather than testing processor
family.

Thanks to Maciej for noticing this.

Signed-off-by: Zachary Amsden <[email protected]>
Index: linux-2.6.14/arch/i386/kernel/process.c
===================================================================
--- linux-2.6.14.orig/arch/i386/kernel/process.c 2005-11-08 03:25:24.000000000 -0800
+++ linux-2.6.14/arch/i386/kernel/process.c 2005-11-08 03:26:03.000000000 -0800
@@ -314,9 +314,7 @@
cr0 = read_cr0();
cr2 = read_cr2();
cr3 = read_cr3();
- if (current_cpu_data.x86 > 4) {
- cr4 = read_cr4();
- }
+ cr4 = read_cr4_safe();
printk("CR0: %08lx CR2: %08lx CR3: %08lx CR4: %08lx\n", cr0, cr2, cr3, cr4);
show_trace(NULL, &regs->esp);
}
Index: linux-2.6.14/include/asm-i386/system.h
===================================================================
--- linux-2.6.14.orig/include/asm-i386/system.h 2005-11-08 03:25:29.000000000 -0800
+++ linux-2.6.14/include/asm-i386/system.h 2005-11-08 03:26:03.000000000 -0800
@@ -140,6 +140,19 @@
:"=r" (__dummy)); \
__dummy; \
})
+
+#define read_cr4_safe() ({ \
+ unsigned int __dummy; \
+ /* This could fault if %cr4 does not exist */ \
+ __asm__("1: movl %%cr4, %0 \n" \
+ "2: \n" \
+ ".section __ex_table,\"a\" \n" \
+ ".long 1b,2b \n" \
+ ".previous \n" \
+ : "=r" (__dummy): "0" (0)); \
+ __dummy; \
+})
+
#define write_cr4(x) \
__asm__ __volatile__("movl %0,%%cr4": :"r" (x));
#define stts() write_cr0(8 | read_cr0())


2005-11-11 10:37:00

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

Hi!

> So some 486 processors do have CR4 register. Allow them to present it in
> register dumps by using the old fault technique rather than testing processor
> family.

I thought Andi commented this as "way too risky", for little
good. Nested exceptions are evil.
Pavel
--
Thanks, Sharp!

2005-11-11 17:49:46

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

Pavel Machek wrote:
> Hi!
>
>
>>So some 486 processors do have CR4 register. Allow them to present it in
>>register dumps by using the old fault technique rather than testing processor
>>family.
>
>
> I thought Andi commented this as "way too risky", for little
> good. Nested exceptions are evil.
> Pavel

I think the 486's that have CR4 are the same that have CPUID, and thus
can be tested for by the presence of the ID flag.

-hpa

2005-11-11 18:00:22

by Maciej W. Rozycki

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

On Fri, 11 Nov 2005, H. Peter Anvin wrote:

> I think the 486's that have CR4 are the same that have CPUID, and thus can be
> tested for by the presence of the ID flag.

That's correct; for our purposes a 486 that would implement CR4 but not
CPUID would not be interesting anyway, as we don't use CR4 elsewhere but
for features discovered through CPUID. And I don't think there's ever
been an implementation that had CPUID but no CR4.

Maciej

2005-11-11 19:38:57

by Zachary Amsden

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

Pavel Machek wrote:

>Hi!
>
>
>
>>So some 486 processors do have CR4 register. Allow them to present it in
>>register dumps by using the old fault technique rather than testing processor
>>family.
>>
>>
>
>I thought Andi commented this as "way too risky", for little
>good. Nested exceptions are evil.
>
>

I didn't see Andi's comment to that effect. I may have originally
argued that when I made CR4 reads depend on CPU family. But I think it
is useful to know if PSE is enabled, especially on 486s that do support it.

Agree nested exceptions are evil. But where is this called from
execption context?

1) softlockup_tick appears to be perfectly safe call site to handle
exceptions
2) sysrq-p is also a fine site.

I tested this by assembling a hacked safe_read_cr1() macro, and dumped
the contents of my non-existant CR1 regsiter in show_regs to prove the
fault handling correct (although the code already _looks_ correct, I
thought someone might ask the question you just did. :)

Zach

2005-11-11 19:58:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s



On Fri, 11 Nov 2005, Zachary Amsden wrote:
>
> Agree nested exceptions are evil. But where is this called from execption
> context?

We have really nice ways of handling these things, so we should just use
them.

For example, you can do

static inline void read_cr4(void)
{
unsigned long cr4;
alternative_input("xorl %0,%0",
"movl %%cr4,%0",
X86_FEATURE_CR4,
"r" (cr4));
return cr4;
}

and then just add that feature-flag discovery early on in boot (it needs
to be pretty early, since the alternative instruction rewriting happens
early).

We have several "calculated" features already. Things like X86_FEATURE_P4
etc.

Linus

2005-11-11 20:14:05

by Zachary Amsden

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

Linus Torvalds wrote:

>On Fri, 11 Nov 2005, Zachary Amsden wrote:
>
>
>>Agree nested exceptions are evil. But where is this called from execption
>>context?
>>
>>
>
>We have really nice ways of handling these things, so we should just use
>them.
>
>For example, you can do
>
> static inline void read_cr4(void)
> {
> unsigned long cr4;
> alternative_input("xorl %0,%0",
> "movl %%cr4,%0",
> X86_FEATURE_CR4,
> "r" (cr4));
> return cr4;
> }
>
>and then just add that feature-flag discovery early on in boot (it needs
>to be pretty early, since the alternative instruction rewriting happens
>early).
>
>We have several "calculated" features already. Things like X86_FEATURE_P4
>etc.
>
>

Yes, this is fine, but is it worth writing the feature discovery code?
I suppose it doesn't matter, as it gets jettisoned after init. I guess
it is just preference.

Considering run time code size, the alternative approach wins, has no
extra branches, and is just nicer. The faulting technique requires two
extra dwords of space that can not be jettisonned. So obviously, I must
do it (the alternative approach).

Could we consider doing the same with LOCK prefix for SMP kernels booted
on UP? Evil grin.

Zach

2005-11-11 20:22:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s



On Fri, 11 Nov 2005, Zachary Amsden wrote:
>
> Yes, this is fine, but is it worth writing the feature discovery code? I
> suppose it doesn't matter, as it gets jettisoned after init. I guess it is
> just preference.

Well, you could do the feature discovery by trying to take a fault early
at boot-time. That's how we verify that write-protect works, and how we
check that math exceptions come in the right way..

> Could we consider doing the same with LOCK prefix for SMP kernels booted on
> UP? Evil grin.

Not so evil - I think it's been discussed. Not with alternates (not worth
it), but it wouldn't be hard to do: just add a new section for "lock
address", and have each inline asm that does a lock prefix do basically

1:
lock ; xyzzy

.section .lock.address
.long 1b
.previous

and then just walk the ".lock.address" thing and turn all locks into 0x90
(nop).

Linus

2005-11-13 07:43:15

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

On Fri, Nov 11, 2005 at 12:22:07PM -0800, Linus Torvalds wrote:
>
>
> On Fri, 11 Nov 2005, Zachary Amsden wrote:
> >
> > Yes, this is fine, but is it worth writing the feature discovery code? I
> > suppose it doesn't matter, as it gets jettisoned after init. I guess it is
> > just preference.
>
> Well, you could do the feature discovery by trying to take a fault early
> at boot-time. That's how we verify that write-protect works, and how we
> check that math exceptions come in the right way..
>
> > Could we consider doing the same with LOCK prefix for SMP kernels booted on
> > UP? Evil grin.
>
> Not so evil - I think it's been discussed. Not with alternates (not worth
> it), but it wouldn't be hard to do: just add a new section for "lock
> address", and have each inline asm that does a lock prefix do basically
>
> 1:
> lock ; xyzzy
>
> .section .lock.address
> .long 1b
> .previous
>
> and then just walk the ".lock.address" thing and turn all locks into 0x90
> (nop).

Looks like the Ubuntu people already did this...

http://www.kernel.org/git/?p=linux/kernel/git/bcollins/ubuntu-2.6.git;a=commitdiff;h=048985336e32efe665cddd348e92e4a4a5351415;hp=1cb630c2b5aaad7cedaa78aa135e6cecf5ab91ac

Dave

2005-11-13 11:00:15

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

Dave Jones <[email protected]> writes:
>
> Looks like the Ubuntu people already did this...
>
> http://www.kernel.org/git/?p=linux/kernel/git/bcollins/ubuntu-2.6.git;a=commitdiff;h=048985336e32efe665cddd348e92e4a4a5351415;hp=1cb630c2b5aaad7cedaa78aa135e6cecf5ab91ac

It's probably not needed. At least AMD K7/K8 has a SYSCFG MSR bit to
do this (or rather they disable bus cycles for locks that makes them
very cheap) Intel has one too in a different MSR that looks similar.
With some luck they're even already set by the BIOS on UP systems. I
know they are on some AMD systems.

But overall the feature doesn't help longer term because single
threaded CPUs are on their way out.

-Andi

2005-11-13 16:55:59

by Alan

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

On Sul, 2005-11-13 at 11:59 +0100, Andi Kleen wrote:
> Dave Jones <[email protected]> writes:
> >
> > Looks like the Ubuntu people already did this...
> >
> > http://www.kernel.org/git/?p=linux/kernel/git/bcollins/ubuntu-2.6.git;a=commitdiff;h=048985336e32efe665cddd348e92e4a4a5351415;hp=1cb630c2b5aaad7cedaa78aa135e6cecf5ab91ac
>
> It's probably not needed. At least AMD K7/K8 has a SYSCFG MSR bit to
> do this (or rather they disable bus cycles for locks that makes them
> very cheap) Intel has one too in a different MSR that looks similar.
> With some luck they're even already set by the BIOS on UP systems. I
> know they are on some AMD systems.

I'd hope the vendors are not doing that by default because we have
kernel code that uses lock against not other processors but other bus
masters. The ECC code is one example. Is there any good info on the AMD
one so I can make the EDAC code put the processor back in x86 compatible
mode so that it behaves safely when scrubbing.

Alan

2005-11-13 17:10:54

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

Alan Cox <[email protected]> writes:

> On Sul, 2005-11-13 at 11:59 +0100, Andi Kleen wrote:
>> Dave Jones <[email protected]> writes:
>> >
>> > Looks like the Ubuntu people already did this...
>> >
>> >
> http://www.kernel.org/git/?p=linux/kernel/git/bcollins/ubuntu-2.6.git;a=commitdiff;h=048985336e32efe665cddd348e92e4a4a5351415;hp=1cb630c2b5aaad7cedaa78aa135e6cecf5ab91ac
>>
>> It's probably not needed. At least AMD K7/K8 has a SYSCFG MSR bit to
>> do this (or rather they disable bus cycles for locks that makes them
>> very cheap) Intel has one too in a different MSR that looks similar.
>> With some luck they're even already set by the BIOS on UP systems. I
>> know they are on some AMD systems.
>
> I'd hope the vendors are not doing that by default because we have
> kernel code that uses lock against not other processors but other bus
> masters. The ECC code is one example. Is there any good info on the AMD
> one so I can make the EDAC code put the processor back in x86 compatible
> mode so that it behaves safely when scrubbing.

Check out the AMD's BIOS and Kernel Programmer Guide for the K8. The
appropriate bits are documented, although the documentation is quite
terse.

Eric

2005-11-13 18:59:45

by Andi Kleen

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

On Sunday 13 November 2005 18:26, Alan Cox wrote:

> I'd hope the vendors are not doing that by default because we have
> kernel code that uses lock against not other processors but other bus
> masters. The ECC code is one example.

It's a bad hack anyways. Better would be probably to use a uncached WC write.
I would rather use that.

-Andi

2005-11-13 19:08:24

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

Andi Kleen <[email protected]> writes:

> On Sunday 13 November 2005 18:26, Alan Cox wrote:
>
>> I'd hope the vendors are not doing that by default because we have
>> kernel code that uses lock against not other processors but other bus
>> masters. The ECC code is one example.
>
> It's a bad hack anyways. Better would be probably to use a uncached WC write.
> I would rather use that.

For read modify write?

The point is to make the cache line dirty so that the
memory controller will write the data back.

The interesting sequence is:
lock; addl $0, %(reg)

I'm not actually sure the lock is even necessary. Mostly this is
for brain-dead chipsets, chipsets you can't trust, or at least
chipsets that won't do a background scrub for you.

I don't think it is possible to do an uncached read modify write?

Eric

2005-11-13 19:10:57

by Alan

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

On Sul, 2005-11-13 at 20:00 +0100, Andi Kleen wrote:
> It's a bad hack anyways. Better would be probably to use a uncached WC write.
> I would rather use that.

I'm not clear that anything but lock operations have the required
guarantee of atomicity relative to bus masters which are not processors.
Especially so on intel.

2005-11-13 19:25:19

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s



On Sun, 13 Nov 2005, Dave Jones wrote:
>
> Looks like the Ubuntu people already did this...

Yeah, that looks like a sane patch, although I dislike the #ifdef config
option thing (either it works or it doesn't).

It also does it the right way: using LOCK_PREFIX means that you catch
exactly the users that depend on SMP, and not _all_ "lock" prefixes (as
mentioned, some of the lock prefixes are there as memory fences and are
valid and needed even on UP). So me likee.

The only question being whether you'd actually want to nop out the
spinlock instructions _entirely_ (in addition to changing the nops on
things like semaphores). Without the lock, they're not that expensive, but
hey, it's still a useless (memory-modifying) instruction.

Linus

2005-11-13 19:37:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s



On Sun, 13 Nov 2005, Alan Cox wrote:
>
> On Sul, 2005-11-13 at 20:00 +0100, Andi Kleen wrote:
> > It's a bad hack anyways. Better would be probably to use a uncached WC write.
> > I would rather use that.
>
> I'm not clear that anything but lock operations have the required
> guarantee of atomicity relative to bus masters which are not processors.
> Especially so on intel.

The thing is, we wouldn't ever remove _all_ lock prefixes. Only the ones
that already depend on SMP.

So the memory barriers etc that have lock prefixes even on UP would be
totally untouched.

Linus

2005-11-13 19:57:43

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

Alan Cox wrote:
> On Sul, 2005-11-13 at 11:59 +0100, Andi Kleen wrote:
>
>>Dave Jones <[email protected]> writes:
>>
>>>Looks like the Ubuntu people already did this...
>>>
>>>http://www.kernel.org/git/?p=linux/kernel/git/bcollins/ubuntu-2.6.git;a=commitdiff;h=048985336e32efe665cddd348e92e4a4a5351415;hp=1cb630c2b5aaad7cedaa78aa135e6cecf5ab91ac
>>
>>It's probably not needed. At least AMD K7/K8 has a SYSCFG MSR bit to
>>do this (or rather they disable bus cycles for locks that makes them
>>very cheap) Intel has one too in a different MSR that looks similar.
>>With some luck they're even already set by the BIOS on UP systems. I
>>know they are on some AMD systems.
>
> I'd hope the vendors are not doing that by default because we have
> kernel code that uses lock against not other processors but other bus
> masters. The ECC code is one example. Is there any good info on the AMD
> one so I can make the EDAC code put the processor back in x86 compatible
> mode so that it behaves safely when scrubbing.
>

I can't speak about AMD, but on Transmeta's CPUs operations against
cached memory are *always* atomic; the atomicity is guaranteed by the
cache hierarchy. The LOCK prefix does have effects against uncached memory.

-hpa

2005-11-13 20:30:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s



On Sun, 13 Nov 2005, Linus Torvalds wrote:
>
> The only question being whether you'd actually want to nop out the
> spinlock instructions _entirely_ (in addition to changing the nops on
> things like semaphores). Without the lock, they're not that expensive, but
> hey, it's still a useless (memory-modifying) instruction.

Actually, that may turn out to be a dangerous idea.

Sad but true: There's a few tests like

#define assert_spin_locked(x) BUG_ON(!spin_is_locked(x))

and

#define __raw_spin_unlock_wait(lock) \
do { while (__raw_spin_is_locked(lock)) cpu_relax(); } while (0)

that would also need to be nopped out if we nop out the code that updates
the spinlock (right now they are just disabled entirely on UP, exactly
because tests like this don't work without the lock being instantiated).

But it would be wonderful if we could just nop out the whole call to the
spinlock (most of them are out-of-line). It would help I$ footprint, and
likely help improve dynamic scheduling around that call on many CPU's too.

So we can easily remove the lock prefix on the spinlock ops, but sadly we
can't do some other "obvious" optimizations.

We _could_ nop out the actual conditional on the lock result for a
spinlock, and turn

lock ; decb %0
js ...

into

nop ; decb %0
multi-byte-nop

which would help avoid some unnecessary branch prediction etc.

Linus

2005-11-13 21:01:15

by Alan

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

On Sul, 2005-11-13 at 11:36 -0800, Linus Torvalds wrote:
> The thing is, we wouldn't ever remove _all_ lock prefixes. Only the ones
> that already depend on SMP.
>
> So the memory barriers etc that have lock prefixes even on UP would be
> totally untouched.

That much makes sense. Having some magic MSR reloaded to turn lock
effects off is a bit more of a problem for ECC scrubbing however.

2005-11-14 07:46:39

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

On Sun, 2005-11-13 at 21:32 +0000, Alan Cox wrote:
> On Sul, 2005-11-13 at 11:36 -0800, Linus Torvalds wrote:
> > The thing is, we wouldn't ever remove _all_ lock prefixes. Only the ones
> > that already depend on SMP.
> >
> > So the memory barriers etc that have lock prefixes even on UP would be
> > totally untouched.
>
> That much makes sense. Having some magic MSR reloaded to turn lock
> effects off is a bit more of a problem for ECC scrubbing however.

well... you can expect many bioses to have done the MSR hack for you
already... so if you can't cope with that you have to set the MSR to the
value you want it to have regardless.


2005-11-14 15:06:47

by Gerd Hoffmann

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

Hi,

> We _could_ nop out the actual conditional on the lock result for a
> spinlock, and turn
>
> lock ; decb %0
> js ...
>
> into
>
> nop ; decb %0
> multi-byte-nop

Throwing another patch into the discussion ;)

Comes from some xen guy. If I read the thing correctly it builds a elf
section containing a table with both smp and up versions of the code
path, then patching in the one needed at runtime. Allows patching both
directions (up->smp, smp->up) at runtime, for hotplugging (virtual)
CPU's. I'm not a inline asm expert though ...

Comments on that one?

Gerd


Attachments:
smp-alts.patch (17.87 kB)

2005-11-14 19:26:39

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s



On Mon, 14 Nov 2005, Gerd Knorr wrote:
>
> Throwing another patch into the discussion ;)

Ouch, this one is really ugly.

If you want to go this way, then you should instead add an X86_FEATURE_SMP
that gets cleared on UP and on SMP with just one core (and detect when CPU
hotplug ain't gonna happen ;), and then do

#ifdef CONFIG_SMP
#define smp_alternative(x,y) alternative(x,y,X86_FEATURE_SMP)
#else
#define smp_alternative(x,y) asm(x)
#endif

or something similar, instead of creating a totally new infrastructure to
do the thing that "alternative()" already does.

(Yeah, the above doesn't really work, since usually the SMP form is the
longer one, and "alternative()" wants the long complex one first. So maybe
the x86 feature needs to be "X86_FEATURE_UP" instead, since it's now a
"feature" to only have one core ;)

Linus

2005-11-14 19:46:09

by Zachary Amsden

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

Linus Torvalds wrote:

>On Mon, 14 Nov 2005, Gerd Knorr wrote:
>
>
>>Throwing another patch into the discussion ;)
>>
>>
>
>Ouch, this one is really ugly.
>
>If you want to go this way, then you should instead add an X86_FEATURE_SMP
>that gets cleared on UP and on SMP with just one core (and detect when CPU
>hotplug ain't gonna happen ;), and then do
>
> #ifdef CONFIG_SMP
> #define smp_alternative(x,y) alternative(x,y,X86_FEATURE_SMP)
> #else
> #define smp_alternative(x,y) asm(x)
> #endif
>
>or something similar, instead of creating a totally new infrastructure to
>do the thing that "alternative()" already does.
>
>(Yeah, the above doesn't really work, since usually the SMP form is the
>longer one, and "alternative()" wants the long complex one first. So maybe
>the x86 feature needs to be "X86_FEATURE_UP" instead, since it's now a
>"feature" to only have one core ;)
>
>

It seems that SMP vs. UP lock / spinlock overhead is relevant even for
future, multi-core CPUs in a virtualization context, as the notion of
hotplug here is based on scheduling constraints of the virtualization
engine, and the kernel can quite readily end up with only one VCPU.

But it also seems that there are separate, competing mechanisms for
implementing this dynamic code change, which is undesirable. The notion
of boot-time dynamic code change for SMP is useful for native hardware.
Run-time dynamic code change is useful for virtual hardware, and
minimally useful for hardware CPU hotplug. Run-time dynamic code change
is also useful on virtual hardware if you consider live kernel
migrations across CPUs from different vendors, or with different
features. Again, this is minimally useful for hardware CPU hotplug.

But in essence, there should be one nice way to encapsulate this code
modification that lives for both run-time and boot-time code. The
boot-time modifiers can jettison the alternative tables, and the
run-time guys (which might include CPU hotplug) can keep those
alternatives around so they can be unapplied later. One can even
imagine more complex alternative features (if I have SSE2, use code X,
but if SSE3 is available use code Y, else fall back to code Z) being
useful at some point.

Both points combined are a basic argument for providing an alternative
choice function in apply_alternatives, which takes as input the
alternative specification, and returns a pointer to the chosen code.
This function can be driven by dynamic data (number of plugged CPUs), or
by static specifications (feature spec in the alternative section).

Zach

2005-11-14 19:53:29

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

On Mon, 2005-11-14 at 11:46 -0800, Zachary Amsden wrote:

> It seems that SMP vs. UP lock / spinlock overhead is relevant even for
> future, multi-core CPUs in a virtualization context, as the notion of
> hotplug here is based on scheduling constraints of the virtualization
> engine, and the kernel can quite readily end up with only one VCPU.


this assumes that you don't just always want to assume and use SMP
primitives in a virtualized context. I sort of question that assumption;
sure these things have overhead, especially "lock", but if the solution
is more complexity and weird things to hide that half-percent or less of
performance difference... then do remember that such complexity is not
free either. Runtime tricks cost.

2005-11-14 20:34:09

by Zachary Amsden

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

Arjan van de Ven wrote:

>On Mon, 2005-11-14 at 11:46 -0800, Zachary Amsden wrote:
>
>
>
>>It seems that SMP vs. UP lock / spinlock overhead is relevant even for
>>future, multi-core CPUs in a virtualization context, as the notion of
>>hotplug here is based on scheduling constraints of the virtualization
>>engine, and the kernel can quite readily end up with only one VCPU.
>>
>>
>
>
>this assumes that you don't just always want to assume and use SMP
>primitives in a virtualized context. I sort of question that assumption;
>sure these things have overhead, especially "lock", but if the solution
>is more complexity and weird things to hide that half-percent or less of
>performance difference... then do remember that such complexity is not
>free either. Runtime tricks cost.
>
>

Runtime tricks that increase complexity cost, yes. It's all a question
of measured gain vs. complexity. But a couple of percent gained on an
overall basis can be magnified enormously if you are looking at a
workload that stresses a particular path. I would expect some of those
gains to be non-trivial, especially if considering the optimizations you
could do on page table updates knowing you needn't worry about SMP
issues anymore. Even UP has (still?) some places where additional locks
are present here, and could benefit from having SMP alternatives.

Zach

2005-11-14 20:52:59

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s


>
> Runtime tricks that increase complexity cost, yes. It's all a question
> of measured gain vs. complexity. But a couple of percent gained on an
> overall basis can be magnified enormously if you are looking at a
> workload that stresses a particular path.

a couple of percents sounds really really high to me. If it's really
that then I think Andi's conclusion is wrong with respect to that
locking cliff; if we spend a few percent of our performance on locks in
the uncontended case we're way over the edge in my opinion.

> I would expect some of those
> gains to be non-trivial, especially if considering the optimizations you
> could do on page table updates knowing you needn't worry about SMP

page table updates happen in the hypervisor in a xen like
paravirtualized setup right? so that happens outside the kernel..


2005-11-15 14:12:22

by Gerd Hoffmann

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

Linus Torvalds wrote:
>
> On Mon, 14 Nov 2005, Gerd Knorr wrote:
>> Throwing another patch into the discussion ;)
>
> Ouch, this one is really ugly.

I somehow expected that answer, it took me quite some time to figure
what the patch does. It certainly needs at least a number of cleanups
before I'd consider it mergable. The alternative() macro is much easier
to read.

> If you want to go this way, then you should instead add an X86_FEATURE_SMP
> that gets cleared on UP and on SMP with just one core (and detect when CPU
> hotplug ain't gonna happen ;), and then do

Well, the "no hotplug" probably is exactly the reason why the patch
doesn't use the existing alternatives mechanism, it's a boot-time
one-way ticket. The xenified linux kernel actually switches both ways
at runtime if you plug in/out a second virtual CPU.

> #ifdef CONFIG_SMP
> #define smp_alternative(x,y) alternative(x,y,X86_FEATURE_SMP)
> #else
> #define smp_alternative(x,y) asm(x)
> #endif

I don't like the idea very much. That covers only 50% of what the patch
does, you can patch SMP => UP but not the other way around. Doesn't
matter much on real hardware, but for virtual it is quite useful.

> or something similar, instead of creating a totally new infrastructure to
> do the thing that "alternative()" already does.

Yep, extending alternatives is probably better than duplicating the
code. Maybe having some alternative_smp() macro which places both code
versions into the .altinstr_replacement table? If that sounds ok I'll
try to come up with a experimental patch. If not: other ideas are welcome.

cheers,

Gerd

2005-11-15 16:01:33

by Gerd Hoffmann

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

> Yep, extending alternatives is probably better than duplicating the
> code. Maybe having some alternative_smp() macro which places both code
> versions into the .altinstr_replacement table? If that sounds ok I'll
> try to come up with a experimental patch.

i.e. something like this (as basic idea, patch is far away from doing
anything useful ...)?

Gerd


Attachments:
smp-alternatives.diff (3.00 kB)

2005-11-15 16:04:35

by Zachary Amsden

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

Gerd Knorr wrote:

>> Yep, extending alternatives is probably better than duplicating the
>> code. Maybe having some alternative_smp() macro which places both
>> code versions into the .altinstr_replacement table? If that sounds
>> ok I'll try to come up with a experimental patch.
>
>
> i.e. something like this (as basic idea, patch is far away from doing
> anything useful ...)?


You still need to preserve the originals so that you can patch in both
directions. In the dynamic scenario, you need a multi-way set of
alternatives, with the most conservative of those compiled in inline.

Zach

2005-11-15 16:06:40

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

On Tue, 2005-11-15 at 08:04 -0800, Zachary Amsden wrote:
> Gerd Knorr wrote:
>
> >> Yep, extending alternatives is probably better than duplicating the
> >> code. Maybe having some alternative_smp() macro which places both
> >> code versions into the .altinstr_replacement table? If that sounds
> >> ok I'll try to come up with a experimental patch.
> >
> >
> > i.e. something like this (as basic idea, patch is far away from doing
> > anything useful ...)?
>
>
> You still need to preserve the originals so that you can patch in both
> directions.

why do you insist on both directions? That still sounds like real
overkill to me.


2005-11-15 16:08:46

by Roland Dreier

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

> +#define alternative_smp(smpinstr, upinstr) asm(upinstr, ##input)

this wouldn't build with CONFIG_SMP=n -- you forgot the input param here.

also, given this:

> + BUG_ON(a->replacementlen > a->instrlen);

is there any way to at least catch it at compile time if the UP
alternative ends up longer than the SMP alternative?

- R.

2005-11-15 16:11:34

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

On Tue, Nov 15, 2005 at 05:06:03PM +0100, Arjan van de Ven wrote:

> > You still need to preserve the originals so that you can patch in both
> > directions.
>
> why do you insist on both directions? That still sounds like real
> overkill to me.

cpu hotplug going from UP to SMP ? :)

Dave

2005-11-15 16:13:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s



On Tue, 15 Nov 2005, Gerd Knorr wrote:
>
> i.e. something like this (as basic idea, patch is far away from doing anything
> useful ...)?

Can't work. The altinstructions are in init-code/data, and will be free'd
after boot. Which is as it should be. But it means that any setup that
expects to use them to switch back and forth is broken (not that your
patch does so now, but if that's what you are moving toward..)

Linus

2005-11-15 16:16:11

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

Dave Jones wrote:
> On Tue, Nov 15, 2005 at 05:06:03PM +0100, Arjan van de Ven wrote:
>
> > > You still need to preserve the originals so that you can patch in both
> > > directions.
> >
> > why do you insist on both directions? That still sounds like real
> > overkill to me.
>
> cpu hotplug going from UP to SMP ? :)
>

If you have CPU hotplug enabled, you can run SMP code!

-hpa

2005-11-15 16:16:41

by Gerd Hoffmann

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

Zachary Amsden wrote:
> You still need to preserve the originals so that you can patch in both
> directions. In the dynamic scenario, you need a multi-way set of
> alternatives, with the most conservative of those compiled in inline.

Sure, alternatives_smp() puts both versions into the
.altinstr_replacement section because of that ;)

The idea is to have SMP compiled in and let the normal
apply_alternatives() handle the SMP->UP patching case using the new
feature bit. apply_alternatives_smp() handles UP->SMP patching when you
plug in a new virtual CPU.

cheers,

Gerd

2005-11-15 16:17:18

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

On Tue, Nov 15, 2005 at 08:12:36AM -0800, Linus Torvalds wrote:

> On Tue, 15 Nov 2005, Gerd Knorr wrote:
> > i.e. something like this (as basic idea, patch is far away from doing anything
> > useful ...)?
>
> Can't work. The altinstructions are in init-code/data, and will be free'd
> after boot. Which is as it should be.

Hmmm, what about modules ?

Dave

2005-11-15 16:20:13

by Dave Jones

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

On Tue, Nov 15, 2005 at 08:14:29AM -0800, H. Peter Anvin wrote:
> Dave Jones wrote:
> >On Tue, Nov 15, 2005 at 05:06:03PM +0100, Arjan van de Ven wrote:
> >
> > > > You still need to preserve the originals so that you can patch in
> > both > > directions.
> > >
> > > why do you insist on both directions? That still sounds like real
> > > overkill to me.
> >
> >cpu hotplug going from UP to SMP ? :)
> >
>
> If you have CPU hotplug enabled, you can run SMP code!

Sure, but if you boot with 1 CPU, spinlocks get nop'd to emulate UP,
and on a 'installed a new cpu' hotplug event, they all come back.

Dave

2005-11-15 16:25:49

by Zachary Amsden

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

Arjan van de Ven wrote:

>On Tue, 2005-11-15 at 08:04 -0800, Zachary Amsden wrote:
>
>
>>Gerd Knorr wrote:
>>
>>
>>
>>>>Yep, extending alternatives is probably better than duplicating the
>>>>code. Maybe having some alternative_smp() macro which places both
>>>>code versions into the .altinstr_replacement table? If that sounds
>>>>ok I'll try to come up with a experimental patch.
>>>>
>>>>
>>>i.e. something like this (as basic idea, patch is far away from doing
>>>anything useful ...)?
>>>
>>>
>>You still need to preserve the originals so that you can patch in both
>>directions.
>>
>>
>
>why do you insist on both directions? That still sounds like real
>overkill to me.
>
>

It's not overkill in the virtualization context, and there are
(struggling, but infinite possibilities) opportunities for native here
as well. Run-time SMP->UP->SMP can benefit hotplug (albeit slightly).
But once you have a basic, generic mechanism for run-time code
modularization, there is very little cost to adding other features.
Run-time PAE / non-PAE conversion is far more radical, but not outside
the realm of possibility - and useful (in both directions) for memory
hotplug. Run-time CPU vendor migration is possible, if you, say hotplug
an AMD chip into a previously Intel socket.

Sure, most of this is science fiction. But the possibilities are great
- it's another tool you can use towards modularizing functionality -
specifically, scattered functionality like CPU instructions, spinlocks,
and MMU operations that really do deserve to be inlined, and really can
benefit from taking advantage of faster hardware instruction sequences.
That the tool already exists in a limited form means that with natural
extensions, it could easily be refined to allow bi-directional or
multidirectional run-time choices.

Basically, it removes a lot of the barriers that force configuration
time choices on the running kernel, and you can start to look at even
deeply entrenched parts of the kernel as modular.

Zach

2005-11-15 16:25:51

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

On Tue, 2005-11-15 at 11:19 -0500, Dave Jones wrote:
> On Tue, Nov 15, 2005 at 08:14:29AM -0800, H. Peter Anvin wrote:
> > Dave Jones wrote:
> > >On Tue, Nov 15, 2005 at 05:06:03PM +0100, Arjan van de Ven wrote:
> > >
> > > > > You still need to preserve the originals so that you can patch in
> > > both > > directions.
> > > >
> > > > why do you insist on both directions? That still sounds like real
> > > > overkill to me.
> > >
> > >cpu hotplug going from UP to SMP ? :)
> > >
> >
> > If you have CPU hotplug enabled, you can run SMP code!
>
> Sure, but if you boot with 1 CPU, spinlocks get nop'd to emulate UP,
> and on a 'installed a new cpu' hotplug event, they all come back.

the good news is that all hotplugable x86 cpus will have HT or dual core
support.. so you always work in pairs of 2


2005-11-15 16:27:56

by Gerd Hoffmann

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

Dave Jones wrote:
> On Tue, Nov 15, 2005 at 08:12:36AM -0800, Linus Torvalds wrote:
>
> > On Tue, 15 Nov 2005, Gerd Knorr wrote:
> > > i.e. something like this (as basic idea, patch is far away from doing anything
> > > useful ...)?
> >
> > Can't work. The altinstructions are in init-code/data, and will be free'd
> > after boot. Which is as it should be.

Good point, so better place the ones we need at runtime into a separate
table ...

> Hmmm, what about modules ?

We'll need some new fields in struct module ...

Is already on the list in my head, but I didn't bother yet for that
proof-of-concept discussion patch ;)

Gerd

2005-11-15 16:30:15

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

Dave Jones wrote:
> >
> > If you have CPU hotplug enabled, you can run SMP code!
>
> Sure, but if you boot with 1 CPU, spinlocks get nop'd to emulate UP,
> and on a 'installed a new cpu' hotplug event, they all come back.
>

The point that you don't nop if you have hotplug enabled (which is not
the norm.)

-hpa

2005-11-15 16:34:21

by Zachary Amsden

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

Arjan van de Ven wrote:

>the good news is that all hotplugable x86 cpus will have HT or dual core
>support.. so you always work in pairs of 2
>
>

While there are good arguments for a pure SMP hardware world, there are
good arguments in the virtualization world for virtual uniprocessors.
No one can be sure which combination of HT / polycore / package
isolation we will end up with, but we can be sure that legacy systems
will be around for longer than anyone wants to think about. As this CR4
valid on 486s patch proves :)

2005-11-15 16:53:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s



On Tue, 15 Nov 2005, Zachary Amsden wrote:
>
> It's not overkill in the virtualization context, and there are (struggling,
> but infinite possibilities) opportunities for native here as well.

No, there are almost no opportunities for native.

Especially with SMP, doing on-line code switching is really really nasty.
You basically have to shut down all CPU's to make sure there are no races
with other CPU's executing the code while it's being rewritten.

I'd be very very nervous about it. It would have to be some major
performance feature for it to make sense over a simple "switch function
pointers around" approach.

Linus

2005-11-16 09:58:48

by Gerd Hoffmann

[permalink] [raw]
Subject: Re: [PATCH 1/10] Cr4 is valid on some 486s

Roland Dreier wrote:
> > +#define alternative_smp(smpinstr, upinstr) asm(upinstr, ##input)
>
> this wouldn't build with CONFIG_SMP=n -- you forgot the input param here.

Yep, and I've noticed meanwhile that it becomes quite messy if you try
to do that with asm instructions which have both input and output
parameters. One way around that would be to use named parameters in the
inline assembler. Problem with that is that only gcc >= 3.1 understands
those and at the moment the minimun requited compiler for the kernel
still is gcc 2.95.3 according to Documentation/Changes ...

Is it an option to raise the required gcc version to 3.x, given that
even Debian/stable ships with gcc 3.3 these days?

cheers,

Gerd

2005-11-16 16:13:09

by Gerd Hoffmann

[permalink] [raw]
Subject: [RFC] SMP alternatives

Gerd Knorr wrote:

> i.e. something like this (as basic idea, patch is far away from doing
> anything useful ...)?

Adapting $subject to the actual topic, so other lkml readers can catch up ;)

Ok, here new version of the SMP alternatives patch. It features:

* it actually compiles and boots, so you can start playing with it ;)
* reuses the alternatives bits we have already as far as possible.
* separate table for the SMP alternatives, so we can keep them and
switch at runtime between SMP and UP (for virtual CPU hotplug).
* two new alternatives macros, one generic which can handle quite
comples stuff such as spinlocks, one for the "lock prefix" case.

TODO list:

* convert more code (bitops, ...).
* module support (using modules is fine, they run the safe SMP
version of the code, they just don't benefit from the optimizations
yet).
* integrate with xen bits and CPU hotplug, at the moment it's a
boot-time only thing.
* benchmark it.
* x86_64 version.
* drop the printk's placed into the code for debugging.
* probably more ...

How it works right now:

* The patch switches to UP unconditionally when doing the usual
alternatives stuff at boot time
* Just before booting the second CPU it switches to SMP.

How to test:

* boot with "maxcpus=1" to run the UP code.

Comments are welcome.

cheers,

Gerd


Attachments:
smp-alternatives-7.diff (11.77 kB)

2005-11-22 17:48:29

by Gerd Hoffmann

[permalink] [raw]
Subject: [patch] SMP alternatives

Gerd Knorr wrote:
> Gerd Knorr wrote:
>
>> i.e. something like this (as basic idea, patch is far away from doing
>> anything useful ...)?
>
> Adapting $subject to the actual topic, so other lkml readers can catch
> up ;)
>
> Ok, here new version of the SMP alternatives patch. It features:

Now, some days hacking & debugging and kernel crashing later I have
something more than just proof-of-concept ;)

Modules are supported now, fully modularized distro kernel works fine
with it. If you have a kernel with HOTPLUG_CPU compiled you can
shutdown the second CPU of your dual-processor system via sysfs (echo 0
> /sys/devices/system/cpu/cpu1/online) and watch the kernel switch over
to UP code without lock-prefixed instructions and simplified spinlocks,
then power up the second CPU again (echo 1 > /sys/...) and watch it
patching back in the SMP locking.

For testing & benchmarking purposes I've put also in two (temporary)
sysrq's to switch between UP and SMP bits without booting/shutting down
the second CPU. That one breaks non-i386 builds which are trivially
fixable by just dropping the drivers/char/sysrq.c changes ;)

enjoy,

Gerd


Attachments:
smp-alternatives-22.diff (41.94 kB)

2005-11-22 18:02:15

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Hi!

> For testing & benchmarking purposes I've put also in two (temporary)
> sysrq's to switch between UP and SMP bits without booting/shutting down
> the second CPU. That one breaks non-i386 builds which are trivially
> fixable by just dropping the drivers/char/sysrq.c changes ;)

> +/* Replace instructions with better alternatives for this CPU type.
> +
> + This runs before SMP is initialized to avoid SMP problems with
> + self modifying code. This implies that assymetric systems where
> + APs have less capabilities than the boot processor are not handled.
> + Tough. Make sure you disable such features by hand. */
> +void apply_alternatives(struct alt_instr *start, struct alt_instr *end,
> + __u8 *tstart, __u8 *tend)
> +{
> + unsigned char **noptable = intel_nops;
> + struct alt_instr *a;

Some alignment problems here. (Maybe it is okay as a source).

> +struct smp_alt_module {
> + /* what is this ??? */

:-))))))).

> + struct module *mod;
> + char *name;
> +
> + /* our SMP alternatives table */
> + struct alt_instr *astart;
> + struct alt_instr *aend;
> +
> + /* .text segment, needed to avoid patching init code ;) */
> + __u8 *tstart;
> + __u8 *tend;

You should be able to use u8 here.

> + if (0 == strcmp(".text", secstrings + s->sh_name))
> + text = s;
> + if (0 == strcmp(".altinstructions", secstrings + s->sh_name))
> + alt = s;
> + if (0 == strcmp(".smp_altinstructions", secstrings + s->sh_name))
> + smpalt = s;

Can we get if (!strcmp()) here?

Pavel

--
Thanks, Sharp!

2005-11-23 14:51:27

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Gerd Knorr <[email protected]> writes:

> Now, some days hacking & debugging and kernel crashing later I have
> something more than just proof-of-concept ;)
>
> Modules are supported now, fully modularized distro kernel works fine
> with it. If you have a kernel with HOTPLUG_CPU compiled you can
> shutdown the second CPU of your dual-processor system via sysfs (echo
> 0 > /sys/devices/system/cpu/cpu1/online) and watch the kernel switch
> over to UP code without lock-prefixed instructions and simplified
> spinlocks, then power up the second CPU again (echo 1 > /sys/...) and
> watch it patching back in the SMP locking.

This looks like total overkill to me. Who needs to optimize
CPU hotplug this way? If you really need this just do it
at boot time with the existing mechanisms. This would keep
it much simpler and simplicity is very important with
such code because otherwise the testing of all the corner
cases will kill you.

BTW the existing mechanism already works fine for modules too.

> + /* Paranoia */
> + asm volatile ("jmp 1f\n1:");
> + mb();

That would be totally obsolete 386 era paranoia. If anything then use
a CLFLUSH (but not available on all x86s)

-Andi

2005-11-23 15:12:17

by Vincent Hanquez

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Tue, Nov 22, 2005 at 06:48:07PM +0100, Gerd Knorr wrote:
> + smp = kmalloc(sizeof(*smp), GFP_KERNEL);
> + if (NULL == smp)
> + return; /* we'll run the (safe but slow) SMP code then ... */
> +
> + memset(smp,0,sizeof(*smp));

what about using kzalloc ?

> + if (ALT_UP == smp_alt_state)
> + goto out;

any chance to write it smp_alt_state == ALT_UP ?

IMHO, this way of writting equal condition is backward (like giving
answer before asking the question). I do know of the (pseudo-)benefit
to write it this way, but that's not worth it.

Plus, nowadays, gcc warns you about simple equal in if.

Cheers,
--
Vincent Hanquez

2005-11-23 15:29:10

by Gerd Hoffmann

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Andi Kleen wrote:
> Gerd Knorr <[email protected]> writes:
>
>> Modules are supported now, fully modularized distro kernel works fine
>> with it. If you have a kernel with HOTPLUG_CPU compiled you can
>> shutdown the second CPU of your dual-processor system via sysfs (echo
>> 0 > /sys/devices/system/cpu/cpu1/online) and watch the kernel switch
>> over to UP code without lock-prefixed instructions and simplified
>> spinlocks, then power up the second CPU again (echo 1 > /sys/...) and
>> watch it patching back in the SMP locking.
>
> This looks like total overkill to me. Who needs to optimize
> CPU hotplug this way? If you really need this just do it
> at boot time with the existing mechanisms.

Sure, for real hardware doing that at boot time is be perfectly fine.
In a virtual environment it's very useful to be able to plug in one more
virtual CPU on demand without rebooting though. The patch isn't very
useful alone, it's more one step on the road of getting the xen bits
merged mainline.

>> + /* Paranoia */
>> + asm volatile ("jmp 1f\n1:");
>> + mb();
>
> That would be totally obsolete 386 era paranoia. If anything then use
> a CLFLUSH (but not available on all x86s)

Ok, dropped. I've just copyed that from the original, pretty ugly xen
patch.

cheers,

Gerd

2005-11-23 16:10:31

by Alan

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Mer, 2005-11-23 at 12:17 -0700, Andi Kleen wrote:
> > + /* Paranoia */
> > + asm volatile ("jmp 1f\n1:");
> > + mb();
>
> That would be totally obsolete 386 era paranoia. If anything then use
> a CLFLUSH (but not available on all x86s)

If you are patching code another x86 CPU is running you must halt the
other processors and ensure it executes a serialzing instruction before
it enters any patched code.

How many kilobytes of tables do you add to the kernel to do this
pointless stunt btw ?

Alan "CPU errata are fun" Cox

2005-11-23 16:39:17

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Wed, Nov 23, 2005 at 04:42:13PM +0000, Alan Cox wrote:
> On Mer, 2005-11-23 at 12:17 -0700, Andi Kleen wrote:
> > > + /* Paranoia */
> > > + asm volatile ("jmp 1f\n1:");
> > > + mb();
> >
> > That would be totally obsolete 386 era paranoia. If anything then use
> > a CLFLUSH (but not available on all x86s)
>
> If you are patching code another x86 CPU is running you must halt the
> other processors and ensure it executes a serialzing instruction before
> it enters any patched code.

Yes that is why the original alternative() mechanism always only
runs before the code is ever executed.

> How many kilobytes of tables do you add to the kernel to do this
> pointless stunt btw ?

I much prefer the MSR bit too. Unfortunately it doesn't exist
(or rather I bet it exists somewhere, just undocumented) on Intel
systems.

-Andi

2005-11-23 16:43:37

by Gerd Hoffmann

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Alan Cox wrote:
> On Mer, 2005-11-23 at 12:17 -0700, Andi Kleen wrote:
>>> + /* Paranoia */
>>> + asm volatile ("jmp 1f\n1:");
>>> + mb();
>> That would be totally obsolete 386 era paranoia. If anything then use
>> a CLFLUSH (but not available on all x86s)
>
> If you are patching code another x86 CPU is running you must halt the
> other processors and ensure it executes a serialzing instruction before
> it enters any patched code.

Patching in/out SMP-locking with more than one active CPU would be a
pretty silly idea in the first place ;)

> How many kilobytes of tables do you add to the kernel to do this
> pointless stunt btw ?

16 .smp_altinstructions 0000ae0b c03b4000 003b4000 002b4000 2**2
CONTENTS, ALLOC, LOAD, READONLY, DATA
17 .smp_altinstr_replacement 00000f6e c03bee0b 003bee0b 002bee0b 2**0
CONTENTS, ALLOC, LOAD, CODE

cheers,

Gerd

2005-11-23 16:49:19

by Alan

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Mer, 2005-11-23 at 17:39 +0100, Andi Kleen wrote:
> I much prefer the MSR bit too. Unfortunately it doesn't exist
> (or rather I bet it exists somewhere, just undocumented) on Intel
> systems.

The MSR bits will break things like ECC scrubbing however. That can be
addressed although the test patch I have just refuses to load EDAC if
the BIOS writers didn't follow the BIOS guidelines.

Certainly it would be cleaner and easier to save the MSR, scrub and put
it back than do the fixup magic. Some drivers would need auditing as
they seem to use locked ops or xchg (implicit lock) to lock with a PCI
DMA master.

2005-11-23 16:52:36

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Gerd Knorr wrote:
>
> Patching in/out SMP-locking with more than one active CPU would be a
> pretty silly idea in the first place ;)
>

No, doing this crap with CPU hotplug is a silly idea. Patching on a
real UP system, and then throwing out the tables, makes sense. Keeping
two sets of tables for a minimal performance improvement in a very rare
configuration (CPU hotplug is the exception, not the rule) is just plain
stupid. You probably lose as much performance from the memory hogged up
in the tables as you gain from it, and on every system where you have
the tables at all you take the hit.

-hpa

2005-11-23 16:59:25

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Wed, Nov 23, 2005 at 05:21:29PM +0000, Alan Cox wrote:
> On Mer, 2005-11-23 at 17:39 +0100, Andi Kleen wrote:
> > I much prefer the MSR bit too. Unfortunately it doesn't exist
> > (or rather I bet it exists somewhere, just undocumented) on Intel
> > systems.
>
> The MSR bits will break things like ECC scrubbing however. That can be

You mean it might break an insane hack in someone's ECC scrubbing
implementation. But last time I talked to people about this
they suggested just using an uncacheable mapping instead of
this horrible thing. Uncached is actually what you want there,
not relying on some undocumented lock bus cycle behaviour.

IMHO that would be much better and actually
have a chance of working over multiple generation.

> Certainly it would be cleaner and easier to save the MSR, scrub and put
> it back than do the fixup magic. Some drivers would need auditing as
> they seem to use locked ops or xchg (implicit lock) to lock with a PCI
> DMA master.

Which drivers? I don't think there is anything in tree. I went
over all the drivers early in the x86-64 port.

I'm sure I would have noticed because they very likely needed inline
assembly for this and this generally broke when moving to x86-64.

DRM did some tricks, but generally not with the hardware I believe,
only between user/kernel space.

-Andi

2005-11-23 17:03:09

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] SMP alternatives



On Wed, 23 Nov 2005, Alan Cox wrote:
>
> The MSR bits will break things like ECC scrubbing however. That can be
> addressed although the test patch I have just refuses to load EDAC if
> the BIOS writers didn't follow the BIOS guidelines.
>
> Certainly it would be cleaner and easier to save the MSR, scrub and put
> it back than do the fixup magic. Some drivers would need auditing as
> they seem to use locked ops or xchg (implicit lock) to lock with a PCI
> DMA master.

What I suggested to Intel at the Developer Days is to have a MSR (or,
better yet, a bit in the page table pointer %cr0) that disables "lock" in
_user_ space. Ie a lock would be a no-op when in CPL3, and only with
certain processes.

The kernel really isn't that critical. We always need the locks in SMP
(unlike user space, which never needs them if the process isn't threaded),
and in the kernel space we occasionally need it even with UP to protect
against devices. And we _can_ do these instruction rewrites, and they are
even pretty trivial for the non-hotplug case.

User space is actually a lot more important. People spend more time in
user space, and there the lock prefix is much more often totally useless
and cannot just be edited away once per boot.

Linus

2005-11-23 18:03:42

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Linus Torvalds wrote:
> What I suggested to Intel at the Developer Days is to have a MSR (or,
> better yet, a bit in the page table pointer %cr0) that disables "lock" in
> _user_ space. Ie a lock would be a no-op when in CPL3, and only with
> certain processes.

You mean %cr3, right?

-hpa

2005-11-23 18:43:15

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] SMP alternatives



On Wed, 23 Nov 2005, H. Peter Anvin wrote:
>
> Linus Torvalds wrote:
> > What I suggested to Intel at the Developer Days is to have a MSR (or, better
> > yet, a bit in the page table pointer %cr0) that disables "lock" in _user_
> > space. Ie a lock would be a no-op when in CPL3, and only with certain
> > processes.
>
> You mean %cr3, right?

Yes.

It _should_ be fairly easy to do something like that - just a simple
global flag that gets set and makes CPL3 ignore lock prefixes. Even timing
doesn't matter - it it takes a hundred cycles for the setting to take
effect, we don't care, since you can't write %cr3 from user space anyway,
and it will certainly take a hundred cycles (and a few serializing
instructions) until we get to CPL3.

I'd personally prefer it to be in %cr3, since we'd have to reload it on
task switching, and that's one of the registers we load anyway. And it
would make sense. But it could be in an MSR too.

Of course, if it's in one of the low 12 bits of %cr3, there would have to
be a "enable this bit" in %cr4 or something. Historically, you could write
any crap in the low bits, I think.

Linus

2005-11-23 18:46:48

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Wed, Nov 23, 2005 at 10:42:40AM -0800, Linus Torvalds wrote:
>
>
> On Wed, 23 Nov 2005, H. Peter Anvin wrote:
> >
> > Linus Torvalds wrote:
> > > What I suggested to Intel at the Developer Days is to have a MSR (or, better
> > > yet, a bit in the page table pointer %cr0) that disables "lock" in _user_
> > > space. Ie a lock would be a no-op when in CPL3, and only with certain
> > > processes.
> >
> > You mean %cr3, right?
>
> Yes.
>
> It _should_ be fairly easy to do something like that - just a simple
> global flag that gets set and makes CPL3 ignore lock prefixes. Even timing
> doesn't matter - it it takes a hundred cycles for the setting to take
> effect, we don't care, since you can't write %cr3 from user space anyway,
> and it will certainly take a hundred cycles (and a few serializing
> instructions) until we get to CPL3.

Another bit for ring 0 would be actually useful too. Then the patching
patch here wouldn't be needed.

-Andi

2005-11-23 18:51:21

by Jeffrey V. Merkey

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Linus Torvalds wrote:

>On Wed, 23 Nov 2005, H. Peter Anvin wrote:
>
>
>>Linus Torvalds wrote:
>>
>>
>>>What I suggested to Intel at the Developer Days is to have a MSR (or, better
>>>yet, a bit in the page table pointer %cr0) that disables "lock" in _user_
>>>space. Ie a lock would be a no-op when in CPL3, and only with certain
>>>processes.
>>>
>>>
>>You mean %cr3, right?
>>
>>
>
>Yes.
>
>It _should_ be fairly easy to do something like that - just a simple
>global flag that gets set and makes CPL3 ignore lock prefixes. Even timing
>doesn't matter - it it takes a hundred cycles for the setting to take
>effect, we don't care, since you can't write %cr3 from user space anyway,
>and it will certainly take a hundred cycles (and a few serializing
>instructions) until we get to CPL3.
>
>I'd personally prefer it to be in %cr3, since we'd have to reload it on
>task switching, and that's one of the registers we load anyway. And it
>would make sense. But it could be in an MSR too.
>
>Of course, if it's in one of the low 12 bits of %cr3, there would have to
>be a "enable this bit" in %cr4 or something. Historically, you could write
>any crap in the low bits, I think.
>
> Linus
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>
The lock prefix '0F' is used for a lot of opcodes other than "lock". Go
check the instruction set reference. It's not
trivial what you are proposing. Intel has a pretty hacked up opcode map
with a lot of history. The bit should be in
CR4 and not CR3.

J

2005-11-23 19:03:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] SMP alternatives



On Wed, 23 Nov 2005, Jeff V. Merkey wrote:
>
> The lock prefix '0F' is used for a lot of opcodes other than "lock". Go check
> the instruction set reference.

No it's not.

0F is indeed the two-byte prefix. But lock is F0, and it's unique.

Sometimes Intel re-uses the prefixes for other things eg "rep nop", but I
don't think that has ever happened for the lock prefix.

Besides, the instructions look very different internally in the CPU after
decoding, and anyway you'd not want to ignore the lock prefix _early_ at
decode time anyway (many instructions turn into illegal instructions with
a lock prefix, as do reg-reg modrm bytes). So you'd dismiss the lock
prefix not at a byte level, but at a minimum just after the decode stage.

Linus

2005-11-23 19:13:06

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Linus Torvalds wrote:
>
> Of course, if it's in one of the low 12 bits of %cr3, there would have to
> be a "enable this bit" in %cr4 or something. Historically, you could write
> any crap in the low bits, I think.
>

No, most of them are RAZ, but there are at least a couple of bits which
have effect (e.g. caching of the page tables.)

However, with PAE there aren't really a whole lot of unused bits in CR3.

-hpa

2005-11-23 19:42:05

by jmerkey

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Wed, Nov 23, 2005 at 11:12:01AM -0800, H. Peter Anvin wrote:
> Linus Torvalds wrote:
> >
> >Of course, if it's in one of the low 12 bits of %cr3, there would have to
> >be a "enable this bit" in %cr4 or something. Historically, you could write
> >any crap in the low bits, I think.
> >
>
> No, most of them are RAZ, but there are at least a couple of bits which
> have effect (e.g. caching of the page tables.)
>
> However, with PAE there aren't really a whole lot of unused bits in CR3.
>
> -hpa
> -

Changing CR3 will break compatibility with Windows and interfere with Intel's Bread and Butter gravy Train with M$. CR4 was created to deal with some of the
legacy issues with backward compatiblity with OS's that read CR3. Messing with CR3 will break Windows.

They won't do anything that will upset the apple cart with M$. I dealt with
Intel folks for years when Linux was unknown. They look and act like boy scouts-- don't be fooled -- they're totally an M$ shop, always have been, always will be. Linux was and is an intresting brain fart on their radar. Their interests in it are solely based on their internal "Rabbits and Dogs" Andy Grove mentality. They say there are rabbits and dogs. Rabbits run out front, dogs chase the rabbits. Intel does business with rabbits. Linux is a dog -- it chases after innovators and replicates their work. The fact it's free is the basis of their interest.

Those interests do not extend to anything that interferes with their M$ relationship. Push for CR4, they might agree, but be assured your request will pass Balmers desk before it gets approved.

J

2005-11-23 19:43:09

by jmerkey

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Wed, Nov 23, 2005 at 11:03:15AM -0800, Linus Torvalds wrote:
>
>
> On Wed, 23 Nov 2005, Jeff V. Merkey wrote:
> >
> > The lock prefix '0F' is used for a lot of opcodes other than "lock". Go check
> > the instruction set reference.
>
> No it's not.
>
> 0F is indeed the two-byte prefix. But lock is F0, and it's unique.
>
> Sometimes Intel re-uses the prefixes for other things eg "rep nop", but I
> don't think that has ever happened for the lock prefix.
>
> Besides, the instructions look very different internally in the CPU after
> decoding, and anyway you'd not want to ignore the lock prefix _early_ at
> decode time anyway (many instructions turn into illegal instructions with
> a lock prefix, as do reg-reg modrm bytes). So you'd dismiss the lock
> prefix not at a byte level, but at a minimum just after the decode stage.
>
> Linus

I always get numbers and words transposed. Thanks for the correction.

J
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2005-11-23 21:12:13

by Alan

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Mer, 2005-11-23 at 10:42 -0800, Linus Torvalds wrote:
> Of course, if it's in one of the low 12 bits of %cr3, there would have to
> be a "enable this bit" in %cr4 or something. Historically, you could write
> any crap in the low bits, I think.

There is a much much better way to do it than just user space and
without hitting cr3/cr4 - put "lock works" in the PAT and while we'll
have to add PAT support which we need to do anyway we would get a world
where on uniprocessor lock prefix only works on addresse targets we want
it to - ie pci_alloc_consistent() pages.

Alan

2005-11-23 21:14:06

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Wed, Nov 23, 2005 at 09:44:05PM +0000, Alan Cox wrote:
> On Mer, 2005-11-23 at 10:42 -0800, Linus Torvalds wrote:
> > Of course, if it's in one of the low 12 bits of %cr3, there would have to
> > be a "enable this bit" in %cr4 or something. Historically, you could write
> > any crap in the low bits, I think.
>
> There is a much much better way to do it than just user space and
> without hitting cr3/cr4 - put "lock works" in the PAT and while we'll
> have to add PAT support which we need to do anyway we would get a world
> where on uniprocessor lock prefix only works on addresse targets we want
> it to - ie pci_alloc_consistent() pages.

The idea was to turn LOCK on only if the process has any
shared writable mapping and num_online_cpus() > 0.

Might be a bit costly to rewrite all the page tables for that case
just to change the PAT index. A bit is nicer for that.

-Andi

2005-11-23 21:28:31

by Alan

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Mer, 2005-11-23 at 17:59 +0100, Andi Kleen wrote:
> You mean it might break an insane hack in someone's ECC scrubbing
> implementation. But last time I talked to people about this
> they suggested just using an uncacheable mapping instead of
> this horrible thing. Uncached is actually what you want there,
> not relying on some undocumented lock bus cycle behaviour.

I reviewed your suggestions before:

1. The lock behaviour *is* defined for main memory access by all bus
masters.
2. Uncached mappings are unworkable for this because we must never have
a page mapped with conflicting cache types - thats ugly, and plain
horrific on SMP.
3. Uncached has undefined semantics when racing a PCI master. Lock has
defined semantics. An uncached add #0 is permitted to read the memory
and then write it back as two different cycles and I suspect does.
4. The AMD BIOS guide requires both that LOCK is enabled by default and
that the "lock affects the external bus" bit is clear to enable locking
on the external bus.

The problems we would have would be if some smartarse chip designed
optimised lock addl #0 to a no-op and the fact we possibly ought to
wbinvd the page and read it again to check the ECC recovery worked.

> Which drivers? I don't think there is anything in tree. I went
> over all the drivers early in the x86-64 port.

eepro100 used to from memory but it now carefully does a 16bit I/O.

> I'm sure I would have noticed because they very likely needed inline
> assembly for this and this generally broke when moving to x86-64.

Or use xchg() which is implied lock on x86 so isn't magically made
non-atomic by the LOCK macros. I did a sweep for xchg and didn't see any
problem.

> DRM did some tricks, but generally not with the hardware I believe,
> only between user/kernel space.

It does it extensively for user/kernel. Whether it does it with the GPU
in user space I can't say. The cards I am familar with do not.

Alan

2005-11-23 21:33:19

by Alan

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Mer, 2005-11-23 at 22:13 +0100, Andi Kleen wrote:
> The idea was to turn LOCK on only if the process has any
> shared writable mapping and num_online_cpus() > 0.

That makes a lot of sense, and if we hit hardware that does funky stuff
then the driver can set a 'vma needs lock' bit for the same effect.

> Might be a bit costly to rewrite all the page tables for that case
> just to change the PAT index. A bit is nicer for that.

CPU insert/remove is performed how many times a second ? Or for that
matter why not just reload the PAT register and keep the index the
same ?

2005-11-23 21:36:53

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] SMP alternatives



On Wed, 23 Nov 2005, Alan Cox wrote:
>
> There is a much much better way to do it than just user space and
> without hitting cr3/cr4 - put "lock works" in the PAT and while we'll
> have to add PAT support which we need to do anyway we would get a world
> where on uniprocessor lock prefix only works on addresse targets we want
> it to - ie pci_alloc_consistent() pages.

No. That would be wrong.

The thing is, "lock" is useless EVEN ON SMP in user space 99% of the time.

Think of all the thread locking in libc - where 99% of all processes are
single-threaded, and it does nothing but slow things down.

Actual UP machines are going to go away - even ARM is going SMP, and in
the PC space, we'll have multi-core laptops probably being the rule rather
than the exception in a couple of years. So the kernel will need "lock" in
the forseeable future, and optimizing for UP is a lost game.

But optimizing for a single _thread_ is not a lost game. I don't believe
that threaded applications are necessarily going to take over all that
much in a lot of areas. Sure, we'll have more threaded apps too, but we'll
continue to have tons more of performance-critical non-threaded things
like compilers etc.

And _that_ is worth optimizing for. General libraries that have to be able
to handle the threaded case dynamically, but that are often run with no
shared memory anywhere.

THAT is what I'd like to have CPU support for. Not for UP (it's going
away), and not for the kernel (it's never single-threaded).

Linus

2005-11-23 21:37:09

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Wed, Nov 23, 2005 at 10:05:40PM +0000, Alan Cox wrote:
> > Might be a bit costly to rewrite all the page tables for that case
> > just to change the PAT index. A bit is nicer for that.
>
> CPU insert/remove is performed how many times a second ? Or for that
> matter why not just reload the PAT register and keep the index the
> same ?

For user space the primary trigger event would be "has any shared
writable mappings or multiple threads". Even on a real MP systems it's
perfectly ok to run a program with no writable shared mappings with LOCK off.
Depending on the workload this transistion could happen quite often.
Especially there is a worst case of an application allocating a few
GB of memory and then starting a new thread.

-Andi

2005-11-23 21:37:23

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch] SMP alternatives


> CPU insert/remove is performed how many times a second ? Or for that
> matter why not just reload the PAT register and keep the index the
> same ?

you also want this for single threaded apps, so that the glibc locking
stuff can not do lock for single-threaded apps and non-shared memory


2005-11-23 21:43:47

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

> THAT is what I'd like to have CPU support for. Not for UP (it's going
> away), and not for the kernel (it's never single-threaded).

There is one reasonly interesting special case that will probably stay
around: single CPU guest in a virtualized environment.

-Andi

2005-11-23 21:46:57

by Jeff Garzik

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Andi Kleen wrote:
> The idea was to turn LOCK on only if the process has any
> shared writable mapping and num_online_cpus() > 0.

Yep. Though I presume you mean "> 1".

One hopes that num_online_cpus() never reaches zero during runtime ;-)

Jeff


2005-11-23 21:48:51

by Daniel Jacobowitz

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Wed, Nov 23, 2005 at 01:36:08PM -0800, Linus Torvalds wrote:
> But optimizing for a single _thread_ is not a lost game. I don't believe
> that threaded applications are necessarily going to take over all that
> much in a lot of areas. Sure, we'll have more threaded apps too, but we'll
> continue to have tons more of performance-critical non-threaded things
> like compilers etc.
>
> And _that_ is worth optimizing for. General libraries that have to be able
> to handle the threaded case dynamically, but that are often run with no
> shared memory anywhere.
>
> THAT is what I'd like to have CPU support for. Not for UP (it's going
> away), and not for the kernel (it's never single-threaded).

I don't think I see the point. This would let you optimize for the
"multi-threaded, but hasn't created any threads yet" or even
"multi-threaded, but not right now" cases. But those really aren't the
interesting case to optimize for - that's the equivalent of supporting
CPU hotplug.

The interesting case is when you know at static link time that the
library is single-threaded, or even at dynamic link time. And it's
easy enough at both of those times to handle this. In many cases glibc
doesn't, because it's valid to dlopen libpthread.so, but that could be
accomodated - a simple matter of software.

--
Daniel Jacobowitz
CodeSourcery, LLC

2005-11-23 21:55:00

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Daniel Jacobowitz wrote:
>
> I don't think I see the point. This would let you optimize for the
> "multi-threaded, but hasn't created any threads yet" or even
> "multi-threaded, but not right now" cases. But those really aren't the
> interesting case to optimize for - that's the equivalent of supporting
> CPU hotplug.
>
> The interesting case is when you know at static link time that the
> library is single-threaded, or even at dynamic link time. And it's
> easy enough at both of those times to handle this. In many cases glibc
> doesn't, because it's valid to dlopen libpthread.so, but that could be
> accomodated - a simple matter of software.
>

No, you can never know that unless you can't call mmap().

-hpa

2005-11-23 22:03:32

by Daniel Jacobowitz

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Wed, Nov 23, 2005 at 01:53:59PM -0800, H. Peter Anvin wrote:
> Daniel Jacobowitz wrote:
> >
> >I don't think I see the point. This would let you optimize for the
> >"multi-threaded, but hasn't created any threads yet" or even
> >"multi-threaded, but not right now" cases. But those really aren't the
> >interesting case to optimize for - that's the equivalent of supporting
> >CPU hotplug.
> >
> >The interesting case is when you know at static link time that the
> >library is single-threaded, or even at dynamic link time. And it's
> >easy enough at both of those times to handle this. In many cases glibc
> >doesn't, because it's valid to dlopen libpthread.so, but that could be
> >accomodated - a simple matter of software.
> >
>
> No, you can never know that unless you can't call mmap().

Please explain what problem you see. If you use mmap to manually load
libpthread.so, and patch up its relocations without going to ld.so,
obviously you get to keep both pieces. Or are you talking about
synchronizing access to shared mmaped buffers?

This is different from what Linus was talking about precisely because
we can do it imperatively ("I know this program is single-threaded and
I'm telling you so" instead of "Hmm, this program hasn't called clone
yet").

It's not as technologically slick but I'd need a lot of convincing to
believe it wasn't just as useful; and it has the benefit of not
requiring new silicon.

--
Daniel Jacobowitz
CodeSourcery, LLC

2005-11-23 22:09:50

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Daniel Jacobowitz wrote:
>
> Please explain what problem you see. If you use mmap to manually load
> libpthread.so, and patch up its relocations without going to ld.so,
> obviously you get to keep both pieces. Or are you talking about
> synchronizing access to shared mmaped buffers?
>

Yes. Any shared mmaps may require working lock.

-hpa

2005-11-23 22:13:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] SMP alternatives



On Wed, 23 Nov 2005, Alan Cox wrote:
>
> On Mer, 2005-11-23 at 22:13 +0100, Andi Kleen wrote:
> > The idea was to turn LOCK on only if the process has any
> > shared writable mapping and num_online_cpus() > 0.
>
> That makes a lot of sense, and if we hit hardware that does funky stuff
> then the driver can set a 'vma needs lock' bit for the same effect.
>
> > Might be a bit costly to rewrite all the page tables for that case
> > just to change the PAT index. A bit is nicer for that.
>
> CPU insert/remove is performed how many times a second ? Or for that
> matter why not just reload the PAT register and keep the index the
> same ?

It's not about CPU insert/remove.

It's about a single-threaded process becoming multi-threaded, ie a simple
"clone()" operation (or doing a shared mmap).

So it needs to be _fast_.

I would strongly argue that it's not a TLB/PAT operation at all. It has
nothing to do with the address of the operation. It's a global bit, and
it's in the cr3 just because that's what gets reloaded on task switching.
But it could be in the CS register too, for all I care..

Linus

2005-11-23 22:15:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] SMP alternatives



On Wed, 23 Nov 2005, Andi Kleen wrote:
>
> > THAT is what I'd like to have CPU support for. Not for UP (it's going
> > away), and not for the kernel (it's never single-threaded).
>
> There is one reasonly interesting special case that will probably stay
> around: single CPU guest in a virtualized environment.

.. and then the _virtualizer_ should just set the bit.

However, quite frankly, virtualization is overhyped, in my opinion. And if
it forces people to run UP because of performance issues, it's simply not
acceptable for a lot of loads.

It's cool technology and all, but realistically..

Linus

2005-11-23 22:18:15

by Alan

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Mer, 2005-11-23 at 13:36 -0800, Linus Torvalds wrote:
> > have to add PAT support which we need to do anyway we would get a world
> > where on uniprocessor lock prefix only works on addresse targets we want
> > it to - ie pci_alloc_consistent() pages.
>
> No. That would be wrong.
>
> The thing is, "lock" is useless EVEN ON SMP in user space 99% of the time.

Now I see what you are aiming at, yes that makes vast amounts of sense
and since AMD have the "no lock effect" bit for general case maybe they
can


2005-11-23 22:19:42

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] SMP alternatives



On Wed, 23 Nov 2005, Daniel Jacobowitz wrote:
>
> I don't think I see the point. This would let you optimize for the
> "multi-threaded, but hasn't created any threads yet" or even
> "multi-threaded, but not right now" cases. But those really aren't the
> interesting case to optimize for - that's the equivalent of supporting
> CPU hotplug.

NO.

There is not a _single_ compiler that is multi-threaded, and I'd argue
that there probably never will be. It's pointless.

There's a _lot_ of really performance-sensitive stuff that will NEVER EVER
be threaded. You may run a hundred copies of them at the same time, but
every single copy will be single-threaded.

And this will optimize that case in a BIG way.

This is _not_ about "CPU hotplug". This is _not_ about "threaded apps
before they are threaded". This is all about the fact that serious
computation is done single-threaded, and anybody who thinks that
single-threading is going away is so totally out to lunch that it's not
even fun.

And yes, Sun will die. Single-thread performance matters a hell of a lot,
and any company that bets that it doesn't, is a failure.

Linus

2005-11-23 22:21:06

by Daniel Jacobowitz

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Wed, Nov 23, 2005 at 02:19:08PM -0800, Linus Torvalds wrote:
>
>
> On Wed, 23 Nov 2005, Daniel Jacobowitz wrote:
> >
> > I don't think I see the point. This would let you optimize for the
> > "multi-threaded, but hasn't created any threads yet" or even
> > "multi-threaded, but not right now" cases. But those really aren't the
> > interesting case to optimize for - that's the equivalent of supporting
> > CPU hotplug.
>
> NO.
>
> There is not a _single_ compiler that is multi-threaded, and I'd argue
> that there probably never will be. It's pointless.
>
> There's a _lot_ of really performance-sensitive stuff that will NEVER EVER
> be threaded. You may run a hundred copies of them at the same time, but
> every single copy will be single-threaded.
>
> And this will optimize that case in a BIG way.
>
> This is _not_ about "CPU hotplug". This is _not_ about "threaded apps
> before they are threaded". This is all about the fact that serious
> computation is done single-threaded, and anybody who thinks that
> single-threading is going away is so totally out to lunch that it's not
> even fun.

I get the feeling you didn't read my message at all. Let me try it
again.

Why should we use a silicon based solution for this, when I posit that
there are simpler and equally effective userspace solutions?

--
Daniel Jacobowitz
CodeSourcery, LLC

2005-11-23 22:22:21

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] SMP alternatives



On Wed, 23 Nov 2005, H. Peter Anvin wrote:
>
> Yes. Any shared mmaps may require working lock.

Not "any". Only writable shared mmap. Which is actually the rare case.

Even then, we might want to have such processes have a way to say "I don't
do futexes in this mmap" or similar. Quite often, writable shared mmaps
aren't interested in locked cycles - they are there to just write things
to disk, and all the serialization is done in the kernel when the user
does a "munmap()" or a "msync()".

Linus

2005-11-23 22:23:01

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Wed, Nov 23, 2005 at 02:15:24PM -0800, Linus Torvalds wrote:
>
>
> On Wed, 23 Nov 2005, Andi Kleen wrote:
> >
> > > THAT is what I'd like to have CPU support for. Not for UP (it's going
> > > away), and not for the kernel (it's never single-threaded).
> >
> > There is one reasonly interesting special case that will probably stay
> > around: single CPU guest in a virtualized environment.
>
> .. and then the _virtualizer_ should just set the bit.

That wouldn't work if it's limited limited to ring 3.

Also currently at least the Xen the driver interfaces seem to
rely on lock, but perhaps that can be changed.


> However, quite frankly, virtualization is overhyped, in my opinion. And if
> it forces people to run UP because of performance issues, it's simply not
> acceptable for a lot of loads.

I don't think it'll force them to that, it will just be a common
use case. e.g. you start a separate VM to run your firewall in.
Do you really need it multithreaded?

-Andi

2005-11-23 22:22:59

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Alan Cox wrote:
> On Mer, 2005-11-23 at 13:36 -0800, Linus Torvalds wrote:
>
>>>have to add PAT support which we need to do anyway we would get a world
>>>where on uniprocessor lock prefix only works on addresse targets we want
>>>it to - ie pci_alloc_consistent() pages.
>>
>>No. That would be wrong.
>>
>>The thing is, "lock" is useless EVEN ON SMP in user space 99% of the time.
>
>
> Now I see what you are aiming at, yes that makes vast amounts of sense
> and since AMD have the "no lock effect" bit for general case maybe they
> can
>

What it really comes down to (virtualization or not!) is whether or not
the OS can guarantee that nothing else is messing with memory at the
same time.

This is potentially different from process to process (because of page
table differences) and from kernel to user space (because of the User
bit in the page tables.)

-hpa


2005-11-23 22:23:54

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Wed, Nov 23, 2005 at 04:46:27PM -0500, Jeff Garzik wrote:
> Andi Kleen wrote:
> >The idea was to turn LOCK on only if the process has any
> >shared writable mapping and num_online_cpus() > 0.
>
> Yep. Though I presume you mean "> 1".

Yeah, > 1 of course.
-Andi

2005-11-23 22:26:25

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Andi Kleen wrote:
> On Wed, Nov 23, 2005 at 02:15:24PM -0800, Linus Torvalds wrote:
>
>>
>>On Wed, 23 Nov 2005, Andi Kleen wrote:
>>
>>>>THAT is what I'd like to have CPU support for. Not for UP (it's going
>>>>away), and not for the kernel (it's never single-threaded).
>>>
>>>There is one reasonly interesting special case that will probably stay
>>>around: single CPU guest in a virtualized environment.
>>
>>.. and then the _virtualizer_ should just set the bit.
>
> That wouldn't work if it's limited limited to ring 3.
>
> Also currently at least the Xen the driver interfaces seem to
> rely on lock, but perhaps that can be changed.
>

Well, with VTX or Pacifica virtualization is in ring 3. The fact that
Xen isn't is a workaround for current hardware, so when we're talking
about new hardware it's pointless.

What you really want is one bit for kernel mode (cpl 0-2) and one for
user mode (cpl 3).

-hpa

2005-11-23 22:31:59

by Pavel Machek

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Hi!


> >The idea was to turn LOCK on only if the process has any
> >shared writable mapping and num_online_cpus() > 0.
>
> Yep. Though I presume you mean "> 1".
>
> One hopes that num_online_cpus() never reaches zero during runtime ;-)

Actually num_online_cpus() is very usefull -- suspend to RAM ;-))))).

Pavel
--
Thanks, Sharp!

2005-11-23 22:33:04

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

> Well, with VTX or Pacifica virtualization is in ring 3. The fact that

Not it's not. The whole point is that there is no "ring compression"
The guest has all its normal rings, just the hypervisor has additional
"negative" rings.

In the current Xen x86-64 para virtualization setup the guest kernel
is in ring 3, but I hope VT/P. will do away with that because it
causes lots of issues.

> What you really want is one bit for kernel mode (cpl 0-2) and one for
> user mode (cpl 3).

Yes.

-Andi

2005-11-23 22:39:00

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Andi Kleen wrote:
>>Well, with VTX or Pacifica virtualization is in ring 3. The fact that
>
> Not it's not. The whole point is that there is no "ring compression"
> The guest has all its normal rings, just the hypervisor has additional
> "negative" rings.
>

Uhm... maybe we think of it differently, but typically I consider the
host rings (which is what I talked about above) as orthogonal to the
guest ring. To the host, the guest is just a process in ring 3.

-hpa

2005-11-23 22:40:56

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

> Uhm... maybe we think of it differently, but typically I consider the
> host rings (which is what I talked about above) as orthogonal to the
> guest ring. To the host, the guest is just a process in ring 3.

I don't think your thoughts match the terminology as used by Intel/AMD/Xen
at least.

-Andi

2005-11-23 22:53:46

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Andi Kleen wrote:
>>Uhm... maybe we think of it differently, but typically I consider the
>>host rings (which is what I talked about above) as orthogonal to the
>>guest ring. To the host, the guest is just a process in ring 3.
>
> I don't think your thoughts match the terminology as used by Intel/AMD/Xen
> at least.

Perhaps not. The Intel terminology seems really confused, especially.

-hpa

2005-11-23 23:09:37

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] SMP alternatives



On Wed, 23 Nov 2005, Daniel Jacobowitz wrote:
>
> Why should we use a silicon based solution for this, when I posit that
> there are simpler and equally effective userspace solutions?

Name them.

In user space, doing things like clever run-time linking things is
actually horribly bad. It causes COW faults at startup, and/or makes the
compiler have to do indirections unnecessarily. Both of which actually
make caches less effective, because now processes that really effectively
do have exactly the same contents have them in different pages.

The other alternative (which apparently glibc actually does use) is to
dynamically branch over the lock prefixes, which actually works better:
it's more work dynamically, but it's much cheaper from a startup
standpoint and there's no memory duplication, so while it is the "stupid"
approach, it's actually better than the clever one.

The third alternative is to know at link-time that the process never does
anything threaded, but that needs more developer attention and
non-standard setups, and you _will_ get it wrong (some library will create
some thread without the developer even realizing). It also has the
duplicated library overhead (but at least now the duplication is just
twice, not "each process duplicates its own private pointer")

In short, there simply isn't any good alternatives. The end result is that
thread-safe libraries are always in practice thread-safe even on UP, even
though that serializes the CPU altogether unnecessarily.

I'm sure you can make up alternatives every time you hit one _particular_
library, but that just doesn't scale in the real world.

In contrast, the simple silicon support scales wonderfully well. Suddenly
libraries can be thread-safe _and_ efficient on UP too. You get to eat
your cake and have it too.

Linus

2005-11-23 23:11:27

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] SMP alternatives



On Wed, 23 Nov 2005, Andi Kleen wrote:
>
> I don't think it'll force them to that, it will just be a common
> use case. e.g. you start a separate VM to run your firewall in.
> Do you really need it multithreaded?

The question is: do you need a virtualized environment for it.

And the answer is: no.

I realize that you can make up an infinite number of things you _can_ use
virtualization for. That doesn't mean that people _will_.

Linus

2005-11-23 23:31:29

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Linus Torvalds <[email protected]> writes:

> On Wed, 23 Nov 2005, H. Peter Anvin wrote:
>>
>> Yes. Any shared mmaps may require working lock.
>
> Not "any". Only writable shared mmap. Which is actually the rare case.
>
> Even then, we might want to have such processes have a way to say "I don't
> do futexes in this mmap" or similar. Quite often, writable shared mmaps
> aren't interested in locked cycles - they are there to just write things
> to disk, and all the serialization is done in the kernel when the user
> does a "munmap()" or a "msync()".

In fact for being explict we already have PROT_SEM on some architectures
to report if we are going to use atomic operations, in the mmap. For
x86 we would probably need to introduce a PROT_NOSEM but it is sounds
fairly straight forward to implement.

Eric

2005-11-23 23:44:08

by Daniel Jacobowitz

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Wed, Nov 23, 2005 at 03:08:59PM -0800, Linus Torvalds wrote:
>
>
> On Wed, 23 Nov 2005, Daniel Jacobowitz wrote:
> >
> > Why should we use a silicon based solution for this, when I posit that
> > there are simpler and equally effective userspace solutions?
>
> Name them.
>
> In user space, doing things like clever run-time linking things is
> actually horribly bad. It causes COW faults at startup, and/or makes the
> compiler have to do indirections unnecessarily. Both of which actually
> make caches less effective, because now processes that really effectively
> do have exactly the same contents have them in different pages.

Those are the wrong ways of doing this in userspace. There are right
ways. For instance, tag the binary at link time "single-threaded".
Use dynamic linking and the existing hwcap mechanism to select
single-threaded libraries instead of the default ones. Your
single-threaded applications will no longer mmap the same copy of glibc
as your multi-threaded applications; this does make caching mildly less
effective but only if you have a single-threaded app and a
multi-threaded one fighting for CPU time.

> The other alternative (which apparently glibc actually does use) is to
> dynamically branch over the lock prefixes, which actually works better:
> it's more work dynamically, but it's much cheaper from a startup
> standpoint and there's no memory duplication, so while it is the "stupid"
> approach, it's actually better than the clever one.

Glibc does not do this to the best of my knowledge. It does select
different code paths in various places based on the presence of
multiple threads, but that's for cancellation, not for locking.

> The third alternative is to know at link-time that the process never does
> anything threaded, but that needs more developer attention and
> non-standard setups, and you _will_ get it wrong (some library will create
> some thread without the developer even realizing).

This is also a trivially solvable problem in userspace; you make the
dynamic linker enforce consistency of the tags.

> I'm sure you can make up alternatives every time you hit one _particular_
> library, but that just doesn't scale in the real world.

The number of userspace libraries that use atomic operations is, in
practice, quite small.

> In contrast, the simple silicon support scales wonderfully well. Suddenly
> libraries can be thread-safe _and_ efficient on UP too. You get to eat
> your cake and have it too.

By buying new hardware and only caring about people using the magic
architecture. No thanks.

Maybe I'll implement this some weekend.

--
Daniel Jacobowitz
CodeSourcery, LLC

2005-11-23 23:42:10

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] SMP alternatives



On Wed, 23 Nov 2005, Eric W. Biederman wrote:
>
> In fact for being explict we already have PROT_SEM on some architectures
> to report if we are going to use atomic operations, in the mmap. For
> x86 we would probably need to introduce a PROT_NOSEM but it is sounds
> fairly straight forward to implement.

PROT_SEM was a mistake, I feel. It's way too easy to get it wrong. You
have most architectures and environments that don't need it, and as a
result, applications simply won't have it in their sources.

I suspect that with MAP_SHARED + PROT_WRITE being pretty uncommon anyway,
we can probably find trivial patterns in the kernel. Like only one process
holding that file open - which is what you get with things that use mmap()
to write a new file (I think "ld" used to have a config option to write
files that way, for example).

And if we end up sometimes giving "lock" meaning even when it's not
needed, tough. The point of the simple hack is very much a "get 90% of the
advantage for very little effort".

Regardless, even if we get a flag like that (and the Intel people didn't
seem to dismiss the idea), it's likely a more than a few years down the
line. So it's not like this is a pressing concern ;)

Linus

2005-11-24 00:00:29

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] SMP alternatives



On Wed, 23 Nov 2005, Daniel Jacobowitz wrote:
>
> Those are the wrong ways of doing this in userspace. There are right
> ways. For instance, tag the binary at link time "single-threaded".

And I mentioned exactly this. It's my third alternative.

And it doesn't work well, exactly because developers don't know if the
libraries they use are always single-threaded etc. More importantly, it
just doesn't happen that much. People do "make ; make install". Hopefully
from pretty standard sources. Having to tweak things so that a project
compiles with a magic flag on a particular distribution is simply not done
very much.

Linus

2005-11-24 00:27:39

by Jeffrey V. Merkey

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Linus Torvalds wrote:

>On Wed, 23 Nov 2005, Daniel Jacobowitz wrote:
>
>
>>Why should we use a silicon based solution for this, when I posit that
>>there are simpler and equally effective userspace solutions?
>>
>>
>
>Name them.
>
>In user space, doing things like clever run-time linking things is
>actually horribly bad. It causes COW faults at startup, and/or makes the
>compiler have to do indirections unnecessarily. Both of which actually
>make caches less effective, because now processes that really effectively
>do have exactly the same contents have them in different pages.
>
>The other alternative (which apparently glibc actually does use) is to
>dynamically branch over the lock prefixes, which actually works better:
>it's more work dynamically, but it's much cheaper from a startup
>standpoint and there's no memory duplication, so while it is the "stupid"
>approach, it's actually better than the clever one.
>
>

Using self modifying code stubs will work, and Intel's architecture will
support it. This would be
faster than waiting 2-3 years for Intel to spin a processor rev. NetWare
did something similair with
global branch tables for memory protection.


J


>The third alternative is to know at link-time that the process never does
>anything threaded, but that needs more developer attention and
>non-standard setups, and you _will_ get it wrong (some library will create
>some thread without the developer even realizing). It also has the
>duplicated library overhead (but at least now the duplication is just
>twice, not "each process duplicates its own private pointer")
>
>In short, there simply isn't any good alternatives. The end result is that
>thread-safe libraries are always in practice thread-safe even on UP, even
>though that serializes the CPU altogether unnecessarily.
>
>I'm sure you can make up alternatives every time you hit one _particular_
>library, but that just doesn't scale in the real world.
>
>In contrast, the simple silicon support scales wonderfully well. Suddenly
>libraries can be thread-safe _and_ efficient on UP too. You get to eat
>your cake and have it too.
>
> Linus
>-
>To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
>the body of a message to [email protected]
>More majordomo info at http://vger.kernel.org/majordomo-info.html
>Please read the FAQ at http://www.tux.org/lkml/
>
>
>

2005-11-24 00:55:24

by Jeff Garzik

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Andi Kleen wrote:
>>THAT is what I'd like to have CPU support for. Not for UP (it's going
>>away), and not for the kernel (it's never single-threaded).
>
>
> There is one reasonly interesting special case that will probably stay
> around: single CPU guest in a virtualized environment.

There will continue to be tons of embedded uniprocessor Linux use...

It will take a while for Linux phones and watches to become multi-core.

Jeff



2005-11-24 01:02:38

by Jeff Garzik

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Linus Torvalds wrote:
> The third alternative is to know at link-time that the process never does
> anything threaded, but that needs more developer attention and
> non-standard setups, and you _will_ get it wrong (some library will create
> some thread without the developer even realizing). It also has the
> duplicated library overhead (but at least now the duplication is just
> twice, not "each process duplicates its own private pointer")

Small data point: In a lot of gcc-related build processes, the
configure/makefile junk passes '-pthread' to the compiler/linker.

So a lot of programs in Linux distros are already built this way. The
bigger problem is with libraries, which cannot know ahead of time
whether the app is threaded or not, and therefore must assume threaded.

A few libs do things like glibc, others (like GLib) have an explicit
mylib_thread_init() called at program startup.

Jeff


2005-11-24 02:07:10

by Daniel Jacobowitz

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Wed, Nov 23, 2005 at 03:59:52PM -0800, Linus Torvalds wrote:
> On Wed, 23 Nov 2005, Daniel Jacobowitz wrote:
> >
> > Those are the wrong ways of doing this in userspace. There are right
> > ways. For instance, tag the binary at link time "single-threaded".
>
> And I mentioned exactly this. It's my third alternative.
>
> And it doesn't work well, exactly because developers don't know if the
> libraries they use are always single-threaded etc. More importantly, it
> just doesn't happen that much. People do "make ; make install". Hopefully
> from pretty standard sources. Having to tweak things so that a project
> compiles with a magic flag on a particular distribution is simply not done
> very much.

But distributors (Debian included) do this all the time :-)

I'd even volunteer to get it done and pushed out and in use, if I was
as convinced of the benefits. For most applications, though, I'm still
sceptical.

--
Daniel Jacobowitz
CodeSourcery, LLC

2005-11-24 03:24:08

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [patch] SMP alternatives



On Wed, 23 Nov 2005, Alan Cox wrote:

> On Mer, 2005-11-23 at 10:42 -0800, Linus Torvalds wrote:
>> Of course, if it's in one of the low 12 bits of %cr3, there would have to
>> be a "enable this bit" in %cr4 or something. Historically, you could write
>> any crap in the low bits, I think.
>
> There is a much much better way to do it than just user space and
> without hitting cr3/cr4 - put "lock works" in the PAT and while we'll
> have to add PAT support which we need to do anyway we would get a world
> where on uniprocessor lock prefix only works on addresse targets we want
> it to - ie pci_alloc_consistent() pages.

Given the CPU architecture it is unimplementable. Intructions are split
into microinstuctions and they are executed out of order. PAT is looked up
when LOAD microinstruction is executed. Imagine this

MOV EDX, [address1]
LOCK ADD [address2], EDX

that is translated to

LOAD EDX, [address1]
LOAD TMP1, [address2]
ADD TMP1, EDX
STORE [address2], TMP1

... now LOAD finds LOCK attribute in PAT --- so it locks the bus, however
EDX is still not loaded. Now LOAD EDX can't execute because the bus is
locked and ADD and STORE can't execute because they're waiting for LOAD
EDX. Deadlock.

Locks are so slow not because they are locks (if the target is in L1
cache, they operate only on cache and don't go to bus at all), but because
they need to flush completely microinstruction pool to avoid problems like
this. Of course Intel won't waste silicon in the execution engine for
instructions that execute so rarely, so they microcode them instead. So
lock detection is done at the decoder.

Mikulas

> Alan
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2005-11-24 03:31:17

by Mikulas Patocka

[permalink] [raw]
Subject: Re: [patch] SMP alternatives



On Wed, 23 Nov 2005, Linus Torvalds wrote:

>
>
> On Wed, 23 Nov 2005, H. Peter Anvin wrote:
>>
>> Linus Torvalds wrote:
>>> What I suggested to Intel at the Developer Days is to have a MSR (or, better
>>> yet, a bit in the page table pointer %cr0) that disables "lock" in _user_
>>> space. Ie a lock would be a no-op when in CPL3, and only with certain
>>> processes.
>>
>> You mean %cr3, right?
>
> Yes.
>
> It _should_ be fairly easy to do something like that - just a simple
> global flag that gets set and makes CPL3 ignore lock prefixes. Even timing
> doesn't matter - it it takes a hundred cycles for the setting to take
> effect, we don't care, since you can't write %cr3 from user space anyway,
> and it will certainly take a hundred cycles (and a few serializing
> instructions) until we get to CPL3.
>
> I'd personally prefer it to be in %cr3, since we'd have to reload it on
> task switching, and that's one of the registers we load anyway. And it
> would make sense. But it could be in an MSR too.
>
> Of course, if it's in one of the low 12 bits of %cr3, there would have to
> be a "enable this bit" in %cr4 or something. Historically, you could write
> any crap in the low bits, I think.

Why should they waste their (already complex) decoding logic with that?
Why can't an application instead set a bit somewhere if it's running on
SMP and if it's threaded and branch to variants with and without lock
prefix?

(correctly predicted branch is even faster than some microcode to
determine the value of your bit)

Mikulas

> Linus
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>

2005-11-24 03:56:44

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Mikulas Patocka wrote:
>
> Why should they waste their (already complex) decoding logic with that?
> Why can't an application instead set a bit somewhere if it's running on
> SMP and if it's threaded and branch to variants with and without lock
> prefix?
>

Why? Because it would make the same code run faster on their CPU than
on their competitor's. That's the kind of things that make x86 vendors
smile.

-hpa

2005-11-24 13:05:29

by Pádraig Brady

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Linus Torvalds wrote:

>On Wed, 23 Nov 2005, Daniel Jacobowitz wrote:
>
>
>>Why should we use a silicon based solution for this, when I posit that
>>there are simpler and equally effective userspace solutions?
>>
>>
>
>Name them.
>
>In user space, doing things like clever run-time linking things is
>actually horribly bad. It causes COW faults at startup, and/or makes the
>compiler have to do indirections unnecessarily. Both of which actually
>make caches less effective, because now processes that really effectively
>do have exactly the same contents have them in different pages.
>
>The other alternative (which apparently glibc actually does use) is to
>dynamically branch over the lock prefixes, which actually works better:
>it's more work dynamically, but it's much cheaper from a startup
>standpoint and there's no memory duplication, so while it is the "stupid"
>approach, it's actually better than the clever one.
>
>
Just a note to say glibc is getting better wrt to locking.
Compare the results and trival test program here:
http://lkml.org/lkml/2001/12/7/75
That showed that for glibc 2.2.4, getc & putc
were 669% slower than the unlocked versions.

4 years later and with 2.3.5-1ubuntu1, getc & putc
are only 230% slower than the unlocked versions:

$ dd bs=1MB count=100 if=/dev/zero | ./locked >/dev/null
100000000 bytes transferred in 3.709362 seconds (26958813 bytes/sec)
$ dd bs=1MB count=100 if=/dev/zero | ./unlocked >/dev/null
100000000 bytes transferred in 1.602427 seconds (62405339 bytes/sec)

P?draig.

2005-11-24 13:12:57

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [patch] SMP alternatives


> Just a note to say glibc is getting better wrt to locking.
> Compare the results and trival test program here:
> http://lkml.org/lkml/2001/12/7/75
> That showed that for glibc 2.2.4, getc & putc
> were 669% slower than the unlocked versions.
>
> 4 years later and with 2.3.5-1ubuntu1, getc & putc
> are only 230% slower than the unlocked versions:
>
> $ dd bs=1MB count=100 if=/dev/zero | ./locked >/dev/null
> 100000000 bytes transferred in 3.709362 seconds (26958813 bytes/sec)
> $ dd bs=1MB count=100 if=/dev/zero | ./unlocked >/dev/null
> 100000000 bytes transferred in 1.602427 seconds (62405339 bytes/sec)

this could also mean that the unlocked version has gotten slower ;)



2005-11-24 13:13:29

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

> 1. The lock behaviour *is* defined for main memory access by all bus
> masters.

For uncached memory, right?

> 2. Uncached mappings are unworkable for this because we must never have
> a page mapped with conflicting cache types - thats ugly, and plain
> horrific on SMP.

For kernel mapping change_page_attr() takes care of it,
and for user space memory following all mappings is the only
reliable way to find out which process needs to be killed
anyways - and when you do that you can as well unmap
or just kill.

> 3. Uncached has undefined semantics when racing a PCI master. Lock has
> defined semantics. An uncached add #0 is permitted to read the memory
> and then write it back as two different cycles and I suspect does.

Consider what happens with such a race: either the PCI master
gets an bus abort because it still sees the corrupted data.
Or it already accesses the repaired data. Both is ok.

> 4. The AMD BIOS guide requires both that LOCK is enabled by default and
> that the "lock affects the external bus" bit is clear to enable locking
> on the external bus.

The "Linux guidelines" might be different.

-Andi

2005-11-24 13:32:14

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Andi Kleen <[email protected]> writes:

>> 2. Uncached mappings are unworkable for this because we must never have
>> a page mapped with conflicting cache types - thats ugly, and plain
>> horrific on SMP.
>
> For kernel mapping change_page_attr() takes care of it,
> and for user space memory following all mappings is the only
> reliable way to find out which process needs to be killed
> anyways - and when you do that you can as well unmap
> or just kill.

I think I see the source of the confusion. Scrubbing is the
process of taking data that is correctable and writing it back to
memory so that if a second correctable error occurs the net is still
corrected.

Directed killing of processes is something that must be done
inside a synchronous exception (like a machine check) because otherwise
it is so racy you don't know who has seen the bad data.

Eric

2005-11-24 13:39:11

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

> I think I see the source of the confusion. Scrubbing is the
> process of taking data that is correctable and writing it back to
> memory so that if a second correctable error occurs the net is still
> corrected.

That's supposed to be done by hardware, no?
At least the K8 has a hardware scrubber (although it's not always enabled)

> Directed killing of processes is something that must be done
> inside a synchronous exception (like a machine check) because otherwise
> it is so racy you don't know who has seen the bad data.

If you try to do it this way then the code will become such
a mess if not impossible to write that your changes to merge them
and get it right are very slim. The only sane way to do all the locking etc.
is to hand over the handling to a thread. While that make the window
of misusing the data wider it's the only sane alternative vs not
doing it at all.

Also due to the way hardware works with machine checks usually being
async and not precise works you have that window anyways, so it's
not even worse. Also consider multiple CPUs.

-Andi

2005-11-24 14:00:19

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Andi Kleen <[email protected]> writes:

>> I think I see the source of the confusion. Scrubbing is the
>> process of taking data that is correctable and writing it back to
>> memory so that if a second correctable error occurs the net is still
>> corrected.
>
> That's supposed to be done by hardware, no?
> At least the K8 has a hardware scrubber (although it's not always enabled)

Recent good implementations like the Opteron will do it for you.
Older or cheaper memory controllers will not.

Having an architecturally sane software scrubber as backup for
the hardware implementations is nice, and except in the cases
where someone disables the lock prefix it is takes very little
code on x86.

Even on the Opteron you could theoretically have the case of a brain-dead
external memory controller, although that is not likely.

>> Directed killing of processes is something that must be done
>> inside a synchronous exception (like a machine check) because otherwise
>> it is so racy you don't know who has seen the bad data.
>
> If you try to do it this way then the code will become such
> a mess if not impossible to write that your changes to merge them
> and get it right are very slim. The only sane way to do all the locking etc.
> is to hand over the handling to a thread. While that make the window
> of misusing the data wider it's the only sane alternative vs not
> doing it at all.
>
> Also due to the way hardware works with machine checks usually being
> async and not precise works you have that window anyways, so it's
> not even worse. Also consider multiple CPUs.

First I don't have any code to do this, but I have though about it.
The races are the primary reason I have never pushed for something
like this. With memory errors coming in as machine checks it is now
possible to do a correct version.

Essentially we are talking something with the complexity of a page
fault. All that must happen synchronously is the task must
be stopped, and flagged.

As for races every cpu that accesses that data should take
a synchronous exception. DMA should do something similar but I'm
not as familiar with that side of the problem. And because everything
takes an exception multiple cpu races are not a problem.

Of course there are still the memory errors that are so bad that
they don't even cause a machine check. Those are a real pain
to debug, and fix.

In this latter case assuming the memory error is transient and
not hard using a write-combine memory attribute when you write
to re-initialize the ECC state is the way to go. But remember
you most do it on the cpu that is part of the memory controller.
Otherwise something in the whole read/modify process will fail
to get the ECC state initialized properly on an Opteron.

Eric

2005-11-24 13:59:21

by Alan

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Iau, 2005-11-24 at 14:13 +0100, Andi Kleen wrote:
> > 1. The lock behaviour *is* defined for main memory access by all bus
> > masters.
>
> For uncached memory, right?

For all memory. Same as processor to processor.

> > 2. Uncached mappings are unworkable for this because we must never have
> > a page mapped with conflicting cache types - thats ugly, and plain
> > horrific on SMP.
>
> For kernel mapping change_page_attr() takes care of it,
> and for user space memory following all mappings is the only
> reliable way to find out which process needs to be killed
> anyways - and when you do that you can as well unmap
> or just kill.

You are working from a kernel view address of a page that may be user
space. You don't need or want to kill anything because you are scrubbing
a corrected error.

>
> > 3. Uncached has undefined semantics when racing a PCI master. Lock has
> > defined semantics. An uncached add #0 is permitted to read the memory
> > and then write it back as two different cycles and I suspect does.
>
> Consider what happens with such a race: either the PCI master
> gets an bus abort because it still sees the corrupted data.
> Or it already accesses the repaired data. Both is ok.

This is a correctable error so there would be no abort. And there is a
race if you think for a microsecond or two

Scrubber reads (add #0 load of input)
Bus master writes #FFFFFFFF
Scrubber writes back #0

The lock prefix ensures that doesn't occur.

> > 4. The AMD BIOS guide requires both that LOCK is enabled by default and
> > that the "lock affects the external bus" bit is clear to enable locking
> > on the external bus.
>
> The "Linux guidelines" might be different.

Then the EDAC code will reconfigure the registers back as expected by
the PC architecture. I've no problem with that and if EDAC is the only
person requiring a semantic that is more expensive it can flip the bits
back and forth. No need for anyone else to pay that cost.

Alan

2005-11-24 14:01:48

by Alan

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Iau, 2005-11-24 at 14:39 +0100, Andi Kleen wrote:
> That's supposed to be done by hardware, no?

Varies immensely by system. Where there is a hardware scrubber and it is
enabled it will be used. Once nice thing about K8 is the mem controller
is in the CPU so they all use the same driver (not yet merged)

> If you try to do it this way then the code will become such
> a mess if not impossible to write that your changes to merge them
> and get it right are very slim. The only sane way to do all the locking etc.
> is to hand over the handling to a thread. While that make the window
> of misusing the data wider it's the only sane alternative vs not
> doing it at all.

Its utterly hideous because the usual 'ECC error' reporting technique
for an uncorrectable error is an NMI. Locks could be in any state at
this point and even the registers needing to be accessed are across PCI
and we could be half way through a PCI configuration cycle.

The -mm EDAC code works on the basic assumption that unrecovered ECC is
a system halter although that is configurable.

2005-11-24 14:22:18

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Thu, Nov 24, 2005 at 02:34:07PM +0000, Alan Cox wrote:
> On Iau, 2005-11-24 at 14:39 +0100, Andi Kleen wrote:
> > That's supposed to be done by hardware, no?
>
> Varies immensely by system. Where there is a hardware scrubber and it is
> enabled it will be used. Once nice thing about K8 is the mem controller
> is in the CPU so they all use the same driver (not yet merged)

What do you need a special driver for if the northbridge just
can do the scrubbing by itself?

> > If you try to do it this way then the code will become such
> > a mess if not impossible to write that your changes to merge them
> > and get it right are very slim. The only sane way to do all the locking etc.
> > is to hand over the handling to a thread. While that make the window
> > of misusing the data wider it's the only sane alternative vs not
> > doing it at all.
>
> Its utterly hideous because the usual 'ECC error' reporting technique
> for an uncorrectable error is an NMI. Locks could be in any state at

On the modern systems I'm familiar with it's an machine check (although
not necessarily a recoverable one and there might be other bad
side effects)

> this point and even the registers needing to be accessed are across PCI
> and we could be half way through a PCI configuration cycle.
>
> The -mm EDAC code works on the basic assumption that unrecovered ECC is
> a system halter although that is configurable.

I don't know what you could do over the default code for K8 at least.
And on modern Intel server chipsets I would expect it also to not
be needed.

-Andi

2005-11-24 14:43:19

by Alan

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Iau, 2005-11-24 at 15:22 +0100, Andi Kleen wrote:
> What do you need a special driver for if the northbridge just
> can do the scrubbing by itself?

You need a driver to collect and report all the ECC single bit errors to
the user so that they can decide if they have problem hardware.

EDAC is more than one thing
- Control response to a fatal error
- Report non-fatal events for analysis/user decision making

Hardware scrubbing is good, but knowing the rate of non-fatal errors and
the trend in rate of errors is essential to planning and management of
systems.

> On the modern systems I'm familiar with it's an machine check (although
> not necessarily a recoverable one and there might be other bad
> side effects)

The Intel I have looked at generates MCE if the L2/L1/bus parity errors
but not on a RAM ECC error as that is memory controller not CPU level.
That usually asserts NMI. Same for most older chips PIII/AMD Athlon etc

> > The -mm EDAC code works on the basic assumption that unrecovered ECC is
> > a system halter although that is configurable.
>
> I don't know what you could do over the default code for K8 at least.
> And on modern Intel server chipsets I would expect it also to not
> be needed.

Varies a lot again. Hopefully that'll simplify as/when/if Intel put the
memory controller on CPU.

Alan

2005-11-24 14:55:21

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Thu, Nov 24, 2005 at 03:15:24PM +0000, Alan Cox wrote:
> On Iau, 2005-11-24 at 15:22 +0100, Andi Kleen wrote:
> > What do you need a special driver for if the northbridge just
> > can do the scrubbing by itself?
>
> You need a driver to collect and report all the ECC single bit errors to
> the user so that they can decide if they have problem hardware.

Assuming the errors are logged to the standard machine check
architecture that's already done by mce.c. K8 does that definitely.

Take a look at mcelog at some point.
Your distro probably already sets it up by default to log to
/var/log/mcelog

>
> EDAC is more than one thing
> - Control response to a fatal error
> - Report non-fatal events for analysis/user decision making

x86-64 mce.c does all that There was even a port to i386 around at some point.

-Andi

2005-11-24 15:10:54

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Andi Kleen <[email protected]> writes:

> On Thu, Nov 24, 2005 at 03:15:24PM +0000, Alan Cox wrote:
>> On Iau, 2005-11-24 at 15:22 +0100, Andi Kleen wrote:
>> > What do you need a special driver for if the northbridge just
>> > can do the scrubbing by itself?
>>
>> You need a driver to collect and report all the ECC single bit errors to
>> the user so that they can decide if they have problem hardware.
>
> Assuming the errors are logged to the standard machine check
> architecture that's already done by mce.c. K8 does that definitely.
>
> Take a look at mcelog at some point.
> Your distro probably already sets it up by default to log to
> /var/log/mcelog
>
>>
>> EDAC is more than one thing
>> - Control response to a fatal error
>> - Report non-fatal events for analysis/user decision making
>
> x86-64 mce.c does all that There was even a port to i386 around at some point.

Right on the k8 memory controller there is a lot of overlap,
with what has already been implemented. For all other x86 memory
controllers the code is filling a large void. The current k8
code has been delayed for this reason.

Where the EDAC code goes beyond the current k8 facilities is the
decode to the dimm level so that the bad memory stick can be
easily identified.

One of the goals of the EDAC code is to work towards a unified
kernel architecture for this kind error reporting. Currently every
architecture (if the error are reported at all) handles this
differently which makes it very hard to do something sane is
user space.

Eric

2005-11-24 15:30:14

by Alan

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Iau, 2005-11-24 at 08:09 -0700, Eric W. Biederman wrote:
> Where the EDAC code goes beyond the current k8 facilities is the
> decode to the dimm level so that the bad memory stick can be
> easily identified.

And also in finding/recording PCI parity errors (which will link nicely
into the IBM work for code to handle reported PCI errors).

2005-11-24 15:36:57

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Thu, Nov 24, 2005 at 08:09:07AM -0700, Eric W. Biederman wrote:
> Andi Kleen <[email protected]> writes:
>
> > On Thu, Nov 24, 2005 at 03:15:24PM +0000, Alan Cox wrote:
> >> On Iau, 2005-11-24 at 15:22 +0100, Andi Kleen wrote:
> >> > What do you need a special driver for if the northbridge just
> >> > can do the scrubbing by itself?
> >>
> >> You need a driver to collect and report all the ECC single bit errors to
> >> the user so that they can decide if they have problem hardware.
> >
> > Assuming the errors are logged to the standard machine check
> > architecture that's already done by mce.c. K8 does that definitely.
> >
> > Take a look at mcelog at some point.
> > Your distro probably already sets it up by default to log to
> > /var/log/mcelog
> >
> >>
> >> EDAC is more than one thing
> >> - Control response to a fatal error
> >> - Report non-fatal events for analysis/user decision making
> >
> > x86-64 mce.c does all that There was even a port to i386 around at some point.
>
> Right on the k8 memory controller there is a lot of overlap,
> with what has already been implemented. For all other x86 memory
> controllers the code is filling a large void.

At least the Lindenhurst (E7205) datasheet says the chipset can trigger
MCEs in the CPU (using a MCEERR# pin). I don't know if it's always
enabled, but the hardware seems to have the capability.
That's the oldest Intel server chipset supported with EM64T CPUs.

The threshold counters are not supported directly only.


> The current k8
> code has been delayed for this reason.
>
> Where the EDAC code goes beyond the current k8 facilities is the
> decode to the dimm level so that the bad memory stick can be
> easily identified.

That would be nice to have agreed. But I don't really know
how to do this without mainboard specific knowledge.
If you have something usable it's best to port it to mce.c
or perhaps mcelog

>
> One of the goals of the EDAC code is to work towards a unified
> kernel architecture for this kind error reporting. Currently every
> architecture (if the error are reported at all) handles this
> differently which makes it very hard to do something sane is
> user space.

There is a clear case for being architecture specific here. Some
architectures - like PPC64 or IA64 - have good firmware support for it, so it's
best to use these facilities. On others like i386 and
x86-64 the x86-64 log architecture is good. I might be a bit
biased but I think it's very good and should be used on i386
at some point too. I don't see any need for more.

-Andi

2005-11-24 16:51:05

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Andi Kleen <[email protected]> writes:

> At least the Lindenhurst (E7205) datasheet says the chipset can trigger
> MCEs in the CPU (using a MCEERR# pin). I don't know if it's always
> enabled, but the hardware seems to have the capability.
> That's the oldest Intel server chipset supported with EM64T CPUs.
>
> The threshold counters are not supported directly only.

I don't think that triggers on correctable errors, and I'm
not certain how useful the information reported it. But it
should be at least as good as an NMI :)

Truthfully it really isn't just server chipsets that are interesting
either. Anything that supports ECC or parity on memory is
interesting.

>> The current k8
>> code has been delayed for this reason.
>>
>> Where the EDAC code goes beyond the current k8 facilities is the
>> decode to the dimm level so that the bad memory stick can be
>> easily identified.
>
> That would be nice to have agreed. But I don't really know
> how to do this without mainboard specific knowledge.
> If you have something usable it's best to port it to mce.c
> or perhaps mcelog

We do this for every memory controller EDAC supports, so yes
we know how to implement this. Merging the non-controversial
bits is coming. But it is certainly a goal to take the
best of the mce code and the EDAC code to generate a good
k8 driver.

Motherboard specific knowledge is really not required. All that
is really required is memory controller specific knowledge. With that
you can decode to the chip select level on most memory controllers.

Then you need just a little extra code (probably in user space) to map
the chip select to which dimm socket they go to on the motherboard.

The memory controller knowledge pretty much needs to be a
kernel driver because reading the memory controller registers
can be a non-trivial exercise at times. At least one piece of
Intel seems to like recommending that BIOS developers turn off the PCI
device that has the registers.

> There is a clear case for being architecture specific here. Some
> architectures - like PPC64 or IA64 - have good firmware support for it, so it's
> best to use these facilities. On others like i386 and
> x86-64 the x86-64 log architecture is good. I might be a bit
> biased but I think it's very good and should be used on i386
> at some point too. I don't see any need for more.

The implementation clearly should be architecture specific, and
it will take more feeling out before any work can be done
to even think about unify the interfaces.

Right now the goal is to simply get what has proven useful in the
real world merged.

Currently I do not see implementation problems but I do see
facilities not being as useful as they could be. Getting little
things like decoding to the dimm sorted out removes hours of
work figuring out which dimm has problems.

I also see a large number of errors that the hardware can detect
that are going unreported because the code just isn't there. There
are all kinds of bus errors that chipsets can report that are
just being ignored.

So now you know some of the hopes and dreams behind some of the EDAC
code and some more of the model. Hopefully this will lead to productive
conversations as the code is merged.

Eric

2005-11-24 17:48:55

by George Spelvin

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

> I suspect that with MAP_SHARED + PROT_WRITE being pretty uncommon anyway,
> we can probably find trivial patterns in the kernel. Like only one process
> holding that file open - which is what you get with things that use mmap()
> to write a new file (I think "ld" used to have a config option to write
> files that way, for example).

Just a bit of practical experience: I use mmap() to write data a LOT,
because msync(MS_ASYNC) is the most portable way to do an async write.

There are two applications. First, helping the OS not fill up with
dirty pages. It's basically a way of saying "this page is not going to
be dirtied again for a long time".

Secondly, to reduce the latency of synchronous writes. If I need to
log operations durably, it helps to

1) fill the log pages, using MS_ASYNC as soon as the page is full
2) when committing a batch, use MS_SYNC to force data to disk
3) report batch successfully committed to stable storage

The aio_ routines are less widely supported some implementations have
very high overhead. They would allow me to keep working while a commit
is in progress, but the above is simple and reduces the burstiness of
I/O considerably.

2005-11-24 18:24:51

by colin

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

> For user space the primary trigger event would be "has any shared
> writable mappings or multiple threads". Even on a real MP systems it's
> perfectly ok to run a program with no writable shared mappings with LOCK off.
^ single-threaded

> Depending on the workload this transistion could happen quite often.
> Especially there is a worst case of an application allocating a few
> GB of memory and then starting a new thread.

One more thing, that may be to cute to be practical, but is worth mentioning:
shared address space or shared mappings only require LOCK if the memory
is ACTIVELY shared, i.e. used by DMA or by another task that is running
right now.

If you have a process with a helper thread that's asleep 99% of the time,
the savings of running with LOCK off might be worth the occasional
IPI to enable it on the main thread on the rare occasions that the
helper wakes up on a different processor.

For example, imagine a threaded async DNS resolver that tracks
TTL and times out cache entries.

If you have heavier-weight mutual exclusion, you don't need LOCK.

2005-11-24 18:49:06

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] SMP alternatives



On Thu, 24 Nov 2005, [email protected] wrote:
>
> > I suspect that with MAP_SHARED + PROT_WRITE being pretty uncommon anyway,
> > we can probably find trivial patterns in the kernel. Like only one process
> > holding that file open - which is what you get with things that use mmap()
> > to write a new file (I think "ld" used to have a config option to write
> > files that way, for example).
>
> Just a bit of practical experience: I use mmap() to write data a LOT,
> because msync(MS_ASYNC) is the most portable way to do an async write.

Sure. But I suspect that nobody else has that file open when you do so?

In other words, even your usage is something where the OS could tell that
you don't actually need atomic operations. It certainly gets slightly more
complicated (we'd need to trigger some special stuff if another process
does an mmap on it), but it's not conceptually very difficult to just
notice automatically and do the right thing(tm).

Now, if two programs are using mmap() to write to the same file at the
same time, then the kernel can't tell any more. But in that case, you
probably _do_ want atomic ops to be guaranteed, so not disabling them is
the right thing to do there.

Linus

2005-11-24 19:08:05

by Tim Hockin

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Thu, Nov 24, 2005 at 03:15:24PM +0000, Alan Cox wrote:
> The Intel I have looked at generates MCE if the L2/L1/bus parity errors
> but not on a RAM ECC error as that is memory controller not CPU level.
> That usually asserts NMI. Same for most older chips PIII/AMD Athlon etc

Some BIOSes route that into SMI. The BIOS can then log the error and tell
the OS via the nmi_now bit. Uggh.

2005-11-24 19:10:28

by Tim Hockin

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Thu, Nov 24, 2005 at 04:36:35PM +0100, Andi Kleen wrote:
> > The current k8
> > code has been delayed for this reason.
> >
> > Where the EDAC code goes beyond the current k8 facilities is the
> > decode to the dimm level so that the bad memory stick can be
> > easily identified.
>
> That would be nice to have agreed. But I don't really know
> how to do this without mainboard specific knowledge.
> If you have something usable it's best to port it to mce.c
> or perhaps mcelog

I'm curious about that too. Even with k8 you can get down to a
chip-select, but that doesn't necessarily map to a DIMM in any useful way,
unless you have some mobo knowledge. Are we going to need a new BIOS
table to map chip-selects onto DIMMs? :)

2005-11-24 19:15:07

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

> I'm curious about that too. Even with k8 you can get down to a
> chip-select, but that doesn't necessarily map to a DIMM in any useful way,
> unless you have some mobo knowledge. Are we going to need a new BIOS

Yeah that's my problem.

It's not theoretical. We had cases where someone had to go
through 10+ DIMMs on a big machine in try and error to find
out which one is wrong. Very bad situation.

[Double plus bad if it wasn't actually any of the DIMMs that were
bad, but one of the VRMs on a big Opteron - it causes all the same
symptoms as a bad DIMM :/]

> table to map chip-selects onto DIMMs? :)

I proposed something like that - best with an ASCII string
("First DIMM on the top left corner") But getting such stuff into BIOS
is difficult and long winded.

-Andi

2005-11-24 19:14:55

by Tim Hockin

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Thu, Nov 24, 2005 at 06:58:52AM -0700, Eric W. Biederman wrote:
> > That's supposed to be done by hardware, no?
> > At least the K8 has a hardware scrubber (although it's not always enabled)
>
> Recent good implementations like the Opteron will do it for you.
> Older or cheaper memory controllers will not.

Beware of errata - there's at leats one errata on Opteron which forces you
to choose between x4 (chipkill) ECC and scrubber. One or the other, but
not both. There are plenty of errata on the scrubber alone. Worse, if my
(brain)memory is correct, without the scrubber, correctable errors are
corrected on the fly, but never written back to DRAM.

> Having an architecturally sane software scrubber as backup for
> the hardware implementations is nice, and except in the cases
> where someone disables the lock prefix it is takes very little
> code on x86.
>
> Even on the Opteron you could theoretically have the case of a brain-dead
> external memory controller, although that is not likely.

2005-11-24 19:22:32

by Tim Hockin

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Thu, Nov 24, 2005 at 08:14:46PM +0100, Andi Kleen wrote:
> > I'm curious about that too. Even with k8 you can get down to a
> > chip-select, but that doesn't necessarily map to a DIMM in any useful way,
> > unless you have some mobo knowledge. Are we going to need a new BIOS
>
> Yeah that's my problem.
>
> It's not theoretical. We had cases where someone had to go
> through 10+ DIMMs on a big machine in try and error to find
> out which one is wrong. Very bad situation.

I have the exact same problem right now. Part of our early bootup we run
a simplish memory test. Basically it's a "can the memory hold state"
test. If anything fails, we have to identify as exactly as possible WHICH
DIMM needs to be replaced, so the hardware ops people can do it at
assembly/test time.

We implemented AMD's reference algorithm, and made it work in the presence
of a hardware IO hole. It seems to work beautifully, but the last step is
turning a (node:chip-select) into a (node:dimm). Simple boards will use
simple mappings, but we can't know that without board specific info.
Especially with quad-rank DIMMs. :)

> > table to map chip-selects onto DIMMs? :)
>
> I proposed something like that - best with an ASCII string
> ("First DIMM on the top left corner") But getting such stuff into BIOS
> is difficult and long winded.

It would be easy enough to get into LinuxBIOS. :)

Seriously, this is work that is *long* overdue. I have been wanting to
look at this for over a year, but I have not had time.

Doing proper architecture and chipset-specific ECC/error handling which
ties into a bigger abstracted error system is going to be really nice.

2005-11-24 19:26:10

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Thu, Nov 24, 2005 at 11:16:36AM -0800, [email protected] wrote:
> On Thu, Nov 24, 2005 at 06:58:52AM -0700, Eric W. Biederman wrote:
> > > That's supposed to be done by hardware, no?
> > > At least the K8 has a hardware scrubber (although it's not always enabled)
> >
> > Recent good implementations like the Opteron will do it for you.
> > Older or cheaper memory controllers will not.
>
> Beware of errata - there's at leats one errata on Opteron which forces you
> to choose between x4 (chipkill) ECC and scrubber. One or the other, but
> not both. There are plenty of errata on the scrubber alone. Worse, if my
> (brain)memory is correct, without the scrubber, correctable errors are
> corrected on the fly, but never written back to DRAM.

All the scrub errata were fixed with E stepping AFAIK.

You have a point that using a sw scrubber might make sense on earlier
steppings though in case someone really wants chipkill.

-Andi

2005-11-24 19:29:56

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

> We implemented AMD's reference algorithm, and made it work in the presence
> of a hardware IO hole. It seems to work beautifully, but the last step is
> turning a (node:chip-select) into a (node:dimm). Simple boards will use
> simple mappings, but we can't know that without board specific info.
> Especially with quad-rank DIMMs. :)

If you get something working it would be good if you could share the code
(even if it still needs to be tweaked)

>
> > > table to map chip-selects onto DIMMs? :)
> >
> > I proposed something like that - best with an ASCII string
> > ("First DIMM on the top left corner") But getting such stuff into BIOS
> > is difficult and long winded.
>
> It would be easy enough to get into LinuxBIOS. :)
>
> Seriously, this is work that is *long* overdue. I have been wanting to
> look at this for over a year, but I have not had time.
>
> Doing proper architecture and chipset-specific ECC/error handling which
> ties into a bigger abstracted error system is going to be really nice.

IMNSHO the x86-64 mce.c with its error log is a reasonable start. All
the smarts can be in user space and in mcelog.c. DIMM decoding
is a special case though because the information is really useful
to be printed onto the screen for fatal MCEs. So that one is better
in kernel space.

-Andi

2005-11-24 19:43:14

by Tim Hockin

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Thu, Nov 24, 2005 at 08:29:53PM +0100, Andi Kleen wrote:
> > We implemented AMD's reference algorithm, and made it work in the presence
> > of a hardware IO hole. It seems to work beautifully, but the last step is
> > turning a (node:chip-select) into a (node:dimm). Simple boards will use
> > simple mappings, but we can't know that without board specific info.
> > Especially with quad-rank DIMMs. :)
>
> If you get something working it would be good if you could share the code
> (even if it still needs to be tweaked)

The below code works for us. Note that I did not implement the
node-interleaving parts of the AMD algorithm. If that matters, it should
be simple enough to do. The BKDG has good pseudo-code. The only thing it
gets absolutely wrong is the IO hole.

Let me know if you see any problems here. This is a userspace tool, but
should be trivial to adapt.


#include <stdio.h>
#include <stdlib.h>
#include "pci.h"

#define MAX_NODES 8
#define MAX_CS 8
#define NODE_DEV(n) (24+(n))
#define FOUR_GIG 0x01000000 // shifted to hold address[39..8]

static char *progname;

static void
usage(void)
{
fprintf(stderr, "usage: %s <address>\n", progname);
}

/*
* Map a CS to a pair of DIMM slots (for 128 bit operation). This mapping
* is board-specific, and has the potential to be very ugly.
*/
static int cs_to_pair[MAX_CS][2] = {
{ 0, 1 }, { 0, 1 },
{ 2, 3 }, { 2, 3 },
{ 4, 5 }, { 4, 5 },
{ 6, 7 }, { 6, 7 },
};

int
main(int argc, char *argv[])
{
unsigned long long raw_addr;
uint32_t address;
char *endp;
int node;

progname = argv[0];

if (argc != 2) {
usage();
exit(1);
}
raw_addr = strtoull(argv[1], &endp, 0);
if (endp == argv[1]) {
usage();
exit(2);
}

/*
* The address space is 40 bits (for now). We want to use 32 bit
* values, so we convert the input address to hold address[39..8].
* We'll use this format everywhere. This loses the
* low-order bits, so we keep a raw copy around.
*/
address = (raw_addr & 0xffffffffff) >> 8;

/* find the node that holds this address */
for (node = 0; node < MAX_NODES; node++) {
int pci_dev;
uint32_t tmp;
int dram_en;
uint32_t dram_base;
uint32_t dram_limit;
int hole_en;
uint32_t hole_base;
uint32_t hole_size;
int cs;

/*
* The DRAM map and IO hole info are in function 1 of each
* node.
*/
pci_dev = pci_open(0, NODE_DEV(node), 1);
if (pci_dev < 0) {
/* node does not exist */
break;
}

/*
* DRAM_BASE and DRAM_LIMIT already hold address[39..8].
*/
tmp = pci_read32(pci_dev, 0x40+(node*8));
dram_en = tmp & 0x3;
dram_base = tmp & 0xffff0000;

tmp = pci_read32(pci_dev, 0x44+(node*8));
dram_limit = tmp | 0x0000ffff;

/*
* HOLE_BASE holds address[35..4], so we convert it to hold
* address[39..8].
*/
tmp = pci_read32(pci_dev, 0xf0);
hole_en = tmp & 0x1;
hole_base = (tmp & 0xff000000) >> 8;
hole_size = FOUR_GIG - hole_base;

pci_close(pci_dev);

if (!dram_en) {
/* no DRAM here */
continue;
}

if (address > dram_limit) {
/* keep looking */
continue;
}

/*
* The address must be on this node.
*/

/* is the address in the IO hole? */
if (hole_en && address >= hole_base && address < FOUR_GIG) {
/* no DRAM in the IO hole */
break;
}

/* is the address >= 4GB on a node with a hole? */
if (hole_en && address >= FOUR_GIG) {
/* adjust address for the IO hole */
address -= hole_size;
}

/* adjust address to be node-relative */
address -= dram_base;

/* store addr[35..4] */
address <<= 4;

/* The chip-select map is in function 2 of each node. */
pci_dev = pci_open(0, NODE_DEV(node), 2);

/* find the chip-select that has this address */
for (cs = 0; cs < MAX_CS; cs++) {
uint32_t cs_base;
uint32_t cs_mask;

/*
* CS_BASE and CS_MASK hold address[35..0], so we
* convert them to hold address[39..8].
*/
tmp = pci_read32(pci_dev, 0x40+(cs*4));
if ((tmp & 0x1) == 0) {
/* this CS is not enabled */
continue;
}
cs_base = tmp & 0xffe0fe00;

tmp = pci_read32(pci_dev, 0x60+(cs*4));
cs_mask = (tmp | 0x001f01ff) & 0x3fffffff;

/* this should handle interleaving, too */
if ((address & ~cs_mask) == (cs_base & ~cs_mask)) {
int *pair = cs_to_pair[cs];
int dimm;

/* which DIMM is this address on? */
if ((raw_addr & (1<<3)) == 0) {
dimm = pair[0];
} else {
dimm = pair[1];
}

printf("node %d, CS %d, DIMM %d\n",
node, cs, dimm);
break;
}
}
pci_close(pci_dev);
break;
}
return 0;
}

2005-11-24 21:20:16

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Thu, Nov 24, 2005 at 11:44:59AM -0800, [email protected] wrote:
> On Thu, Nov 24, 2005 at 08:29:53PM +0100, Andi Kleen wrote:
> > > We implemented AMD's reference algorithm, and made it work in the presence
> > > of a hardware IO hole. It seems to work beautifully, but the last step is
> > > turning a (node:chip-select) into a (node:dimm). Simple boards will use
> > > simple mappings, but we can't know that without board specific info.
> > > Especially with quad-rank DIMMs. :)
> >
> > If you get something working it would be good if you could share the code
> > (even if it still needs to be tweaked)
>
> The below code works for us. Note that I did not implement the
> node-interleaving parts of the AMD algorithm. If that matters, it should
> be simple enough to do. The BKDG has good pseudo-code. The only thing it
> gets absolutely wrong is the IO hole.

Thanks. But without a per board DIMM mapping it's pretty useless, isn't it?

One could detect the IO hole by reading the IORR MSRs or alternatively
parsing the e820 map in /var/log/boot.msg

-Andi

2005-11-24 21:38:38

by Tim Hockin

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Thu, Nov 24, 2005 at 10:20:00PM +0100, Andi Kleen wrote:
> > The below code works for us. Note that I did not implement the
> > node-interleaving parts of the AMD algorithm. If that matters, it should
> > be simple enough to do. The BKDG has good pseudo-code. The only thing it
> > gets absolutely wrong is the IO hole.
>
> Thanks. But without a per board DIMM mapping it's pretty useless, isn't it?

Exactly.

> One could detect the IO hole by reading the IORR MSRs or alternatively
> parsing the e820 map in /var/log/boot.msg

Why bother? In this process - turning a physical address into a DIMM,
you're poking at all the data anyway, just get the IO hole straight from
the chipset.

2005-11-24 22:32:24

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On 11/23/05, Linus Torvalds <[email protected]> wrote:
> It _should_ be fairly easy to do something like that - just a simple
> global flag that gets set and makes CPL3 ignore lock prefixes.

This is a goof rist step. But the effectiveness will descrease
significantly in the near future. It can only work for
single-threaded processes without writable shared memory.

With the growing number of cores/threads the need to use parallelism
rises. With techniques like OpenMP the threshold to do this is
lowered significantly. The process-model, so much preferred on this
list over the thread model, requires shared memory, therefore also
eliminating the effectiveness of this functionality.

A real solution needs to be more fine grained. It is often known in
the userland code whether the specific word which is accessed using
atomic ops really can be shared. The POSIX interfaces, for instance,
require that all mutexes etc which are placed in shared memory are
attributed as such. Combine this with the knowledge about the number
of threads in use and the result is that even with writable shared
memory segments the lock prefix can be avoided. There are a whole
bunch of cases where we already do conditional locking. It's just
plain ugly and not as efficient as we would like i t.

So, implementing control with per-userlevel context will se rapidly
diminishing success and I'm wondering whether it is better to go for
something with a bit finer level of control.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

2005-11-24 22:34:11

by Ulrich Drepper

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On 11/23/05, Daniel Jacobowitz <[email protected]> wrote:
> Those are the wrong ways of doing this in userspace. There are right
> ways. For instance, tag the binary at link time "single-threaded".

This works and the system is designed this way. But it's unlikely
that any distribution will ship code like this since the maintenance
is to problematic.


> Glibc does not do this to the best of my knowledge. It does select
> different code paths in various places based on the presence of
> multiple threads, but that's for cancellation, not for locking.

Wrong. Linus is right, we jump over lock prefix. After a lot of
benchmarking I found this to be the fastest was and the Intel people
seemed to agree.


> This is also a trivially solvable problem in userspace; you make the
> dynamic linker enforce consistency of the tags.

This would require that potentially every single DSO is duplicated as
threaded and non-threaded. If you like this you might as well enter
the horror world of BSD with their libc_r. This will never fly, the
support costs are too high.


> The number of userspace libraries that use atomic operations is, in
> practice, quite small.

It really not and the number using them is growing.

--
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖

2005-11-24 22:40:18

by Alan

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Iau, 2005-11-24 at 20:14 +0100, Andi Kleen wrote:
> I proposed something like that - best with an ASCII string
> ("First DIMM on the top left corner") But getting such stuff into BIOS
> is difficult and long winded.

Propose it the desktop management people and get it into the DMI
standard. They already have entries for each memory slot, they already
have entries for descriptive strings for connectors. In fact you may
well be able to 'bend' the spec enough to do it as is.

2005-11-24 22:47:07

by Tim Hockin

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Thu, Nov 24, 2005 at 11:12:14PM +0000, Alan Cox wrote:
> On Iau, 2005-11-24 at 20:14 +0100, Andi Kleen wrote:
> > I proposed something like that - best with an ASCII string
> > ("First DIMM on the top left corner") But getting such stuff into BIOS
> > is difficult and long winded.
>
> Propose it the desktop management people and get it into the DMI
> standard. They already have entries for each memory slot, they already
> have entries for descriptive strings for connectors. In fact you may
> well be able to 'bend' the spec enough to do it as is.

There are enough fields that maybe one of them is loose enough to mean
this. It doesn't help us convince mobo vendors to support it, though.

2005-11-24 23:35:28

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Andi Kleen <[email protected]> writes:

> Thanks. But without a per board DIMM mapping it's pretty useless, isn't it?

Nope.

Getting a per board chip select to DIMM mapping is fairly easy, you
just need a lookup table of:
(memory_controller, chip_select, channel, dimm_label)

Which if the motherboard vendor does not give it to you is pretty
straight forward to discover just by plugging a minimal memory configuration
into various slots. We can already query in software what the
motherboard is so keeping a table like this in user space is
not a problem.

> One could detect the IO hole by reading the IORR MSRs or alternatively
> parsing the e820 map in /var/log/boot.msg

The problem is not detection but compensating for how it changes
the address.

I do agree that it would be nice if there was a standard for
BIOS's reporting this information. In LinuxBIOS it is one of
those TODO list items we never quite get to. But so far a user
space table has proved quite useful in practice.

Eric

2005-11-24 23:35:34

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Thu, Nov 24, 2005 at 02:48:25PM -0800, [email protected] wrote:
> On Thu, Nov 24, 2005 at 11:12:14PM +0000, Alan Cox wrote:
> > On Iau, 2005-11-24 at 20:14 +0100, Andi Kleen wrote:
> > > I proposed something like that - best with an ASCII string
> > > ("First DIMM on the top left corner") But getting such stuff into BIOS
> > > is difficult and long winded.
> >
> > Propose it the desktop management people and get it into the DMI
> > standard. They already have entries for each memory slot, they already
> > have entries for descriptive strings for connectors. In fact you may
> > well be able to 'bend' the spec enough to do it as is.
>
> There are enough fields that maybe one of them is loose enough to mean
> this. It doesn't help us convince mobo vendors to support it, though.

With arbitary desktop/laptop/etc. vendors it's pretty hopeless I agree.
But I suspect there is a chance at least on the server side. There
is only a limited number of companies working on server BIOSes
for their boards and they tend to be more receptive to Linux's need
because it's now a significant part of their market.

And it's clearly an obviously useful "RAS feature" which is
fully buzzword compatible and everything.

IMHO it's time that Linux gets more proactive regarding talking
to BIOS vendors. Perhaps a generic "BIOS writers guide for Linux"
would be a good thing. I have at least one other extension I would like
BIOS vendors to support. Just would need to come up with a writeup
for a clearly defined specification.

-Andi

2005-11-24 23:42:16

by Alan

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Gwe, 2005-11-25 at 00:35 +0100, Andi Kleen wrote:
> to BIOS vendors. Perhaps a generic "BIOS writers guide for Linux"
> would be a good thing. I have at least one other extension I would like
> BIOS vendors to support. Just would need to come up with a writeup
> for a clearly defined specification.

I wrote one for 2.2 era but it never got updated. I've probably still
got it around somewhere.

2005-11-25 01:34:15

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Andi Kleen wrote:
>
> IMHO it's time that Linux gets more proactive regarding talking
> to BIOS vendors. Perhaps a generic "BIOS writers guide for Linux"
> would be a good thing. I have at least one other extension I would like
> BIOS vendors to support. Just would need to come up with a writeup
> for a clearly defined specification.
>

BIOS, and hardware. I think Alan wrote something up way long ago, but
it hasn't really been updated.

-hpa

2005-11-25 07:39:40

by Chris Wedgwood

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

On Wed, Nov 23, 2005 at 01:36:08PM -0800, Linus Torvalds wrote:

> Actual UP machines are going to go away - even ARM is going SMP, and
> in the PC space, we'll have multi-core laptops probably being the
> rule rather than the exception in a couple of years.

CPUs in embedded the space could outnumber desktops & servers greatly
(cell phones, access pointers, routers, media players, etc). Most of
these will be UP for some time.

2005-11-25 17:37:40

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] SMP alternatives



On Thu, 24 Nov 2005, Chris Wedgwood wrote:
>
> CPUs in embedded the space could outnumber desktops & servers greatly
> (cell phones, access pointers, routers, media players, etc). Most of
> these will be UP for some time.

That's not entirely clear either.

There are definite advantages to SMP even in the embedded space - or, to
put it more strongly: _especially_ in the embedded space.

Not only does power usage go up cubically with frequency (which means that
two cores are a lot more efficient than one at double the frequency), but
embedded space also often has some clear separation between tasks that
-can- be threaded (and often part of is has real-time characteristics, so
getting a core of its own can be a good thing). Often more so than in the
desktop space.

Now, obviously, in the "4- or 8-bit microcontroller" kind of embedded
space, SMP isn't going to be a big issue. But anything that already uses
an ARM, MIPS or a PowerPC-like chip, going SMP is not at all ridiculous.
That includes things like cellphones, where one core might be for
communication functions, and one for smartphones.

None of the cellphone manufacturers seem to be in the least interested in
doing a "phone only" solution. They can already do that cheaply, they
can't make much money off it, and they are all interested in features. And
it really _is_ more power-efficient to have, say, a dual-core 200MHz chip
than it is to have a single-core 300MHz one.

Now, sometimes those SMP systems will actually be used as "tightly coupled
UP", where one of the CPU's is just basically a DSP. And from a power
efficiency standpoint, having specialized hardware (and thus _A_MP rather
than SMP) is obviously better, but in complex tasks - and communication
tends to be that - general-purpose is often desirable enough that people
will take the inefficiencies of a GP CPU over a fixed-function specialized
DSP-kind of environment.

But SMP is absolutely _not_ unusual in embedded. It's been there for years
already, and it's clearly moving downwards there too.

Linus

2005-11-25 20:20:45

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Chris Wedgwood wrote:
> On Wed, Nov 23, 2005 at 01:36:08PM -0800, Linus Torvalds wrote:
>
>>Actual UP machines are going to go away - even ARM is going SMP, and
>>in the PC space, we'll have multi-core laptops probably being the
>>rule rather than the exception in a couple of years.
>
> CPUs in embedded the space could outnumber desktops & servers greatly
> (cell phones, access pointers, routers, media players, etc). Most of
> these will be UP for some time.

It's unlikely, though, that you'd have a need to run an SMP-compiled
kernel on these devices.

-hpa

2005-11-28 19:15:47

by Bill Davidsen

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Andi Kleen wrote:
> On Thu, Nov 24, 2005 at 02:48:25PM -0800, [email protected] wrote:
>
>>On Thu, Nov 24, 2005 at 11:12:14PM +0000, Alan Cox wrote:
>>
>>>On Iau, 2005-11-24 at 20:14 +0100, Andi Kleen wrote:
>>>
>>>>I proposed something like that - best with an ASCII string
>>>>("First DIMM on the top left corner") But getting such stuff into BIOS
>>>>is difficult and long winded.
>>>
>>>Propose it the desktop management people and get it into the DMI
>>>standard. They already have entries for each memory slot, they already
>>>have entries for descriptive strings for connectors. In fact you may
>>>well be able to 'bend' the spec enough to do it as is.
>>
>>There are enough fields that maybe one of them is loose enough to mean
>>this. It doesn't help us convince mobo vendors to support it, though.
>
>
> With arbitary desktop/laptop/etc. vendors it's pretty hopeless I agree.
> But I suspect there is a chance at least on the server side. There
> is only a limited number of companies working on server BIOSes
> for their boards and they tend to be more receptive to Linux's need
> because it's now a significant part of their market.

It would seem that the OEMs buying the board would like this feature,
since it could be incorporated into POST, diagnostic CDs, etc. And since
server owners are more likely to have a service contract, anything to
make service calls faster is a benefit to the system vendor.
>
> And it's clearly an obviously useful "RAS feature" which is
> fully buzzword compatible and everything.
>
> IMHO it's time that Linux gets more proactive regarding talking
> to BIOS vendors. Perhaps a generic "BIOS writers guide for Linux"
> would be a good thing. I have at least one other extension I would like
> BIOS vendors to support. Just would need to come up with a writeup
> for a clearly defined specification.

If someone handed them some good specs except for the table, I suspect
they would see the benefit. Independent BIOS writers compete for board
contracts, in-house writers want features with one time cost and every
time benefit, I think you're right that this would be a benefit to everyone.

Given that it seems so simple, is there a reason why this hasn't been
around for ages?

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2005-11-28 19:50:48

by Bill Davidsen

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Linus Torvalds wrote:
>
> On Wed, 23 Nov 2005, Daniel Jacobowitz wrote:
>
>>Why should we use a silicon based solution for this, when I posit that
>>there are simpler and equally effective userspace solutions?
>
>
> Name them.
>
> In user space, doing things like clever run-time linking things is
> actually horribly bad. It causes COW faults at startup, and/or makes the
> compiler have to do indirections unnecessarily. Both of which actually
> make caches less effective, because now processes that really effectively
> do have exactly the same contents have them in different pages.
>
> The other alternative (which apparently glibc actually does use) is to
> dynamically branch over the lock prefixes, which actually works better:
> it's more work dynamically, but it's much cheaper from a startup
> standpoint and there's no memory duplication, so while it is the "stupid"
> approach, it's actually better than the clever one.
>
> The third alternative is to know at link-time that the process never does
> anything threaded, but that needs more developer attention and
> non-standard setups, and you _will_ get it wrong (some library will create
> some thread without the developer even realizing). It also has the
> duplicated library overhead (but at least now the duplication is just
> twice, not "each process duplicates its own private pointer")
>
> In short, there simply isn't any good alternatives. The end result is that
> thread-safe libraries are always in practice thread-safe even on UP, even
> though that serializes the CPU altogether unnecessarily.
>
> I'm sure you can make up alternatives every time you hit one _particular_
> library, but that just doesn't scale in the real world.
>
> In contrast, the simple silicon support scales wonderfully well. Suddenly
> libraries can be thread-safe _and_ efficient on UP too. You get to eat
> your cake and have it too.

I believe that a hardware solution would also accomodate the case where
a program runs unthreaded for most of the processing, and only starts
threads to do the final stage "report generation" tasks, where that
makes sense. I don't believe that it helps in the case where init uses
threads and then reverts to a single thread for the balance of the task.
I can't think of anything which does that, so it's probably a
non-critical corner case, or something the thread library could correct.


--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2005-11-28 20:05:10

by Zachary Amsden

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Bill Davidsen wrote:

> Linus Torvalds wrote:
>
>>
>> In contrast, the simple silicon support scales wonderfully well.
>> Suddenly libraries can be thread-safe _and_ efficient on UP too. You
>> get to eat your cake and have it too.
>
>
> I believe that a hardware solution would also accomodate the case
> where a program runs unthreaded for most of the processing, and only
> starts threads to do the final stage "report generation" tasks, where
> that makes sense. I don't believe that it helps in the case where init
> uses threads and then reverts to a single thread for the balance of
> the task. I can't think of anything which does that, so it's probably
> a non-critical corner case, or something the thread library could
> correct.


Startup routine of a scientific app calls a multithreaded "fetch work"
routine, then crunches the data using a single thread. This could even
happen somewhere inside a library, so the application itself is unaware
that threads were ever invoked. This is not a far-fetched case.

You really need per-address object notions of "threadedness" when
talking about shared memory, since you may need shared memory to be
atomic, but operate on the heap in single threaded fashion.

Zach

2005-11-28 20:23:14

by Bill Davidsen

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Linus Torvalds wrote:
>
> On Thu, 24 Nov 2005, Chris Wedgwood wrote:
>
>>CPUs in embedded the space could outnumber desktops & servers greatly
>>(cell phones, access pointers, routers, media players, etc). Most of
>>these will be UP for some time.
>
>
> That's not entirely clear either.
>
> There are definite advantages to SMP even in the embedded space - or, to
> put it more strongly: _especially_ in the embedded space.
>
I would argue that there is no "the embedded space," but rather a set of
embedded spaces with various needs. Having worked doing industrial
control for three years and lunched with IC folks another decade, I'm
fairly sure that consumer goods are very different from real industrial
control, a realtime items (multimedia) are different than phones and
PDAs. Until the phone gets "swear at it" slow, features like voice
recognition are more important than doing voice to number lookup in 20ms
instead of 400ms. Cost and battery life matter a lot too, while the
media and IC markets are already attached to expensive stuff, so the
computer is is smaller fraction of the cost.

> None of the cellphone manufacturers seem to be in the least interested in
> doing a "phone only" solution. They can already do that cheaply, they
> can't make much money off it, and they are all interested in features. And
> it really _is_ more power-efficient to have, say, a dual-core 200MHz chip
> than it is to have a single-core 300MHz one.
>
> Now, sometimes those SMP systems will actually be used as "tightly coupled
> UP", where one of the CPU's is just basically a DSP. And from a power
> efficiency standpoint, having specialized hardware (and thus _A_MP rather
> than SMP) is obviously better, but in complex tasks - and communication
> tends to be that - general-purpose is often desirable enough that people
> will take the inefficiencies of a GP CPU over a fixed-function specialized
> DSP-kind of environment.
>
> But SMP is absolutely _not_ unusual in embedded. It's been there for years
> already, and it's clearly moving downwards there too.

Absolutely true, but that dual core 200 MHz chip probably draws more
power than a 200 MHz uni, etc. So there will probably be uni
applications for the forseeable future, any benefit in uni performance
will be useful.

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me

2005-11-28 22:46:07

by Jeffrey V. Merkey

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Bill Davidsen wrote:

> Linus Torvalds wrote:
>
>>
>> On Wed, 23 Nov 2005, Daniel Jacobowitz wrote:
>>
>>> Why should we use a silicon based solution for this, when I posit that
>>> there are simpler and equally effective userspace solutions?
>>
>>
>>
>> Name them.
>>
>> In user space, doing things like clever run-time linking things is
>> actually horribly bad. It causes COW faults at startup, and/or makes
>> the compiler have to do indirections unnecessarily. Both of which
>> actually make caches less effective, because now processes that
>> really effectively do have exactly the same contents have them in
>> different pages.
>>
>> The other alternative (which apparently glibc actually does use) is
>> to dynamically branch over the lock prefixes, which actually works
>> better: it's more work dynamically, but it's much cheaper from a
>> startup standpoint and there's no memory duplication, so while it is
>> the "stupid" approach, it's actually better than the clever one.
>>
>> The third alternative is to know at link-time that the process never
>> does anything threaded, but that needs more developer attention and
>> non-standard setups, and you _will_ get it wrong (some library will
>> create some thread without the developer even realizing). It also has
>> the duplicated library overhead (but at least now the duplication is
>> just twice, not "each process duplicates its own private pointer")
>>
>> In short, there simply isn't any good alternatives. The end result is
>> that thread-safe libraries are always in practice thread-safe even on
>> UP, even though that serializes the CPU altogether unnecessarily.
>>
>> I'm sure you can make up alternatives every time you hit one
>> _particular_ library, but that just doesn't scale in the real world.
>>
>> In contrast, the simple silicon support scales wonderfully well.
>> Suddenly libraries can be thread-safe _and_ efficient on UP too. You
>> get to eat your cake and have it too.
>
>
> I believe that a hardware solution would also accomodate the case
> where a program runs unthreaded for most of the processing, and only
> starts threads to do the final stage "report generation" tasks, where
> that makes sense. I don't believe that it helps in the case where init
> uses threads and then reverts to a single thread for the balance of
> the task. I can't think of anything which does that, so it's probably
> a non-critical corner case, or something the thread library could
> correct.
>
>
In 2-3 years we might actually see the hardware solution, maybee .... I
am skeptical Intel will move quickly on it. A software solution will get
out faster.

Jeff

2005-11-28 23:05:41

by Zachary Amsden

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Jeff V. Merkey wrote:

>
> In 2-3 years we might actually see the hardware solution, maybee ....
> I am skeptical Intel will move quickly on it. A software solution will
> get out faster.



I'm not sure a hardware solution is even the right thing - consider a
shared memory database process with a private heap. You really want
locks on the shared memory, and you really don't on the heap.

You need a way to type the lock semantics by memory region, and a
working hardware solution can not perform as well as a careful software
solution. As was pointed out earlier, you can't use memory type
attributes to infer lock semantics, you must assume them in the decoder
or implement complex deadlock detection and recovery in silicon.

I would be willing to bet that library users know best. Most cleanly
written libraries already have wrapper functions that can be used to
plug in needed libc functions like malloc, even file I/O. Even if they
don't, you can rewrap all of the imported functions. Using this, you
can isolate threaded libraries from single threaded applications, and
make sure the performance critical libraries use non-threaded
operations. You can even afford to use a medium heavy hammer and switch
from non-threaded to threaded dependent libraries every time you call a
thread-using library function, because by assumption, the majority of
performance critical code is going to be running single threaded.

Zach

2005-11-28 23:08:48

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Zachary Amsden wrote:
>
> You need a way to type the lock semantics by memory region, and a
> working hardware solution can not perform as well as a careful software
> solution. As was pointed out earlier, you can't use memory type
> attributes to infer lock semantics, you must assume them in the decoder
> or implement complex deadlock detection and recovery in silicon.
>

Sure you can. You just have to be prepared to take a microop exception
if you speculate incorrectly.

-hpa

2005-11-28 23:12:52

by Andi Kleen

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

> I'm not sure a hardware solution is even the right thing - consider a
> shared memory database process with a private heap. You really want
> locks on the shared memory, and you really don't on the heap.
>
> You need a way to type the lock semantics by memory region, and a
> working hardware solution can not perform as well as a careful software
> solution. As was pointed out earlier, you can't use memory type

The problem is that nobody will change all the software.
Your careful software solution will only benefit a small
minority of performance conscious and well tuned programs.

The hardware solution might not be perfect, but has a good chance
to apply to 90% of the "don't care" programs and help them all a bit.
And every bit counts in the quest for more single thread performance.

And if someone wants to fine tune their programs they can
still change the software as much as they want.

-Andi

2005-11-28 23:30:06

by Zachary Amsden

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

H. Peter Anvin wrote:

> Zachary Amsden wrote:
>
>>
>> You need a way to type the lock semantics by memory region, and a
>> working hardware solution can not perform as well as a careful
>> software solution. As was pointed out earlier, you can't use memory
>> type attributes to infer lock semantics, you must assume them in the
>> decoder or implement complex deadlock detection and recovery in silicon.
>>
>
> Sure you can. You just have to be prepared to take a microop
> exception if you speculate incorrectly.


I spoke silicon too heavy handedly. The complexity of the issue
disappears if you take an exception, but rewinding state prior to the
exception and reissuing is going to be less efficient than getting it
right the first time, which is something software can always guarantee.
You need to add more hardware for prediction to get it right all the
time, and it is not clear the cost of that hardware is justified when
software can always do the right thing.

Zach

2005-11-28 23:32:49

by H. Peter Anvin

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Zachary Amsden wrote:
>
> I spoke silicon too heavy handedly. The complexity of the issue
> disappears if you take an exception, but rewinding state prior to the
> exception and reissuing is going to be less efficient than getting it
> right the first time, which is something software can always guarantee.
> You need to add more hardware for prediction to get it right all the
> time, and it is not clear the cost of that hardware is justified when
> software can always do the right thing.
>

Taking exceptions is fine as long as you don't do it too often. I'm
starting to suspect that the only way to do this right all the time is
to have this be part of the page attributes, since it's region-specific.

-hpa

2005-11-29 14:46:06

by Bill Davidsen

[permalink] [raw]
Subject: Re: [patch] SMP alternatives

Daniel Jacobowitz wrote:
> On Wed, Nov 23, 2005 at 03:08:59PM -0800, Linus Torvalds wrote:

>>In contrast, the simple silicon support scales wonderfully well. Suddenly
>>libraries can be thread-safe _and_ efficient on UP too. You get to eat
>>your cake and have it too.
>
>
> By buying new hardware and only caring about people using the magic
> architecture. No thanks.

That is the problem, waiting for Intel to do hardware magic, or even to
decide IF they do it. Like assuming that everyone has SMP because a few
percent of the users have dual core chips. The majority of the markey
will have SMP someday, but ignoring the current status isn't realistic.
>
> Maybe I'll implement this some weekend.

Love to see it, I'm only semi-convinced it can be done in a way which
actually produces significant benefits.
>

--
-bill davidsen ([email protected])
"The secret to procrastination is to put things off until the
last possible moment - but no longer" -me