2004-01-02 23:51:51

by Martin J. Bligh

[permalink] [raw]
Subject: BUG_ON(!cpus_equal(cpumask, tmp));

The bug is on "BUG_ON(!cpus_equal(cpumask, tmp));" in flush_tlb_others
This was from 2.6.1-rc1-mjb1, and seems to be a race on shutdown
(prints after "Power Down"), but I've no reason to believe it's specific
to the patchset I have - it's not an area I'm touching, I think.

I presume we've got a race between taking CPUs offline and the
tlbflush code ... tlb_flush_mm reads the value from mm->cpu_vm_mask,
and then presumably some other cpu changes cpu_online_map before it
gets to calling flush_tlb_others ... does that sound about right?

M.

------------[ cut here ]------------
kernel BUG at smp.c:361!
invalid operand: 0000 [#1]
CPU: 0
EIP: 0060:[<c0115242>] Not tainted VLI
EFLAGS: 00010202
EIP is at flush_tlb_others+0x22/0xd0
eax: 00000000 ebx: c3932020 ecx: eca6bd60 edx: 00000004
esi: 00000000 edi: eca6bd80 ebp: eca6bd60 esp: ee4f3e58
ds: 007b es: 007b ss: 0068
Process halt (pid: 25304, threadinfo=ee4f2000 task=ee765330)
Stack: c0115366 00000004 eca6bd60 ffffffff 00000004 c01461ab eca6bd60 eca6bd60
ee765330 eca6bd80 00000000 c3932020 0000000b c011e1f4 eca6bd60 eca6bd60
c0121e8d eca6bd60 fee1dead ee4f2000 bffffe24 ee4f2000 c012c16a 00000000
fee1dead 08049960 00000092 ee4f3ed8 c011a4b8 f0191980 00000001 00000000
Call Trace:
[<c0115366>] flush_tlb_mm+0x76/0x7c
[<c01461ab>] exit_mmap+0x11f/0x1cc
[<c011e1f4>] mmput+0x50/0x70
[<c0121e8d>] do_exit+0x1b9/0x330
[<c012c16a>] sys_reboot+0x1f2/0x2f8
[<c011a4b8>] wake_up_state+0xc/0x10
[<c0129057>] kill_proc_info+0x37/0x4c
[<c0129150>] kill_something_info+0xe4/0xec
[<c012ae18>] sys_kill+0x54/0x5c
[<c0150a43>] filp_open+0x3b/0x5c
[<c0150e09>] sys_open+0x59/0x74
[<c02609ff>] syscall_call+0x7/0xb

Code: f0 0f b3 0d e4 04 36 c0 c3 8b 4c 24 08 8b 44 24 04 89 c2 85 d2 75 08 0f 0b 66 01 e7 a0 26 c0 89 d0 23 05 08 05 36 c0 39 c2 74 0e <0f> 0b 69 01 e7 a0 26 c0 8d b6 00 00 00 00 b8 00 e0 ff ff 21 e0


2004-03-30 00:19:16

by Andrew Morton

[permalink] [raw]
Subject: Re: BUG_ON(!cpus_equal(cpumask, tmp));

"Martin J. Bligh" <[email protected]> wrote:
>
> The bug is on "BUG_ON(!cpus_equal(cpumask, tmp));" in flush_tlb_others
> This was from 2.6.1-rc1-mjb1, and seems to be a race on shutdown
> (prints after "Power Down"), but I've no reason to believe it's specific
> to the patchset I have - it's not an area I'm touching, I think.
>
> I presume we've got a race between taking CPUs offline and the
> tlbflush code ... tlb_flush_mm reads the value from mm->cpu_vm_mask,
> and then presumably some other cpu changes cpu_online_map before it
> gets to calling flush_tlb_others ... does that sound about right?

Looks like it, yes. I don't think there's a sane way of fixing that - we'd
need the flush_tlb_others() caller to hold some lock which keeps the cpu
add/remove code away.

I'd propose removing the assertion?

2004-03-30 00:23:51

by Andrew Morton

[permalink] [raw]
Subject: Re: BUG_ON(!cpus_equal(cpumask, tmp));

Andrew Morton <[email protected]> wrote:
>
> "Martin J. Bligh" <[email protected]> wrote:
> >
> > The bug is on "BUG_ON(!cpus_equal(cpumask, tmp));" in flush_tlb_others
> > This was from 2.6.1-rc1-mjb1, and seems to be a race on shutdown
> > (prints after "Power Down"), but I've no reason to believe it's specific
> > to the patchset I have - it's not an area I'm touching, I think.
> >
> > I presume we've got a race between taking CPUs offline and the
> > tlbflush code ... tlb_flush_mm reads the value from mm->cpu_vm_mask,
> > and then presumably some other cpu changes cpu_online_map before it
> > gets to calling flush_tlb_others ... does that sound about right?
>
> Looks like it, yes. I don't think there's a sane way of fixing that - we'd
> need the flush_tlb_others() caller to hold some lock which keeps the cpu
> add/remove code away.
>
> I'd propose removing the assertion?

If the going-away CPU can still take IPIs we're OK, I think. If it cannot,
we have a problem. Can you do a s/BUG_ON/WARN_ON/ and run with that for a
while? Check that the warnings come out and the machine doesn't go crump?

2004-03-30 13:29:14

by Hariprasad Nellitheertha

[permalink] [raw]
Subject: Re: BUG_ON(!cpus_equal(cpumask, tmp));

Hello Andrew,

We faced this problem starting 2.6.3 while working on kexec.

The problem is because we now initialize cpu_vm_mask for init_mm with
CPU_MASK_ALL (from 2.6.3 onwards) which makes all bits in cpumask 1 (on SMP).
Hence BUG_ON(!cpus_equal(cpumask,tmp) fails. The change to set
cpu_vm_mask to CPU_MASK_ALL was done to remove tlb flush optimizations
for ppc64.

I had posted a patch for this in the earlier thread. Reposting the same
here. This patch removes the assertion and uses "tmp" instead of cpumask.
Otherwise, we will end up sending IPIs to offline CPUs as well.

Comments please.

Regards, Hari


diff -Naur linux-2.6.3-before/arch/i386/kernel/smp.c linux-2.6.3/arch/i386/kernel/smp.c
--- linux-2.6.3-before/arch/i386/kernel/smp.c 2004-02-18 09:27:15.000000000 +0530
+++ linux-2.6.3/arch/i386/kernel/smp.c 2004-03-04 14:16:43.000000000 +0530
@@ -356,7 +356,8 @@
BUG_ON(cpus_empty(cpumask));

cpus_and(tmp, cpumask, cpu_online_map);
- BUG_ON(!cpus_equal(cpumask, tmp));
+ if(cpus_empty(tmp))
+ return;
BUG_ON(cpu_isset(smp_processor_id(), cpumask));
BUG_ON(!mm);

@@ -371,12 +372,12 @@
flush_mm = mm;
flush_va = va;
#if NR_CPUS <= BITS_PER_LONG
- atomic_set_mask(cpumask, &flush_cpumask);
+ atomic_set_mask(tmp, &flush_cpumask);
#else
{
int k;
unsigned long *flush_mask = (unsigned long *)&flush_cpumask;
- unsigned long *cpu_mask = (unsigned long *)&cpumask;
+ unsigned long *cpu_mask = (unsigned long *)&tmp;
for (k = 0; k < BITS_TO_LONGS(NR_CPUS); ++k)
atomic_set_mask(cpu_mask[k], &flush_mask[k]);
}
@@ -385,7 +386,7 @@
* We have to send the IPI only to
* CPUs affected.
*/
- send_IPI_mask(cpumask, INVALIDATE_TLB_VECTOR);
+ send_IPI_mask(tmp, INVALIDATE_TLB_VECTOR);

while (!cpus_empty(flush_cpumask))
/* nothing. lockup detection does not belong here */

On Mon, Mar 29, 2004 at 04:25:55PM -0800, Andrew Morton wrote:
> Andrew Morton <[email protected]> wrote:
> >
> > "Martin J. Bligh" <[email protected]> wrote:
> > >
> > > The bug is on "BUG_ON(!cpus_equal(cpumask, tmp));" in flush_tlb_others
> > > This was from 2.6.1-rc1-mjb1, and seems to be a race on shutdown
> > > (prints after "Power Down"), but I've no reason to believe it's specific
> > > to the patchset I have - it's not an area I'm touching, I think.
> > >
> > > I presume we've got a race between taking CPUs offline and the
> > > tlbflush code ... tlb_flush_mm reads the value from mm->cpu_vm_mask,
> > > and then presumably some other cpu changes cpu_online_map before it
> > > gets to calling flush_tlb_others ... does that sound about right?
> >
> > Looks like it, yes. I don't think there's a sane way of fixing that - we'd
> > need the flush_tlb_others() caller to hold some lock which keeps the cpu
> > add/remove code away.
> >
> > I'd propose removing the assertion?
>
> If the going-away CPU can still take IPIs we're OK, I think. If it cannot,
> we have a problem. Can you do a s/BUG_ON/WARN_ON/ and run with that for a
> while? Check that the warnings come out and the machine doesn't go crump?
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>

--
Hariprasad Nellitheertha
Linux Technology Center
India Software Labs
IBM India, Bangalore

2004-03-30 23:20:48

by Randy.Dunlap

[permalink] [raw]
Subject: Re: BUG_ON(!cpus_equal(cpumask, tmp));

On Tue, 30 Mar 2004 18:58:32 +0530 Hariprasad Nellitheertha wrote:

| Hello Andrew,
|
| We faced this problem starting 2.6.3 while working on kexec.
|
| The problem is because we now initialize cpu_vm_mask for init_mm with
| CPU_MASK_ALL (from 2.6.3 onwards) which makes all bits in cpumask 1 (on SMP).
| Hence BUG_ON(!cpus_equal(cpumask,tmp) fails. The change to set
| cpu_vm_mask to CPU_MASK_ALL was done to remove tlb flush optimizations
| for ppc64.
|
| I had posted a patch for this in the earlier thread. Reposting the same
| here. This patch removes the assertion and uses "tmp" instead of cpumask.
| Otherwise, we will end up sending IPIs to offline CPUs as well.
|
| Comments please.

I'll just say that kexec fails without this patch and works with
it applied, so I'd like to see it merged. If this patch isn't
acceptable, let's find out why and try to make one that is.

Thanks for the patch, Hari.

--
~Randy
"You can't do anything without having to do something else first."
-- Belefant's Law

2004-03-31 00:23:19

by Martin J. Bligh

[permalink] [raw]
Subject: Re: BUG_ON(!cpus_equal(cpumask, tmp));

>| We faced this problem starting 2.6.3 while working on kexec.
>|
>| The problem is because we now initialize cpu_vm_mask for init_mm with
>| CPU_MASK_ALL (from 2.6.3 onwards) which makes all bits in cpumask 1 (on SMP).
>| Hence BUG_ON(!cpus_equal(cpumask,tmp) fails. The change to set
>| cpu_vm_mask to CPU_MASK_ALL was done to remove tlb flush optimizations
>| for ppc64.
>|
>| I had posted a patch for this in the earlier thread. Reposting the same
>| here. This patch removes the assertion and uses "tmp" instead of cpumask.
>| Otherwise, we will end up sending IPIs to offline CPUs as well.
>|
>| Comments please.
>
> I'll just say that kexec fails without this patch and works with
> it applied, so I'd like to see it merged. If this patch isn't
> acceptable, let's find out why and try to make one that is.
>
> Thanks for the patch, Hari.

>From discussions with Andy, it seems this still has the same race as before
just smaller. I don't see how we can fix this properly without having some
locking on cpu_online_map .... probably RCU as it's massively read-biased
and we don't want to pay a spinlock cost to read it.

M.

2004-03-31 00:37:36

by Andrew Morton

[permalink] [raw]
Subject: Re: BUG_ON(!cpus_equal(cpumask, tmp));

"Martin J. Bligh" <[email protected]> wrote:
>
> > I'll just say that kexec fails without this patch and works with
> > it applied, so I'd like to see it merged. If this patch isn't
> > acceptable, let's find out why and try to make one that is.
> >
> > Thanks for the patch, Hari.
>
> >From discussions with Andy, it seems this still has the same race as before
> just smaller. I don't see how we can fix this properly without having some
> locking on cpu_online_map .... probably RCU as it's massively read-biased
> and we don't want to pay a spinlock cost to read it.

We do want to avoid adding stuff to the IPI path. If the going-away CPU
still responds to IPIs after it has gone away then do we actually need to
do anything? For x86, at least?

2004-03-31 00:59:29

by Martin J. Bligh

[permalink] [raw]
Subject: Re: BUG_ON(!cpus_equal(cpumask, tmp));

>> > I'll just say that kexec fails without this patch and works with
>> > it applied, so I'd like to see it merged. If this patch isn't
>> > acceptable, let's find out why and try to make one that is.
>> >
>> > Thanks for the patch, Hari.
>>
>> > From discussions with Andy, it seems this still has the same race as before
>> just smaller. I don't see how we can fix this properly without having some
>> locking on cpu_online_map .... probably RCU as it's massively read-biased
>> and we don't want to pay a spinlock cost to read it.
>
> We do want to avoid adding stuff to the IPI path.

OK, but my thinking was that the readlock of RCU is free (pretty much).

> If the going-away CPU still responds to IPIs after it has gone away
> then do we actually need to do anything? For x86, at least?

Eeek, you *really* don't want it responding to IPIs after it's been shut
down. It's dead, dead, dead, and we don't want it rolling over in it's
gasping throes, and stabbing us in the back. Supposing we've kexeced or
otherwise rebooted the machine? If we've shut down the other cpus, we
should be able to reprogram the IDT, drop back out of paging mode to
reboot into the BIOS, or basically anything. Nothing short of an NMI or
a power cycle (aka Startup / INIT sequence) should arouse it.

And if we don't take IPIs, I really think we're playing russian roulette
here ... I was under the *impression* that if you sent an IPI to a cpu
that was down and didn't ack it, we'd hang. Forever. At least on a P3
style apic bus, when we're sending individual interrupts to each CPU,
not broadcast.

I made a similar patch, but I don't see how we can really fix it without
providing locking on cpu_online_map. Or else you have to do something
gross like mark the CPU offline, spin for 5 seconds, then take it offline.
Ick. Maybe we could be smart with a 2-stage commit with 2 bitmaps or
something, but I think that basically evolves into RCU anyway, and just
reimplements it ;-(

Shared global data needs some form of locking, even if "oh we don't
write to that very often, I think we'll get away with it" ;-)

M.

2004-03-31 01:02:37

by Andy Whitcroft

[permalink] [raw]
Subject: Re: BUG_ON(!cpus_equal(cpumask, tmp));

--On 30 March 2004 16:22 -0800 "Martin J. Bligh" <[email protected]> wrote:

>>| We faced this problem starting 2.6.3 while working on kexec.
>>|
>>| The problem is because we now initialize cpu_vm_mask for init_mm with
>>| CPU_MASK_ALL (from 2.6.3 onwards) which makes all bits in cpumask 1 (on
>>| SMP). Hence BUG_ON(!cpus_equal(cpumask,tmp) fails. The change to set
>>| cpu_vm_mask to CPU_MASK_ALL was done to remove tlb flush optimizations
>>| for ppc64.
>>|
>>| I had posted a patch for this in the earlier thread. Reposting the same
>>| here. This patch removes the assertion and uses "tmp" instead of
>>| cpumask. Otherwise, we will end up sending IPIs to offline CPUs as
>>| well.
>>|
>>| Comments please.
>>
>> I'll just say that kexec fails without this patch and works with
>> it applied, so I'd like to see it merged. If this patch isn't
>> acceptable, let's find out why and try to make one that is.
>>
>> Thanks for the patch, Hari.
>
>> From discussions with Andy, it seems this still has the same race as
>> before
> just smaller. I don't see how we can fix this properly without having some
> locking on cpu_online_map .... probably RCU as it's massively read-biased
> and we don't want to pay a spinlock cost to read it.

Realistically the patch does two things. It removed the BUG_ON which
causing issues and attempts to cover for the case where the BUG_ON would
have triggered. However, as there is no locking on cpu_online_map, there
is nothing to prevent further cpus from leaving before we send the IPI's.
If the CPU's are gone what would stop us from hanging here waiting for them
to acknowledge the invalidate. It would seem we want to get the cpu
shutdown code to continue to handle IPI's until we are sure that all other
cpus will have seen us missing from cpu_online_map? ie that no other cpu
is using a stale cpu_online_map for IPI's.

Poking around in the IPI code it does appear that all users of the IPI
interface are locked, for i386 either under tlbstate_lock or call_lock. If
we add a restriction that you must sample cpu_online_map within the
critical section then the cpu which is leaving must merely ensure that
there are no callers in these critical sections to ensure there are no
stale copies of cpu_online_map to worry about.

In this case we know smp_call_function is safe, we are using it to trigger
the actions. Therefore I think we only need to ensure we are not doing a
tlb invalidate. If we start to shutdown by removing ourselves from
cpu_online_map, then acquire and release tlbstate_lock we can be sure that
everyone has see us go. Then we can shutdown and halt.

Something like this:

static void stop_this_cpu (void * dummy)
{
/*
* Remove this CPU:
*/
cpu_clear(smp_processor_id(), cpu_online_map);
spin_lock(&tlbstate_lock);
spin_unlock(&tlbstate_lock);
local_irq_disable();
disable_local_APIC();
if (cpu_data[smp_processor_id()].hlt_works_ok)
for(;;) __asm__("hlt");
for (;;);
}

No additional code in the IPI path? What have I missed?

-apw

2004-03-31 01:09:03

by Andrew Morton

[permalink] [raw]
Subject: Re: BUG_ON(!cpus_equal(cpumask, tmp));

"Martin J. Bligh" <[email protected]> wrote:
>
> I made a similar patch, but I don't see how we can really fix it without
> providing locking on cpu_online_map.

Are we missing something here?

Why does, for example, smp_send_reschedule() not have the same problem?
Because we've gone around and correctly removed all references to the CPU
from the scheduler data structures before offlining it.

But we're not doing that in the mm code, right? Should we not be taking
mmlist_lock and running around knocking this CPU out of everyone's
cpu_vm_mask before offlining it?

2004-03-31 01:24:43

by Martin J. Bligh

[permalink] [raw]
Subject: Re: BUG_ON(!cpus_equal(cpumask, tmp));

--On Tuesday, March 30, 2004 17:11:04 -0800 Andrew Morton <[email protected]> wrote:

> "Martin J. Bligh" <[email protected]> wrote:
>>
>> I made a similar patch, but I don't see how we can really fix it without
>> providing locking on cpu_online_map.
>
> Are we missing something here?
>
> Why does, for example, smp_send_reschedule() not have the same problem?
> Because we've gone around and correctly removed all references to the CPU
> from the scheduler data structures before offlining it.
>
> But we're not doing that in the mm code, right? Should we not be taking
> mmlist_lock and running around knocking this CPU out of everyone's
> cpu_vm_mask before offlining it?

I think we're assuming that we don't have to because the problem is fixed
by the "cpus_and(tmp, cpumask, cpu_online_map)" in flush_tlb_others so we
don't have to. Except it's racy, and doesn't work.

It would seem to me that your suggestion would fix it. But isn't locking
cpu_online_map both simpler and (most importantly) more generic? I can't
imagine that we don't use this elsewhere ... suppose for instance we took
a timer interrupt, causing a scheduler rebalance, and moved a process to
an offline CPU at that point? Isn't any user of smp_call_function also racy?

Not locking it just seems fundamentally dangerous to me ... maybe I'm
wrong though.

M.

2004-03-31 01:36:38

by Andrew Morton

[permalink] [raw]
Subject: Re: BUG_ON(!cpus_equal(cpumask, tmp));

"Martin J. Bligh" <[email protected]> wrote:
>
> --On Tuesday, March 30, 2004 17:11:04 -0800 Andrew Morton <[email protected]> wrote:
>
> > "Martin J. Bligh" <[email protected]> wrote:
> >>
> >> I made a similar patch, but I don't see how we can really fix it without
> >> providing locking on cpu_online_map.
> >
> > Are we missing something here?
> >
> > Why does, for example, smp_send_reschedule() not have the same problem?
> > Because we've gone around and correctly removed all references to the CPU
> > from the scheduler data structures before offlining it.
> >
> > But we're not doing that in the mm code, right? Should we not be taking
> > mmlist_lock and running around knocking this CPU out of everyone's
> > cpu_vm_mask before offlining it?
>
> I think we're assuming that we don't have to because the problem is fixed
> by the "cpus_and(tmp, cpumask, cpu_online_map)" in flush_tlb_others so we
> don't have to. Except it's racy, and doesn't work.

And it's a kludge, to work around dangling references to a CPU which has
gone away.

> It would seem to me that your suggestion would fix it. But isn't locking
> cpu_online_map both simpler and (most importantly) more generic? I can't
> imagine that we don't use this elsewhere ... suppose for instance we took
> a timer interrupt, causing a scheduler rebalance, and moved a process to
> an offline CPU at that point? Isn't any user of smp_call_function also racy?

If we have to add any fastpath locking to cope with CPU removal or reboot
then it's time to make CONFIG_HOTPLUG_CPU dependent upon CONFIG_BROKEN.

yes, cpu_online_map should be viewed as a reference to the going-away CPU
for smp_call_function purposes. However the CPU takedown code appears to
do the right thing: it removes the cpu from cpu_online_map first, then does
the stop_machine() thing which should ensure that all other CPUs have
completed any cross-CPU call which they were doing, yes?


2004-03-31 01:51:52

by Martin J. Bligh

[permalink] [raw]
Subject: Re: BUG_ON(!cpus_equal(cpumask, tmp));

>> I think we're assuming that we don't have to because the problem is fixed
>> by the "cpus_and(tmp, cpumask, cpu_online_map)" in flush_tlb_others so we
>> don't have to. Except it's racy, and doesn't work.
>
> And it's a kludge, to work around dangling references to a CPU which has
> gone away.

Yes ;-)

>> It would seem to me that your suggestion would fix it. But isn't locking
>> cpu_online_map both simpler and (most importantly) more generic? I can't
>> imagine that we don't use this elsewhere ... suppose for instance we took
>> a timer interrupt, causing a scheduler rebalance, and moved a process to
>> an offline CPU at that point? Isn't any user of smp_call_function also racy?
>
> If we have to add any fastpath locking to cope with CPU removal or reboot
> then it's time to make CONFIG_HOTPLUG_CPU dependent upon CONFIG_BROKEN.

Yeah, but as we've proved, it's not just hotplug, it's shutdown. And I don't
think we can make that depend on CONFIG_BROKEN ;-) I don't see a *read*
side RCU lock as an impostion on the fastpath (for reading cpu_online_map),
and I don't care if writing to cpu_online_map is slower. A spinlock would
be crappy, yes ...

> yes, cpu_online_map should be viewed as a reference to the going-away CPU
> for smp_call_function purposes. However the CPU takedown code appears to
> do the right thing: it removes the cpu from cpu_online_map first, then does
> the stop_machine() thing which should ensure that all other CPUs have
> completed any cross-CPU call which they were doing, yes?

Andy almost managed to convince me that the smp_call_function stuff is safe,
based on call_lock exclusion. Except that we count that cpu stuff outside
it ... but that's trivial to fix, we just move it inside the lock (patch
below - untested, but trivial).

He also pointed out that we could fairly easily fix the tlb stuff by
taking the tlb lock before taking a cpu offline. Which still doesn't
make me desperately comfortable ... but then he's smarter than me ;-)
To me it comes down to ... do we want to lock the damned thing, or fix
all the callers to be really, really careful?

diff -purN -X /home/mbligh/.diff.exclude virgin/arch/i386/kernel/smp.c smp_call_function/arch/i386/kernel/smp.c
--- virgin/arch/i386/kernel/smp.c 2004-03-11 14:33:36.000000000 -0800
+++ smp_call_function/arch/i386/kernel/smp.c 2004-03-30 17:43:34.000000000 -0800
@@ -514,10 +514,7 @@ int smp_call_function (void (*func) (voi
*/
{
struct call_data_struct data;
- int cpus = num_online_cpus()-1;
-
- if (!cpus)
- return 0;
+ int cpus;

data.func = func;
data.info = info;
@@ -527,6 +524,10 @@ int smp_call_function (void (*func) (voi
atomic_set(&data.finished, 0);

spin_lock(&call_lock);
+ cpus = num_online_cpus()-1;
+ if (!cpus)
+ return 0;
+
call_data = &data;
mb();


2004-03-31 04:43:40

by Hariprasad Nellitheertha

[permalink] [raw]
Subject: Re: BUG_ON(!cpus_equal(cpumask, tmp));

There are actually two different problems here, and the BUG_ON is
hit in both cases.

1) In INIT_MM(), we now do this

.cpu_vm_mask = CPU_MASK_ALL,

With this, if we enter flush_tlb_others with init_mm, all bits except the one
corresponding to the current cpu are set. For example, on my UNI machine
running an SMP kernel with NR_CPUS=8, cpumask is 0xfe. The BUG_ON is hit
even if we are doing nothing related to shutdown or cpu-offlining (like
when we are just loading a new kernel into memory using kexec).

2) The issue of a race between taking CPUs offline and the tlbflush code. The
discussions have been centered around this issue.

The patch I sent across, though, is completely targetted towards issue 1.

Regards, Hari

On Tue, Mar 30, 2004 at 05:51:13PM -0800, Martin J. Bligh wrote:
> >> I think we're assuming that we don't have to because the problem is fixed
> >> by the "cpus_and(tmp, cpumask, cpu_online_map)" in flush_tlb_others so we
> >> don't have to. Except it's racy, and doesn't work.
> >
> > And it's a kludge, to work around dangling references to a CPU which has
> > gone away.
>
> Yes ;-)
>
> >> It would seem to me that your suggestion would fix it. But isn't locking
> >> cpu_online_map both simpler and (most importantly) more generic? I can't
> >> imagine that we don't use this elsewhere ... suppose for instance we took
> >> a timer interrupt, causing a scheduler rebalance, and moved a process to
> >> an offline CPU at that point? Isn't any user of smp_call_function also racy?
> >
> > If we have to add any fastpath locking to cope with CPU removal or reboot
> > then it's time to make CONFIG_HOTPLUG_CPU dependent upon CONFIG_BROKEN.
>
> Yeah, but as we've proved, it's not just hotplug, it's shutdown. And I don't
> think we can make that depend on CONFIG_BROKEN ;-) I don't see a *read*
> side RCU lock as an impostion on the fastpath (for reading cpu_online_map),
> and I don't care if writing to cpu_online_map is slower. A spinlock would
> be crappy, yes ...
>
> > yes, cpu_online_map should be viewed as a reference to the going-away CPU
> > for smp_call_function purposes. However the CPU takedown code appears to
> > do the right thing: it removes the cpu from cpu_online_map first, then does
> > the stop_machine() thing which should ensure that all other CPUs have
> > completed any cross-CPU call which they were doing, yes?
>
> Andy almost managed to convince me that the smp_call_function stuff is safe,
> based on call_lock exclusion. Except that we count that cpu stuff outside
> it ... but that's trivial to fix, we just move it inside the lock (patch
> below - untested, but trivial).
>
> He also pointed out that we could fairly easily fix the tlb stuff by
> taking the tlb lock before taking a cpu offline. Which still doesn't
> make me desperately comfortable ... but then he's smarter than me ;-)
> To me it comes down to ... do we want to lock the damned thing, or fix
> all the callers to be really, really careful?
>
> diff -purN -X /home/mbligh/.diff.exclude virgin/arch/i386/kernel/smp.c smp_call_function/arch/i386/kernel/smp.c
> --- virgin/arch/i386/kernel/smp.c 2004-03-11 14:33:36.000000000 -0800
> +++ smp_call_function/arch/i386/kernel/smp.c 2004-03-30 17:43:34.000000000 -0800
> @@ -514,10 +514,7 @@ int smp_call_function (void (*func) (voi
> */
> {
> struct call_data_struct data;
> - int cpus = num_online_cpus()-1;
> -
> - if (!cpus)
> - return 0;
> + int cpus;
>
> data.func = func;
> data.info = info;
> @@ -527,6 +524,10 @@ int smp_call_function (void (*func) (voi
> atomic_set(&data.finished, 0);
>
> spin_lock(&call_lock);
> + cpus = num_online_cpus()-1;
> + if (!cpus)
> + return 0;
> +
> call_data = &data;
> mb();
>
>

--
Hariprasad Nellitheertha
Linux Technology Center
India Software Labs
IBM India, Bangalore

2004-04-01 00:34:32

by Andy Whitcroft

[permalink] [raw]
Subject: Re: BUG_ON(!cpus_equal(cpumask, tmp));

I think we all agree that the reason that we need Hari's change
is that the underlying cpu shutdown code is not clearing out all
references to the cpu from the VM system; and that in an ideal
world we should fix that. But this is v2.6 and we need something
which fixes the issue. Hari's patch is a very good starting point,
as it closes the window to almost nothing. But there is still a
small window in which we can end up hung during shutdown if the
timing is wrong as cpu_online_map can change whilst we are in
the process of sending an IPI. I think we also agree that we
don't want to add any additional active locking to VM hot path
for the remote tlb invalidate if at all possible.

In a previous email I outlined a possible solution in which we
change shutdown to guarentee that if you are in the middle of
sending an IPI all cpus online at the start of that IPI will
respond, even if they are in the process of going offline.

Below is a patch to 2.6.5-rc3 which implements this idea, subtly
modifying Hari's solution and incorporating Martins changes.
I have added copious comments as we are essentially open-coding
RCU semantics.

I have only tested this on i386 and only shutdown (!) tested it.
If there is anyone who can reliably reproduce this issue could
try this patch I would appreciate it.

Review/Comments always appreciated.

-apw

---
smp.c | 47 +++++++++++++++++++++++++++++++++++++----------
1 files changed, 37 insertions(+), 10 deletions(-)

diff -X /home/apw/lib/vdiff.excl -rupN reference/arch/i386/kernel/smp.c current/arch/i386/kernel/smp.c
--- reference/arch/i386/kernel/smp.c 2004-03-11 20:47:12.000000000 +0000
+++ current/arch/i386/kernel/smp.c 2004-04-01 02:01:11.000000000 +0100
@@ -355,8 +355,6 @@ static void flush_tlb_others(cpumask_t c
*/
BUG_ON(cpus_empty(cpumask));

- cpus_and(tmp, cpumask, cpu_online_map);
- BUG_ON(!cpus_equal(cpumask, tmp));
BUG_ON(cpu_isset(smp_processor_id(), cpumask));
BUG_ON(!mm);

@@ -367,16 +365,24 @@ static void flush_tlb_others(cpumask_t c
* detected by the NMI watchdog.
*/
spin_lock(&tlbstate_lock);
+
+ /* Subtle, mask the request mask with the currently online cpu's.
+ * Sample this under the lock; cpus in the the middle of going
+ * offline will wait until there is noone in this critical section
+ * before disabling IPI handling. */
+ cpus_and(tmp, cpumask, cpu_online_map);
+ if(cpus_empty(tmp))
+ return;

flush_mm = mm;
flush_va = va;
#if NR_CPUS <= BITS_PER_LONG
- atomic_set_mask(cpumask, &flush_cpumask);
+ atomic_set_mask(tmp, &flush_cpumask);
#else
{
int k;
unsigned long *flush_mask = (unsigned long *)&flush_cpumask;
- unsigned long *cpu_mask = (unsigned long *)&cpumask;
+ unsigned long *cpu_mask = (unsigned long *)&tmp;
for (k = 0; k < BITS_TO_LONGS(NR_CPUS); ++k)
atomic_set_mask(cpu_mask[k], &flush_mask[k]);
}
@@ -385,7 +391,7 @@ static void flush_tlb_others(cpumask_t c
* We have to send the IPI only to
* CPUs affected.
*/
- send_IPI_mask(cpumask, INVALIDATE_TLB_VECTOR);
+ send_IPI_mask(tmp, INVALIDATE_TLB_VECTOR);

while (!cpus_empty(flush_cpumask))
/* nothing. lockup detection does not belong here */
@@ -514,11 +520,8 @@ int smp_call_function (void (*func) (voi
*/
{
struct call_data_struct data;
- int cpus = num_online_cpus()-1;
-
- if (!cpus)
- return 0;
-
+ int cpus;
+
data.func = func;
data.info = info;
atomic_set(&data.started, 0);
@@ -527,6 +530,15 @@ int smp_call_function (void (*func) (voi
atomic_set(&data.finished, 0);

spin_lock(&call_lock);
+
+ /* Subtle, get the current number of online cpus.
+ * Sample this under the lock; cpus in the the middle of going
+ * offline will wait until there is noone in this critical section
+ * before disabling IPI handling. */
+ cpus = num_online_cpus()-1;
+ if (!cpus)
+ goto out_unlock;
+
call_data = &data;
mb();

@@ -540,6 +552,7 @@ int smp_call_function (void (*func) (voi
if (wait)
while (atomic_read(&data.finished) != cpus)
barrier();
+out_unlock:
spin_unlock(&call_lock);

return 0;
@@ -551,6 +564,20 @@ static void stop_this_cpu (void * dummy)
* Remove this CPU:
*/
cpu_clear(smp_processor_id(), cpu_online_map);
+
+ /* Subtle, IPI users assume that they will be able to get IPI's
+ * though to the cpus listed in cpu_online_map. To ensure this
+ * we add the requirement that they check cpu_online_map within
+ * the IPI critical sections. Here we remove ourselves from the
+ * map, then ensure that all other cpus have left the relevant
+ * critical sections since the change. We do this by aquiring
+ * the relevant section locks, if we have them none else is in
+ * them. Once this is done we can go offline. */
+ spin_lock(&tlbstate_lock);
+ spin_unlock(&tlbstate_lock);
+ spin_lock(&tlbstate_lock);
+ spin_unlock(&tlbstate_lock);
+
local_irq_disable();
disable_local_APIC();
if (cpu_data[smp_processor_id()].hlt_works_ok)


2004-04-01 05:03:08

by Srivatsa Vaddagiri

[permalink] [raw]
Subject: Re: BUG_ON(!cpus_equal(cpumask, tmp));

On Thu, Apr 01, 2004 at 01:31:15AM +0000, Andy Whitcroft wrote:
> spin_lock(&tlbstate_lock);
> +
> + /* Subtle, mask the request mask with the currently online cpu's.
> + * Sample this under the lock; cpus in the the middle of going
> + * offline will wait until there is noone in this critical section
> + * before disabling IPI handling. */
> + cpus_and(tmp, cpumask, cpu_online_map);
> + if(cpus_empty(tmp))
> + return;

Hmm ..Doesn't it need to drop tlbstate_lock before returning?

> + /* Subtle, IPI users assume that they will be able to get IPI's
> + * though to the cpus listed in cpu_online_map. To ensure this
> + * we add the requirement that they check cpu_online_map within
> + * the IPI critical sections. Here we remove ourselves from the
> + * map, then ensure that all other cpus have left the relevant
> + * critical sections since the change. We do this by aquiring
> + * the relevant section locks, if we have them none else is in
> + * them. Once this is done we can go offline. */
> + spin_lock(&tlbstate_lock);
> + spin_unlock(&tlbstate_lock);
> + spin_lock(&tlbstate_lock);
> + spin_unlock(&tlbstate_lock);

The second lock should be call_lock?


--


Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017

2004-04-01 08:47:37

by Paul Jackson

[permalink] [raw]
Subject: Re: BUG_ON(!cpus_equal(cpumask, tmp));

Hari,

I see code similar to what you are changing, also in the file:

arch/x86_64/kernel/smp.c

Do your considerations apply here as well?

===

static void flush_tlb_others(cpumask_t cpumask, struct mm_struct *mm,
unsigned long va)
{
cpumask_t tmp;
/* ... */
BUG_ON(cpus_empty(cpumask));
cpus_and(tmp, cpumask, cpu_online_map);
BUG_ON(!cpus_equal(tmp, cpumask));
BUG_ON(cpu_isset(smp_processor_id(), cpumask));
if (!mm)
BUG();
/* ... */

--
I won't rest till it's the best ...
Programmer, Linux Scalability
Paul Jackson <[email protected]> 1.650.933.1373

2004-04-01 11:37:47

by Andy Whitcroft

[permalink] [raw]
Subject: Re: BUG_ON(!cpus_equal(cpumask, tmp));

--On 01 April 2004 10:34 +0530 Srivatsa Vaddagiri <[email protected]> wrote:

> Hmm ..Doesn't it need to drop tlbstate_lock before returning?
> The second lock should be call_lock?

Yes and Yes. I don't know how Andrew copes with 300 odd patches.
I don't seem to be able to keep track of the versions on 3 of them?
Seems I sent out an old version. Doh. Explicit version numbers
from now on.

Below is tested version of the patch. If anyone can reproduce the
issue I would be interested in knowing if this passes a reboot on
that system.

Apologies for the confusion. And thanks for reviewing!

-apw

---
smp.c | 48 ++++++++++++++++++++++++++++++++++++++----------
1 files changed, 38 insertions(+), 10 deletions(-)

diff -upN reference/arch/i386/kernel/smp.c current/arch/i386/kernel/smp.c
--- reference/arch/i386/kernel/smp.c 2004-03-11 20:47:12.000000000 +0000
+++ current/arch/i386/kernel/smp.c 2004-04-01 12:35:40.000000000 +0100
@@ -355,8 +355,6 @@ static void flush_tlb_others(cpumask_t c
*/
BUG_ON(cpus_empty(cpumask));

- cpus_and(tmp, cpumask, cpu_online_map);
- BUG_ON(!cpus_equal(cpumask, tmp));
BUG_ON(cpu_isset(smp_processor_id(), cpumask));
BUG_ON(!mm);

@@ -367,16 +365,24 @@ static void flush_tlb_others(cpumask_t c
* detected by the NMI watchdog.
*/
spin_lock(&tlbstate_lock);
+
+ /* Subtle, mask the request mask with the currently online cpu's.
+ * Sample this under the lock; cpus in the the middle of going
+ * offline will wait until there is noone in this critical section
+ * before disabling IPI handling. */
+ cpus_and(tmp, cpumask, cpu_online_map);
+ if(cpus_empty(tmp))
+ goto out_unlock;

flush_mm = mm;
flush_va = va;
#if NR_CPUS <= BITS_PER_LONG
- atomic_set_mask(cpumask, &flush_cpumask);
+ atomic_set_mask(tmp, &flush_cpumask);
#else
{
int k;
unsigned long *flush_mask = (unsigned long *)&flush_cpumask;
- unsigned long *cpu_mask = (unsigned long *)&cpumask;
+ unsigned long *cpu_mask = (unsigned long *)&tmp;
for (k = 0; k < BITS_TO_LONGS(NR_CPUS); ++k)
atomic_set_mask(cpu_mask[k], &flush_mask[k]);
}
@@ -385,7 +391,7 @@ static void flush_tlb_others(cpumask_t c
* We have to send the IPI only to
* CPUs affected.
*/
- send_IPI_mask(cpumask, INVALIDATE_TLB_VECTOR);
+ send_IPI_mask(tmp, INVALIDATE_TLB_VECTOR);

while (!cpus_empty(flush_cpumask))
/* nothing. lockup detection does not belong here */
@@ -393,6 +399,7 @@ static void flush_tlb_others(cpumask_t c

flush_mm = NULL;
flush_va = 0;
+out_unlock:
spin_unlock(&tlbstate_lock);
}

@@ -514,11 +521,8 @@ int smp_call_function (void (*func) (voi
*/
{
struct call_data_struct data;
- int cpus = num_online_cpus()-1;
-
- if (!cpus)
- return 0;
-
+ int cpus;
+
data.func = func;
data.info = info;
atomic_set(&data.started, 0);
@@ -527,6 +531,15 @@ int smp_call_function (void (*func) (voi
atomic_set(&data.finished, 0);

spin_lock(&call_lock);
+
+ /* Subtle, get the current number of online cpus.
+ * Sample this under the lock; cpus in the the middle of going
+ * offline will wait until there is noone in this critical section
+ * before disabling IPI handling. */
+ cpus = num_online_cpus()-1;
+ if (!cpus)
+ goto out_unlock;
+
call_data = &data;
mb();

@@ -540,6 +553,7 @@ int smp_call_function (void (*func) (voi
if (wait)
while (atomic_read(&data.finished) != cpus)
barrier();
+out_unlock:
spin_unlock(&call_lock);

return 0;
@@ -551,6 +565,20 @@ static void stop_this_cpu (void * dummy)
* Remove this CPU:
*/
cpu_clear(smp_processor_id(), cpu_online_map);
+
+ /* Subtle, IPI users assume that they will be able to get IPI's
+ * though to the cpus listed in cpu_online_map. To ensure this
+ * we add the requirement that they check cpu_online_map within
+ * the IPI critical sections. Here we remove ourselves from the
+ * map, then ensure that all other cpus have left the relevant
+ * critical sections since the change. We do this by aquiring
+ * the relevant section locks, if we have them none else is in
+ * them. Once this is done we can go offline. */
+ spin_lock(&tlbstate_lock);
+ spin_unlock(&tlbstate_lock);
+ spin_lock(&call_lock);
+ spin_unlock(&call_lock);
+
local_irq_disable();
disable_local_APIC();
if (cpu_data[smp_processor_id()].hlt_works_ok)

2004-04-01 13:58:59

by Hariprasad Nellitheertha

[permalink] [raw]
Subject: Re: BUG_ON(!cpus_equal(cpumask, tmp));

Hi Paul,

On Thu, Apr 01, 2004 at 12:42:51AM -0800, Paul Jackson wrote:
> Hari,
>
> I see code similar to what you are changing, also in the file:
>
> arch/x86_64/kernel/smp.c
>
> Do your considerations apply here as well?

I think they do. Although I can't recreate the problem to verify (no kexec
on x86_64).

The attached patch is just an extension of Andy's patch to cover x86_64.

Regards, Hari
-
Hariprasad Nellitheertha
Linux Technology Center
India Software Labs
IBM India, Bangalore


diff -Naurp before/arch/i386/kernel/smp.c after/arch/i386/kernel/smp.c
--- before/arch/i386/kernel/smp.c 2004-03-30 08:55:32.000000000 +0530
+++ after/arch/i386/kernel/smp.c 2004-04-01 18:26:22.000000000 +0530
@@ -355,8 +355,6 @@ static void flush_tlb_others(cpumask_t c
*/
BUG_ON(cpus_empty(cpumask));

- cpus_and(tmp, cpumask, cpu_online_map);
- BUG_ON(!cpus_equal(cpumask, tmp));
BUG_ON(cpu_isset(smp_processor_id(), cpumask));
BUG_ON(!mm);

@@ -367,16 +365,24 @@ static void flush_tlb_others(cpumask_t c
* detected by the NMI watchdog.
*/
spin_lock(&tlbstate_lock);
+
+ /* Subtle, mask the request mask with the currently online cpu's.
+ * Sample this under the lock; cpus in the the middle of going
+ * offline will wait until there is noone in this critical section
+ * before disabling IPI handling. */
+ cpus_and(tmp, cpumask, cpu_online_map);
+ if(cpus_empty(tmp))
+ goto out_unlock;

flush_mm = mm;
flush_va = va;
#if NR_CPUS <= BITS_PER_LONG
- atomic_set_mask(cpumask, &flush_cpumask);
+ atomic_set_mask(tmp, &flush_cpumask);
#else
{
int k;
unsigned long *flush_mask = (unsigned long *)&flush_cpumask;
- unsigned long *cpu_mask = (unsigned long *)&cpumask;
+ unsigned long *cpu_mask = (unsigned long *)&tmp;
for (k = 0; k < BITS_TO_LONGS(NR_CPUS); ++k)
atomic_set_mask(cpu_mask[k], &flush_mask[k]);
}
@@ -385,7 +391,7 @@ static void flush_tlb_others(cpumask_t c
* We have to send the IPI only to
* CPUs affected.
*/
- send_IPI_mask(cpumask, INVALIDATE_TLB_VECTOR);
+ send_IPI_mask(tmp, INVALIDATE_TLB_VECTOR);

while (!cpus_empty(flush_cpumask))
/* nothing. lockup detection does not belong here */
@@ -393,6 +399,7 @@ static void flush_tlb_others(cpumask_t c

flush_mm = NULL;
flush_va = 0;
+out_unlock:
spin_unlock(&tlbstate_lock);
}

@@ -514,11 +521,8 @@ int smp_call_function (void (*func) (voi
*/
{
struct call_data_struct data;
- int cpus = num_online_cpus()-1;
-
- if (!cpus)
- return 0;
-
+ int cpus;
+
data.func = func;
data.info = info;
atomic_set(&data.started, 0);
@@ -527,6 +531,15 @@ int smp_call_function (void (*func) (voi
atomic_set(&data.finished, 0);

spin_lock(&call_lock);
+
+ /* Subtle, get the current number of online cpus.
+ * Sample this under the lock; cpus in the the middle of going
+ * offline will wait until there is noone in this critical section
+ * before disabling IPI handling. */
+ cpus = num_online_cpus()-1;
+ if (!cpus)
+ goto out_unlock;
+
call_data = &data;
mb();

@@ -540,6 +553,7 @@ int smp_call_function (void (*func) (voi
if (wait)
while (atomic_read(&data.finished) != cpus)
barrier();
+out_unlock:
spin_unlock(&call_lock);

return 0;
@@ -551,6 +565,20 @@ static void stop_this_cpu (void * dummy)
* Remove this CPU:
*/
cpu_clear(smp_processor_id(), cpu_online_map);
+
+ /* Subtle, IPI users assume that they will be able to get IPI's
+ * though to the cpus listed in cpu_online_map. To ensure this
+ * we add the requirement that they check cpu_online_map within
+ * the IPI critical sections. Here we remove ourselves from the
+ * map, then ensure that all other cpus have left the relevant
+ * critical sections since the change. We do this by aquiring
+ * the relevant section locks, if we have them none else is in
+ * them. Once this is done we can go offline. */
+ spin_lock(&tlbstate_lock);
+ spin_unlock(&tlbstate_lock);
+ spin_lock(&call_lock);
+ spin_unlock(&call_lock);
+
local_irq_disable();
disable_local_APIC();
if (cpu_data[smp_processor_id()].hlt_works_ok)
diff -Naurp before/arch/x86_64/kernel/smp.c after/arch/x86_64/kernel/smp.c
--- before/arch/x86_64/kernel/smp.c 2004-03-30 08:55:37.000000000 +0530
+++ after/arch/x86_64/kernel/smp.c 2004-04-01 18:36:56.000000000 +0530
@@ -243,8 +243,6 @@ static void flush_tlb_others(cpumask_t c
* - mask must exist :)
*/
BUG_ON(cpus_empty(cpumask));
- cpus_and(tmp, cpumask, cpu_online_map);
- BUG_ON(!cpus_equal(tmp, cpumask));
BUG_ON(cpu_isset(smp_processor_id(), cpumask));
if (!mm)
BUG();
@@ -256,22 +254,31 @@ static void flush_tlb_others(cpumask_t c
* detected by the NMI watchdog.
*/
spin_lock(&tlbstate_lock);
+
+ /* Subtle, mask the request mask with the currently online cpu's.
+ * Sample this under the lock; cpus in the the middle of going
+ * offline will wait until there is noone in this critical section
+ * before disabling IPI handling. */
+ cpus_and(tmp, cpumask, cpu_online_map);
+ if(cpus_empty(tmp))
+ goto out_unlock;

flush_mm = mm;
flush_va = va;
- cpus_or(flush_cpumask, cpumask, flush_cpumask);
+ cpus_or(flush_cpumask, tmp, flush_cpumask);

/*
* We have to send the IPI only to
* CPUs affected.
*/
- send_IPI_mask(cpumask, INVALIDATE_TLB_VECTOR);
+ send_IPI_mask(tmp, INVALIDATE_TLB_VECTOR);

while (!cpus_empty(flush_cpumask))
mb(); /* nothing. lockup detection does not belong here */;

flush_mm = NULL;
flush_va = 0;
+out_unlock:
spin_unlock(&tlbstate_lock);
}

@@ -399,10 +406,7 @@ int smp_call_function (void (*func) (voi
*/
{
struct call_data_struct data;
- int cpus = num_online_cpus()-1;
-
- if (!cpus)
- return 0;
+ int cpus;

data.func = func;
data.info = info;
@@ -412,6 +416,16 @@ int smp_call_function (void (*func) (voi
atomic_set(&data.finished, 0);

spin_lock(&call_lock);
+
+ /* Subtle, get the current number of online cpus.
+ * Sample this under the lock; cpus in the the middle of going
+ * offline will wait until there is noone in this critical section
+ * before disabling IPI handling. */
+ cpus = num_online_cpus()-1;
+ if (!cpus)
+ goto out_unlock;
+
+
call_data = &data;
wmb();
/* Send a message to all other CPUs and wait for them to respond */
@@ -424,6 +438,7 @@ int smp_call_function (void (*func) (voi
if (wait)
while (atomic_read(&data.finished) != cpus)
barrier();
+out_unlock:
spin_unlock(&call_lock);

return 0;
@@ -435,6 +450,20 @@ void smp_stop_cpu(void)
* Remove this CPU:
*/
cpu_clear(smp_processor_id(), cpu_online_map);
+
+ /* Subtle, IPI users assume that they will be able to get IPI's
+ * though to the cpus listed in cpu_online_map. To ensure this
+ * we add the requirement that they check cpu_online_map within
+ * the IPI critical sections. Here we remove ourselves from the
+ * map, then ensure that all other cpus have left the relevant
+ * critical sections since the change. We do this by aquiring
+ * the relevant section locks, if we have them none else is in
+ * them. Once this is done we can go offline. */
+ spin_lock(&tlbstate_lock);
+ spin_unlock(&tlbstate_lock);
+ spin_lock(&call_lock);
+ spin_unlock(&call_lock);
+
local_irq_disable();
disable_local_APIC();
local_irq_enable();

2004-04-02 18:37:20

by Randy.Dunlap

[permalink] [raw]
Subject: Re: BUG_ON(!cpus_equal(cpumask, tmp));

On Thu, 01 Apr 2004 12:38:40 +0100 Andy Whitcroft wrote:

| --On 01 April 2004 10:34 +0530 Srivatsa Vaddagiri <[email protected]> wrote:
|
| > Hmm ..Doesn't it need to drop tlbstate_lock before returning?
| > The second lock should be call_lock?
|
| Yes and Yes. I don't know how Andrew copes with 300 odd patches.
| I don't seem to be able to keep track of the versions on 3 of them?
| Seems I sent out an old version. Doh. Explicit version numbers
| from now on.
|
| Below is tested version of the patch. If anyone can reproduce the
| issue I would be interested in knowing if this passes a reboot on
| that system.
|
| Apologies for the confusion. And thanks for reviewing!


This version works well, thank you. Without it I still see the
BUG_ON() in smp.c (line 359).


I noted a few comments corrections and style changes below.
Want a patch for them instead?


| @@ -367,16 +365,24 @@ static void flush_tlb_others(cpumask_t c
| * detected by the NMI watchdog.
| */
| spin_lock(&tlbstate_lock);
| +
| + /* Subtle, mask the request mask with the currently online cpu's.
| + * Sample this under the lock; cpus in the the middle of going
x.x
| + * offline will wait until there is noone in this critical section
| + * before disabling IPI handling. */
| + cpus_and(tmp, cpumask, cpu_online_map);
| + if(cpus_empty(tmp))
if (cpus_empty(tmp))
| + goto out_unlock;


| @@ -527,6 +531,15 @@ int smp_call_function (void (*func) (voi
| atomic_set(&data.finished, 0);
|
| spin_lock(&call_lock);
| +
| + /* Subtle, get the current number of online cpus.
| + * Sample this under the lock; cpus in the the middle of going
x.x
| + * offline will wait until there is noone in this critical section
| + * before disabling IPI handling. */


| @@ -551,6 +565,20 @@ static void stop_this_cpu (void * dummy)
| * Remove this CPU:
| */
| cpu_clear(smp_processor_id(), cpu_online_map);
| +
| + /* Subtle, IPI users assume that they will be able to get IPI's
| + * though to the cpus listed in cpu_online_map. To ensure this
through
| + * we add the requirement that they check cpu_online_map within
| + * the IPI critical sections. Here we remove ourselves from the
| + * map, then ensure that all other cpus have left the relevant
| + * critical sections since the change. We do this by aquiring
acquiring
| + * the relevant section locks, if we have them none else is in
noone
| + * them. Once this is done we can go offline. */


--
~Randy
(Again. Sometimes I think ln -s /usr/src/linux/.config .signature)

2004-04-03 01:46:47

by Andy Whitcroft

[permalink] [raw]
Subject: Re: BUG_ON(!cpus_equal(cpumask, tmp));

Below is the latest version of the cpu shutdown race fix (cpu_race_R3).
Mostly a merging of some changes to in-code commentary (thanks to Randy)
and the addition of the x86_64 architecture portion (contributed by Hari).
Of course also based on Hari's original patch.

Feedback from Randy Dunlap is that this patch is at least as good as
previous solutions. Testing on x86_64 in general would be appreciated.

Andrew, would you consider this for testing in -mm?

-apw

---
During the shutdown process cpus are taken offline. If this process overlaps
with an ongoing TLB flush we can attempt to flush a cpu which is already
offline. This triggers a BUG_ON in tlb_flush_others(). This occurs because
we do not remove an outgoing cpu from the VM state.

This patch does two things. Firstly it validates the cpu mask passed
for the flush against the current online cpu mask preventing us from
flushing no longer valid cpus. Secondly it ensures that the cpus found
to be online at this stage cannot go completely offline before the TLB
flush is complete.

---
i386/kernel/smp.c | 48 ++++++++++++++++++++++++++++++++++++++----------
x86_64/kernel/smp.c | 46 +++++++++++++++++++++++++++++++++++++---------
2 files changed, 75 insertions(+), 19 deletions(-)

diff -X /home/apw/lib/vdiff.excl -rupN reference/arch/i386/kernel/smp.c current/arch/i386/kernel/smp.c
--- reference/arch/i386/kernel/smp.c 2004-03-11 20:47:12.000000000 +0000
+++ current/arch/i386/kernel/smp.c 2004-04-03 02:52:53.000000000 +0100
@@ -355,8 +355,6 @@ static void flush_tlb_others(cpumask_t c
*/
BUG_ON(cpus_empty(cpumask));

- cpus_and(tmp, cpumask, cpu_online_map);
- BUG_ON(!cpus_equal(cpumask, tmp));
BUG_ON(cpu_isset(smp_processor_id(), cpumask));
BUG_ON(!mm);

@@ -367,16 +365,24 @@ static void flush_tlb_others(cpumask_t c
* detected by the NMI watchdog.
*/
spin_lock(&tlbstate_lock);
+
+ /* Subtle, mask the request mask with the currently online cpu's.
+ * Sample this under the lock; cpus in the middle of going offline
+ * will wait until there is noone in this critical section before
+ * disabling IPI handling. */
+ cpus_and(tmp, cpumask, cpu_online_map);
+ if(cpus_empty(tmp))
+ goto out_unlock;

flush_mm = mm;
flush_va = va;
#if NR_CPUS <= BITS_PER_LONG
- atomic_set_mask(cpumask, &flush_cpumask);
+ atomic_set_mask(tmp, &flush_cpumask);
#else
{
int k;
unsigned long *flush_mask = (unsigned long *)&flush_cpumask;
- unsigned long *cpu_mask = (unsigned long *)&cpumask;
+ unsigned long *cpu_mask = (unsigned long *)&tmp;
for (k = 0; k < BITS_TO_LONGS(NR_CPUS); ++k)
atomic_set_mask(cpu_mask[k], &flush_mask[k]);
}
@@ -385,7 +391,7 @@ static void flush_tlb_others(cpumask_t c
* We have to send the IPI only to
* CPUs affected.
*/
- send_IPI_mask(cpumask, INVALIDATE_TLB_VECTOR);
+ send_IPI_mask(tmp, INVALIDATE_TLB_VECTOR);

while (!cpus_empty(flush_cpumask))
/* nothing. lockup detection does not belong here */
@@ -393,6 +399,7 @@ static void flush_tlb_others(cpumask_t c

flush_mm = NULL;
flush_va = 0;
+out_unlock:
spin_unlock(&tlbstate_lock);
}

@@ -514,11 +521,8 @@ int smp_call_function (void (*func) (voi
*/
{
struct call_data_struct data;
- int cpus = num_online_cpus()-1;
-
- if (!cpus)
- return 0;
-
+ int cpus;
+
data.func = func;
data.info = info;
atomic_set(&data.started, 0);
@@ -527,6 +531,15 @@ int smp_call_function (void (*func) (voi
atomic_set(&data.finished, 0);

spin_lock(&call_lock);
+
+ /* Subtle, get the current number of online cpus.
+ * Sample this under the lock; cpus in the middle of going offline
+ * will wait until there is noone in this critical section before
+ * disabling IPI handling. */
+ cpus = num_online_cpus()-1;
+ if (!cpus)
+ goto out_unlock;
+
call_data = &data;
mb();

@@ -540,6 +553,7 @@ int smp_call_function (void (*func) (voi
if (wait)
while (atomic_read(&data.finished) != cpus)
barrier();
+out_unlock:
spin_unlock(&call_lock);

return 0;
@@ -551,6 +565,20 @@ static void stop_this_cpu (void * dummy)
* Remove this CPU:
*/
cpu_clear(smp_processor_id(), cpu_online_map);
+
+ /* Subtle, IPI users assume that they will be able to get IPI's
+ * through to the cpus listed in cpu_online_map. To ensure this
+ * we add the requirement that they check cpu_online_map within
+ * the IPI critical sections. Here we remove ourselves from the
+ * map, then ensure that all other cpus have left the relevant
+ * critical sections since the change. We do this by acquiring
+ * the relevant section locks, if we have them noone else is in
+ * them. Once this is done we can go offline. */
+ spin_lock(&tlbstate_lock);
+ spin_unlock(&tlbstate_lock);
+ spin_lock(&call_lock);
+ spin_unlock(&call_lock);
+
local_irq_disable();
disable_local_APIC();
if (cpu_data[smp_processor_id()].hlt_works_ok)
diff -X /home/apw/lib/vdiff.excl -rupN reference/arch/x86_64/kernel/smp.c current/arch/x86_64/kernel/smp.c
--- reference/arch/x86_64/kernel/smp.c 2004-01-09 06:59:19.000000000 +0000
+++ current/arch/x86_64/kernel/smp.c 2004-04-03 03:14:01.000000000 +0100
@@ -243,8 +243,6 @@ static void flush_tlb_others(cpumask_t c
* - mask must exist :)
*/
BUG_ON(cpus_empty(cpumask));
- cpus_and(tmp, cpumask, cpu_online_map);
- BUG_ON(!cpus_equal(tmp, cpumask));
BUG_ON(cpu_isset(smp_processor_id(), cpumask));
if (!mm)
BUG();
@@ -257,21 +255,30 @@ static void flush_tlb_others(cpumask_t c
*/
spin_lock(&tlbstate_lock);

+ /* Subtle, mask the request mask with the currently online cpu's.
+ * Sample this under the lock; cpus in the middle of going offline
+ * will wait until there is noone in this critical section before
+ * disabling IPI handling. */
+ cpus_and(tmp, cpumask, cpu_online_map);
+ if(cpus_empty(tmp))
+ goto out_unlock;
+
flush_mm = mm;
flush_va = va;
- cpus_or(flush_cpumask, cpumask, flush_cpumask);
+ cpus_or(flush_cpumask, tmp, flush_cpumask);

/*
* We have to send the IPI only to
* CPUs affected.
*/
- send_IPI_mask(cpumask, INVALIDATE_TLB_VECTOR);
+ send_IPI_mask(tmp, INVALIDATE_TLB_VECTOR);

while (!cpus_empty(flush_cpumask))
mb(); /* nothing. lockup detection does not belong here */;

flush_mm = NULL;
flush_va = 0;
+out_unlock:
spin_unlock(&tlbstate_lock);
}

@@ -399,11 +406,8 @@ int smp_call_function (void (*func) (voi
*/
{
struct call_data_struct data;
- int cpus = num_online_cpus()-1;
-
- if (!cpus)
- return 0;
-
+ int cpus;
+
data.func = func;
data.info = info;
atomic_set(&data.started, 0);
@@ -412,6 +416,15 @@ int smp_call_function (void (*func) (voi
atomic_set(&data.finished, 0);

spin_lock(&call_lock);
+
+ /* Subtle, get the current number of online cpus.
+ * Sample this under the lock; cpus in the middle of going offline
+ * will wait until there is noone in this critical section before
+ * disabling IPI handling. */
+ cpus = num_online_cpus()-1;
+ if (!cpus)
+ goto out_unlock;
+
call_data = &data;
wmb();
/* Send a message to all other CPUs and wait for them to respond */
@@ -424,6 +437,7 @@ int smp_call_function (void (*func) (voi
if (wait)
while (atomic_read(&data.finished) != cpus)
barrier();
+out_unlock:
spin_unlock(&call_lock);

return 0;
@@ -435,6 +449,20 @@ void smp_stop_cpu(void)
* Remove this CPU:
*/
cpu_clear(smp_processor_id(), cpu_online_map);
+
+ /* Subtle, IPI users assume that they will be able to get IPI's
+ * through to the cpus listed in cpu_online_map. To ensure this
+ * we add the requirement that they check cpu_online_map within
+ * the IPI critical sections. Here we remove ourselves from the
+ * map, then ensure that all other cpus have left the relevant
+ * critical sections since the change. We do this by acquiring
+ * the relevant section locks, if we have them noone else is in
+ * them. Once this is done we can go offline. */
+ spin_lock(&tlbstate_lock);
+ spin_unlock(&tlbstate_lock);
+ spin_lock(&call_lock);
+ spin_unlock(&call_lock);
+
local_irq_disable();
disable_local_APIC();
local_irq_enable();