Hi Andi
Attached are patches that would apply over 2.6.12-rc4-mm2 to provide
cpu hotplug support. It seems to hold up to stress (make -j's) on a 4way
HT system)
FYI: Iam sending this via quilt mail, hopefully it makes its way ok,
otherwise please pardon, and i will resend my usuall way again if this
doesn't work.
- I tested with booting maxcpus=1, then you can kick each cpu online.
* I also manually migrate interrupts to the new CPU, and ensure they are off
when that same cpu is being offlined.
* Seems to have trouble with CONFIG_SCHED_SMT, i need to look into this
later.
- Offline works as well
Andi: You had mentioned that you would not prefer to replace the broadcast IPI
with the mask version for performance. Currently this seems to be the
most optimal way without putting a sledge hammer on the cpu_up process.
If there are any known performance issues that we are worried about
we should look for alternative mechanism.
TBD:
0. Verify with CONFIG_NUMA (not done yet)
1. ACPI integration for physical cpu hotplug support.
2. replace fixup_irqs() with a correct implementation. This works fine for now
but has potential to miss interrupts. (in works).
Cheers,
Ashok Raj
On Fri, May 20, 2005 at 03:16:22PM -0700, Ashok Raj wrote:
> Andi: You had mentioned that you would not prefer to replace the broadcast IPI
> with the mask version for performance. Currently this seems to be the
> most optimal way without putting a sledge hammer on the cpu_up process.
I already put a sledgehammer to __cpu_up with that last
patch. Some more hammering surely wouldnt be a big issue. Unlike i386
we actually still have a chance to test all relevant platforms, so I
dont think it is a big issue.
What changes did you plan?
P.S.: An alternative would be to define a new genapic subarch that
you only enable when you detect cpuhotplug support at boot.
Again just commenting on the text, not patch sorry.
-Andi
On Mon, May 23, 2005 at 06:40:46PM +0200, Andi Kleen wrote:
> On Fri, May 20, 2005 at 03:16:22PM -0700, Ashok Raj wrote:
> > Andi: You had mentioned that you would not prefer to replace the broadcast IPI
> > with the mask version for performance. Currently this seems to be the
> > most optimal way without putting a sledge hammer on the cpu_up process.
>
> I already put a sledgehammer to __cpu_up with that last
Yours was a good sledge hammer :-) the way it should have been done
but carried legacy boot from i386 that wasnt pretty. The one iam referring
to is pretty darn slow, and think it wont be liked my many to slow down the
system just to add a new cpu.
> patch. Some more hammering surely wouldnt be a big issue. Unlike i386
> we actually still have a chance to test all relevant platforms, so I
> dont think it is a big issue.
>
> What changes did you plan?
The only other workable alternate would be to use the stop_machine()
like thing which we use to automically update cpu_online_map. This means we
execute a high priority thread on all cpus, bringing the system to knees before
just adding a new cpu. On very large systems this will definitly be
visible.
Just curious, what performance impact did you allude to that would be lost
if we dont use the shortcut IPI version?
>
> P.S.: An alternative would be to define a new genapic subarch that
> you only enable when you detect cpuhotplug support at boot.
>
There is nothing currently there to find out if something is hotplug
capable in a generic way at platform level, other than adding cmdline options
etc.
Also FYI: ACPI folks are experimenting using CPU hotplug to suspend/resume
support. So hotplug support may be required not just on platforms that support
it but also for other related uses.
Cheers,
Ashok Raj
- Open Source Technology Center
On Mon, May 23, 2005 at 09:54:51AM -0700, Ashok Raj wrote:
> On Mon, May 23, 2005 at 06:40:46PM +0200, Andi Kleen wrote:
> > On Fri, May 20, 2005 at 03:16:22PM -0700, Ashok Raj wrote:
> > > Andi: You had mentioned that you would not prefer to replace the broadcast IPI
> > > with the mask version for performance. Currently this seems to be the
> > > most optimal way without putting a sledge hammer on the cpu_up process.
> >
> > I already put a sledgehammer to __cpu_up with that last
>
> Yours was a good sledge hammer :-) the way it should have been done
> but carried legacy boot from i386 that wasnt pretty. The one iam referring
> to is pretty darn slow, and think it wont be liked my many to slow down the
> system just to add a new cpu.
>
> > patch. Some more hammering surely wouldnt be a big issue. Unlike i386
> > we actually still have a chance to test all relevant platforms, so I
> > dont think it is a big issue.
> >
> > What changes did you plan?
>
> The only other workable alternate would be to use the stop_machine()
> like thing which we use to automically update cpu_online_map. This means we
> execute a high priority thread on all cpus, bringing the system to knees before
That is not nice agreed.
> just adding a new cpu. On very large systems this will definitly be
> visible.
I still dont quite get it why it is not enough to keep interrupts
off until the CPU enters idle. Currently we enable them shortly
in the middle of the initialization (whcih is already dangerous
because interrupts can see half initialized state like out of date TSC),
but I hope to get rid of that soon too. With the full startup
in CLI would you problems be gone?
>
> Just curious, what performance impact did you allude to that would be lost
> if we dont use the shortcut IPI version?
I am worried about the TLB flush interrupt. I used to have
some workloads in 2.4 that stresses it very badly (e.g. process
with working set just above the physical memory. It would always
fault in new pages while on another CPU kswapd would unmap
and age pages. Leads to a constant flood of flush IPIs). Another
case is COW in a multithreaded process. You always have to flush
all the other CPUs there.
Even smp_call_function is a bit of an issue in slab intensive
loads because the per CPU slab cache relies on them. I dont
think it is that big an issue as the flush above, but still
would be better to keep it fast.
> > P.S.: An alternative would be to define a new genapic subarch that
> > you only enable when you detect cpuhotplug support at boot.
> >
>
> There is nothing currently there to find out if something is hotplug
> capable in a generic way at platform level, other than adding cmdline options
> etc.
When you have the command line option you can do it. Later I guess
you will have a way to get it from ACPI (e.g. CPUs in tables
but marked inactive etc.).
> Also FYI: ACPI folks are experimenting using CPU hotplug to suspend/resume
> support. So hotplug support may be required not just on platforms that support
> it but also for other related uses.
I am aware of that. Also the virtualization people will likely use it.
-Andi
On Mon, May 23, 2005 at 07:12:12PM +0200, Andi Kleen wrote:
> > The only other workable alternate would be to use the stop_machine()
> > like thing which we use to automically update cpu_online_map. This means we
> > execute a high priority thread on all cpus, bringing the system to knees before
>
> That is not nice agreed.
>
> > just adding a new cpu. On very large systems this will definitly be
> > visible.
>
> I still dont quite get it why it is not enough to keep interrupts
> off until the CPU enters idle. Currently we enable them shortly
> in the middle of the initialization (whcih is already dangerous
> because interrupts can see half initialized state like out of date TSC),
> but I hope to get rid of that soon too. With the full startup
> in CLI would you problems be gone?
>
I think so, if we can ensure none is delivered to the partially up cpu
we probably are covered.
Iam not a 100% sure about above either, if the smp_call_function
is started with 3 cpus initially, and 1 just came up, the counts in
the smp_call data struct could be set to 3 as a result of the new cpu
received this broadcast as well, and we might quit earlier in the wait.
sending to only relevant cpus removes that ambiquity.
[Vatsa would know this better, since was the corner case man then :-)]
On Mon, May 23, 2005 at 10:40:46AM -0700, Ashok Raj wrote:
> Iam not a 100% sure about above either, if the smp_call_function
> is started with 3 cpus initially, and 1 just came up, the counts in
> the smp_call data struct could be set to 3 as a result of the new cpu
> received this broadcast as well, and we might quit earlier in the wait.
True.
> sending to only relevant cpus removes that ambiquity.
Or grab the 'call_lock' before setting the upcoming cpu in the online_map.
This should also avoid the race when a CPU is coming online.
--
Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017
On Tue, May 24, 2005 at 11:16:17AM +0530, Srivatsa Vaddagiri wrote:
> On Mon, May 23, 2005 at 10:40:46AM -0700, Ashok Raj wrote:
> > Iam not a 100% sure about above either, if the smp_call_function
> > is started with 3 cpus initially, and 1 just came up, the counts in
> > the smp_call data struct could be set to 3 as a result of the new cpu
> > received this broadcast as well, and we might quit earlier in the wait.
>
> True.
Thanks for confirming.
I think the window would be a little bit more narrow, but not completely
closed even if we bring cpu in cli state completely.
>
> > sending to only relevant cpus removes that ambiquity.
>
> Or grab the 'call_lock' before setting the upcoming cpu in the online_map.
> This should also avoid the race when a CPU is coming online.
>
We do this today in x86_64 case when we setup this upcomming cpu in
cpu_online_map. But the issue is when we use ipi broadcast, its an ugly
case when we dont know if the new cpu is receiving this as well, and
we dont have real control there.
i converted to use send_IPI_mask(cpu_online_map, CALL_FUNCTION_VECTOR)
instead of the send_IPI_allbutself(CALL_FUNCTION_VECTOR) in
__smp_call_function(), apart from taking the call_lock before setting the
bit in online_map.
Since Andi is concerned about tlb flush intr performance in the 8cpu and less
case, iam planning temporarily use a startup cmd or choose this option
automatically if CONFIG_HOTPLUG_CPU is set for the time being, until we can
find a clean solution that satisfies everyone.
Cheers,
Ashok Raj
On Mon, May 23, 2005 at 11:01:06PM -0700, Ashok Raj wrote:
> We do this today in x86_64 case when we setup this upcomming cpu in
> cpu_online_map. But the issue is when we use ipi broadcast, its an ugly
I don't know of x86-64, but atleast on x86 ipi broadcast will send
to _all_ CPUs present, right? I mean the h/w does not know of the offline
CPUs and will send to them also. This could lead to a problem for the offline
CPUs when they come online and can take a spurious IPI (unless
there is support in h/w to clear pending IPI before doing STI).
> case when we dont know if the new cpu is receiving this as well, and
> we dont have real control there.
>
> i converted to use send_IPI_mask(cpu_online_map, CALL_FUNCTION_VECTOR)
> instead of the send_IPI_allbutself(CALL_FUNCTION_VECTOR) in
> __smp_call_function(), apart from taking the call_lock before setting the
> bit in online_map.
>
> Since Andi is concerned about tlb flush intr performance in the 8cpu and less
> case, iam planning temporarily use a startup cmd or choose this option
> automatically if CONFIG_HOTPLUG_CPU is set for the time being, until we can
> find a clean solution that satisfies everyone.
I think this should be the recommended approach (using send_IPI_mask if
CONFIG_HOTPLUG_CPU is defined) unless h/w is intelligent enough
that it avoids sending IPIs to offline CPUs or it supports clearing
pending interrupts.
--
Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017
On Mon, May 23, 2005 at 10:40:46AM -0700, Ashok Raj wrote:
> On Mon, May 23, 2005 at 07:12:12PM +0200, Andi Kleen wrote:
> > > The only other workable alternate would be to use the stop_machine()
> > > like thing which we use to automically update cpu_online_map. This means we
> > > execute a high priority thread on all cpus, bringing the system to knees before
> >
> > That is not nice agreed.
> >
> > > just adding a new cpu. On very large systems this will definitly be
> > > visible.
> >
> > I still dont quite get it why it is not enough to keep interrupts
> > off until the CPU enters idle. Currently we enable them shortly
> > in the middle of the initialization (whcih is already dangerous
> > because interrupts can see half initialized state like out of date TSC),
> > but I hope to get rid of that soon too. With the full startup
> > in CLI would you problems be gone?
> >
>
> I think so, if we can ensure none is delivered to the partially up cpu
> we probably are covered.
You mean not delivered to its APIC or not delivered as an visible
interrupt in the instruction stream?
The later can be ensured, the first not. I guess if the first is a problem
you could add a function to ack all pending interrupts after initial sti.
e.g. we can assume the CPU will deliver everything pending after two
instruction after the sti and when there are interrupts left in the APIC
you can ack them. But why would they not be raised as real interruptions
at this point anyways?
> Iam not a 100% sure about above either, if the smp_call_function
> is started with 3 cpus initially, and 1 just came up, the counts in
> the smp_call data struct could be set to 3 as a result of the new cpu
> received this broadcast as well, and we might quit earlier in the wait.
In the worst case a smp_call_function would be delayed for the whole
boot up time of a new CPU which should be quite bounded. The longest
delay in there is probably the bogomips calibrate, but I believe
Venkatesh recently sped that up greatly anyways so it should not be
an issue anymore. If the delay is < 1s that is probably tolerable.
Or do I miss some shade of the problem you are worried about?
-Andi
> > sending to only relevant cpus removes that ambiquity.
>
> Or grab the 'call_lock' before setting the upcoming cpu in the online_map.
> This should also avoid the race when a CPU is coming online.
Such a simple solution would be great.
-Andi
On Tue, May 24, 2005 at 01:48:12PM +0200, Andi Kleen wrote:
> On Mon, May 23, 2005 at 10:40:46AM -0700, Ashok Raj wrote:
> > I think so, if we can ensure none is delivered to the partially up cpu
> > we probably are covered.
>
> You mean not delivered to its APIC or not delivered as an visible
> interrupt in the instruction stream?
I think Ashok is referring to first.
>
> The later can be ensured, the first not. I guess if the first is a problem
> you could add a function to ack all pending interrupts after initial sti.
I dont think it is that simple.
First, do we handle the pending interrupt as normal interrupts? For ex: in
case of call function interrupt, do we read the call_data_struct, which may be
stale if we are handling a stale IPI (IPI that was sent to us but caller
didnt wait for us to read the call_data_struct)?
Ideally we have to drop (not handle) such stale interrupts. Which
means add a flag in the interrupt handler. If flag is set, interrupt
is dropped otherwise it is handled. This flag may be set by the upcoming
CPU before doing STI. It may be cleared by it after a brief time after STI.
And this may affect the fast path.
Alternately we may register dummy handlers for various IRQs, sti, wait
for pending IRQs to be flushed and then register the real handlers (and
_then_ set the CPU in the online map?)
Secondly, I think there is another subtle race if we broadcast IPIs. Consider
this:
CPU0 CPU1
Puts itself Offline
smp_call_function
broadcasts IPI
IPI queued on CPU1
Wants to come online now
spin_lock(call_lock)
spin_lock(call_lock) cpu_set(cpu1, online_map);
spin_unlock(call_lock);
broadcast IPI
wait_for_all_cpus
(including CPU1)
sti();
gets IPI, but decides to not handle
it since it may be a stale IPI
This causes CPU0 to wait forever? Not sure if I have missed some h/w subtelity
here.
This problem could probably be prevented if cpu1 is set in the online_map
after the sti.
>
> e.g. we can assume the CPU will deliver everything pending after two
> instruction after the sti and when there are interrupts left in the APIC
> you can ack them. But why would they not be raised as real interruptions
> at this point anyways?
>
>
> > Iam not a 100% sure about above either, if the smp_call_function
> > is started with 3 cpus initially, and 1 just came up, the counts in
> > the smp_call data struct could be set to 3 as a result of the new cpu
> > received this broadcast as well, and we might quit earlier in the wait.
>
> In the worst case a smp_call_function would be delayed for the whole
> boot up time of a new CPU which should be quite bounded. The longest
> delay in there is probably the bogomips calibrate, but I believe
> Venkatesh recently sped that up greatly anyways so it should not be
> an issue anymore. If the delay is < 1s that is probably tolerable.
>
> Or do I miss some shade of the problem you are worried about?
The problem is at the time smp_call_function was running on CPU0 say,
only CPU1 and CPU2 were online. So it decides that it has to wait for
2 CPUs to ack the IPI. Next it broadcasts the IPI. At this particular
point, CPU3 is coming up and is yet to do a STI. Meanwhile, CPU1 and CPU2
ACK CPU0's call_function IPI. Hence smp_call_function on CPU0 decides to
return, which makes call_data_struct stale (since it was on the stack). After
this CPU3 does a STI and gets now the call_function IPI. If it were to handle
this IPI normally, it will read the stale call_data_struct and may
crash. Ideally, CPU3 should have dropped that stale IPI.
I think doing a send_IPI_mask is a safe solution for this problem,
unless h/w avoids broadcasting IPIs to offline CPUs.
--
Thanks and Regards,
Srivatsa Vaddagiri,
Linux Technology Center,
IBM Software Labs,
Bangalore, INDIA - 560017
On Tue, May 24, 2005 at 02:23:30PM +0530, Srivatsa Vaddagiri wrote:
> On Mon, May 23, 2005 at 11:01:06PM -0700, Ashok Raj wrote:
> > We do this today in x86_64 case when we setup this upcomming cpu in
> > cpu_online_map. But the issue is when we use ipi broadcast, its an ugly
>
> I don't know of x86-64, but atleast on x86 ipi broadcast will send
> to _all_ CPUs present, right? I mean the h/w does not know of the offline
> CPUs and will send to them also. This could lead to a problem for the offline
> CPUs when they come online and can take a spurious IPI (unless
> there is support in h/w to clear pending IPI before doing STI).
x86-64 works the same here.
The hardware does not clear pending IPIs AFAIK, but software could
do that manually during cpu bootup. Races can be avoided by taking
call_lock and the tlb flush lock while doing that.
-Andi