nr_processes() returns the sum of the per cpu counter process_counts for
all online CPUs. This counter is incremented for the current CPU on
fork() and decremented for the current CPU on exit(). Since a process
does not necessarily fork and exit on the same CPU the process_count for
an individual CPU can be either positive or negative and effectively has
no meaning in isolation.
Therefore calculating the sum of process_counts over only the online
CPUs omits the processes which were started or stopped on any CPU which
has since been unplugged. Only the sum of process_counts across all
possible CPUs has meaning.
The only caller of nr_processes() is proc_root_getattr() which
calculates the number of links to /proc as
stat->nlink = proc_root.nlink + nr_processes();
You don't have to be all that unlucky for the nr_processes() to return a
negative value leading to a negative number of links (or rather, an
apparently enormous number of links). If this happens then you can get
failures where things like "ls /proc" start to fail because they got an
-EOVERFLOW from some stat() call.
Example with some debugging inserted to show what goes on:
# ps haux|wc -l
nr_processes: CPU0: 90
nr_processes: CPU1: 1030
nr_processes: CPU2: -900
nr_processes: CPU3: -136
nr_processes: TOTAL: 84
proc_root_getattr. nlink 12 + nr_processes() 84 = 96
84
# echo 0 >/sys/devices/system/cpu/cpu1/online
# ps haux|wc -l
nr_processes: CPU0: 85
nr_processes: CPU2: -901
nr_processes: CPU3: -137
nr_processes: TOTAL: -953
proc_root_getattr. nlink 12 + nr_processes() -953 = -941
75
# stat /proc/
nr_processes: CPU0: 84
nr_processes: CPU2: -901
nr_processes: CPU3: -137
nr_processes: TOTAL: -954
proc_root_getattr. nlink 12 + nr_processes() -954 = -942
File: `/proc/'
Size: 0 Blocks: 0 IO Block: 1024 directory
Device: 3h/3d Inode: 1 Links: 4294966354
Access: (0555/dr-xr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
Access: 2009-11-03 09:06:55.000000000 +0000
Modify: 2009-11-03 09:06:55.000000000 +0000
Change: 2009-11-03 09:06:55.000000000 +0000
I'm not 100% convinced that the per_cpu regions remain valid for offline
CPUs, although my testing suggests that they do. If not then I think the
correct solution would be to aggregate the process_count for a given CPU
into a global base value in cpu_down().
This bug appears to pre-date the transition to git and it looks like it
may even have been present in linux-2.6.0-test7-bk3 since it looks like
the code Rusty patched in http://lwn.net/Articles/64773/ was already
wrong.
Signed-off-by: Ian Campbell <[email protected]>
diff --git a/kernel/fork.c b/kernel/fork.c
index 4b36858..7af7217 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -89,7 +89,7 @@ int nr_processes(void)
int cpu;
int total = 0;
- for_each_online_cpu(cpu)
+ for_each_possible_cpu(cpu)
total += per_cpu(process_counts, cpu);
return total;
On Tue, 3 Nov 2009, Ian Campbell wrote:
>
> Therefore calculating the sum of process_counts over only the online
> CPUs omits the processes which were started or stopped on any CPU which
> has since been unplugged. Only the sum of process_counts across all
> possible CPUs has meaning.
I think your patch is fine, but I also suspect that we could/should
instead fix it by doing something like
per_cpu(process_counts, this_cpu) += per_cpu(process_counts, down_cpu);
per_cpu(process_counts, down_cpu) = 0;
in the CPU off-lining case after having quieted 'down_cpu'. That way we
could always use online CPU's rather than possible CPU's. I think that
just sounds nicer.
So I'll apply the patch, but I get the feeling it could be done better.
Linus
* Ian Campbell <[email protected]> wrote:
> I'm not 100% convinced that the per_cpu regions remain valid for
> offline CPUs, although my testing suggests that they do. If not then I
> think the correct solution would be to aggregate the process_count for
> a given CPU into a global base value in cpu_down().
Sidenote: percpu areas currently are kept allocated on x86.
That might change in the future though, especially with virtual systems
where the possible range of CPUs can be very high - without us
necessarily wanting to pay the percpu area allocation price for it. I.e.
dynamic deallocation of percpu areas is something that could happen in
the future.
> This bug appears to pre-date the transition to git and it looks like
> it may even have been present in linux-2.6.0-test7-bk3 since it looks
> like the code Rusty patched in http://lwn.net/Articles/64773/ was
> already wrong.
Nice one. I'm wondering why it was not discovered for such a long time.
That count can go out of sync easily, and we frequently offline cpus
during suspend/resume, and /proc lookup failures will be noticed in
general. How come nobody ran into this? And i'm wondering how you have
run into this - running cpu hotplug stress-tests with Xen guests - or
via pure code review?
Ingo
On Tue, 3 Nov 2009, Ingo Molnar wrote:
> Sidenote: percpu areas currently are kept allocated on x86.
They must be kept allocated for all possible cpus. Arch code cannot decide
to not allocate per cpu areas.
Search for "for_each_possible_cpu" in the source tree if you want more
detail.
> That might change in the future though, especially with virtual systems
> where the possible range of CPUs can be very high - without us
> necessarily wanting to pay the percpu area allocation price for it. I.e.
> dynamic deallocation of percpu areas is something that could happen in
> the future.
Could be good but would not be as easy as you may think since core code
assumes that possible cpus have per cpu areas configured. There will be
the need for additional notifiers and more complex locking if we want to
have percpu areas for online cpus only. Per cpu areas are permanent at
this point which is a good existence guarantee that avoids all sorts of
complex scenarios.
> Nice one. I'm wondering why it was not discovered for such a long time.
Cpu hotplug is rarely used (what you listed are rare and unusual cases)
and therefore online cpus == possible cpus == present cpus.
On Tue, Nov 03, 2009 at 01:34:32PM -0500, Christoph Lameter wrote:
> On Tue, 3 Nov 2009, Ingo Molnar wrote:
>
> > Sidenote: percpu areas currently are kept allocated on x86.
>
> They must be kept allocated for all possible cpus. Arch code cannot decide
> to not allocate per cpu areas.
>
> Search for "for_each_possible_cpu" in the source tree if you want more
> detail.
Here are a few in my area:
kernel/rcutorture.c srcu_torture_stats 523 for_each_possible_cpu(cpu) {
kernel/rcutorture.c rcu_torture_printk 800 for_each_possible_cpu(cpu) {
kernel/rcutorture.c rcu_torture_init 1127 for_each_possible_cpu(cpu) {
kernel/rcutree.c RCU_DATA_PTR_INIT 1518 for_each_possible_cpu(i) { \
kernel/rcutree_trace.c PRINT_RCU_DATA 73 for_each_possible_cpu(_p_r_d_i) \
kernel/rcutree_trace.c print_rcu_pendings 237 for_each_possible_cpu(cpu) {
kernel/srcu.c srcu_readers_active_idx 64 for_each_possible_cpu(cpu)
> > That might change in the future though, especially with virtual systems
> > where the possible range of CPUs can be very high - without us
> > necessarily wanting to pay the percpu area allocation price for it. I.e.
> > dynamic deallocation of percpu areas is something that could happen in
> > the future.
>
> Could be good but would not be as easy as you may think since core code
> assumes that possible cpus have per cpu areas configured. There will be
> the need for additional notifiers and more complex locking if we want to
> have percpu areas for online cpus only. Per cpu areas are permanent at
> this point which is a good existence guarantee that avoids all sorts of
> complex scenarios.
Indeed.
I will pick on the last one. I would need to track all the srcu_struct
structures. Each such structure would need an additional counter.
In the CPU_DYING notifier, SRCU would need to traverse all srcu_struct
structures, zeroing the dying CPU's count and adding the old value
to the additional counter. This is safe because CPU_DYING happens in
stop_machine_run() context.
Then srcu_readers_active_idx() would need to initialize "sum" to the
additional counter rather than to zero.
Not all -that- bad, and similar strategies could likely be put in place
for the other six offenders in RCU.
Another class of problems would be from code that did not actually access
an offline CPU's per-CPU variables, but instead implicitly expected the
values to remain across an offline-online event pair. The various
rcu_data per-CPU structures would need some fixups when the CPU came
back online.
One way to approach this would be have two types of per-CPU variable,
one type with current semantics, and another type that can go away
when the corresponding CPU goes offline. This latter type probably
needs to be set back to the initial values when the corresponding
CPU comes back online.
Of course, given an "easy way out", one might expect most people to
opt for the old-style per-CPU variables. On the other hand, how
much work do we want to do to save (say) four bytes?
> > Nice one. I'm wondering why it was not discovered for such a long time.
>
> Cpu hotplug is rarely used (what you listed are rare and unusual cases)
> and therefore online cpus == possible cpus == present cpus.
Though it is not unusual for "possible cpus" to be quite a bit larger
than "online cpus"...
Thanx, Paul
On Tue, 3 Nov 2009 08:41:14 pm Ian Campbell wrote:
> nr_processes() returns the sum of the per cpu counter process_counts for
> all online CPUs. This counter is incremented for the current CPU on
> fork() and decremented for the current CPU on exit(). Since a process
> does not necessarily fork and exit on the same CPU the process_count for
> an individual CPU can be either positive or negative and effectively has
> no meaning in isolation.
>
> Therefore calculating the sum of process_counts over only the online
> CPUs omits the processes which were started or stopped on any CPU which
> has since been unplugged. Only the sum of process_counts across all
> possible CPUs has meaning.
>
> The only caller of nr_processes() is proc_root_getattr() which
> calculates the number of links to /proc as
> stat->nlink = proc_root.nlink + nr_processes();
>
> You don't have to be all that unlucky for the nr_processes() to return a
> negative value leading to a negative number of links (or rather, an
> apparently enormous number of links). If this happens then you can get
> failures where things like "ls /proc" start to fail because they got an
> -EOVERFLOW from some stat() call.
>
> Example with some debugging inserted to show what goes on:
> # ps haux|wc -l
> nr_processes: CPU0: 90
> nr_processes: CPU1: 1030
> nr_processes: CPU2: -900
> nr_processes: CPU3: -136
> nr_processes: TOTAL: 84
> proc_root_getattr. nlink 12 + nr_processes() 84 = 96
> 84
> # echo 0 >/sys/devices/system/cpu/cpu1/online
> # ps haux|wc -l
> nr_processes: CPU0: 85
> nr_processes: CPU2: -901
> nr_processes: CPU3: -137
> nr_processes: TOTAL: -953
> proc_root_getattr. nlink 12 + nr_processes() -953 = -941
> 75
> # stat /proc/
> nr_processes: CPU0: 84
> nr_processes: CPU2: -901
> nr_processes: CPU3: -137
> nr_processes: TOTAL: -954
> proc_root_getattr. nlink 12 + nr_processes() -954 = -942
> File: `/proc/'
> Size: 0 Blocks: 0 IO Block: 1024 directory
> Device: 3h/3d Inode: 1 Links: 4294966354
> Access: (0555/dr-xr-xr-x) Uid: ( 0/ root) Gid: ( 0/ root)
> Access: 2009-11-03 09:06:55.000000000 +0000
> Modify: 2009-11-03 09:06:55.000000000 +0000
> Change: 2009-11-03 09:06:55.000000000 +0000
>
> I'm not 100% convinced that the per_cpu regions remain valid for offline
> CPUs, although my testing suggests that they do.
Yep. And so code should usually start with for_each_possible_cpu() then:
> If not then I think the
> correct solution would be to aggregate the process_count for a given CPU
> into a global base value in cpu_down().
If it proves to be an issue.
Acked-by: Rusty Russell <[email protected]>
Thanks!
Rusty.
* Christoph Lameter <[email protected]> wrote:
> On Tue, 3 Nov 2009, Ingo Molnar wrote:
>
> > Sidenote: percpu areas currently are kept allocated on x86.
>
> They must be kept allocated for all possible cpus. Arch code cannot
> decide to not allocate per cpu areas.
>
> Search for "for_each_possible_cpu" in the source tree if you want more
> detail.
This has come up in the past. On-demand deallocation of percpu areas is
a future possibility that might be useful - see previous percpu
discussions on lkml for details.
All in one, the conclusion is that we dont want it yet (on-demand
allocation of them is perfectly fine for now) - i just wanted to mention
that this is a _current_ situation and an implementational choice, which
judgement could change in the future.
The more 'virtual' our abstraction of CPUs becomes, the less the notion
of permanent allocation will be acceptable as time goes on.
Dynamic allocation concepts are natural and the new, generic percpu
allocator makes it more feasible to eventually deallocate percpu areas.
(Of course various for_each_possible_cpu() assumptions will have to be
fixed in that case.)
Ingo
On Tue, 2009-11-03 at 16:07 +0000, Ingo Molnar wrote:
>
> > This bug appears to pre-date the transition to git and it looks
> > like it may even have been present in linux-2.6.0-test7-bk3 since
> > it looks like the code Rusty patched in
> > http://lwn.net/Articles/64773/ was already wrong.
>
> Nice one. I'm wondering why it was not discovered for such a long
> time. That count can go out of sync easily, and we frequently offline
> cpus during suspend/resume, and /proc lookup failures will be noticed
> in general.
I think most people probably don't run for all that long with CPUs
unplugged, in the suspend/resume case the unplugs are fairly transient
and apart from the suspend/resume itself the system is fairly idle while
the CPUs are not plugged. Note that once you plug all the CPUs back in
the problem goes away again.
I can't imagine very many things pay any attention to st_nlinks
for /proc anyway, so as long as the stat itself succeeds things will
trundle on.
> How come nobody ran into this? And i'm wondering how you have run
> into this - running cpu hotplug stress-tests with Xen guests - or via
> pure code review?
We run our Xen domain 0 with a single VCPU by unplugging the others on
boot. We only actually noticed the issue when we switched our installer
to do the same for consistency. The installer uses uclibc and IIRC (the
original discovery was a little while ago) it was using an older variant
of stat(2) which doesn't have a st_nlinks field wide enough to represent
the bogus value and so returned -EOVERFLOW.
My guess is that most systems these days have a libc which uses a newer
variant of stat(2) which is able to represent the large (but invalid)
value so stat() succeeds and since nothing ever actually looks at the
st_nlink field (at least for /proc) things keep working.
Ian.
On Tue, 3 Nov 2009, Paul E. McKenney wrote:
> > Cpu hotplug is rarely used (what you listed are rare and unusual cases)
> > and therefore online cpus == possible cpus == present cpus.
>
> Though it is not unusual for "possible cpus" to be quite a bit larger
> than "online cpus"...
The only reason for possible cpus to be larger than online cpus is if you
can actually plugin more processors while the system is running. Typically
the larger possible cpus seems to be some sort of Vendor BIOS screwup
where the vendor advertises how many processors the maximum configuration
of a certain type of hardware would suppport if one would buy that config.
On Wed, Nov 04, 2009 at 02:37:06PM -0500, Christoph Lameter wrote:
> On Tue, 3 Nov 2009, Paul E. McKenney wrote:
>
> > > Cpu hotplug is rarely used (what you listed are rare and unusual cases)
> > > and therefore online cpus == possible cpus == present cpus.
> >
> > Though it is not unusual for "possible cpus" to be quite a bit larger
> > than "online cpus"...
>
> The only reason for possible cpus to be larger than online cpus is if you
> can actually plugin more processors while the system is running. Typically
> the larger possible cpus seems to be some sort of Vendor BIOS screwup
> where the vendor advertises how many processors the maximum configuration
> of a certain type of hardware would suppport if one would buy that config.
Or if you are running as a guest OS on top of some sort of hypervisor.
Thanx, Paul
On Wed, 4 Nov 2009 05:04:32 am Christoph Lameter wrote:
> On Tue, 3 Nov 2009, Ingo Molnar wrote:
>
> > Sidenote: percpu areas currently are kept allocated on x86.
>
> They must be kept allocated for all possible cpus. Arch code cannot decide
> to not allocate per cpu areas.
>
> Search for "for_each_possible_cpu" in the source tree if you want more
> detail.
Yeah, handling onlining/offlining of cpus is a hassle for most code.
But I can see us wanting abstractions for counters which handle being per-cpu
or per-node and doing the folding etc. automagically. It's best that this
be done by looking at all the existing users to see if there's a nice API
which would cover 90%.
Cheers,
Rusty.