When reading from /proc/stat we allocate a large buffer to maximise
the chances of the results being from a single run and thus internally
consistent. This currently is sized at 128 * num_possible_cpus() which,
in the face of kernels sized to handle large configurations (256 cpus
plus), results in the buffer being an order-4 allocation or more.
When system memory becomes fragmented these cannot be guarenteed, leading
to read failures due to allocation failures.
There seem to be two issues in play here. Firstly the allocation is
going to be vastly over sized in the common case, as we only consume the
buffer based on the num_online_cpus(). Secondly, regardless of size we
should not be requiring allocations greater than PAGE_ALLOC_COSTLY_ORDER
as allocations above this order are significantly more likely to fail.
The following patch addesses both of these issues. Does that make sense
generally? It seemed to stop top complaining wildly for the reporter
at least.
-Stefan
---
>From a329ad61fbd26990b294f3b35a31ec80ffab35bb Mon Sep 17 00:00:00 2001
From: Stefan Bader <[email protected]>
Date: Wed, 14 May 2014 12:58:37 +0200
Subject: [PATCH] fs/stat: Reduce memory requirements for stat_open
When reading /proc/stat the stat_open function currently sizes its
internal buffer at:
1024 + 128 * num_possible_cpus() + 2 * num_irqs
This is to maximise the chances of the results as returned to userspace
be a single internally consistent result. With CONFIG_NR_CPUS sized
for larger configs this buffer balloons rapidly, at 256 cpus we end
up at least 33kB which translates into an order-4 allocation (64kB).
This triggered random errors in top when reading /proc/stat due to
memory allocation failures.
In reality the buffer is only consumed in proportion to the
num_online_cpus(), so in the common case it makes much more sense to
allocate the buffer size based on that. Secondly, regardless of size we
should not be requiring allocations greater than PAGE_ALLOC_COSTLY_ORDER
as allocations above this order are significantly more likely to fail.
As the code already bounds the buffer size based on the maximum kmem_alloc
allocation size, we are already relying on the seq_file buffering when
this is exceeded. This will also protect us from overflowing should cpus
come online mid read.
We do not attempt to fix potential inconsistancies that this existing
use of this buffering introduces.
BugLink: http://bugs.launchpad.net/bugs/1319244
Signed-off-by: Stefan Bader <[email protected]>
---
fs/proc/stat.c | 6 +++---
1 file changed, 3 insertions(+), 3 deletions(-)
diff --git a/fs/proc/stat.c b/fs/proc/stat.c
index 9d231e9..9498cf7 100644
--- a/fs/proc/stat.c
+++ b/fs/proc/stat.c
@@ -184,7 +184,7 @@ static int show_stat(struct seq_file *p, void *v)
static int stat_open(struct inode *inode, struct file *file)
{
- size_t size = 1024 + 128 * num_possible_cpus();
+ size_t size = 1024 + 128 * num_online_cpus();
char *buf;
struct seq_file *m;
int res;
@@ -193,8 +193,8 @@ static int stat_open(struct inode *inode, struct file *file)
size += 2 * nr_irqs;
/* don't ask for more than the kmalloc() max size */
- if (size > KMALLOC_MAX_SIZE)
- size = KMALLOC_MAX_SIZE;
+ if (size > (PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER))
+ size = PAGE_SIZE << PAGE_ALLOC_COSTLY_ORDER;
buf = kmalloc(size, GFP_KERNEL);
if (!buf)
return -ENOMEM;
--
1.7.9.5
On Thu, Jun 12, 2014 at 03:00:17PM +0200, Stefan Bader wrote:
> When reading from /proc/stat we allocate a large buffer to maximise
> the chances of the results being from a single run and thus internally
> consistent. This currently is sized at 128 * num_possible_cpus() which,
> in the face of kernels sized to handle large configurations (256 cpus
> plus), results in the buffer being an order-4 allocation or more.
> When system memory becomes fragmented these cannot be guarenteed, leading
> to read failures due to allocation failures.
>
> There seem to be two issues in play here. Firstly the allocation is
> going to be vastly over sized in the common case, as we only consume the
> buffer based on the num_online_cpus(). Secondly, regardless of size we
> should not be requiring allocations greater than PAGE_ALLOC_COSTLY_ORDER
> as allocations above this order are significantly more likely to fail.
>
> The following patch addesses both of these issues. Does that make sense
> generally? It seemed to stop top complaining wildly for the reporter
> at least.
Hi Stefan,
see also https://lkml.org/lkml/2014/5/21/341
and one possible solution:
https://lkml.org/lkml/2014/5/30/191
and the other one:
https://lkml.org/lkml/2014/6/12/92
https://lkml.org/lkml/2014/6/12/107
Thanks,
Heiko
On 12.06.2014 15:41, Heiko Carstens wrote:
> On Thu, Jun 12, 2014 at 03:00:17PM +0200, Stefan Bader wrote:
>> When reading from /proc/stat we allocate a large buffer to maximise
>> the chances of the results being from a single run and thus internally
>> consistent. This currently is sized at 128 * num_possible_cpus() which,
>> in the face of kernels sized to handle large configurations (256 cpus
>> plus), results in the buffer being an order-4 allocation or more.
>> When system memory becomes fragmented these cannot be guarenteed, leading
>> to read failures due to allocation failures.
>>
>> There seem to be two issues in play here. Firstly the allocation is
>> going to be vastly over sized in the common case, as we only consume the
>> buffer based on the num_online_cpus(). Secondly, regardless of size we
>> should not be requiring allocations greater than PAGE_ALLOC_COSTLY_ORDER
>> as allocations above this order are significantly more likely to fail.
>>
>> The following patch addesses both of these issues. Does that make sense
>> generally? It seemed to stop top complaining wildly for the reporter
>> at least.
>
> Hi Stefan,
>
> see also https://lkml.org/lkml/2014/5/21/341
Hi Heiko,
doh, so you guys have been hit by that before. And I have missed the fact that
single_open is special. Which makes the change for the upper limit do the wrong
thing. While long-term it sounds like changing it to vmalloc or iterative reads
sounds better, maybe the change from possible to online cpus might be something
that is better acceptable as a quick fix... Not sure. At least this giving back
a bit of attention to the matter and it is not only affecting zSeries. x86
starts to see hw that requires a similar high possible cpus... :)
-Stefan
>
> and one possible solution:
>
> https://lkml.org/lkml/2014/5/30/191
>
> and the other one:
>
> https://lkml.org/lkml/2014/6/12/92
> https://lkml.org/lkml/2014/6/12/107
>
> Thanks,
> Heiko
>
On Thu, 12 Jun 2014, Stefan Bader wrote:
> doh, so you guys have been hit by that before. And I have missed the fact that
> single_open is special. Which makes the change for the upper limit do the wrong
> thing. While long-term it sounds like changing it to vmalloc or iterative reads
> sounds better, maybe the change from possible to online cpus might be something
> that is better acceptable as a quick fix... Not sure. At least this giving back
> a bit of attention to the matter and it is not only affecting zSeries. x86
> starts to see hw that requires a similar high possible cpus... :)
>
Heiko's patches that should fix this problem are now in -mm, so it would
be nice to know if there are any existing issues that haven't been
addressed yet with your bug report. See
http://www.ozlabs.org/~akpm/mmotm/
On 25.06.2014 01:44, David Rientjes wrote:
> On Thu, 12 Jun 2014, Stefan Bader wrote:
>
>> doh, so you guys have been hit by that before. And I have missed the fact that
>> single_open is special. Which makes the change for the upper limit do the wrong
>> thing. While long-term it sounds like changing it to vmalloc or iterative reads
>> sounds better, maybe the change from possible to online cpus might be something
>> that is better acceptable as a quick fix... Not sure. At least this giving back
>> a bit of attention to the matter and it is not only affecting zSeries. x86
>> starts to see hw that requires a similar high possible cpus... :)
>>
>
> Heiko's patches that should fix this problem are now in -mm, so it would
> be nice to know if there are any existing issues that haven't been
> addressed yet with your bug report. See
> http://www.ozlabs.org/~akpm/mmotm/
>
Heiko and I both had the same issue. Since some x86 hardware also reaches a lot
of CPUs (hyperthreads included), we bumped the possible number of CPUs to 256 at
least for the 64bit kernel. And that resulted in failed accesses to /proc/stat
when memory became fragmented.
So the first patch will avoid this on most systems. I have not seen this myself,
but I would expect him to be happy with 1/2 already. For really excessive
hardware 2/2 will close the gap.
Since this is no critical bug, I am fine with 3.17, too. I have not done so,
yet, but I could let our reporter try the patches (again, probably not verifying
the second part). Just waited to do so to see whether the code settles down to
these changes.
-Stefan
On Wed, 25 Jun 2014, Stefan Bader wrote:
> Heiko and I both had the same issue. Since some x86 hardware also reaches a lot
> of CPUs (hyperthreads included), we bumped the possible number of CPUs to 256 at
> least for the 64bit kernel. And that resulted in failed accesses to /proc/stat
> when memory became fragmented.
> So the first patch will avoid this on most systems. I have not seen this myself,
> but I would expect him to be happy with 1/2 already. For really excessive
> hardware 2/2 will close the gap.
> Since this is no critical bug, I am fine with 3.17, too. I have not done so,
> yet, but I could let our reporter try the patches (again, probably not verifying
> the second part). Just waited to do so to see whether the code settles down to
> these changes.
>
Ok, thanks. Looks like
proc-stat-convert-to-single_open_size.patch
fs-seq_file-fallback-to-vmalloc-allocation.patch
are destined for 3.17.
On Wed, 25 Jun 2014 15:52:30 -0700 (PDT) David Rientjes <[email protected]> wrote:
> On Wed, 25 Jun 2014, Stefan Bader wrote:
>
> > Heiko and I both had the same issue. Since some x86 hardware also reaches a lot
> > of CPUs (hyperthreads included), we bumped the possible number of CPUs to 256 at
> > least for the 64bit kernel. And that resulted in failed accesses to /proc/stat
> > when memory became fragmented.
> > So the first patch will avoid this on most systems. I have not seen this myself,
> > but I would expect him to be happy with 1/2 already. For really excessive
> > hardware 2/2 will close the gap.
> > Since this is no critical bug, I am fine with 3.17, too. I have not done so,
> > yet, but I could let our reporter try the patches (again, probably not verifying
> > the second part). Just waited to do so to see whether the code settles down to
> > these changes.
> >
>
> Ok, thanks. Looks like
>
> proc-stat-convert-to-single_open_size.patch
> fs-seq_file-fallback-to-vmalloc-allocation.patch
>
> are destined for 3.17.
Actually I've bumped them into the 3.16 queue, cc stable.
On Thu, Jun 12, 2014 at 03:00:17PM +0200, Stefan Bader wrote:
> When reading from /proc/stat we allocate a large buffer to maximise
> the chances of the results being from a single run and thus internally
> consistent. This currently is sized at 128 * num_possible_cpus() which,
> in the face of kernels sized to handle large configurations (256 cpus
> plus), results in the buffer being an order-4 allocation or more.
> When system memory becomes fragmented these cannot be guarenteed, leading
> to read failures due to allocation failures.
> @@ -184,7 +184,7 @@ static int show_stat(struct seq_file *p, void *v)
>
> static int stat_open(struct inode *inode, struct file *file)
> {
> - size_t size = 1024 + 128 * num_possible_cpus();
> + size_t size = 1024 + 128 * num_online_cpus();
> char *buf;
> struct seq_file *m;
> int res;
Old thread, and already solved in the meantime, but note that
CONFIG_NR_CPUS _should_ have no reflection on num_possible_cpus().
The arch (x86 does) should detect at boot time the max possible CPUs the
actual hardware supports and put num_possible_cpus() to that number. So
your typical laptop will mostly have num_possible_cpus() <= 4, even
though CONFIG_NR_CPUS could be 4k.
Of course, if you actually do put 256+ cpus in your system, well, then
the difference between possible and online isn't going to help either.
If on the other hand your 'board' reports it can hold 256 CPUs while in
fact it cannot, go kick your vendor in the nuts.
On 08.07.2014 15:09, Peter Zijlstra wrote:
> On Thu, Jun 12, 2014 at 03:00:17PM +0200, Stefan Bader wrote:
>> When reading from /proc/stat we allocate a large buffer to maximise
>> the chances of the results being from a single run and thus internally
>> consistent. This currently is sized at 128 * num_possible_cpus() which,
>> in the face of kernels sized to handle large configurations (256 cpus
>> plus), results in the buffer being an order-4 allocation or more.
>> When system memory becomes fragmented these cannot be guarenteed, leading
>> to read failures due to allocation failures.
>
>> @@ -184,7 +184,7 @@ static int show_stat(struct seq_file *p, void *v)
>>
>> static int stat_open(struct inode *inode, struct file *file)
>> {
>> - size_t size = 1024 + 128 * num_possible_cpus();
>> + size_t size = 1024 + 128 * num_online_cpus();
>> char *buf;
>> struct seq_file *m;
>> int res;
>
> Old thread, and already solved in the meantime, but note that
> CONFIG_NR_CPUS _should_ have no reflection on num_possible_cpus().
>
> The arch (x86 does) should detect at boot time the max possible CPUs the
> actual hardware supports and put num_possible_cpus() to that number. So
> your typical laptop will mostly have num_possible_cpus() <= 4, even
> though CONFIG_NR_CPUS could be 4k.
>
> Of course, if you actually do put 256+ cpus in your system, well, then
> the difference between possible and online isn't going to help either.
>
> If on the other hand your 'board' reports it can hold 256 CPUs while in
> fact it cannot, go kick your vendor in the nuts.
>
Odd. So at least for the reporter the board must be doing that. For the other
way round at least back then when we did the change, some of the big hoovers
which end up claiming 64 or so CPUs (including threads) would not boot with a
too low CONFIG_NR_CPUS. So that seems to have changed at some point.
But at least this would explain why there have not been too many reports.
Thanks for the detail,
Stefan