I just did some profiling of a simple "make test" in the git repo, and
was surprised by the top kernel offender: get_vmalloc_info() showed up
at roughly 4% cpu use.
It turns out that bash ends up reading /proc/meminfo on every single
activation, and "make test" is basically just running a huge
collection of shell scripts. You can verify by just doing
strace -o trace sh -c "echo"
to see what bash does on your system. I suspect it's actually glibc,
because a quick google finds the function "get_phys_pages()" that just
looks at the "MemTotal" line (or possibly get_avphys_pageslooks at the
MemFree" line).
Ok, so bash is insane for caring so deeply that it does this
regardless of anything else. But what else is new - user space does
odd things. It's like a truism.
My gut feel for this is that we should just rate-limit this and cache
the vmalloc information for a fraction of a second or something. Maybe
we could expose total memory sizes in some more efficient format, but
it's not like existing binaries will magically de-crapify themselves,
so just speeding up meminfo sounds like a good thing.
Maybe we could even cache the whole seqfile buffer - Al? How painful
would something like that be? Although from the profiles, it's really
just the vmalloc info gathering that shows up as actually wasting CPU
cycles..
Linus
On Wed, 12 Aug 2015 20:29:34 -0700 Linus Torvalds <[email protected]> wrote:
> I just did some profiling of a simple "make test" in the git repo, and
> was surprised by the top kernel offender: get_vmalloc_info() showed up
> at roughly 4% cpu use.
>
> It turns out that bash ends up reading /proc/meminfo on every single
> activation, and "make test" is basically just running a huge
> collection of shell scripts. You can verify by just doing
>
> strace -o trace sh -c "echo"
>
> to see what bash does on your system. I suspect it's actually glibc,
> because a quick google finds the function "get_phys_pages()" that just
> looks at the "MemTotal" line (or possibly get_avphys_pageslooks at the
> MemFree" line).
And bash surely isn't interested in vmalloc stats. Putting all these
things in the same file wasn't the smartest thing we've ever done.
> Ok, so bash is insane for caring so deeply that it does this
> regardless of anything else. But what else is new - user space does
> odd things. It's like a truism.
>
> My gut feel for this is that we should just rate-limit this and cache
> the vmalloc information for a fraction of a second or something. Maybe
> we could expose total memory sizes in some more efficient format, but
> it's not like existing binaries will magically de-crapify themselves,
> so just speeding up meminfo sounds like a good thing.
>
> Maybe we could even cache the whole seqfile buffer - Al? How painful
> would something like that be? Although from the profiles, it's really
> just the vmalloc info gathering that shows up as actually wasting CPU
> cycles..
>
Do your /proc/meminfo vmalloc numbers actually change during that build?
Mine don't. Perhaps we can cache the most recent vmalloc_info and
invalidate that cache whenever someone does a vmalloc/vfree/etc.
On Wed, Aug 12, 2015 at 9:00 PM, Andrew Morton
<[email protected]> wrote:
>
> Do your /proc/meminfo vmalloc numbers actually change during that build?
> Mine don't. Perhaps we can cache the most recent vmalloc_info and
> invalidate that cache whenever someone does a vmalloc/vfree/etc.
Sure, that works too.
Looking at that mm/vmalloc.c file, the locking is pretty odd. It looks
pretty strange in setup_vmalloc_vm(), for example. If that newly
allocated "va" that we haven't even exposed to anybody yet has its
address or size changed, we're screwed in so many ways.
I get the feeling this file should be rewritten. But that's not going
to happen. The "let's just cache the last value for one jiffy" seemed
to be the minimal fixup to it.
Linus
On Thu, Aug 13 2015, Linus Torvalds <[email protected]> wrote:
> On Wed, Aug 12, 2015 at 9:00 PM, Andrew Morton
> <[email protected]> wrote:
>>
>> Do your /proc/meminfo vmalloc numbers actually change during that build?
>> Mine don't. Perhaps we can cache the most recent vmalloc_info and
>> invalidate that cache whenever someone does a vmalloc/vfree/etc.
>
> Sure, that works too.
>
> Looking at that mm/vmalloc.c file, the locking is pretty odd. It looks
> pretty strange in setup_vmalloc_vm(), for example. If that newly
> allocated "va" that we haven't even exposed to anybody yet has its
> address or size changed, we're screwed in so many ways.
>
> I get the feeling this file should be rewritten. But that's not going
> to happen. The "let's just cache the last value for one jiffy" seemed
> to be the minimal fixup to it.
>
I think it's simpler and better to fix glibc. Looking at the history,
the code for get_[av]phys_pages was added in 1996 (in the git repo
commit 845dcb57), with comments such as
/* Return the number of pages of physical memory in the system. There
is currently (as of version 2.0.21) no system call to determine the
number. It is planned for the 2.1.x series to add this, though.
and
/* XXX Here will come a test for the new system call. */
And that syscall seems to be sysinfo(). So even though sysinfo() also
returns much more than required, it is still more than an order of
magnitude faster than reading and parsing /proc/meminfo (in the quick
microbench I threw together):
#include <stdio.h>
#include <sys/sysinfo.h>
#include <rdtsc.h>
void do_get_phys_pages(void)
{
get_phys_pages();
}
void do_get_avphys_pages(void)
{
get_avphys_pages();
}
void do_sysinfo(void)
{
struct sysinfo info;
sysinfo(&info);
}
void time_this(const char *name, void (*f)(void), int rep)
{
int i;
unsigned long start, stop;
start = rdtsc();
for (i = 0; i < rep; ++i)
f();
stop = rdtsc();
printf("%-20s\t%d\t%lu\t%.1f\n", name, rep, stop-start, (double)(stop-start)/rep);
}
#define time_this(f, rep) time_this(#f, do_ ## f, rep)
int main(void)
{
time_this(sysinfo, 1);
time_this(get_phys_pages, 1);
time_this(get_avphys_pages, 1);
time_this(sysinfo, 1);
time_this(get_phys_pages, 1);
time_this(get_avphys_pages, 1);
time_this(sysinfo, 10);
time_this(get_phys_pages, 10);
time_this(get_avphys_pages, 10);
return 0;
}
$ ./sysinfo
sysinfo 1 6056 6056.0
get_phys_pages 1 226744 226744.0
get_avphys_pages 1 84480 84480.0
sysinfo 1 2288 2288.0
get_phys_pages 1 73216 73216.0
get_avphys_pages 1 76692 76692.0
sysinfo 10 6856 685.6
get_phys_pages 10 626936 62693.6
get_avphys_pages 10 604440 60444.0
Rasmus
On Thu, Aug 13, 2015 at 12:42 AM, Rasmus Villemoes
<[email protected]> wrote:
>
> I think it's simpler and better to fix glibc.
Well, I certainly don't disagree, but at the same time I suspect that
(a) many distros will not update glibc very aggressively and (b) we
should fix the unnecessarily expensive kernel operation regardless.
I think that just printing out the ASCII numbers in /proc/meminfo
should be a lot more expensive than computing the numbers. Which is
clearly not the case now, because we do that crazy expensive dynamic
computation every time, even though it hardly ever really changes (and
nobody even really cares about the values).
So I'd prefer to fix the kernel for existing insane users (because the
kernel is just doing stupid things that it didn't expect anybody to
care about), _and_ fix glibc.
Linus
On Wed, Aug 12, 2015 at 10:52 PM, Linus Torvalds
<[email protected]> wrote:
>
> I get the feeling this file should be rewritten. But that's not going
> to happen. The "let's just cache the last value for one jiffy" seemed
> to be the minimal fixup to it.
Here's a totally untested patch (I'll reboot and test soon - it does
at least compile for me).
Notice the total lack of locking, which means that it's fundamentally
racy. I really can't find it inside myself to care. Introducing those
vmalloc fields was a mistake to begin with, any races here are "ok, we
get values that were valid at some point, but it might have been a
second ago".
And I also can't find it in myself to care about the "on 32-bit,
jiffies wraps in 49 days if HZ is 1000". If somebody carefully avoids
ever reading /proc/meminfo for 49 days, and then reads it in _just_
the right second, and gets a really stale value, I'm just going to do
my honey badger impression.
Because we really shouldn't have added the vmalloc information to
/proc/meminfo to begin with, and nobody ever cares about those values
anyway.
Comments?
Linus
On Thu, Aug 13, 2015 at 11:32 AM, Linus Torvalds
<[email protected]> wrote:
>
> Here's a totally untested patch (I'll reboot and test soon - it does
> at least compile for me).
Seems to work for me. Anybody want to object to the admittedly disgusting patch?
Linus
On Thu, 13 Aug 2015, Linus Torvalds wrote:
> On Thu, Aug 13, 2015 at 11:32 AM, Linus Torvalds
> <[email protected]> wrote:
> >
> > Here's a totally untested patch (I'll reboot and test soon - it does
> > at least compile for me).
>
> Seems to work for me. Anybody want to object to the admittedly disgusting patch?
>
Rather than a time based approach, why not invalidate when it is known to
change, either on the next call to get_vmalloc_info() or as part of the
vmalloc operation itself? If the numbers don't change during your test,
this would seem better and I don't think any vmalloc() function is
intended to be hot.
On Thu, Aug 13, 2015 at 12:50 PM, David Rientjes <[email protected]> wrote:
>
> Rather than a time based approach, why not invalidate when it is known to
> change, either on the next call to get_vmalloc_info() or as part of the
> vmalloc operation itself?
I started doing that, looking at all the users of vmap_area_lock, and
decided that it's just too messy for me. I wanted something minimal
and obvious. The vmap_area handling is not obvious, especially since
the whole vmalloc statistics don't just depend on the vmap area list,
but also take the individual flags into account (ie it counts
VM_LAZY_FREE[ING] entries as having already been removed etc).
So I started looking at actually turning the vmap_area_lock into a
seqlock or even a rwlock (because some of the users are pure readers)
just to clarify things, and that wasn't hard per se, but I threw up my
hands in disgust. The code that modifies the thing is full of "take
lock, look up,. drop lock, do something, take lock again") etc.
Yes, we could just sprinkle "invalidate_cache = 1" statements around
in all the places that can cause this, but that would be pretty ad-hoc
and ugly too. And since the reader side is actually entirely lockless
(currently using rcu, with my patch it has the additional lockless and
racy cache reading), to do it *right* you actually need to use at a
minimum memory ordering rules. So it would be fairly subtle too.
In contrast, my "just base it on time" sure as hell isn't subtle. It's
not great, but at least it's obvious and the crazy logic is
localized..
So I'd be all for somebody actually taking the time to do it right.
I'm pretty sure nobody really cares enough, though.
Willing to prove me wrong?
Linus
On Wed, 12 Aug 2015 22:52:44 -0700 Linus Torvalds <[email protected]> wrote:
> Looking at that mm/vmalloc.c file, the locking is pretty odd. It looks
> pretty strange in setup_vmalloc_vm(), for example. If that newly
> allocated "va" that we haven't even exposed to anybody yet has its
> address or size changed, we're screwed in so many ways.
>
> I get the feeling this file should be rewritten. But that's not going
> to happen.
Nick Piggin started doing a rewrite in 2008 but it was never completed
(https://lwn.net/Articles/285341/).
On Thu, Aug 13, 2015 at 11:32 AM, Linus Torvalds
<[email protected]> wrote:
> And I also can't find it in myself to care about the "on 32-bit,
> jiffies wraps in 49 days if HZ is 1000". If somebody carefully avoids
> ever reading /proc/meminfo for 49 days, and then reads it in _just_
> the right second, and gets a really stale value, I'm just going to do
> my honey badger impression.
could you at least care enough to write that as
if (time_after(now, last + HZ)) {
-Tony
On Thu, Aug 13, 2015 at 2:15 PM, Tony Luck <[email protected]> wrote:
>
> could you at least care enough to write that as
>
> if (time_after(now, last + HZ)) {
Absolutely not. That would be wrong.
"time_after()" covers half the "unsigned long" space (very much on
purpose). We want the cached case to trigger _only_ for the much
tighter case of exactly 1000 jiffies.
Linus