On 2.6 based systems, the top command utilizes /proc/[pid]/wchan to
determine WCHAN symbol name information. This information is provided
by the kernel function kallsyms_lookup(), which expands a stem-compressed
list of kernel symbols in order to provide a symbol name. The decompression
is relatively slow, performing a lot of throw-away work in the form of
strncpy() invocations whose results are overwritten immediately. On systems
with many processes, this results in the kernel consuming a large amount of
time on top's behalf performing the decompression. For example, on a
large SGI Altix system with 4500 current tasks, there is approximately
8 seconds worth of work to be done with each top screen update.
The following patch adds a result cache to kallsyms_lookup(), cacheing
previously decompressed symbol names and using them on subsequent
kallsyms_lookup() calls. With this change, top with a 1 second update
now consumes only about 60% of a CPU on the previously mentioned sytem,
as opposed to the prior 800%. A trivial test program which repeatedly
opens/reads/closes /proc/1/wchan improves from 160 iterations per second
to 30000.
The cache size of 32 items was chosen experimentally. 16 was too few
to cache all the results on even a small system with approximatley 100
tasks, and 64 was too large for all but the most extreme cases. Cacheing
was chosen instead of improving the expansion algorithm due to the large
number of strlen() invocations which would have been necessary with that
solution.
Further work obviously needs to be done for top itself to reduce CPU
utilization even further, but this is a large first step. Some quick
experiments indicate that the slow path has now moved to other areas
of code, which will be addressed in due course.
Signed-off-by: Brent Casavant <[email protected]>
--- linux.orig/kernel/kallsyms.c 2004-03-16 14:13:30.000000000 -0600
+++ linux/kernel/kallsyms.c 2004-06-18 14:28:17.000000000 -0500
@@ -56,6 +56,16 @@
return module_kallsyms_lookup_name(name);
}
+/* Cache of uncompressed symbol names from kallsyms_lookup() */
+#define NAMECACHE_SIZE 32
+static int namecache_index;
+static struct {
+ unsigned long addr;
+ unsigned long nextsym;
+ char name[127];
+} namecache[NAMECACHE_SIZE];
+static rwlock_t namecache_lock;
+
/* Lookup an address. modname is set to NULL if it's in the kernel. */
const char *kallsyms_lookup(unsigned long addr,
unsigned long *symbolsize,
@@ -74,6 +84,22 @@
unsigned long symbol_end;
char *name = kallsyms_names;
+ /* Search for a cached response */
+ read_lock(&namecache_lock);
+ for (i = 0; i < NAMECACHE_SIZE; i++) {
+ if (namecache[i].addr &&
+ namecache[i].addr <= addr &&
+ addr < namecache[i].nextsym) {
+ strncpy(namebuf, namecache[i].name, 127);
+ *symbolsize = namecache[i].nextsym - namecache[i].addr;
+ *modname = NULL;
+ *offset = addr - namecache[i].addr;
+ read_unlock(&namecache_lock);
+ return namebuf;
+ }
+ }
+ read_unlock(&namecache_lock);
+
/* They're sorted, we could be clever here, but who cares? */
for (i = 0; i < kallsyms_num_syms; i++) {
if (kallsyms_addresses[i] > kallsyms_addresses[best] &&
@@ -99,6 +125,15 @@
*symbolsize = symbol_end - kallsyms_addresses[best];
*modname = NULL;
*offset = addr - kallsyms_addresses[best];
+
+ /* Cache the results */
+ write_lock(&namecache_lock);
+ namecache[namecache_index].addr = kallsyms_addresses[best];
+ namecache[namecache_index].nextsym = symbol_end;
+ strncpy(namecache[namecache_index].name, namebuf, 127);
+ namecache_index = (namecache_index + 1) % NAMECACHE_SIZE;
+ write_unlock(&namecache_lock);
+
return namebuf;
}
--
Brent Casavant [email protected] Forget bright-eyed and
Operating System Engineer http://www.sgi.com/ bushy-tailed; I'm red-
Silicon Graphics, Inc. 44.8562N 93.1355W 860F eyed and bushy-haired.
On Fri, 2004-06-18 at 22:03, Brent Casavant wrote:
Hi Brent
just something that cought my eye.
> +static rwlock_t namecache_lock;
should be
static rwlock_t namecache_lock = RW_LOCK_UNLOCKED;
--
/Martin
On Fri, 18 Jun 2004, Martin Josefsson wrote:
> On Fri, 2004-06-18 at 22:03, Brent Casavant wrote:
>
> Hi Brent
>
> just something that cought my eye.
>
> > +static rwlock_t namecache_lock;
>
> should be
>
> static rwlock_t namecache_lock = RW_LOCK_UNLOCKED;
Ah, thank you. Fixed. I'll queue up any other catches before resending.
Thanks,
Brent
--
Brent Casavant [email protected] Forget bright-eyed and
Operating System Engineer http://www.sgi.com/ bushy-tailed; I'm red-
Silicon Graphics, Inc. 44.8562N 93.1355W 860F eyed and bushy-haired.
On Fri, 18 Jun 2004 15:03:00 -0500
Brent Casavant <[email protected]> wrote:
> On 2.6 based systems, the top command utilizes /proc/[pid]/wchan to
> determine WCHAN symbol name information. This information is provided
> by the kernel function kallsyms_lookup(), which expands a stem-compressed
That sounds more like a bug in your top to me. /proc/*/wchan itself
does not access kallsyms, it just outputs a number.
My top doesn't do that.
Are you saying your top reads /proc/kallsyms on each redisplay?
That sounds completely wrong - it should only read the file once
and cache it and then look the numbers it is reading from wchan
in the cache.
Doing the cache in the kernel is the wrong place. This should be fixed
in user space.
As an unrelated comment: i would suggest to avoid rwlocks until
absolutely needed. They are a lot slower than regular spinlocks.
-Andi
On Friday, June 18, 2004 6:03 pm, Andi Kleen wrote:
> On Fri, 18 Jun 2004 15:03:00 -0500
>
> Brent Casavant <[email protected]> wrote:
> > On 2.6 based systems, the top command utilizes /proc/[pid]/wchan to
> > determine WCHAN symbol name information. This information is provided
> > by the kernel function kallsyms_lookup(), which expands a stem-compressed
>
> That sounds more like a bug in your top to me. /proc/*/wchan itself
> does not access kallsyms, it just outputs a number.undisclosed-recipients:;
No, it outputs a string:
jbarnes@mill:~$ cat /proc/1/wchan
do_select
I haven't looked, but it's probably dependent on CONFIG_KALLSYMS.
> Doing the cache in the kernel is the wrong place. This should be fixed
> in user space.
Sure, but that would be a change in behavior. It's arguably the right thing
to do though.
Jesse
On Sat, 19 Jun 2004, Andi Kleen wrote:
> On Fri, 18 Jun 2004 15:03:00 -0500
> Brent Casavant <[email protected]> wrote:
>
> > On 2.6 based systems, the top command utilizes /proc/[pid]/wchan to
> > determine WCHAN symbol name information. This information is provided
> > by the kernel function kallsyms_lookup(), which expands a stem-compressed
>
> That sounds more like a bug in your top to me. /proc/*/wchan itself
> does not access kallsyms, it just outputs a number.
Strange. All the 2.6 boxes I have access to output a symbol name.
I've checked on Debian, SLES9, and Fedora Core 2. It's not
impossible that I have the same whacked-out configuration on all of
them, but they definitely output symbol names, not just numbers.
> My top doesn't do that.
Hmm. procps version 3.1.3 introduced the use of /proc/*/wchan where
possible. I'm using 2.0.13 normally, which appears to have the same
behavior (can't find a changelong for the 2 series at the moment).
Is it your recommendation that I take this up with the maintainers of
procps?
> Are you saying your top reads /proc/kallsyms on each redisplay?
> That sounds completely wrong - it should only read the file once
> and cache it and then look the numbers it is reading from wchan
> in the cache.
It would appear so from the performance improvements I observed upon
implementing this cache.
> Doing the cache in the kernel is the wrong place. This should be fixed
> in user space.
Hard to disagree with that -- as long as we're getting an address
out of the wchan file. But that seems to be a point of some debate. :)
I'll have to get to the bottom of that next Monday (too tired to
think straight at the moment).
> As an unrelated comment: i would suggest to avoid rwlocks until
> absolutely needed. They are a lot slower than regular spinlocks.
The lock isn't very highly contended at all, and assuming the cache
hits almost all of the time, the rwlock seemed better as it would
cause the minimum amount of waiting. That said, my first implementation
used spinlocks, and the difference in performance was lost in the
noise -- meaning we could really go either way and make little
difference which style of lock was used.
Thanks,
Brent
--
Brent Casavant [email protected] Forget bright-eyed and
Operating System Engineer http://www.sgi.com/ bushy-tailed; I'm red-
Silicon Graphics, Inc. 44.8562N 93.1355W 860F eyed and bushy-haired.
On Fri, 18 Jun 2004 19:26:39 -0400
Jesse Barnes <[email protected]> wrote:
> On Friday, June 18, 2004 6:03 pm, Andi Kleen wrote:
> > On Fri, 18 Jun 2004 15:03:00 -0500
> >
> > Brent Casavant <[email protected]> wrote:
> > > On 2.6 based systems, the top command utilizes /proc/[pid]/wchan to
> > > determine WCHAN symbol name information. This information is provided
> > > by the kernel function kallsyms_lookup(), which expands a stem-compressed
> >
> > That sounds more like a bug in your top to me. /proc/*/wchan itself
> > does not access kallsyms, it just outputs a number.undisclosed-recipients:;
>
> No, it outputs a string:
> jbarnes@mill:~$ cat /proc/1/wchan
> do_select
Indeed. I looked at /proc/self/wchan, but of course that is 0 because
the process is running.
But there is numerical wchan anyways - just get it from /proc/*/stat
That is what all 2.4 based tops always used. I bet they still
have the fallback code for that around.The 2.6 change
will just be to read the symbol table from /proc/kallsyms instead
of from the System.map file.
>
> > Doing the cache in the kernel is the wrong place. This should be fixed
> > in user space.
>
> Sure, but that would be a change in behavior. It's arguably the right thing
> to do though.
Change what behaviour? I argue that doing it in the kernel is the wrong
thing.
-Andi
On Fri, 18 Jun 2004 18:31:27 -0500
Brent Casavant <[email protected]> wrote:
> On Sat, 19 Jun 2004, Andi Kleen wrote:
>
> > On Fri, 18 Jun 2004 15:03:00 -0500
> > Brent Casavant <[email protected]> wrote:
> >
> > > On 2.6 based systems, the top command utilizes /proc/[pid]/wchan to
> > > determine WCHAN symbol name information. This information is provided
> > > by the kernel function kallsyms_lookup(), which expands a stem-compressed
> >
> > That sounds more like a bug in your top to me. /proc/*/wchan itself
> > does not access kallsyms, it just outputs a number.
>
> Strange. All the 2.6 boxes I have access to output a symbol name.
> I've checked on Debian, SLES9, and Fedora Core 2. It's not
> impossible that I have the same whacked-out configuration on all of
> them, but they definitely output symbol names, not just numbers.
Yes, that was my mistake. I looked at /proc/self/wchan, which is
probably the only one who outputs a number ...
But anyways you can get the number from /proc/*/stat like in 2.4
>
> > My top doesn't do that.
>
> Hmm. procps version 3.1.3 introduced the use of /proc/*/wchan where
> possible. I'm using 2.0.13 normally, which appears to have the same
> behavior (can't find a changelong for the 2 series at the moment).
> Is it your recommendation that I take this up with the maintainers of
> procps?
Yes. Let them revert to the 2.4 code again, only change that they
should read /proc/kallsyms instead of /boot/System.map
>
> > Are you saying your top reads /proc/kallsyms on each redisplay?
> > That sounds completely wrong - it should only read the file once
> > and cache it and then look the numbers it is reading from wchan
> > in the cache.
>
> It would appear so from the performance improvements I observed upon
> implementing this cache.
I straced a fairly recent SUSE top and it didn't access wchan
in normal operation.
>
> > As an unrelated comment: i would suggest to avoid rwlocks until
> > absolutely needed. They are a lot slower than regular spinlocks.
>
> The lock isn't very highly contended at all, and assuming the cache
> hits almost all of the time, the rwlock seemed better as it would
> cause the minimum amount of waiting. That said, my first implementation
When it starts to get contended the read locks tend to be slower
because they have to bounce more cache lines around.
-Andi
On Friday, June 18, 2004 7:56 pm, Andi Kleen wrote:
> > Sure, but that would be a change in behavior. It's arguably the right
> > thing to do though.
>
> Change what behaviour? I argue that doing it in the kernel is the wrong
> thing.
/proc/<pid>/wchan reports the function name as a string. You're arguing that
doing that in the kernel is the wrong thing to do, right? If so, it would
make sense to change its behavior. Either way, I guess we can fix top to
use /proc/<pid>/stat instead, and lookup symbols in an external System.map
file.
Jesse
Andi Kleen writes:
> Brent Casavant writes:
>> On 2.6 based systems, the top command utilizes /proc/[pid]/wchan to
>> determine WCHAN symbol name information. This information is provided
>> by the kernel function kallsyms_lookup(), which expands a stem-compressed
>
> That sounds more like a bug in your top to me. /proc/*/wchan itself
> does not access kallsyms, it just outputs a number.
>
> My top doesn't do that.
To get a working WCHAN column on a 2.6.xx kernel, your top
must do that. You also must have CONFIG_KALLSYMS enabled.
> Are you saying your top reads /proc/kallsyms on each redisplay?
> That sounds completely wrong - it should only read the file once
> and cache it and then look the numbers it is reading from wchan
> in the cache.
>
> Doing the cache in the kernel is the wrong place. This should be fixed
> in user space.
No way, because:
1. kernel modules may be loaded or unloaded at any time
2. the /proc/*/wchan files don't provide both name and address
I'd be happy to make top (and the rest of procps) use a cache
if those problems were addressed. I need a signal sent on
module load/unload, or a real /proc/kallsyms st_mtime that I
can poll. I also need to have the numeric wchan address in
the /proc/*/wchan file, such that it is reliably the same
thing as the function name already found there.
Jesse Barnes writes:
> /proc/<pid>/wchan reports the function name as a string.
> You're arguing that doing that in the kernel is the wrong
> thing to do, right? If so, it would make sense to change
> its behavior. Either way, I guess we can fix top to use
> /proc/<pid>/stat instead, and lookup symbols in an external
> System.map file.
This is exactly what top does when /proc/*/wchan is missing.
It does not work well, due to kernel modules, and also due
to the inability to reliably determine if a System.map file
matches the currently running kernel. Also, simply parsing
that file is slow due to the size of it.
Brent Casavant also writes:
> The cache size of 32 items was chosen experimentally. 16 was
> too few to cache all the results on even a small system with
> approximatley 100 tasks, and 64 was too large for all but the
> most extreme cases.
I found that you need about 64 when using a simple hash,
but going to 256 is really cheap so why not? The old procps
symbol cache code (now unused) worked well. I notice that
you are scanning an array for ranges; it is better to hash
on exact address matches. Also, failures get cached and no
attempt was made to keep multiple items in a hash bucket.
Here it is. The rest of the code should be obvious.
typedef struct symb {
const char *name;
unsigned KLONG addr;
} symb;
...
static const symb fail = { "?", 0 };
...
static symb hashtable[256];
...
hash = (address >> 4) & 0xff;
> Further work obviously needs to be done for top itself to reduce CPU
> utilization even further, but this is a large first step. Some quick
> experiments indicate that the slow path has now moved to other areas
> of code, which will be addressed in due course.
Jim and I have already used profilers on top. He mostly used
gprof. I mostly used gcc's ability to call some custom code on
function entry and exit, counting either the function calls or
the CPU cycles they consumed. So, lots of luck to you, but...
(eh, this assumes you're using the all-new top in procps 3.x.x)
If you want things fast, provide the DWARF2 debug info
needed to make use of /dev/mem. >:-) The LKCD project
might have a patch for this; they were converting from
STABS to DWARF2 last I heard.
Brent Casavant later writes:
> Hmm. procps version 3.1.3 introduced the use of /proc/*/wchan
> where possible. I'm using 2.0.13 normally, which appears to
> have the same behavior (can't find a changelong for the 2
> series at the moment).
Oh. Well. There's your problem. The procps-3.x.x code will only
read /proc/*/wchan for the processes shown on your screen.
It also doesn't corrupt wchan with a buffer overflow.
Albert Cahalan <[email protected]> wrote:
>
>> Are you saying your top reads /proc/kallsyms on each redisplay?
>> That sounds completely wrong - it should only read the file once
>> and cache it and then look the numbers it is reading from wchan
>> in the cache.
>>
>> Doing the cache in the kernel is the wrong place. This should be fixed
>> in user space.
>
> No way, because:
>
> 1. kernel modules may be loaded or unloaded at any time
We seem to have coped alright under 2.4.
--
Visit Openswan at http://www.openswan.org/
Email: Herbert Xu ~{PmV>HI~} <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt
Albert Cahalan <[email protected]> wrote:
>
> > Doing the cache in the kernel is the wrong place. This should be fixed
> > in user space.
>
> No way, because:
>
> 1. kernel modules may be loaded or unloaded at any time
Poll /proc/modules?
> 2. the /proc/*/wchan files don't provide both name and address
/proc/stat has the address, /proc/kallsyms gives the symbol?
> I'd be happy to make top (and the rest of procps) use a cache
> if those problems were addressed. I need a signal sent on
> module load/unload, or a real /proc/kallsyms st_mtime that I
> can poll. I also need to have the numeric wchan address in
> the /proc/*/wchan file, such that it is reliably the same
> thing as the function name already found there.
Updating mtime on /proc/modules may be more logical, or even both.
Or put a modprobe sequence number somewhere in /proc and look for changes
in that.
It definitely needs to be fixed, but it doesn't seem that adding code to
the kernel is needed.
On Sat, 2004-06-19 at 00:51, Herbert Xu wrote:
> Albert Cahalan <[email protected]> wrote:
> >
> >> Are you saying your top reads /proc/kallsyms on each redisplay?
> >> That sounds completely wrong - it should only read the file once
> >> and cache it and then look the numbers it is reading from wchan
> >> in the cache.
> >>
> >> Doing the cache in the kernel is the wrong place. This should be fixed
> >> in user space.
> >
> > No way, because:
> >
> > 1. kernel modules may be loaded or unloaded at any time
>
> We seem to have coped alright under 2.4.
Not really. We coped alright long ago. Things were
getting worse year by year, with modules becoming
more popular and their symbols not 100% exported.
On Sat, 2004-06-19 at 06:06, Andrew Morton wrote:
> Albert Cahalan <[email protected]> wrote:
>>> Doing the cache in the kernel is the wrong place.
>>> This should be fixed in user space.
>>
>> No way, because:
>>
>> 1. kernel modules may be loaded or unloaded at any time
>
> Poll /proc/modules?
Ugh. I like the sequence number suggestion better.
This is about speeding things up after all.
A signal that I could register to get would be best of all.
>> 2. the /proc/*/wchan files don't provide both name and address
>
> /proc/stat has the address, /proc/kallsyms gives the symbol?
Eh, why did we add the /proc/*/wchan files then if not
to use them?
One problem with parsing /proc/kallsyms is that I can't mmap
it and do a binary search. (the System.map was done this way)
It looks like I'd have to read half a megabyte of text, then
sort it for later use. It's highly desirable that the procps
tools not eat lots of memory or suffer high start-up costs.
>> I'd be happy to make top (and the rest of procps) use a cache
>> if those problems were addressed. I need a signal sent on
>> module load/unload, or a real /proc/kallsyms st_mtime that I
>> can poll. I also need to have the numeric wchan address in
>> the /proc/*/wchan file, such that it is reliably the same
>> thing as the function name already found there.
>
> Updating mtime on /proc/modules may be more logical, or even both.
>
> Or put a modprobe sequence number somewhere in /proc and
> look for changes in that.
>
> It definitely needs to be fixed, but it doesn't seem that
> adding code to the kernel is needed.
I'm not so sure anything needs to be fixed, save for SGI upgrading
to a more modern procps. There are many more important things:
Supply a real SUSv3-compliant %CPU
Supply whole-process CPU usage data
Add a /proc/*/tty symlink
Allocated PIDs by least-recently-used for safety
Add an "adopted child" flag (for processes reparented to init)
Supply the sleep time (in seconds for example)
Supply jobc, the "count of processes qualifying PGID for job control"
Supply the raw task_struct and the needed debug info
On Sat, 19 Jun 2004, Albert Cahalan wrote:
> I'm not so sure anything needs to be fixed, save for SGI upgrading
> to a more modern procps. There are many more important things:
OK, I looked into this more closely, and gave procps 3.2.1 a spin.
This gives us a similar speedup (top now only consumes 60% of a CPU)
to that which I obtained by cacheing symbol lookups and using an old
procps. This difference is certainly explainable by top now only
obtaining wchan information for displayed processes.
60% is far better than 800%, so this is certainly progress. However
60% is also still quite a bit of CPU time. I'll spend some cycles
trying to whittle it down some more, but I'm not all that hopeful.
Thanks for the discussion, it was certainly enlightening.
Brent
--
Brent Casavant [email protected] Forget bright-eyed and
Operating System Engineer http://www.sgi.com/ bushy-tailed; I'm red-
Silicon Graphics, Inc. 44.8562N 93.1355W 860F eyed and bushy-haired.
On Tuesday, June 22, 2004 3:45 pm, Brent Casavant wrote:
> On Sat, 19 Jun 2004, Albert Cahalan wrote:
> > I'm not so sure anything needs to be fixed, save for SGI upgrading
> > to a more modern procps. There are many more important things:
>
> OK, I looked into this more closely, and gave procps 3.2.1 a spin.
> This gives us a similar speedup (top now only consumes 60% of a CPU)
> to that which I obtained by cacheing symbol lookups and using an old
> procps. This difference is certainly explainable by top now only
> obtaining wchan information for displayed processes.
>
> 60% is far better than 800%, so this is certainly progress. However
> 60% is also still quite a bit of CPU time. I'll spend some cycles
> trying to whittle it down some more, but I'm not all that hopeful.
Is the 60% mainly kernel or user time now? There were some good ideas in this
thread that we could use to speed up symbol table searching. And if opening
and reading /proc/kallsyms is taking a significant portion of the time, we
could probably trim it down too by only reading it once and polling for
module insertion/removal.
Jesse