2003-02-20 16:23:54

by Ingo Molnar

[permalink] [raw]
Subject: [patch] procfs/procps threading performance speedup, 2.5.62


Lots of people have requested that threads should show up in /proc again,
to be able to look at per-thread CPU usage, activity, and generally, to
ease the debugging of threaded apps.

the main problem with threads in /proc is that there's a big slowdown when
using lots of threads. Here are some runtime numbers in seconds, under
2.5.62, on a 525MHz PIII box:

# of threads (*): 1000 2000 4000 8000 16000
--------------------------------------------------------
ps 0.77 1.52 3.13 6.44 13.81
ps -axm 0.93 1.85 3.78 7.73 18.37
top -d 0 -n 0 0.75 1.53 3.23 7.60 22.12

[ (*): the system is completely idle, all threads are sleeping. Overhead
is combined system and userspace overhead measured via 'time'.]

ie. the overhead is really massive, eg. with 16K threads running, procps
is basically unusable for any administration or debugging work. And there
are users that want 50K or more threads. So clearly, this state of procfs
and procps is unacceptable - who would use 'top' to check a system's state
if a single screen-refresh takes 22 seconds?!

in the above timings, only 'ps -axm' is actually displaying every thread,
all other commands produce only a few lines of output. The reason of the
overhead is two-fold:

1) there's significant kernel overhead in reading large /proc directories,
the overhead of many readdir()'s is O(N^2). The main overhead is in
get_pid_list(), which has to loop over an increasing number of threads
to find the next intended batch of PIDs.

to fix this overhead i've introduced a 'lookup cursor' cookie, which is
cached in filp->private_data, across readdir() [getdents64()] calls. If
the cursor matches then we skip all the overhead of skipping threads. If
the cursor is not available then we fall back to the old-style skipping
algorithm.

2) procps is forced to parse every thread in /proc to build up accurate
'process CPU usage' counters. The parsing and accessing of every
/proc/PID/stat file is necessary because CPU statistics are scattered
across all threads.

the fix for this is two-fold. First, it must be possible for procps to
separate 'threads' from 'processes' without having to go into 16 thousand
directories. I solved this by prefixing 'threads' (ie. non-group-leader
threads) with a dot ('.') character in the /proc listing:

$ ls -a /proc
. 16994 .17078 412 7 execdomains locks stat
.. 16995 .17079 460 8 filesystems meminfo swaps
1 17031 .17080 469 9 fs misc sys
16864 17033 .17081 5 92 ide mounts sysvipc
16866 17034 .17082 515 buddyinfo interrupts mtrr tty
16867 17072 17113 516 bus iomem net uptime
16946 .17073 2 517 cmdline ioports partitions version
16948 .17074 3 518 cpuinfo irq profile vmstat
16949 .17075 390 519 devices kcore scsi
16989 .17076 4 520 dma kmsg self
16992 .17077 400 6 driver loadavg slabinfo

the .17073 ... .17082 entries belong to the thread-group 17072.

The key here is for procps to be able to parse threads without having to
call into the kernel 16K times. The dot-approach also has the added
benefit of 'hiding' threads in the default 'ls /proc' listing.

the other change needed was the ability to read comulative CPU usage
statistics from the thread group leader. I've introduced 4 new fields in
/proc/PID/stat for that purpose, the kernel keeps those uptodate across
fork/exit and in the timer interrupt - it's very low-overhead.

the attached patch, against 2.5.62-BK, implements these kernel features.

Alex Larsson has modified procps for these new kernel capabilities, the
new procps package (or the patch against upstream procps) can be
downloaded from:

http://people.redhat.com/alexl/procps/

here are the performance measurements (with stock procps+procfs numbers in
paranthesis)

# of threads: 1000 2000 4000 8000 16000
----------------------------------------------------------------
ps 0.02 0.03 0.03 0.03 0.04
(0.77) (1.52) (3.13) (6.44) (13.81)

ps -axm 0.89 1.72 3.40 6.87 15.57
(0.93) (1.85) (3.78) (7.73) (18.37)

top -d 0 -n 0 0.11 0.12 0.12 0.13 0.16
(0.75) (1.53) (3.23) (7.60) (22.12)

eg. with 16K threads running in the background, a single 'top'
screen-refresh got more than 130 times faster. A simple 'ps' got more than
340 times faster ... Even the 'ps -axm' (which displays all threads)
command got faster, due to the pid-cursor. But even with just 1000 threads
running a simple 'ps' is 30 times faster, and top refresh is 6 times
faster.

another advantage of this approach is that old procps is fully compatible
with the new kernel, and new procps is fully compatible with old kernels.
Plus everything is still encoded in the ASCII namespace, no binary
interfaces are used.

the patch works just fine on my boxes. (the patch is also included in the
threading backport, in the rawhide 2.4 kernel, and has been in use for a
couple of weeks already.)

Ingo

--- linux/include/linux/sched.h.orig
+++ linux/include/linux/sched.h
@@ -359,6 +359,7 @@ struct task_struct {
unsigned long it_real_incr, it_prof_incr, it_virt_incr;
struct timer_list real_timer;
unsigned long utime, stime, cutime, cstime;
+ unsigned long group_utime, group_stime, group_cutime, group_cstime;
u64 start_time;
/* mm fault and swap info: this can arguably be seen as either mm-specific or thread-specific */
unsigned long min_flt, maj_flt, nswap, cmin_flt, cmaj_flt, cnswap;
@@ -640,6 +641,9 @@ extern void kick_if_running(task_t * p);
* Careful: do_each_thread/while_each_thread is a double loop so
* 'break' will not work as expected - use goto instead.
*/
+#define __do_each_thread(g, t) \
+ for ( ; (g = t = next_task(g)) != &init_task ; ) do
+
#define do_each_thread(g, t) \
for (g = t = &init_task ; (g = t = next_task(g)) != &init_task ; ) do

--- linux/fs/proc/array.c.orig
+++ linux/fs/proc/array.c
@@ -336,7 +336,7 @@ int proc_pid_stat(struct task_struct *ta
read_unlock(&tasklist_lock);
res = sprintf(buffer,"%d (%s) %c %d %d %d %d %d %lu %lu \
%lu %lu %lu %lu %lu %ld %ld %ld %ld %ld %ld %llu %lu %ld %lu %lu %lu %lu %lu \
-%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu\n",
+%lu %lu %lu %lu %lu %lu %lu %lu %d %d %lu %lu %lu %lu %lu %lu\n",
task->pid,
task->comm,
state,
@@ -381,8 +381,12 @@ int proc_pid_stat(struct task_struct *ta
task->exit_signal,
task_cpu(task),
task->rt_priority,
- task->policy);
- if(mm)
+ task->policy,
+ task->group_utime,
+ task->group_stime,
+ task->group_cutime,
+ task->group_cstime);
+ if (mm)
mmput(mm);
return res;
}
--- linux/fs/proc/base.c.orig
+++ linux/fs/proc/base.c
@@ -876,6 +876,12 @@ static unsigned name_to_int(struct dentr

if (len > 1 && *name == '0')
goto out;
+ /*
+ * Deal with dot-aliases for threads.
+ */
+ if (name[0] == '.')
+ name++, len--;
+
while (len-- > 0) {
unsigned c = *name++ - '0';
if (c > 9)
@@ -1155,31 +1161,50 @@ out:
* tasklist lock while doing this, and we must release it before
* we actually do the filldir itself, so we use a temp buffer..
*/
-static int get_pid_list(int index, unsigned int *pids)
+static int get_pid_list(int index, int *pids, struct file *filp)
{
- struct task_struct *p;
- int nr_pids = 0;
+ int nr_pids = 0, pid = 0, pid_cursor = (int)filp->private_data;
+ struct task_struct *g = NULL, *p = NULL;

- index--;
read_lock(&tasklist_lock);
- for_each_process(p) {
- int pid = p->pid;
- if (!pid)
- continue;
+ if (pid_cursor) {
+ p = find_task_by_pid(pid_cursor);
+ g = p->group_leader;
+ }
+ if (!p) {
+ g = p = &init_task;
+ index--;
+ } else
+ index = 0;
+
+ goto inside;
+
+ __do_each_thread(g, p) {
if (--index >= 0)
continue;
- pids[nr_pids] = pid;
+ pid = p->pid;
+ if (!pid)
+ BUG();
+ if (p->tgid != p->pid)
+ pids[nr_pids] = -pid;
+ else
+ pids[nr_pids] = pid;
nr_pids++;
if (nr_pids >= PROC_MAXPIDS)
- break;
- }
+ goto out;
+inside:
+ ;
+ } while_each_thread(g, p);
+out:
+ filp->private_data = (void *)pid;
read_unlock(&tasklist_lock);
+
return nr_pids;
}

int proc_pid_readdir(struct file * filp, void * dirent, filldir_t filldir)
{
- unsigned int pid_array[PROC_MAXPIDS];
+ int pid_array[PROC_MAXPIDS];
char buf[PROC_NUMBUF];
unsigned int nr = filp->f_pos - FIRST_PROCESS_ENTRY;
unsigned int nr_pids, i;
@@ -1192,14 +1217,16 @@ int proc_pid_readdir(struct file * filp,
nr++;
}

- nr_pids = get_pid_list(nr, pid_array);
+ nr_pids = get_pid_list(nr, pid_array, filp);

for (i = 0; i < nr_pids; i++) {
- int pid = pid_array[i];
+ int pid = abs(pid_array[i]);
ino_t ino = fake_ino(pid,PROC_PID_INO);
unsigned long j = PROC_NUMBUF;

do buf[--j] = '0' + (pid % 10); while (pid/=10);
+ if (pid_array[i] < 0)
+ buf[--j] = '.';

if (filldir(dirent, buf+j, PROC_NUMBUF-j, filp->f_pos, ino, DT_DIR) < 0)
break;
--- linux/kernel/timer.c.orig
+++ linux/kernel/timer.c
@@ -657,7 +657,10 @@ static inline void do_process_times(stru
unsigned long psecs;

psecs = (p->utime += user);
+ p->group_leader->group_utime += user;
psecs += (p->stime += system);
+ p->group_leader->group_stime += system;
+
if (psecs / HZ > p->rlim[RLIMIT_CPU].rlim_cur) {
/* Send SIGXCPU every second.. */
if (!(psecs % HZ))
--- linux/kernel/exit.c.orig
+++ linux/kernel/exit.c
@@ -92,6 +92,8 @@ void release_task(struct task_struct * p

p->parent->cutime += p->utime + p->cutime;
p->parent->cstime += p->stime + p->cstime;
+ p->parent->group_leader->group_cutime += p->utime + p->cutime;
+ p->parent->group_leader->group_cstime += p->stime + p->cstime;
p->parent->cmin_flt += p->min_flt + p->cmin_flt;
p->parent->cmaj_flt += p->maj_flt + p->cmaj_flt;
p->parent->cnswap += p->nswap + p->cnswap;
--- linux/kernel/fork.c.orig
+++ linux/kernel/fork.c
@@ -830,6 +830,9 @@ static struct task_struct *copy_process(
p->tty_old_pgrp = 0;
p->utime = p->stime = 0;
p->cutime = p->cstime = 0;
+ p->group_utime = p->group_stime = 0;
+ p->group_cutime = p->group_cstime = 0;
+
p->array = NULL;
p->lock_depth = -1; /* -1 = no lock */
p->start_time = get_jiffies_64();


2003-02-20 16:37:12

by Miquel van Smoorenburg

[permalink] [raw]
Subject: Re: [patch] procfs/procps threading performance speedup, 2.5.62

In article <[email protected]>,
Ingo Molnar <[email protected]> wrote:
>the fix for this is two-fold. First, it must be possible for procps to
>separate 'threads' from 'processes' without having to go into 16 thousand
>directories. I solved this by prefixing 'threads' (ie. non-group-leader
>threads) with a dot ('.') character in the /proc listing:

Why not put threads belonging to a thread group into /proc/17072/threads ?

> $ ls -a /proc

I'm seeing 17072 as a group-leader and the 'threads' as .17073 etc.
When you put all 'threads' into /proc/17072/threads, you'd get
/proc/17072/threads/17072, /proc/17072/threads/17073, etc.
/proc/17072 would show stats for the whole threads group, while
/proc/17072/threads/17072 would show stats for just that thread.

>the .17073 ... .17082 entries belong to the thread-group 17072.

Yuck ;(

>The key here is for procps to be able to parse threads without having to
>call into the kernel 16K times. The dot-approach also has the added
>benefit of 'hiding' threads in the default 'ls /proc' listing.

What is against /proc/<pid>/threads ?

>the other change needed was the ability to read comulative CPU usage
>statistics from the thread group leader. I've introduced 4 new fields in
>/proc/PID/stat for that purpose, the kernel keeps those uptodate across
>fork/exit and in the timer interrupt - it's very low-overhead.

That would also be solved with /proc/<pid> and /proc/<pid>/threads/<thread>

>another advantage of this approach is that old procps is fully compatible
>with the new kernel, and new procps is fully compatible with old kernels.

That would also be the case with /proc/<pid>/threads

Mike.
--
Anyone who is capable of getting themselves made President should
on no account be allowed to do the job -- Douglas Adams.

2003-02-20 16:45:40

by Daniel Jacobowitz

[permalink] [raw]
Subject: Re: [patch] procfs/procps threading performance speedup, 2.5.62

On Thu, Feb 20, 2003 at 04:47:15PM +0000, Miquel van Smoorenburg wrote:
> In article <[email protected]>,
> Ingo Molnar <[email protected]> wrote:
> >the fix for this is two-fold. First, it must be possible for procps to
> >separate 'threads' from 'processes' without having to go into 16 thousand
> >directories. I solved this by prefixing 'threads' (ie. non-group-leader
> >threads) with a dot ('.') character in the /proc listing:
>
> Why not put threads belonging to a thread group into /proc/17072/threads ?

I'm also inclined to think this is a good idea. We could even choose
to use /proc/17072/lwp, to match Solaris, but it's probably not worth
it since the semantics are different.

--
Daniel Jacobowitz
MontaVista Software Debian GNU/Linux Developer

2003-02-20 16:58:51

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] procfs/procps threading performance speedup, 2.5.62


On Thu, 20 Feb 2003, Ingo Molnar wrote:
>
> the main problem with threads in /proc is that there's a big slowdown when
> using lots of threads.

Well, part of the problem (I think) is that you added all the threads to
the same main directory.

Putting a "." in front of the name doesn't fix the /proc level directory
scalability issues, it only means that you can avoid some of the user-
level scalability ones.

So to offset that bad design, you then add other cruft, like the lookup
cursor and the "." marker. Which is not a bad idea in itself, but I claim
that if you'd made the directory structure saner you wouldn't have needed
it in the first place.

It would just be _so_ much nicer if the threads would show up as
subdirectories ie /proc/<tgid>/<tid>/xxx. More scalable, more readable,
and just generally more sane.

Linus

2003-02-20 16:50:07

by Nikita Danilov

[permalink] [raw]
Subject: Re: [patch] procfs/procps threading performance speedup, 2.5.62

Ingo Molnar writes:
>

[...]

>
> to fix this overhead i've introduced a 'lookup cursor' cookie, which is
> cached in filp->private_data, across readdir() [getdents64()] calls. If
> the cursor matches then we skip all the overhead of skipping threads. If
> the cursor is not available then we fall back to the old-style skipping
> algorithm.

Shouldn't filp->private_data be cleared on lseek? It looks like lookup
cursor is never cleared once set and so readdir will always go forward
independently of ->f_pos updates. Note that glibc implementation of
readdir() (on the top of getdents64()) does call lseek on the
directory. So does seekdir(3).

>
> 2) procps is forced to parse every thread in /proc to build up accurate
> 'process CPU usage' counters. The parsing and accessing of every

Nikita.

2003-02-20 17:05:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] procfs/procps threading performance speedup, 2.5.62


On Thu, 20 Feb 2003, Linus Torvalds wrote:

> > the main problem with threads in /proc is that there's a big slowdown when
> > using lots of threads.
>
> Well, part of the problem (I think) is that you added all the threads to
> the same main directory.
>
> Putting a "." in front of the name doesn't fix the /proc level directory
> scalability issues, it only means that you can avoid some of the user-
> level scalability ones.
>
> So to offset that bad design, you then add other cruft, like the lookup
> cursor and the "." marker. Which is not a bad idea in itself, but I
> claim that if you'd made the directory structure saner you wouldn't have
> needed it in the first place.

with the dot-marker it's not scaling badly at all - you can see the
numbers in the mail, all of the command variants (even a simple 'ps')
still reads the _full_ /proc directory. (just they dont go further than
that, because the readdir() output is enough to filter stuff.)

> It would just be _so_ much nicer if the threads would show up as
> subdirectories ie /proc/<tgid>/<tid>/xxx. More scalable, more readable,
> and just generally more sane.

Al says that this cannot be done sanely, and is fraught with security
problems. I'd vote for it if it were possible. Al?

but, if you worry about the scalability of large /proc directories - it's
not bad at all.

Ingo

2003-02-20 17:14:15

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] procfs/procps threading performance speedup, 2.5.62


eg. with 16,000 threads in /proc, "ls /proc" is still fast:

real:0m0.032s user:0m0.007s sys:0m0.024s

without those threads, it's:

real:0m0.014s user:0m0.004s sys:0m0.010s

15 msecs difference. Even simple 'ls' reads the full /proc directory (all
16K+ entries). So performance-wise there's not a big difference.

architecture-wise there is a difference, and i'd be the last one arguing
against a tree-based approach to thread groups. It's much easier to find
threads belonging to a single 'process' via /proc this way - although no
functionality in procps has or requires such a feature currently.

Ingo

2003-02-20 17:15:33

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] procfs/procps threading performance speedup, 2.5.62


On Thu, 20 Feb 2003, Linus Torvalds wrote:
>
> It would just be _so_ much nicer if the threads would show up as
> subdirectories ie /proc/<tgid>/<tid>/xxx. More scalable, more readable,
> and just generally more sane.

It shouldn't even be all that much harder. You only really need to add the
"lookup()" and "readdir()" logic to the pid-fd's, and they both should be
fairly straightforward, ie something like the appended should do the
lookup() part.

(UNTESTED! NOT COMPILED! PROBABLY HORRIBLY BUGGY! CAVEAT USER! CONCEPTUAL
CODE ONLY! YOU GET THE IDEA! I'M GETTING HOARSE FROM ALL THE SHOUTING!)

Linus

---
===== base.c 1.39 vs edited =====
--- 1.39/fs/proc/base.c Sat Feb 15 19:30:17 2003
+++ edited/base.c Thu Feb 20 09:18:15 2003
@@ -964,10 +964,15 @@
struct task_struct *task = proc_task(dir);
struct pid_entry *p;
struct proc_inode *ei;
+ unsigned long tid;

error = -ENOENT;
inode = NULL;

+ tid = name_to_int(dentry);
+ if (tid != ~0U)
+ goto thread_lookup;
+
for (p = base_stuff; p->name; p++) {
if (p->len != dentry->d_name.len)
continue;
@@ -1052,6 +1057,30 @@
if (!proc_task(dentry->d_inode)->pid)
d_drop(dentry);
return NULL;
+
+thread_lookup: {
+ struct task_struct *thread;
+ read_lock(&tasklist_lock);
+ thread = task;
+ while ((thread = next_thread(thread)) != task) {
+ if (thread->pid == tid)
+ goto found_thread;
+ }
+ read_unlock(&tasklist_lock);
+ return ERR_PTR(-ENOENT);
+
+found_thread:
+ get_task_struct(thread);
+ read_unlock(&tasklist_lock);
+
+ inode = proc_pid_make_inode(sb, thread, 0x800000);
+ put_task_struct(thread);
+ if (!inode)
+ ERR_PTR(-ENOENT);
+ dentry->d_op = dentry->d_parent->d_op;
+ d_add(dentry, inode);
+ return NULL;
+}

out:
return ERR_PTR(error);

2003-02-20 17:22:18

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] procfs/procps threading performance speedup, 2.5.62


On Thu, 20 Feb 2003, Jeff Garzik wrote:

> > > It would just be _so_ much nicer if the threads would show up as
> > > subdirectories ie /proc/<tgid>/<tid>/xxx. More scalable, more readable,
> > > and just generally more sane.
> >
> > Al says that this cannot be done sanely, and is fraught with security
> > problems. I'd vote for it if it were possible. Al?
>
> Having the kernel automatically manage creation/destruction of
> directories is the sticking point, AFAIK.

it already has to create/destroy the main PID directory, so i cannot see a
big difference.

> Why not use the "squid method"? Create directories 00..FF, and sort the
> pids/tids into buckets that way. Then you are not creating and
> destroying directories all the time.

yuck ...

Ingo

2003-02-20 17:17:47

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] procfs/procps threading performance speedup, 2.5.62


On Thu, 20 Feb 2003, Ingo Molnar wrote:
>
> Al says that this cannot be done sanely, and is fraught with security
> problems. I'd vote for it if it were possible. Al?

I seriously doubt it. It's all exactly the same as the _current_
/proc/<pid> stuff, it just shows up in a different place.

> but, if you worry about the scalability of large /proc directories - it's
> not bad at all.

I worry about the _sanity_ of it, and it basically makes no sense to
iterate over every single thread, when you should always be able to just
iterate over processes (and then within the process - only when you want
to - iterate over the threads of _that_ process).

Linus

2003-02-20 17:20:59

by Jeff Garzik

[permalink] [raw]
Subject: Re: [patch] procfs/procps threading performance speedup, 2.5.62

On Thu, Feb 20, 2003 at 06:14:38PM +0100, Ingo Molnar wrote:
> On Thu, 20 Feb 2003, Linus Torvalds wrote:
> > It would just be _so_ much nicer if the threads would show up as
> > subdirectories ie /proc/<tgid>/<tid>/xxx. More scalable, more readable,
> > and just generally more sane.
>
> Al says that this cannot be done sanely, and is fraught with security
> problems. I'd vote for it if it were possible. Al?

Having the kernel automatically manage creation/destruction of
directories is the sticking point, AFAIK.

Why not use the "squid method"? Create directories 00..FF, and sort the
pids/tids into buckets that way. Then you are not creating and
destroying directories all the time.

Jeff



2003-02-20 17:43:49

by Linus Torvalds

[permalink] [raw]
Subject: Re: [patch] procfs/procps threading performance speedup, 2.5.62


On Thu, 20 Feb 2003, Jeff Garzik wrote:
>
> Having the kernel automatically manage creation/destruction of
> directories is the sticking point, AFAIK.

That _used_ to be an issue, since we didn't actually create the dentry at
fork time. But we do all that maintenance anyway these days, I think.

Linus

2003-02-20 19:04:06

by Daniel Jacobowitz

[permalink] [raw]
Subject: Re: [patch] procfs/procps threading performance speedup, 2.5.62

On Thu, Feb 20, 2003 at 09:20:38AM -0800, Linus Torvalds wrote:
>
> On Thu, 20 Feb 2003, Linus Torvalds wrote:
> >
> > It would just be _so_ much nicer if the threads would show up as
> > subdirectories ie /proc/<tgid>/<tid>/xxx. More scalable, more readable,
> > and just generally more sane.
>
> It shouldn't even be all that much harder. You only really need to add the
> "lookup()" and "readdir()" logic to the pid-fd's, and they both should be
> fairly straightforward, ie something like the appended should do the
> lookup() part.
>
> (UNTESTED! NOT COMPILED! PROBABLY HORRIBLY BUGGY! CAVEAT USER! CONCEPTUAL
> CODE ONLY! YOU GET THE IDEA! I'M GETTING HOARSE FROM ALL THE SHOUTING!)

It'd be a little (very little) more complex, but can I once again
suggest /proc/<tgid>/threads/<tid> instead of /proc/<tgid>/<tid>/xxx?


--
Daniel Jacobowitz
MontaVista Software Debian GNU/Linux Developer

2003-02-21 07:28:46

by Oliver Xymoron

[permalink] [raw]
Subject: Re: [patch] procfs/procps threading performance speedup, 2.5.62

On Thu, Feb 20, 2003 at 09:22:42AM -0800, Linus Torvalds wrote:
>
> On Thu, 20 Feb 2003, Ingo Molnar wrote:
> >
> > Al says that this cannot be done sanely, and is fraught with security
> > problems. I'd vote for it if it were possible. Al?
>
> I seriously doubt it. It's all exactly the same as the _current_
> /proc/<pid> stuff, it just shows up in a different place.

Well perhaps that was just Al's way of saying the current stuff is
broken too. I think we could get into trouble reparenting or exec'ing
really quickly though.

--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."

2003-02-22 20:46:27

by Albert Cahalan

[permalink] [raw]
Subject: Re: [patch] procfs/procps threading performance speedup, 2.5.62

On Thu, 2003-02-20 at 12:23, Ingo Molnar wrote:

> architecture-wise there is a difference, and i'd be
> the last one arguing against a tree-based approach to
> thread groups. It's much easier to find threads belonging
> to a single 'process' via /proc this way - although no
> functionality in procps has or requires such a feature currently.

Nope, the /proc/123/threads/246/stat approach is required.
Without this, procps is forced to read _all_ tasks to
group threads together. This is slow, prone to race conditions,
more vulnerable to kernel bugs, and a memory hog.

FYI, thread grouping is required even if whole-process
information is available. Many "ps" output formats need
grouping, and it is desirable for many more.

I might as well mention that whole-process information
includes the four fault counters and some indication
that wchan data is multi-valued (a '*' must be displayed
in that case). There may be more I haven't spotted yet.

Note that the recent /proc/*/wchan addition was botched.
Caching is prevented due to race conditions. This could
be fixed by changing the file format to contain:
number, function name
with optional:
function address, file name, module name
(next time, discuss such changes with an experienced
procps developer first)


2003-02-22 21:44:49

by Andrew Morton

[permalink] [raw]
Subject: Re: [patch] procfs/procps threading performance speedup, 2.5.62

Albert Cahalan <[email protected]> wrote:
>
> Note that the recent /proc/*/wchan addition was botched.
> Caching is prevented due to race conditions. This could
> be fixed by changing the file format to contain:
> number, function name

There is not enough detail here for it to be fixed.

What are the race conditions?

What is "number"?

2003-02-22 22:58:13

by Albert Cahalan

[permalink] [raw]
Subject: Re: [patch] procfs/procps threading performance speedup, 2.5.62

On Sat, 2003-02-22 at 16:55, Andrew Morton wrote:
> Albert Cahalan <[email protected]> wrote:

>> Note that the recent /proc/*/wchan addition was botched.
>> Caching is prevented due to race conditions. This could
>> be fixed by changing the file format to contain:
>> number, function name
>
> There is not enough detail here for it to be fixed.
>
> What are the race conditions?
>
> What is "number"?

By "number" I mean the numeric wchan. ("ps -o nwchan")
This value is provided in the /proc/*/stat files.

Most tools are stuck reading /proc/*/stat most of
the time. So the nwchan value is being read already.
You can't cache a mapping from nwchan to wchan,
because the values may change from the time you
read /proc/*/stat to the time you read /proc/*/wchan.
This forces reading /proc/*/wchan, even though top
has already seen the same value before and already
has the numeric version of it.

One more thing is needed. There should be a counter,
incremented every time a module is loaded or unloaded.
This allows the cache to be invalidated as needed.
I suppose /proc/stat is a decent place for the counter,
either at the end (safest) or right before intr.

Caching makes a major difference for wchan. Look how
few values are seen: ps -eonwchan= | sort -u | wc -l

Since the current /proc/*/wchan format isn't yet in
a stable kernel release, breaking the format wouldn't
be too bad. I like "%08x %s \n" and "%016lx %s \n"
(note the space on the end) for a nice fast parser and
ease of adding more fields. If you have something against
that space, then "%s %08x\n" and "%s %016lx\n" would do.

Any thoughts on the other stuff? Different modules
and files could have functions with the same name.
Maybe some info about this should be available, at
least if it doesn't slow things down much.


2003-02-23 18:39:43

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] procfs/procps threading performance speedup, 2.5.62


On 22 Feb 2003, Albert Cahalan wrote:

> > architecture-wise there is a difference, and i'd be
> > the last one arguing against a tree-based approach to
> > thread groups. It's much easier to find threads belonging
> > to a single 'process' via /proc this way - although no
> > functionality in procps has or requires such a feature currently.
>
> Nope, the /proc/123/threads/246/stat approach is required. Without this,
> procps is forced to read _all_ tasks to group threads together. This is
> slow, prone to race conditions, more vulnerable to kernel bugs, and a
> memory hog.

actually, what you mention does not happen in practice. We coded it up and
it works, and with tons of threads around it performs a few orders of
magnitudes faster than any other stuff available so far. So the question
here is 'only' interface/approach cleanliness, not actual performance
difference. Sure, we could shave off another millisec or two, but the
performance problems are off the radar already.

> Note that the recent /proc/*/wchan addition was botched.

(fyi, i have nothing to do with that change, so spare your insults for
someone else.)

> (next time, discuss such changes with an experienced procps developer
> first)

(given that this whole area was left alone in a state like this for years
i'm not quite sure how you can still sit so high on your horse.)

Ingo

2003-02-23 21:41:23

by Albert D. Cahalan

[permalink] [raw]
Subject: Re: [patch] procfs/procps threading performance speedup, 2.5.62

Ingo Molnar writes:
> On 22 Feb 2003, Albert Cahalan wrote:

>>> architecture-wise there is a difference, and i'd be
>>> the last one arguing against a tree-based approach to
>>> thread groups. It's much easier to find threads belonging
>>> to a single 'process' via /proc this way - although no
>>> functionality in procps has or requires such a feature currently.
>>
>> Nope, the /proc/123/threads/246/stat approach is required.
>> Without this, procps is forced to read _all_ tasks to group
>> threads together. This is slow, prone to race conditions,
>> more vulnerable to kernel bugs, and a memory hog.
>
> actually, what you mention does not happen in practice.
> We coded it up and it works,

Surely you realize that I have seen this code?

Proper "ps m" behavior groups threads in the output.
The hacked-up procps being used by Red Hat fails to
do this. You chould change that... at the cost of reading
all processes and sorting them. There goes all of your
performance improvement.

I hope you will eliminate the Red Hat bias in future patches.
Using the subdirectory approach certainly won't hurt you.
In time, you will realize that it is needed, or you will
switch over to using procps 3. Several distributions have
recently switched; I doubt you can hold out forever.

(I'd still like to know why Red Hat broke "sort -n" once.
What is it with you guys mucking up the most basic tools?)

BTW, proper thread support in ps includes the "-T" and
"-L" options. There's more too. I know this; nobody
doing procps-2 work has a clue. You're living in the
same Linux-only world that made procps-1 so hated for
incompatibility with the rest of the world. Sys admins
like to write scripts that run on everything they use.

>> (next time, discuss such changes with an experienced
>> procps developer first)
>
> (given that this whole area was left alone in a state
> like this for years i'm not quite sure how you can still
> sit so high on your horse.)

What area, in what state? If you mean procps in general:

I've been maintaining upstream procps for Debian for years.
The project was kept quiet out of politeness to the former
maintainer, who didn't want to let go of the project. Now
that the former maintainer is clearly not coming back, I've
put my tree at http://procps.sf.net/ for all to use. I am
after all the original author of ps itself, so I'm one to
know what is required for correct and efficient operation.

If you mean the threading support specifically:

The kernel does not provide proper support for grouping
threads by process. Hacks exist to group them anyway,
but such hacks will falsely group similar tasks and will
fail to group tasks due to race conditions. The hacks are
also slow. As none of this is acceptable in a critical
system tool, task grouping is not currently available.

There's a need for responsibility here. One shouldn't just
throw unreliable hacks into a critical system tool.

The FAQ covers some of this and more.
http://procps.sourceforge.net/faq.html

Linux-kernel readers should note that only procps-3 can
handle the very latest kernels. No procps-2 release is
able to do so, in more than one way.


2003-02-24 09:18:12

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] procfs/procps threading performance speedup, 2.5.62


On Sun, 23 Feb 2003, Albert D. Cahalan wrote:

> Surely you realize that I have seen this code?

have you actually tried it?

> Proper "ps m" behavior groups threads in the output. The hacked-up
> procps being used by Red Hat fails to do this. You chould change that...
> at the cost of reading all processes and sorting them. There goes all of
> your performance improvement.

Albert, the new code properly reads all threads and sorts them, in the
"ps m" case. Had you truly read my emails you'd notice where the overhead
lies, and what steps were taken to get rid of it.

Ingo

2003-02-24 09:28:49

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] procfs/procps threading performance speedup, 2.5.62


On Sun, 23 Feb 2003, Albert D. Cahalan wrote:

> The kernel does not provide proper support for grouping threads by
> process. Hacks exist to group them anyway, but such hacks will falsely
> group similar tasks and will fail to group tasks due to race conditions.
> The hacks are also slow. As none of this is acceptable in a critical
> system tool, task grouping is not currently available.

Albert, here you make the mistake of connecting 'thread directories' to
'raceless task state displaying'. There's no such connection, and there's
no way to get rid of said 'race conditions'. There's just no way, and in
fact no desire to get a completely race-free task state from /proc. It's
simply not possible. readdir() and /proc is just fundamentally non-atomic
- i challenge you to show me a single mainstream OS in existence that can
display a completely race-free and accurate snapshot of task state.

here's an offer: propose a sane /proc based mechanism that is totally
race-free, and i'll implement it. If you think that thread-directories
solve the races then you are dead wrong. (Btw., i too would like to see
thread directories, and if you look at the first NPTL patches you'll see
that i've proposed them as the superior solution.) This is a hard problem,
and if you think it's "just" the matter of kernel support then you vastly
underestimate the complexity involved.

Ingo

2003-02-24 11:02:10

by Albert D. Cahalan

[permalink] [raw]
Subject: Re: [patch] procfs/procps threading performance speedup, 2.5.62

Ingo Molnar writes:
> On Sun, 23 Feb 2003, Albert D. Cahalan wrote:

>> Surely you realize that I have seen this code?
>
> have you actually tried it?
>
>> Proper "ps m" behavior groups threads in the output. The hacked-up
>> procps being used by Red Hat fails to do this. You chould change that...
>> at the cost of reading all processes and sorting them. There goes all of
>> your performance improvement.
>
> Albert, the new code properly reads all threads and sorts them, in the
> "ps m" case. Had you truly read my emails you'd notice where the overhead
> lies, and what steps were taken to get rid of it.

I see that for default output, unpatched procps-2.x.xx is
forced to read and sort all tasks. Ignoring fault counts
and wchan for the moment, you made this work unnecessary by
adding some whole-process values. You also sped up /proc
directory operations in general, so not "all" I guess.

In the "ps m" and "ps -m" cases, you revert to the
"loose tasks" behavior. Neither "ps -L" nor "ps -T"
have been implemented. In some of these cases, the
threads should be grouped. Without subdirectories,
user code must do something bloated and slow to get
the grouping. (unless you care to guarantee the order
of tasks in a /proc directory listing, and relying on
such an ordering would prevent some other optimizations)

By grouping I mean something roughly like this:

PID TID
123 123
123 222
444 444
444 456

Not this:

PID TID
123 123
444 444
123 222
444 456

Header names and meanings may vary, etc. I'm not about to
dig out my notes regarding correct behavior for this email.

Anyway, to quote Linus:

----- begin -----
Well, part of the problem (I think) is that you added all the threads to
the same main directory.

Putting a "." in front of the name doesn't fix the /proc level directory
scalability issues, it only means that you can avoid some of the user-
level scalability ones.

So to offset that bad design, you then add other cruft, like the lookup
cursor and the "." marker. Which is not a bad idea in itself, but I claim
that if you'd made the directory structure saner you wouldn't have needed
it in the first place.

It would just be _so_ much nicer if the threads would show up as
subdirectories ie /proc/<tgid>/<tid>/xxx. More scalable, more readable,
and just generally more sane.
----- end -----

2003-02-24 11:16:56

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] procfs/procps threading performance speedup, 2.5.62


On Mon, 24 Feb 2003, Albert D. Cahalan wrote:

> By grouping I mean something roughly like this:
>
> PID TID
> 123 123
> 123 222
> 444 444
> 444 456

Albert, do you realize the simple fact that the procps enhancements we did
change absolutely nothing for the 'ps m' case? All thread PIDs are still
scanned and sorted.

And mind you, thread-directories do not change much in this area - the
PIDs within the thread-directory will still be largely unsorted, and it
will not make the reading & sorting of 20K threads any faster.

> Anyway, to quote Linus:

and to quote myself:

> [...] i'd be the last one arguing against a tree-based approach to
> thread groups. It's much easier to find threads belonging to a single
> 'process' via /proc this way [...]

so i'm in full agreement with Linus. And here's an email of mine from more
than 7 months ago:

> and yet another issue, what do you think about not listing every
> CLONE_THREAD_GROUP thread in /proc, but to list them in the group
> leader's directory - something like /proc/12345/threads/567/... This
> will remove much of the runtime overhead of massively threaded
> applications, where there are lots of native threads.

so in fact _i_ came up with this whole issue 7 months ago. I just dont
share many of _your_ largely bogus arguments that seem to miss the point.
Can we finally stop this storm in a teapot?

Ingo

2003-02-24 12:03:35

by Albert D. Cahalan

[permalink] [raw]
Subject: Re: [patch] procfs/procps threading performance speedup, 2.5.62

Ingo Molnar writes:
> On Sun, 23 Feb 2003, Albert D. Cahalan wrote:

>> The kernel does not provide proper support for grouping threads by
>> process. Hacks exist to group them anyway, but such hacks will falsely
>> group similar tasks and will fail to group tasks due to race conditions.
>> The hacks are also slow. As none of this is acceptable in a critical
>> system tool, task grouping is not currently available.
>
> Albert, here you make the mistake of connecting 'thread directories'
> to 'raceless task state displaying'. There's no such connection,
> and there's no way to get rid of said 'race conditions'.

There are races with serious consequences, and ones without.
That we have race conditions is no excuse to be adding more!
I never did claim that /proc would provide an atomic snapshot.
I don't even claim that the race conditions are because of
your code, but you were passing up an opportunity to fix a few.

As long as there is a way to determine all tasks belonging
to a POSIX process (TIDs in a TGID, or whatever) without
reading damn near everything, then things can be sane.

> here's an offer: propose a sane /proc based mechanism that
> is totally race-free, and i'll implement it. If you think
> that thread-directories solve the races then you are dead wrong.

All or nothing? I would value a few improvements.

Thread directories are one. Alternately, add per-task lists
of tasks with the same VM or a mapping file.

I just went looking for that syscall which was supposed to
allow opening a file relative to a directory file descriptor.
Hmmm, not there. Was that one shot down for some security
concern related to file descriptor passing? It would help
with the process death and PID-wrap-around races at least.

Per-task UUID values could help. Reset one on exec, and one
for every operation that would set an existing ID.

> (Btw., i too would like to see thread directories, and if
> you look at the first NPTL patches you'll see that i've
> proposed them as the superior solution.) This is a hard
> problem, and if you think it's "just" the matter of kernel
> support then you vastly underestimate the complexity involved.

There are many problems, some of which are hard.

2003-02-24 12:19:24

by Albert D. Cahalan

[permalink] [raw]
Subject: Re: [patch] procfs/procps threading performance speedup, 2.5.62

> Albert, do you realize the simple fact that the procps
> enhancements we did change absolutely nothing for the 'ps m'
> case? All thread PIDs are still scanned and sorted.

That is a contradiction. There is no sorting with "ps m".
Any ordering is from the /proc directory itself.

Sorting is not default because of the memory requirements
and because there have been many kernel bugs that cause
ps to hang when it hits a particular process. Sorting may
mean that ps hangs or is killed before producing anything.

> And mind you, thread-directories do not change much in
> this area - the PIDs within the thread-directory will
> still be largely unsorted, and it will not make the
> reading & sorting of 20K threads any faster.

That's OK. I need in-kernel grouping, not in-kernel sorting.
It's fine to mix up thread order, but bad to interleave the
threads of unrelated processes.

> so in fact _i_ came up with this whole issue 7 months ago. I just dont
> share many of _your_ largely bogus arguments that seem to miss the point.
> Can we finally stop this storm in a teapot?

Glad to, assuming you understand the importance of grouping.
I hope I have now done a better job of explaining it.

2003-02-24 12:35:48

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] procfs/procps threading performance speedup, 2.5.62


On Mon, 24 Feb 2003, Albert D. Cahalan wrote:

> Sorting is not default because of the memory requirements and because
> there have been many kernel bugs that cause ps to hang when it hits a
> particular process. Sorting may mean that ps hangs or is killed before
> producing anything.

in fact there was an unconditional qsort() done after scanning all tasks,
until i pointed it out to Alex. It never caused any problems, and sorting
never showed up as a source of overhead in the profiler, so i'm not sure
why you insist on sorting so much.

if procps hangs then that's a bug in procps. I'm not quite sure what you
mean by 'it hits a particular process'.

Ingo

2003-02-24 12:32:26

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] procfs/procps threading performance speedup, 2.5.62


On Mon, 24 Feb 2003, Albert D. Cahalan wrote:

> It's fine to mix up thread order, but bad to interleave the threads of
> unrelated processes.

this is very simple to do, and does not necessiate thread-directories.
There's a PID and TGid field in /proc/PID/status. Just link the task to
the TGid-task, and you have instant access to all threads per TGid. In the
'groupped output' case you have to scan & access all threads anyway. Ok?

(but this is way offtopic. The changes we posted address the normal case
of process-listing. (no -m option.) If there are tons of threads around
then any discussed variant of 'ps -m' will be slow.)

Ingo

2003-02-24 12:42:37

by Ingo Molnar

[permalink] [raw]
Subject: Re: [patch] procfs/procps threading performance speedup, 2.5.62


Albert, you, as a contributor to procps, have you ever run a high number
of (eg. more than 10K) threads and tested procps for that workload? Just
try it - start up eg. 10,000 threads (which just sleep) under 2.5.62, and
time all the various procps utilities. Then you'll see it first-hand,
where all the overhead lies.

Ingo