2002-11-02 01:29:45

by Andrew Morton

[permalink] [raw]
Subject: Re: dcache_rcu [performance results]

Dipankar Sarma wrote:
>
> [ dcache-rcu ]
>
> Anton (Blanchard) did some benchmarking with this
> in a 24-way ppc64 box and the results showed why we need this patch.
> Here are some performace comparisons based on a multi-user benchmark
> that Anton ran with vanilla 2.5.40 and 2.5.40-mm.
>
> http://lse.sourceforge.net/locking/dcache/summary.png
>
> base = 2.5.40
> base-nops = 2.5.40 but ps command in benchmark scripts commented out
> mm = 2.5.40-mm
> mm-nops = 2.5.40-mm but ps command in benchmark scripts commented out
>

I'm going to need some help understanding what's going on in
there. I assume the test is SDET (there, I said it), which
simulates lots of developers doing developer things on a multiuser
machine. Lots of compiling, groffing, etc.

Why does the removal of `ps' from the test script make such a huge
difference? That's silly, and we should fix it.

And it appears that dcache-rcu made a ~10% difference on a 24-way PPC64,
yes? That is nice, and perhaps we should take that, but it is not a
tremendous speedup.

Thanks.


2002-11-02 09:11:12

by Dipankar Sarma

[permalink] [raw]
Subject: Re: dcache_rcu [performance results]

On Fri, Nov 01, 2002 at 05:36:03PM -0800, Andrew Morton wrote:
> Dipankar Sarma wrote:
> > [ dcache-rcu ]
> > Anton (Blanchard) did some benchmarking with this
> > in a 24-way ppc64 box and the results showed why we need this patch.
> > Here are some performace comparisons based on a multi-user benchmark
> > that Anton ran with vanilla 2.5.40 and 2.5.40-mm.
> >
> > http://lse.sourceforge.net/locking/dcache/summary.png
> >
> simulates lots of developers doing developer things on a multiuser
> machine. Lots of compiling, groffing, etc.
>
> Why does the removal of `ps' from the test script make such a huge
> difference? That's silly, and we should fix it.

I have uploaded the profiles from Anton's benchmark runs -

http://lse.sourceforge.net/locking/dcache/results/2.5.40/200-base.html
http://lse.sourceforge.net/locking/dcache/results/2.5.40/200-base-nops.html
http://lse.sourceforge.net/locking/dcache/results/2.5.40/200-mm.html
http://lse.sourceforge.net/locking/dcache/results/2.5.40/200-mm-nops.html

A quick comparison of base and base-nops profiles show this -

base :

Hits Percentage Function
75185 100.00 total
11215 14.92 path_lookup <1.html>
8578 11.41 atomic_dec_and_lock <2.html>
5763 7.67 do_lookup <3.html>
5745 7.64 proc_pid_readlink <4.html>
4344 5.78 page_remove_rmap <5.html>
2144 2.85 page_add_rmap <6.html>
1587 2.11 link_path_walk <7.html>
1531 2.04 proc_check_root <8.html>
1461 1.94 save_remaining_regs <9.html>
1345 1.79 inode_change_ok <10.html>
1236 1.64 ext2_free_blocks <11.html>
1215 1.62 ext2_new_block <12.html>
1067 1.42 d_lookup <13.html>

base-no-ps :

Hits Percentage Function
50895 100.00 total
8222 16.15 page_remove_rmap <1.html>
3837 7.54 page_add_rmap <2.html>
2222 4.37 save_remaining_regs <3.html>
1618 3.18 release_pages <4.html>
1533 3.01 pSeries_flush_hash_range <5.html>
1446 2.84 do_page_fault <6.html>
1343 2.64 find_get_page <7.html>
1273 2.50 copy_page <8.html>
1228 2.41 copy_page_range <9.html>
1186 2.33 path_lookup <10.html>
1186 2.33 pSeries_insert_hpte <11.html>
1171 2.30 atomic_dec_and_lock <12.html>
1152 2.26 zap_pte_range <13.html>
841 1.65 do_generic_file_read <14.html>

Clearly dcache_lock is the killer when 'ps' command is used in
the benchmark. My guess (without looking at 'ps' code) is that
it has to open/close a lot of files in /proc and that increases
the number of acquisitions of dcache_lock. Increased # of acquisition
add to cache line bouncing and contention.

I should add that this is a general trend we see in all workloads
that do a lot of open/closes and so much so that performance is very
sensitive to how close to / your application's working directory
is. You would get much better system time if you compile a kernel
in /linux as compared to say /home/fs01/users/akpm/kernel/linux ;-)

> And it appears that dcache-rcu made a ~10% difference on a 24-way PPC64,
> yes? That is nice, and perhaps we should take that, but it is not a
> tremendous speedup.

Hmm.. based on Anton's graph it looked more like ~25% difference for
60 or more scripts. At 200 scripts it is ~27.6%. Without the ps
command, it seems more like ~4%.


Thanks
Dipankar

2002-11-04 17:28:41

by Martin J. Bligh

[permalink] [raw]
Subject: Re: dcache_rcu [performance results]

> Clearly dcache_lock is the killer when 'ps' command is used in
> the benchmark. My guess (without looking at 'ps' code) is that
> it has to open/close a lot of files in /proc and that increases
> the number of acquisitions of dcache_lock. Increased # of acquisition
> add to cache line bouncing and contention.

Strace it - IIRC it does 5 opens per PID. Vomit.

M.

2002-11-04 23:54:28

by jw schultz

[permalink] [raw]
Subject: Re: dcache_rcu [performance results]

On Mon, Nov 04, 2002 at 09:29:14AM -0800, Martin J. Bligh wrote:
> > Clearly dcache_lock is the killer when 'ps' command is used in
> > the benchmark. My guess (without looking at 'ps' code) is that
> > it has to open/close a lot of files in /proc and that increases
> > the number of acquisitions of dcache_lock. Increased # of acquisition
> > add to cache line bouncing and contention.
>
> Strace it - IIRC it does 5 opens per PID. Vomit.

I just did, had the same reaction. This is ugly.

It opens stat, statm, status, cmdline and environ apparently
regardless of what will be in the ouput. At least environ
will fail on most pids if you aren't root, saving on some of
the overhead. It compunds this by doing so for every pid
even if you have explicitly requested only one pid by
number.

Clearly ps could do with a cleanup. There is no reason to
read environ if it wasn't asked for. Deciding which files
are needed based on the command line options would be a
start.

I'm thinking that ps, top and company are good reasons to
make an exception of one value per file in proc. Clearly
open+read+close of 3-5 "files" each extracting data from
task_struct isn't more efficient than one "file" that
generates the needed data one field per line.

Don't get me wrong. I believe in the one field per file
rule but ps &co are the exception that proves (tests) the
rule. Especially on the heavily laden systems with
tens of thousands of tasks. We could do with a something
between /dev/kmem and five files per pid.

--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: [email protected]

Remember Cernan and Schmitt

2002-11-05 01:13:11

by Martin J. Bligh

[permalink] [raw]
Subject: ps performance sucks (was Re: dcache_rcu [performance results])

> Clearly ps could do with a cleanup. There is no reason to
> read environ if it wasn't asked for. Deciding which files
> are needed based on the command line options would be a
> start.
>
> I'm thinking that ps, top and company are good reasons to
> make an exception of one value per file in proc. Clearly
> open+read+close of 3-5 "files" each extracting data from
> task_struct isn't more efficient than one "file" that
> generates the needed data one field per line.

I think it's pretty trivial to make /proc/<pid>/psinfo, which
dumps the garbage from all five files in one place. Which makes
it 5 times better, but it still sucks.

> Don't get me wrong. I believe in the one field per file
> rule but ps &co are the exception that proves (tests) the
> rule. Especially on the heavily laden systems with
> tens of thousands of tasks. We could do with a something
> between /dev/kmem and five files per pid.

I had a very brief think about this at the weekend, seeing
if I could make a big melting pot /proc/psinfo file that did
seqfile and read everything out in one go, using seq_file
internally to interate over the tasklist. The most obvious
problem that sprung to mind seems to be the tasklist locking -
you obviously can't just hold a lock over the whole thing.
As I know very little about that, I'll let someone else suggest
how to do this, but I'm prepared to do the grunt work of implementing
it if need be.

M.

2002-11-05 03:51:28

by Werner Almesberger

[permalink] [raw]
Subject: Re: ps performance sucks (was Re: dcache_rcu [performance results])

Martin J. Bligh wrote:
> I had a very brief think about this at the weekend, seeing
> if I could make a big melting pot /proc/psinfo file

You could take a more radical approach. Since the goal of such
a psinfo file would be to accelerate access to information
that's already available elsewhere, you can do away with many
of the niceties of procfs, e.g.

- no need to be human-readable (e.g. binary or hex dump may
make sense in this case)
- may use other operations than just open and read (e.g.
do an initial write to select what should be read)
- you may cache previous responses and only output deltas
(not sure if this is useful - all you'd safe is the
actual copy to user space)

Actually, I think attempting to just make it brutally efficient,
no matter how much nastiness you amass doing that, might be
good approach for a first version. Then, if people are
digusted, you can make things nicer, and keep track of how
much performance you're losing.

Example:

First write says "pid,comm". Internally, this gets translated
to 0x8c+0x04, 0x2ee+0x10 (offset+length). Next read returns
"pid 4,comm 16" (include the name, so you can indicate fields
the kernel doesn't recognize). Then, kmalloc 20*tasks bytes,
lock, copy the fields from struct task_struct, unlock, let the
stuff be read by user space, kfree. Adjacent fields can be
optimized to single byte strings at setup time.

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2002-11-05 04:19:56

by jw schultz

[permalink] [raw]
Subject: Re: ps performance sucks (was Re: dcache_rcu [performance results])

On Mon, Nov 04, 2002 at 05:14:19PM -0800, Martin J. Bligh wrote:
> I think it's pretty trivial to make /proc/<pid>/psinfo, which
> dumps the garbage from all five files in one place. Which makes
> it 5 times better, but it still sucks.

And i'd still keep environ seperate. I'm inclined to think
ps should never have presented it in the first place.
This is the direction i (for what it's worth) favor.

> > Don't get me wrong. I believe in the one field per file
> > rule but ps &co are the exception that proves (tests) the
> > rule. Especially on the heavily laden systems with
> > tens of thousands of tasks. We could do with a something
> > between /dev/kmem and five files per pid.
>
> I had a very brief think about this at the weekend, seeing
> if I could make a big melting pot /proc/psinfo file that did
> seqfile and read everything out in one go, using seq_file
> internally to interate over the tasklist. The most obvious
> problem that sprung to mind seems to be the tasklist locking -
> you obviously can't just hold a lock over the whole thing.
> As I know very little about that, I'll let someone else suggest
> how to do this, but I'm prepared to do the grunt work of implementing
> it if need be.

Yep, can't hold the lock across syscalls. That would be
quite a bit of data to hold in a per fd buffer. Think of
the big iron with tons of processes.

The other way i could see this working is to present it as a
sparse file. ps (or whatever) would first get a list of
pids then iterate over them using lseek to set the file
offset to pid * CONSTANT_SIZE and read would return
something smaller than CONSTANT_SIZE bytes. If the pid no
longer exists return 0.

I really hate this idea. It stinks almost as much as
/dev/kmem.


--
________________________________________________________________
J.W. Schultz Pegasystems Technologies
email address: [email protected]

Remember Cernan and Schmitt

2002-11-05 04:35:46

by Erik Andersen

[permalink] [raw]
Subject: Re: ps performance sucks (was Re: dcache_rcu [performance results])

On Tue Nov 05, 2002 at 12:57:45AM -0300, Werner Almesberger wrote:
> Martin J. Bligh wrote:
> > I had a very brief think about this at the weekend, seeing
> > if I could make a big melting pot /proc/psinfo file
>
> You could take a more radical approach. Since the goal of such
> a psinfo file would be to accelerate access to information
> that's already available elsewhere, you can do away with many
> of the niceties of procfs, e.g.
>
> - no need to be human-readable (e.g. binary or hex dump may
> make sense in this case)
> - may use other operations than just open and read (e.g.

Hehe. You just reinvented my old /dev/ps driver. :)

http://www.busybox.net/cgi-bin/cvsweb/busybox/examples/kernel-patches/devps.patch.9_25_2000?rev=1.2&content-type=text/vnd.viewcvs-markup

This is what Linus has to say on the subject:

I do dislike /dev/ps mightily. If the problem is that /proc
is too large, then the right solution is to just clean up
/proc. Which is getting done. And yes, /proc will be larger
than /dev/ps, but I still find that preferable to having two
incompatible ways to do the same thing.

-Erik

--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--

2002-11-05 05:41:11

by Martin J. Bligh

[permalink] [raw]
Subject: Re: ps performance sucks (was Re: dcache_rcu [performance results])

> Hehe. You just reinvented my old /dev/ps driver. :)

Indeed, sounds much more like a /dev thing than a /proc thing
at this point ;-)

> http://www.busybox.net/cgi-bin/cvsweb/busybox/examples/kernel-patches/devps.patch.9_25_2000?rev=1.2&content-type=text/vnd.viewcvs-markup
>
> This is what Linus has to say on the subject:
>
> ... If the problem is that /proc
> is too large, then the right solution is to just clean up
> /proc. Which is getting done. And yes, /proc will be larger
> than /dev/ps, but I still find that preferable to having two
> incompatible ways to do the same thing.

Ummm ... how do we make /proc smaller than 1 file to open per PID?
It's pretty easy to get it down that far. But it still sucks.

> I do dislike /dev/ps mightily.

Well it can't be any worse than the current crap. At least it'd
stand a chance in hell of scaling a little bit. So I took a very
quick look ... what syscalls are you reduced to per pid, one ioctl
and one read?

M.



2002-11-05 05:48:15

by Martin J. Bligh

[permalink] [raw]
Subject: Re: ps performance sucks (was Re: dcache_rcu [performance results])

> And i'd still keep environ seperate. I'm inclined to think
> ps should never have presented it in the first place.
> This is the direction i (for what it's worth) favor.

If it doesn't need it then sure, otherwise just dump whatever
it needs in there. The seperate files would still be there too.

> Yep, can't hold the lock across syscalls. That would be
> quite a bit of data to hold in a per fd buffer. Think of
> the big iron with tons of processes.

I *have* the big iron with tons of processes ;-) That's why
I care ...

> The other way i could see this working is to present it as a
> sparse file. ps (or whatever) would first get a list of
> pids then iterate over them using lseek to set the file
> offset to pid * CONSTANT_SIZE and read would return
> something smaller than CONSTANT_SIZE bytes. If the pid no
> longer exists return 0.
>
> I really hate this idea. It stinks almost as much as
> /dev/kmem.

Well if we want to be gross and efficient, we could just compile
a kmem-diving dynamic library with every kernel compile and stick
it in /boot or somewhere. Mildly less extreme is a flat index file
for the data you need a la System.map. Then just open /dev/kmem
and grab what you want. Walking the tasklist with no locking would
be an interesting challenge, but probably not insurmountable.
That's how things like ps always used to work IIRC.

M.

2002-11-05 05:53:12

by Alexander Viro

[permalink] [raw]
Subject: Re: ps performance sucks (was Re: dcache_rcu [performance results])



On Mon, 4 Nov 2002, Martin J. Bligh wrote:

> > I do dislike /dev/ps mightily.
>
> Well it can't be any worse than the current crap. At least it'd
> stand a chance in hell of scaling a little bit. So I took a very
> quick look ... what syscalls are you reduced to per pid, one ioctl
> and one read?

Oh, yes it can. Easily.
* device is not network-transparent - even in principle
* restricting data access would be harder - welcome to suid or
sgid country
* real killer: you think Albert would fail to produce equally
crappy code and equally crappy behaviour? Yeah, right.

2002-11-05 06:02:34

by Martin J. Bligh

[permalink] [raw]
Subject: Re: ps performance sucks (was Re: dcache_rcu [performance results])

>> Well it can't be any worse than the current crap. At least it'd
>> stand a chance in hell of scaling a little bit. So I took a very
>> quick look ... what syscalls are you reduced to per pid, one ioctl
>> and one read?
>
> Oh, yes it can. Easily.
> * device is not network-transparent - even in principle

Is that really a major issue for ps?

> * restricting data access would be harder - welcome to suid or
> sgid country

I can live with that level of pain if my benchmark doesn't get
driven into the wall by the tools that are meant to be montoring
it ...

I'm sure there are bigger rocks to be thrown at it as well, and
ugly critters to be found under those rocks, but I don't see anything
insurmountable here yet. Whereas opening billions of files is just
unworkable.

Better still, you seem like an excellent candidate to propose a good
design that's efficient and workable?

> * real killer: you think Albert would fail to produce equally
> crappy code and equally crappy behaviour? Yeah, right.

Heh ;-)
A hostile takeover might suffice here, if necessary ...

M.

2002-11-05 06:08:22

by Werner Almesberger

[permalink] [raw]
Subject: Re: ps performance sucks (was Re: dcache_rcu [performance results])

Erik Andersen wrote:
> Hehe. You just reinvented my old /dev/ps driver. :)

Hmm, you still need 2+#pids operations, while my approach could
essentially do everything in a single read. Besides, many people
don't exactly love ioctls.

Furthermore, you'd need to version your big struct pid_info,
while my approach wouldn't have problems if fields are added or
removed (only if their content changes).

Two advantages of your approach are that the amount of data
cached in the kernel is limited to max(sizeof(pid_t)*#pids,
sizeof(struct pid_info)), and that you do less work under
tasklist_lock.

> I do dislike /dev/ps mightily. If the problem is that /proc
> is too large, then the right solution is to just clean up
> /proc. Which is getting done. And yes, /proc will be larger
> than /dev/ps, but I still find that preferable to having two
> incompatible ways to do the same thing.

Hmm yes, so he might not like my /proc/grab-taskdata-FAST
either :)

- Werner

--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/

2002-11-05 06:06:53

by Erik Andersen

[permalink] [raw]
Subject: Re: ps performance sucks (was Re: dcache_rcu [performance results])

On Mon Nov 04, 2002 at 09:44:07PM -0800, Martin J. Bligh wrote:
> > Hehe. You just reinvented my old /dev/ps driver. :)
>
> Indeed, sounds much more like a /dev thing than a /proc thing
> at this point ;-)
>
> > http://www.busybox.net/cgi-bin/cvsweb/busybox/examples/kernel-patches/devps.patch.9_25_2000?rev=1.2&content-type=text/vnd.viewcvs-markup
> >
> > This is what Linus has to say on the subject:
> >
> > ... If the problem is that /proc
> > is too large, then the right solution is to just clean up
> > /proc. Which is getting done. And yes, /proc will be larger
> > than /dev/ps, but I still find that preferable to having two
> > incompatible ways to do the same thing.
>
> Ummm ... how do we make /proc smaller than 1 file to open per PID?
> It's pretty easy to get it down that far. But it still sucks.
>
> > I do dislike /dev/ps mightily.
>
> Well it can't be any worse than the current crap. At least it'd
> stand a chance in hell of scaling a little bit. So I took a very
> quick look ... what syscalls are you reduced to per pid, one ioctl
> and one read?

As I implemented it, it was one ioctl per pid... Of course
it could be easily modified to be one syscall, one read from
the /dev/ps char device, or similar...

-Erik

--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--

2002-11-05 06:09:45

by Robert Love

[permalink] [raw]
Subject: Re: ps performance sucks (was Re: dcache_rcu [performance results])

On Tue, 2002-11-05 at 00:59, Alexander Viro wrote:

> Oh, yes it can. Easily.
> * device is not network-transparent - even in principle
> * restricting data access would be harder - welcome to suid or
> sgid country
> * real killer: you think Albert would fail to produce equally
> crappy code and equally crappy behaviour? Yeah, right.

Well I think Rik and I can handle it in our tree :)

But I agree - I do not care much for this /dev idea either.

Robert Love

2002-11-05 21:03:55

by kaih

[permalink] [raw]
Subject: Re: ps performance sucks (was Re: dcache_rcu [performance results])

[email protected] (Martin J. Bligh) wrote on 04.11.02 in <1118170000.1036458859@flay>:

> I had a very brief think about this at the weekend, seeing
> if I could make a big melting pot /proc/psinfo file that did
> seqfile and read everything out in one go, using seq_file
> internally to interate over the tasklist. The most obvious
> problem that sprung to mind seems to be the tasklist locking -
> you obviously can't just hold a lock over the whole thing.

Well, one thing i to make certain you can actually do it with one or two
system calls. Say, one system call to figure out how big a buffer is
necessary (essentially, #tasks*size), then one read with a suitably-sized
buffer. Then have a loop in the kernel that drops the lock as often as
necessary, and otherwise puts it all in the buffer in one go. (If the
#tasks grows too fast so it overruns the buffer even with some slack given
in advance, tough, have a useful return code to indicate that and let ps
retry.)

I briefly thought about mmap, but I don't think that actually buys
anything.

MfG Kai

2002-11-05 21:27:07

by Erik Andersen

[permalink] [raw]
Subject: Re: ps performance sucks (was Re: dcache_rcu [performance results])

On Tue Nov 05, 2002 at 09:57:00PM +0200, Kai Henningsen wrote:
> [email protected] (Martin J. Bligh) wrote on 04.11.02 in <1118170000.1036458859@flay>:
>
> > I had a very brief think about this at the weekend, seeing
> > if I could make a big melting pot /proc/psinfo file that did
> > seqfile and read everything out in one go, using seq_file
> > internally to interate over the tasklist. The most obvious
> > problem that sprung to mind seems to be the tasklist locking -
> > you obviously can't just hold a lock over the whole thing.
>
> Well, one thing i to make certain you can actually do it with one or two
> system calls. Say, one system call to figure out how big a buffer is
> necessary (essentially, #tasks*size), then one read with a suitably-sized
> buffer. Then have a loop in the kernel that drops the lock as often as
> necessary, and otherwise puts it all in the buffer in one go. (If the
> #tasks grows too fast so it overruns the buffer even with some slack given
> in advance, tough, have a useful return code to indicate that and let ps
> retry.)
>
> I briefly thought about mmap, but I don't think that actually buys
> anything.

Once again, reminds me of my /dev/ps driver, which had the
following ioctls:

#define DEVPS_GET_NUM_PIDS 0xeba1 /* Get a list of all PIDs */
#define DEVPS_GET_PID_LIST 0xeba2 /* Get a list of all PIDs */
#define DEVPS_GET_PID_INFO 0xeba3 /* Get info about a specific PID */
#define DEVPS_GET_CURRENT_PID 0xeba4 /* Get the current PID */

So a user spave ps app would call DEVPS_GET_NUM_PIDS to find out
how many processes there are, then it would allocate some memory
(and would allocate a some extra just in case some new processes
were to start up, the kernel would truncate things if we gave it
too little space) . Then ps would grab the pid list by calling
the DEVPS_GET_PID_LIST ioctl, and then for each item in the list
it would call DEVPS_GET_PID_INFO. Assuming that call was
successful, ps would print out a line of output and move on to
the next pid in the list.

The idea need not be implemented without using ioctl and without
using binary structures (which were the things Linus objected to)

The same thing could be easily done using flat ascii...

-Erik

--
Erik B. Andersen http://codepoet-consulting.com/
--This message was written using 73% post-consumer electrons--

2002-11-05 21:59:00

by Karim Yaghmour

[permalink] [raw]
Subject: Re: ps performance sucks (was Re: dcache_rcu [performance results])


I'm not sure why people are trying to make pigs fly, but if you
really need in-depth information regarding a process or a set
of processes, you should be looking at something that's been
designed from the ground up to actually carry this weight, which
is exactly what LTT is about. Using this approach, all the
accounting gets to be done in user-space. It's like using
"top -q" without the actual disadvantage of killing your system.

Karim

===================================================
Karim Yaghmour
[email protected]
Embedded and Real-Time Linux Expert
===================================================

2002-11-05 22:43:49

by Peter Chubb

[permalink] [raw]
Subject: Re: ps performance sucks (was Re: dcache_rcu [performance results])

>>>>> "Martin" == Martin J Bligh <[email protected]> writes:

>> And i'd still keep environ seperate. I'm inclined to think ps
>> should never have presented it in the first place. This is the
>> direction i (for what it's worth) favor.

Martin> If it doesn't need it then sure, otherwise just dump whatever
Martin> it needs in there. The seperate files would still be there
Martin> too.


Why not make an mmapable file in /proc with everything in it?
It can be sparse, so have the first part a bit map to say what
processes are active, then just look at the bits you need.
That gets rid of all but the open and mmap system call.

Peter C

2002-11-05 23:10:36

by bert hubert

[permalink] [raw]
Subject: Re: ps performance sucks (was Re: dcache_rcu [performance results])

On Tue, Nov 05, 2002 at 04:06:21PM -0800, Martin J. Bligh wrote:

> The locking of walking the tasklist seems non-trivial, but we may well
> end up with something like that. By the time you finish, it looks more
> like a /dev device thing than /proc (which I'm fine with), and looks

Can people just oprofile this instead of guessing? Opening a file is not
very expensive anymore, so if ps is noticeably slow, it must be something
else.

'To measure is to know'

Regards,

bert

--
http://www.PowerDNS.com Versatile DNS Software & Services
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO

2002-11-05 23:04:50

by Martin J. Bligh

[permalink] [raw]
Subject: Re: ps performance sucks (was Re: dcache_rcu [performance results])

>>> And i'd still keep environ seperate. I'm inclined to think ps
>>> should never have presented it in the first place. This is the
>>> direction i (for what it's worth) favor.
>
> Martin> If it doesn't need it then sure, otherwise just dump whatever
> Martin> it needs in there. The seperate files would still be there
> Martin> too.
>
> Why not make an mmapable file in /proc with everything in it?
> It can be sparse, so have the first part a bit map to say what
> processes are active, then just look at the bits you need.
> That gets rid of all but the open and mmap system call.

The locking of walking the tasklist seems non-trivial, but we may well
end up with something like that. By the time you finish, it looks more
like a /dev device thing than /proc (which I'm fine with), and looks
rather like the patch that was posted earlier in this thread. Ideally
I'd like fewer ioctls than that, and lower syscall overhead, but it
seems much closer to functional than the current system.

M.

2002-11-06 00:03:34

by bert hubert

[permalink] [raw]
Subject: Re: ps performance sucks (was Re: dcache_rcu [performance results])

On Tue, Nov 05, 2002 at 04:57:47PM -0800, Martin J. Bligh wrote:

> > 'To measure is to know'
>
> Errm... we have profiled it. Look at the subject line ... this started
> off as a dcache_rcu discussion. The dcache lookup ain't cheap, for
> starters, but that's not really the problem ... it's O(number of tasks),
> which sucks.

Ok - but if opening a few files is the problem, the solution is not to roll
those files into one but to figure out why opening the files is slow in the
first place.

Apologies for misinterpreting some of the comments I saw, I'm a bit hyper
from my futex documenting spree.

Regards,

bert

--
http://www.PowerDNS.com Versatile DNS Software & Services
http://lartc.org Linux Advanced Routing & Traffic Control HOWTO

2002-11-05 23:58:14

by Martin J. Bligh

[permalink] [raw]
Subject: Re: ps performance sucks (was Re: dcache_rcu [performance results])

>> The locking of walking the tasklist seems non-trivial, but we may well
>> end up with something like that. By the time you finish, it looks more
>> like a /dev device thing than /proc (which I'm fine with), and looks
>
> Can people just oprofile this instead of guessing? Opening a file is not
> very expensive anymore, so if ps is noticeably slow, it must be something
> else.
>
> 'To measure is to know'

Errm... we have profiled it. Look at the subject line ... this started
off as a dcache_rcu discussion. The dcache lookup ain't cheap, for
starters, but that's not really the problem ... it's O(number of tasks),
which sucks.

M.

2002-11-06 00:21:44

by Martin J. Bligh

[permalink] [raw]
Subject: Re: ps performance sucks (was Re: dcache_rcu [performance results])

>> Errm... we have profiled it. Look at the subject line ... this started
>> off as a dcache_rcu discussion. The dcache lookup ain't cheap, for
>> starters, but that's not really the problem ... it's O(number of tasks),
>> which sucks.
>
> Ok - but if opening a few files is the problem, the solution is not to roll
> those files into one but to figure out why opening the files is slow in the
> first place.

It's not a few files if you have large numbers of tasks. It's an
interface that fundamentally wasn't designed to scale, and futzing
around tweaking the thing isn't going to cut it, it needs a different
design. I'm not proposing throwing out any of the old simple interfaces,
just providing something efficient as a data gathering interface for
those people who wish to use it.

M.

2002-11-06 00:28:13

by Alexander Viro

[permalink] [raw]
Subject: Re: ps performance sucks (was Re: dcache_rcu [performance results])



On Tue, 5 Nov 2002, Martin J. Bligh wrote:

> It's not a few files if you have large numbers of tasks. It's an
> interface that fundamentally wasn't designed to scale, and futzing
> around tweaking the thing isn't going to cut it, it needs a different
> design. I'm not proposing throwing out any of the old simple interfaces,
> just providing something efficient as a data gathering interface for
> those people who wish to use it.

That's odd, to say the least. Userland side is at least linear by
number of tasks, regardless of the way you gather information. So
I really wonder how opening O(number of tasks) files can show up
when you scale the things up - pure userland parts will grow at
least as fast as that.

2002-11-08 00:27:10

by Rusty Russell

[permalink] [raw]
Subject: Re: ps performance sucks (was Re: dcache_rcu [performance results])

On Tue, 5 Nov 2002 19:34:47 -0500 (EST)
Alexander Viro <[email protected]> wrote:
> On Tue, 5 Nov 2002, Martin J. Bligh wrote:
>
> > It's not a few files if you have large numbers of tasks. It's an
> > interface that fundamentally wasn't designed to scale, and futzing
> > around tweaking the thing isn't going to cut it, it needs a different
> > design. I'm not proposing throwing out any of the old simple interfaces,
> > just providing something efficient as a data gathering interface for
> > those people who wish to use it.
>
> That's odd, to say the least. Userland side is at least linear by
> number of tasks, regardless of the way you gather information. So
> I really wonder how opening O(number of tasks) files can show up
> when you scale the things up - pure userland parts will grow at
> least as fast as that.

And I look forward to your "du" interface the kernel, too. It uses
this terrible method of statting every file!

Now, according to wli, there's a real problem with starvation by saturating
the read side of the tasklist_lock so eg. the write_lock_irq(&tasklist_lock)
in exit.c's release_task causes a CPU to spin for ages (forever?) with
interrupts off. This needs fixing, be it RCU or making that particular
lock give way to writers, or some other effect.

But the "ps takes too much time from my benchmarks" is a "don't do that".
Or you may really hate the /dev/kmem hack, but it's simple and effective.
The "hundreds of users doing ps/top" is easily solved with a "ps" daemon.

Fixing top is probably a worthwhile thing to do, and you'll note that after
the first iteration it should be rarely neccessary to read every "stat"
proc file.

Optimize userspace before putting a hack into the kernel, *please*!

Rusty.
--
there are those who do and those who hang on and you don't see too
many doers quoting their contemporaries. -- Larry McVoy

2002-11-08 03:56:38

by William Lee Irwin III

[permalink] [raw]
Subject: Re: ps performance sucks (was Re: dcache_rcu [performance results])

On Thu, Nov 07, 2002 at 11:06:13PM +1100, Rusty Russell wrote:
> Now, according to wli, there's a real problem with starvation by saturating
> the read side of the tasklist_lock so eg. the write_lock_irq(&tasklist_lock)
> in exit.c's release_task causes a CPU to spin for ages (forever?) with
> interrupts off. This needs fixing, be it RCU or making that particular
> lock give way to writers, or some other effect.

One way to at least "postpone" having to do things like making a fair
tasklist_lock is to make readers well-behaved. /proc/ is the worst
remaining offender left with its quadratic (!) get_pid_list(). After
"kernel, you're being bad and spinning in near-infinite loops with the
tasklist_lock readlocked" is (completely?) solved, then we can wait for
boxen with higher cpu counts to catch fire anyway when the arrival rate
of readers * hold time of readers > 1, which will happen because arrival
rates are O(cpus), and cpus will grow without bound as machines advance.

I'm not sure RCU would help this any; I'd be very much afraid of the
writes being postponed indefinitely or just too long in the presence
of what's essentially perpetually in-progress read access. Does RCU
have a guarantee of forward progress for writers?

Thanks,
Bill

2002-11-08 04:12:19

by Robert Love

[permalink] [raw]
Subject: Re: ps performance sucks (was Re: dcache_rcu [performance results])

On Thu, 2002-11-07 at 22:57, William Lee Irwin III wrote:

> One way to at least "postpone" having to do things like making a fair
> tasklist_lock is to make readers well-behaved. /proc/ is the worst
> remaining offender left with its quadratic (!) get_pid_list(). After
> "kernel, you're being bad and spinning in near-infinite loops with the
> tasklist_lock readlocked" is (completely?) solved, then we can wait for
> boxen with higher cpu counts to catch fire anyway when the arrival rate
> of readers * hold time of readers > 1, which will happen because arrival
> rates are O(cpus), and cpus will grow without bound as machines advance.
>
> I'm not sure RCU would help this any; I'd be very much afraid of the
> writes being postponed indefinitely or just too long in the presence
> of what's essentially perpetually in-progress read access. Does RCU
> have a guarantee of forward progress for writers?

I am not sure I like the idea of RCU for the tasklist_lock.

I do agree 100% with your first point, though - the problem is
ill-behaved readers. I think the writing vs. reading is such that the
rw-lock we have now is fine, we just need to make e.g. /proc play way
way more fair.

Robert Love

2002-11-08 04:20:22

by Daniel Jacobowitz

[permalink] [raw]
Subject: Re: ps performance sucks (was Re: dcache_rcu [performance results])

On Thu, Nov 07, 2002 at 11:17:21PM -0500, Robert Love wrote:
> On Thu, 2002-11-07 at 22:57, William Lee Irwin III wrote:
>
> > One way to at least "postpone" having to do things like making a fair
> > tasklist_lock is to make readers well-behaved. /proc/ is the worst
> > remaining offender left with its quadratic (!) get_pid_list(). After
> > "kernel, you're being bad and spinning in near-infinite loops with the
> > tasklist_lock readlocked" is (completely?) solved, then we can wait for
> > boxen with higher cpu counts to catch fire anyway when the arrival rate
> > of readers * hold time of readers > 1, which will happen because arrival
> > rates are O(cpus), and cpus will grow without bound as machines advance.
> >
> > I'm not sure RCU would help this any; I'd be very much afraid of the
> > writes being postponed indefinitely or just too long in the presence
> > of what's essentially perpetually in-progress read access. Does RCU
> > have a guarantee of forward progress for writers?
>
> I am not sure I like the idea of RCU for the tasklist_lock.
>
> I do agree 100% with your first point, though - the problem is
> ill-behaved readers. I think the writing vs. reading is such that the
> rw-lock we have now is fine, we just need to make e.g. /proc play way
> way more fair.

If you consider the atomicity that readdir() in /proc needs (that is:
almost none; it needs a guarantee that it will not miss a task which is
alive both before and after the opendir; but that's about it) then
there should be an easy way to augment the pidhash for this sort of
search.

--
Daniel Jacobowitz
MontaVista Software Debian GNU/Linux Developer

2002-11-08 04:24:57

by William Lee Irwin III

[permalink] [raw]
Subject: Re: ps performance sucks (was Re: dcache_rcu [performance results])

On Thu, 2002-11-07 at 22:57, William Lee Irwin III wrote:
>> I'm not sure RCU would help this any; I'd be very much afraid of the
>> writes being postponed indefinitely or just too long in the presence
>> of what's essentially perpetually in-progress read access. Does RCU
>> have a guarantee of forward progress for writers?

On Thu, Nov 07, 2002 at 11:17:21PM -0500, Robert Love wrote:
> I am not sure I like the idea of RCU for the tasklist_lock.
> I do agree 100% with your first point, though - the problem is
> ill-behaved readers. I think the writing vs. reading is such that the
> rw-lock we have now is fine, we just need to make e.g. /proc play way
> way more fair.

This is only feasible for small numbers of cpus. Any compensation
provided by algorithmic improvements on the read-side is outweighed by
NR_CPUS. Making readers well-behaved only solves half of the issue.
Whether the other half is addressible in a 2.6.x time scale is an open
question, but a question I'd like to see answered in favor of fixing
the livelocks sooner rather than later (esp. as 2.7+ issues are unlikely
to be resolved in line with hardware release schedules). I have a
strong bias toward code which works everywhere, all the time.

Bill