2006-08-22 08:36:06

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: [RFC][PATCH] ps command race fix take2 [1/4] list token

This is ps command race fix take2. Unfortunately, against 2.6.18-rc4.
I'll rebase this to appropriate kernel if O.K. (I think this is RFC)

This patch implements Paul Jackson's idea, 'inserting false link in task list'.

Good point of this approach is cost of searching task is O(N) (N=num of tgids).
Bad point is lock and kmalloc/kfree.
I didin't modified thread_list and cpuset's proc list, maybe future work.

If searching pid bitmap is better, please take Erics.

consists of 4 patches.
- listoken.patch -- implements token in list
- tasklist.patch -- make task list to use list_with_token
- profile.patch -- fix oprofile
- proc_pid_readdir.patch -- implements proc_pid_readdir for /proc/<pid>

Works well on x86/SMP system.

-Kame
==

listtoken , a helper for walking a list by intermittent access.

When we walk a list intermittently and the list is being modified at the
same time, it's hard to remember our position in it.

With this list token, a user can remember where he is reading by inserting
a token in the list.

Now this is designed just to support /proc/<pid> readdir() access. Not very
rich functions are supporetd.

Signed-Off-By: KAMEZAWA Hiroyuki <[email protected]>


include/linux/listtoken.h | 62 ++++++++++++++++++++++++++++++++++++++++++++++
lib/Makefile | 2 -
lib/listtoken.c | 25 ++++++++++++++++++
3 files changed, 88 insertions(+), 1 deletion(-)

Index: linux-2.6.18-rc4/include/linux/listtoken.h
===================================================================
--- /dev/null
+++ linux-2.6.18-rc4/include/linux/listtoken.h
@@ -0,0 +1,62 @@
+/*
+ helper routine for intermittent list walker.
+ token saves position in a list.
+*/
+
+
+#ifndef _LINUX_LISTTOKEN_H
+#define _LINUX_LISTTOKEN_H
+
+#ifdef __KERNEL__
+#include <linux/list.h>
+#include <linux/rcupdate.h>
+
+struct list_token {
+ struct list_head list;
+ long token; /* 0: data 1: token */
+ struct rcu_head rcu; /* useful when walking list with RCU */
+};
+
+#define LIST_TOKEN_INIT(token) {LIST_HEAD_INIT((token).list),0,}
+
+static inline void
+list_add_rcu_token(struct list_token *new, struct list_token *head)
+{
+ list_add_rcu(&new->list, &head->list);
+}
+
+static inline void
+list_add_tail_rcu_token(struct list_token *new, struct list_token *head)
+{
+ list_add_tail_rcu(&new->list, &head->list);
+}
+
+static inline void
+list_del_rcu_token(struct list_token *token)
+{
+ list_del_rcu(&token->list);
+}
+
+/* returns next valid ent in a list skip token. */
+static inline struct list_token *
+list_next_rcu_skiptoken(struct list_token *ent)
+{
+ do {
+ ent = list_entry(rcu_dereference(ent->list.next),
+ struct list_token, list);
+ } while (ent->token);
+ return ent;
+}
+/* set *token* before specfied list entry */
+static inline void
+insert_list_token_rcu(struct list_token *token, struct list_token *ent)
+{
+ token->token = 1; /* mark this as token */
+ list_add_tail_rcu_token(token, ent);
+}
+
+/* remove token */
+extern void remove_list_token_rcu(struct list_token *ent);
+
+#endif
+#endif
Index: linux-2.6.18-rc4/lib/Makefile
===================================================================
--- linux-2.6.18-rc4.orig/lib/Makefile
+++ linux-2.6.18-rc4/lib/Makefile
@@ -5,7 +5,7 @@
lib-y := errno.o ctype.o string.o vsprintf.o cmdline.o \
bust_spinlocks.o rbtree.o radix-tree.o dump_stack.o \
idr.o div64.o int_sqrt.o bitmap.o extable.o prio_tree.o \
- sha1.o
+ sha1.o listtoken.o

lib-$(CONFIG_SMP) += cpumask.o

Index: linux-2.6.18-rc4/lib/listtoken.c
===================================================================
--- /dev/null
+++ linux-2.6.18-rc4/lib/listtoken.c
@@ -0,0 +1,25 @@
+/*
+ implements token to remember the place in a list. useful for
+ traversing busily modified list by intermittent access
+*/
+#include <linux/types.h>
+#include <linux/gfp.h>
+#include <linux/listtoken.h>
+#include <linux/slab.h>
+#include <linux/module.h>
+/*
+ * for freeing RCU token.
+ */
+static void delayed_free_token(struct rcu_head *rhp)
+{
+ kfree(container_of(rhp, struct list_token , rcu));
+}
+
+void remove_list_token_rcu(struct list_token *token)
+{
+ BUG_ON(!token->token);
+ list_del_rcu_token(token);
+ call_rcu(&token->rcu, delayed_free_token);
+}
+
+EXPORT_SYMBOL_GPL(remove_list_token_rcu);






2006-08-22 11:06:38

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC][PATCH] ps command race fix take2 [1/4] list token

KAMEZAWA Hiroyuki <[email protected]> writes:
>
> listtoken , a helper for walking a list by intermittent access.
>
> When we walk a list intermittently and the list is being modified at the
> same time, it's hard to remember our position in it.
>
> With this list token, a user can remember where he is reading by inserting
> a token in the list.

I think the header/code needs much more comments so that other people
can still make sense of it later. Even with the commit log it's not
quite clear how it works.

Also locking needs to be explained. I suppose only one user is allowed
to use a token at one time?

But overall it's a good idea to have a generic facility like this.
I suppose it will be a more common problem in the future.

-Andi

2006-08-22 11:20:49

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH] ps command race fix take2 [1/4] list token

On 22 Aug 2006 13:06:34 +0200
Andi Kleen <[email protected]> wrote:

> KAMEZAWA Hiroyuki <[email protected]> writes:
> >
> > listtoken , a helper for walking a list by intermittent access.
> >
> > When we walk a list intermittently and the list is being modified at the
> > same time, it's hard to remember our position in it.
> >
> > With this list token, a user can remember where he is reading by inserting
> > a token in the list.
>
> I think the header/code needs much more comments so that other people
> can still make sense of it later. Even with the commit log it's not
> quite clear how it works.
>
Okay, make them more informative.

> Also locking needs to be explained.
Yes, with RCU. we need appropriate lock.

> I suppose only one user is allowed to use a token at one time?
>
Yes, I expext that a user uses a token. But several tokens can be inserted to a list.

> But overall it's a good idea to have a generic facility like this.
> I suppose it will be a more common problem in the future.
>

Thanks,
-Kame

2006-08-22 16:56:51

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH] ps command race fix take2 [1/4] list token

KAMEZAWA Hiroyuki <[email protected]> writes:

> This is ps command race fix take2. Unfortunately, against 2.6.18-rc4.
> I'll rebase this to appropriate kernel if O.K. (I think this is RFC)
>
> This patch implements Paul Jackson's idea, 'inserting false link in task list'.

Currently the tasklist_lock is one of the more highly contended locks in
the kernel. Adding an extra place it is taken is undesirable.
If could see a better algorithm for sending a signal to all processes
in a process groups we could remove the tasklist_lock entirely.

In addition you only solves half the readdir problems. You don't solve
the seek problem which is returning to an offset you had been to
before. A relatively rare case but...

> Good point of this approach is cost of searching task is O(N) (N=num of tgids).
> Bad point is lock and kmalloc/kfree.
> I didin't modified thread_list and cpuset's proc list, maybe future work.
>
> If searching pid bitmap is better, please take Erics.

My patch at least needs a good changelog but I believe it will work
better and can be further improved with a better pid data structure
if there is actually a problem there. Given that I don't take
any locks it should be much friendlier at scale, and the code
was simpler.

However I will miss a few newly forked processes and I don't think your
technique will miss any. Still neither will miss a process that
existed the entire time.

If nothing else I think it was worth posting so we could contrast the two.

Eric

2006-08-22 22:23:35

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH] ps command race fix take2 [1/4] list token

On Tue, 22 Aug 2006 10:56:08 -0600
[email protected] (Eric W. Biederman) wrote:

> KAMEZAWA Hiroyuki <[email protected]> writes:
>
> > This is ps command race fix take2. Unfortunately, against 2.6.18-rc4.
> > I'll rebase this to appropriate kernel if O.K. (I think this is RFC)
> >
> > This patch implements Paul Jackson's idea, 'inserting false link in task list'.
>
> Currently the tasklist_lock is one of the more highly contended locks in
> the kernel. Adding an extra place it is taken is undesirable.
yes. taking lock is a probem.
I know current readdir() uses 8192 bytes buffer for getdents64(). Then,
maybe write-lock will be acquired all-tgids/400+ times for inserting token
(in 32bit system).

> If could see a better algorithm for sending a signal to all processes
> in a process groups we could remove the tasklist_lock entirely.
>
??
Sorry, could you explain more ?

> In addition you only solves half the readdir problems. You don't solve
> the seek problem which is returning to an offset you had been to
> before. A relatively rare case but...
>
Ah, I should add lseek handler for proc root. Okay.

> > Good point of this approach is cost of searching task is O(N) (N=num of tgids).
> > Bad point is lock and kmalloc/kfree.
> > I didin't modified thread_list and cpuset's proc list, maybe future work.
> >
> > If searching pid bitmap is better, please take Erics.
>
> My patch at least needs a good changelog but I believe it will work
> better and can be further improved with a better pid data structure
> if there is actually a problem there. Given that I don't take
> any locks it should be much friendlier at scale, and the code
> was simpler.
yes. it has several good points and simple.
My patch's point is just using task_list if we can, because it exists for keeping
all tasks(tgids).

>
> However I will miss a few newly forked processes and I don't think your
> technique will miss any. Still neither will miss a process that
> existed the entire time.
>
> If nothing else I think it was worth posting so we could contrast the two.
>
please post again. I think comparing the two is good.
I will post take3 with improved comments and lseek handler, and so on.

-Kame

2006-08-23 06:12:00

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH] ps command race fix take2 [1/4] list token

KAMEZAWA Hiroyuki <[email protected]> writes:

> On Tue, 22 Aug 2006 10:56:08 -0600
> [email protected] (Eric W. Biederman) wrote:
>
>> KAMEZAWA Hiroyuki <[email protected]> writes:
>>
>> > This is ps command race fix take2. Unfortunately, against 2.6.18-rc4.
>> > I'll rebase this to appropriate kernel if O.K. (I think this is RFC)
>> >
>> > This patch implements Paul Jackson's idea, 'inserting false link in task
> list'.
>>
>> Currently the tasklist_lock is one of the more highly contended locks in
>> the kernel. Adding an extra place it is taken is undesirable.
> yes. taking lock is a probem.
> I know current readdir() uses 8192 bytes buffer for getdents64(). Then,
> maybe write-lock will be acquired all-tgids/400+ times for inserting token
> (in 32bit system).
>
>> If could see a better algorithm for sending a signal to all processes
>> in a process groups we could remove the tasklist_lock entirely.
>>
> ??
> Sorry, could you explain more ?

The core problem is not when there is a single user. The problem is
that no matter how large the system gets we have a single lock. So it
gets increasingly contended.

I almost removed the tasklist_lock from all read paths. But as it
happens sending a signal to a process group is an atomic operation
with respect to fork so that path has to take the lock, or else
we get places where "kill -9 -pgrp" fails to kill every process in
the process group. Which is even worse.


>> In addition you only solves half the readdir problems. You don't solve
>> the seek problem which is returning to an offset you had been to
>> before. A relatively rare case but...
>>
> Ah, I should add lseek handler for proc root. Okay.

Hmm. Possibly. Mostly what I was thinking is that a token in the
list simply cannot solve the problem of a guaranteeing lseek to a
previous position works. I really haven't looked closely on
how you handle that case.

>> > Good point of this approach is cost of searching task is O(N) (N=num of
> tgids).
>> > Bad point is lock and kmalloc/kfree.
>> > I didin't modified thread_list and cpuset's proc list, maybe future work.
>> >
>> > If searching pid bitmap is better, please take Erics.
>>
>> My patch at least needs a good changelog but I believe it will work
>> better and can be further improved with a better pid data structure
>> if there is actually a problem there. Given that I don't take
>> any locks it should be much friendlier at scale, and the code
>> was simpler.
> yes. it has several good points and simple.
> My patch's point is just using task_list if we can, because it exists for
> keeping
> all tasks(tgids).

One of the reasons I have an issue with it, is that with the
impending introduction of multiple pid spaces is that the task list
really isn't what we want to traverse.

>> However I will miss a few newly forked processes and I don't think your
>> technique will miss any. Still neither will miss a process that
>> existed the entire time.
>>
>> If nothing else I think it was worth posting so we could contrast the two.
>>
> please post again. I think comparing the two is good.
> I will post take3 with improved comments and lseek handler, and so
> on.

I intend to, I'm unfortunately busy in another direction at the
moment.

Eric

2006-08-23 08:31:04

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH] ps command race fix take2 [1/4] list token

On Wed, 23 Aug 2006 00:11:17 -0600
[email protected] (Eric W. Biederman) wrote:

> KAMEZAWA Hiroyuki <[email protected]> writes:
>
> > On Tue, 22 Aug 2006 10:56:08 -0600
> > [email protected] (Eric W. Biederman) wrote:
> The core problem is not when there is a single user. The problem is
> that no matter how large the system gets we have a single lock. So it
> gets increasingly contended.

> I almost removed the tasklist_lock from all read paths. But as it
> happens sending a signal to a process group is an atomic operation
> with respect to fork so that path has to take the lock, or else
> we get places where "kill -9 -pgrp" fails to kill every process in
> the process group. Which is even worse.
>
Hmm, maybe tasklist_lock covers too wide area.
we can add some other (RCU) lock just for linked list from init_task.tasks.
And pid_alive() will help people who want to access not stale task.

Now, job in fork() is
- set cpu allowed
- set parent
- attach pgid, sid
- add to linked list from init_task
- attach pid

Then, adding for_each_alive_process() and new (RCU) lock for
linked_list_from_init_task_lock (divide lock) and change job as

- set cpu allowed
- set parent
- attach pgid, sid
- attach pid
new_list_writelock()
- add to linked list
new_list_writeunlock()

may reduce contention. for_each_alive_process() will do

rcu_readlock()
for (task =....)
if (!pid_alive(task))
continue;
rcu_readunlock();

Is this bad ?

> >> In addition you only solves half the readdir problems. You don't solve
> >> the seek problem which is returning to an offset you had been to
> >> before. A relatively rare case but...
> >>
> > Ah, I should add lseek handler for proc root. Okay.
>
> Hmm. Possibly. Mostly what I was thinking is that a token in the
> list simply cannot solve the problem of a guaranteeing lseek to a
> previous position works. I really haven't looked closely on
> how you handle that case.
>
I'll try some. But lseek on directory, which is modified at any moment, cannot
work stable anyway.

> > My patch's point is just using task_list if we can, because it exists for
> > keeping
> > all tasks(tgids).
>
> One of the reasons I have an issue with it, is that with the
> impending introduction of multiple pid spaces is that the task list
> really isn't what we want to traverse.
>
Yes, scanning the whole space is not good.
I think this can be handlerd by task_lists per pid-space.
Is pidmap is maintained per pid-space ?

-Kame

2006-08-23 09:55:19

by Avi Kivity

[permalink] [raw]
Subject: Re: [RFC][PATCH] ps command race fix take2 [1/4] list token

[email protected] wrote:
>
> I almost removed the tasklist_lock from all read paths. But as it
> happens sending a signal to a process group is an atomic operation
> with respect to fork so that path has to take the lock, or else
> we get places where "kill -9 -pgrp" fails to kill every process in
> the process group. Which is even worse.
>

Can't that be fixed by adding a per-pgrp lock, and having both
fork()/clone() and kill(-pgrp) take that lock?

--
error compiling committee.c: too many arguments to function

2006-08-23 11:36:14

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH] ps command race fix take2 [1/4] list token

KAMEZAWA Hiroyuki <[email protected]> writes:

> On Wed, 23 Aug 2006 00:11:17 -0600
> [email protected] (Eric W. Biederman) wrote:
>
>> KAMEZAWA Hiroyuki <[email protected]> writes:
>>
>> > On Tue, 22 Aug 2006 10:56:08 -0600
>> > [email protected] (Eric W. Biederman) wrote:
>> The core problem is not when there is a single user. The problem is
>> that no matter how large the system gets we have a single lock. So it
>> gets increasingly contended.
>
>> I almost removed the tasklist_lock from all read paths. But as it
>> happens sending a signal to a process group is an atomic operation
>> with respect to fork so that path has to take the lock, or else
>> we get places where "kill -9 -pgrp" fails to kill every process in
>> the process group. Which is even worse.
>>
> Hmm, maybe tasklist_lock covers too wide area.
> we can add some other (RCU) lock just for linked list from init_task.tasks.
> And pid_alive() will help people who want to access not stale task.
>
> Now, job in fork() is
> - set cpu allowed
> - set parent
> - attach pgid, sid
> - add to linked list from init_task
> - attach pid
>
> Then, adding for_each_alive_process() and new (RCU) lock for
> linked_list_from_init_task_lock (divide lock) and change job as
>
> - set cpu allowed
> - set parent
> - attach pgid, sid
> - attach pid
> new_list_writelock()
> - add to linked list
> new_list_writeunlock()
>
> may reduce contention. for_each_alive_process() will do
>
> rcu_readlock()
> for (task =....)
> if (!pid_alive(task))
> continue;
> rcu_readunlock();
>
> Is this bad ?

What you are proposing is to reduce contention by having several different
locks for each of the global data structures. The process tree, the task list,
the process groups and the sessions. If this was all a really hot spot
then doing that might be useful. In practice it doesn't really
address the problem that we have. We have global locks we need to
take, and we have global data structures we need to modify. The
result are global cache line bounces every time you write to the data
structure.

With purely reads on such a global data structure you are probably ok,
with respect to scalability as you can get shared cache lines.
Writing to a global data structure that is heavily read will generally
scale badly, simply because of cache line bounces.

So until some clever person figures out how to refactor the data
structures so we they are not global it is probably best not modify
them if we can avoid it.

>> >> In addition you only solves half the readdir problems. You don't solve
>> >> the seek problem which is returning to an offset you had been to
>> >> before. A relatively rare case but...
>> >>
>> > Ah, I should add lseek handler for proc root. Okay.
>>
>> Hmm. Possibly. Mostly what I was thinking is that a token in the
>> list simply cannot solve the problem of a guaranteeing lseek to a
>> previous position works. I really haven't looked closely on
>> how you handle that case.
>>
> I'll try some. But lseek on directory, which is modified at any moment, cannot
> work stable anyway.

It can work as well as anything else in readdir. It can ensure that you don't
miss things that haven't been added or deleted during the while you are in
the middle of readdir. I'm just after the usual Single Unix Spec/POSIX guarantees.
The same thing that are missing in the current readdir implementation.

I think the ideal case for a readdir would be a sequence number that
always increases, that is assigned each time a new directory entry (in
our case a task) is created. This will ensure we see everything new
but not everything old. The problem with that approach is that
all such counters eventually wrap. The core problem is that we have
to give out an indefinite number of tokens that last for an indefinite
amount of time.

If we relax the rules and allow ourselves to miss new directory
entries then anything that provides a total ordering of the
entries, allows us to export the key by which the total ordering
is defined to user space, and provides the concept of finding the
entry at or after this key provides us with what we need to implement
a solid readdir implementation complete with supporting seeks.

>> > My patch's point is just using task_list if we can, because it exists for
>> > keeping
>> > all tasks(tgids).
>>
>> One of the reasons I have an issue with it, is that with the
>> impending introduction of multiple pid spaces is that the task list
>> really isn't what we want to traverse.
>>
> Yes, scanning the whole space is not good.
> I think this can be handlerd by task_lists per pid-space.
> Is pidmap is maintained per pid-space ?

It will be. It pretty much has to be, and the actual accessor function
did not actually assume you were using the pid bitmap. Merely that you
could traverse through pids in order.

Eric

2006-08-23 12:12:49

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [RFC][PATCH] ps command race fix take2 [1/4] list token

Avi Kivity <[email protected]> writes:

> [email protected] wrote:
>>
>> I almost removed the tasklist_lock from all read paths. But as it
>> happens sending a signal to a process group is an atomic operation
>> with respect to fork so that path has to take the lock, or else
>> we get places where "kill -9 -pgrp" fails to kill every process in
>> the process group. Which is even worse.
>>
>
> Can't that be fixed by adding a per-pgrp lock, and having both fork()/clone()
> and kill(-pgrp) take that lock?

Possibly. The core issue though is that you still need to take a lock and
a big group can be as bad as just about anything else. So all you do with
a per group lock is you change the odds of hitting the problem and make the
code a little more complicated. For the small systems that most people have
I don't believe the tasklist_lock shows up at all.

If someone can find a data structure that I could use on two independent
machines to create processes in the same process group and still allow atomic
kill behavior between those two machines I think we would have something that
could be made to scale very well.

Until the point where I see the truly better data structure or that people
who can actually see problems with the lock start to fix it. I think
it is not to modify the data structure more than necessary, at runtime.

Modifying the global task list in the middle of readdir looks like it will
allow user space simply by running top with a fast update frequency to
cause problems for people on bigger machines. Which is really the
wrong direction to go.

Eric

2006-08-23 12:47:35

by Kamezawa Hiroyuki

[permalink] [raw]
Subject: Re: [RFC][PATCH] ps command race fix take2 [1/4] list token

On Wed, 23 Aug 2006 05:35:08 -0600
[email protected] (Eric W. Biederman) wrote:

> What you are proposing is to reduce contention by having several different
> locks for each of the global data structures.
not for each, just a lock for a list for for_each_process ;)
About cache bounsing, it's problem if heavy.
In my plan, fork/exit/proc_readdir will have write lock of
for_each_process_write_lock. talking this again after take3 will be good.
If I'm very lucky, I'll find some another way..

> >> >> In addition you only solves half the readdir problems. You don't solve
> >> >> the seek problem which is returning to an offset you had been to
> >> >> before. A relatively rare case but...
> >> >>
> >> > Ah, I should add lseek handler for proc root. Okay.
> >>
> >> Hmm. Possibly. Mostly what I was thinking is that a token in the
> >> list simply cannot solve the problem of a guaranteeing lseek to a
> >> previous position works. I really haven't looked closely on
> >> how you handle that case.
> >>
> > I'll try some. But lseek on directory, which is modified at any moment, cannot
> > work stable anyway.
>
> It can work as well as anything else in readdir. It can ensure that you don't
> miss things that haven't been added or deleted during the while you are in
> the middle of readdir. I'm just after the usual Single Unix Spec/POSIX guarantees.
> The same thing that are missing in the current readdir implementation.
>
BTW, what position means at lseek() in directory ?
bytes ? implementation dependent ?

I'm thinking of implementing "position" as offset in task list.
Hmm..about lseek(), it's obvious that searching in a table has an advantage.
we cannot define position with list.
What will you do if user moves f->pos to not-used-position.

I have no complaint about pidmap scanning next_tgid() unless it doesn't scan
all over the world.

-Kame