In the pid virtualization thread it was noticed that currently in
the kernel we have a variables that hold pids of processes or
process groups that we later intend to send signals to.
It was suggest that we instead use pointers to struct task_struct
and reference count them to get around this problem.
The problem was that each file descriptor in the system has one of
these variables in f_owner an instance of struct fown_struct it would
allow user to bypass the limits on the number of processes and
type of megabytes of memory with invisible zombie processes.
For kernel threads where things are tightly coordinated this is not an
issue hand holding a reference to the task struct of a kernel thread
seems the sane thing to do.
The other problem was that pointers to struct task_struct don't map
well to process groups.
I believe the solution to these problems is to introduce a small data
structure that can be reference counted and holds a pointer to a
struct task_struct. The struct task_struct holds a pointer to this
small data structure that will be used to remove the pointer to itself
when the struct task_struct is freed.
The following series of patches implement the data structure and
the routines for manipulating it. As well as implementing this
data structure in two common cases in the kernel, that demonstrate
it's usefulness.
Eric
Holding a reference to a task for purposes of possibly sending
it a signal later can potentially pin large amounts of memory.
Not holding a reference to a task but instead using pids can
result in the wrong task getting the signal.
struct task_ref and the associtated tref functions should get
around this problem. It provides a 3 word reference counted
structure that can be pointed at instead of a task. struct
task_ref then points at the task for you. This structure is
updated whenever a task exits so that it is alwasy pointing
at a valid task or it is a NULL pointer.
In addition task_ref has a type field and is associated
with a type of pid. This allows it to track process groups
and sessions as well as individual task structures.
Signed-off-by: Eric W. Biederman <[email protected]>
---
include/linux/pid.h | 2 +
include/linux/sched.h | 27 ++++++++++++++++
kernel/fork.c | 8 +++++
kernel/pid.c | 82 +++++++++++++++++++++++++++++++++++++++++++++++++
4 files changed, 119 insertions(+), 0 deletions(-)
29bb70ab97b015cb393aa13ee30cd179b755d1c1
diff --git a/include/linux/pid.h b/include/linux/pid.h
index 5b2fcb1..e206149 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -17,6 +17,8 @@ struct pid
struct hlist_node pid_chain;
/* list of pids with the same nr, only one of them is in the hash */
struct list_head pid_list;
+ /* Does a weak references of this type exsit to the task struct? */
+ struct task_ref *ref;
};
#define pid_task(elem, type) \
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 8645ae1..12f3cc5 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -252,6 +252,33 @@ arch_get_unmapped_area_topdown(struct fi
extern void arch_unmap_area(struct mm_struct *, unsigned long);
extern void arch_unmap_area_topdown(struct mm_struct *, unsigned long);
+struct task_ref
+{
+ atomic_t count;
+ enum pid_type type;
+ struct task_struct *task;
+};
+
+/* Note to read a usable value task value from struct task_ref
+ * the tasklist_lock must be held. The atomic property of single
+ * word reads will keep any vaule you read consistent but it doesn't
+ * protect you from the race of the task exiting on another cpu and
+ * having it's task_struct freed or reused. Holding the tasklist_lock
+ * prevents the task from going away as you derference the task pointer.
+ */
+
+extern struct task_ref init_tref;
+#define TASK_REF_INIT (&init_tref)
+#define TASK_REF(name) \
+ struct task_ref *name = TASK_REF_INIT
+
+extern void tref_put(struct task_ref *ref);
+extern struct task_ref *tref_get(struct task_ref *ref);
+extern struct task_ref *tref_get_by_task(task_t *task, enum pid_type type);
+extern struct task_ref *tref_get_by_pid(int who, enum pid_type type);
+extern void tref_set(struct task_ref **dst, struct task_ref *ref);
+extern void tref_clear(struct task_ref **dst);
+
#if NR_CPUS >= CONFIG_SPLIT_PTLOCK_CPUS
/*
* The mm counters are not protected by its page_table_lock,
diff --git a/kernel/fork.c b/kernel/fork.c
index 4ae8cfc..0b0df9c 100644
--- a/kernel/fork.c
+++ b/kernel/fork.c
@@ -157,6 +157,7 @@ void __init fork_init(unsigned long memp
static struct task_struct *dup_task_struct(struct task_struct *orig)
{
+ int type;
struct task_struct *tsk;
struct thread_info *ti;
@@ -179,6 +180,13 @@ static struct task_struct *dup_task_stru
/* One for us, one for whoever does the "release_task()" (usually parent) */
atomic_set(&tsk->usage,2);
atomic_set(&tsk->fs_excl, 0);
+
+ /* Initially there are no weak references to this task */
+ for(type = 0; type < PIDTYPE_MAX; type++) {
+ tsk->pids[type].nr = 0;
+ tsk->pids[type].ref = TASK_REF_INIT;
+ }
+
return tsk;
}
diff --git a/kernel/pid.c b/kernel/pid.c
index 7890867..7c40310 100644
--- a/kernel/pid.c
+++ b/kernel/pid.c
@@ -168,21 +168,31 @@ int fastcall attach_pid(task_t *task, en
static fastcall int __detach_pid(task_t *task, enum pid_type type)
{
+ task_t *task_next;
struct pid *pid, *pid_next;
+ struct task_ref *ref;
int nr = 0;
pid = &task->pids[type];
+ ref = pid->ref;
if (!pid->nr)
goto out;
if (!hlist_unhashed(&pid->pid_chain)) {
if (list_empty(&pid->pid_list)) {
+ if (ref)
+ ref->task = NULL;
nr = pid->nr;
hlist_del_rcu(&pid->pid_chain);
} else {
+ task_next = pid_task(pid->pid_list.next, type);
pid_next = list_entry(pid->pid_list.next,
struct pid, pid_list);
+ /* Update the reference to point at the next task */
+ if (ref)
+ ref->task = task_next;
+ pid_next->ref = ref;
/* insert next pid from pid_list to hash */
hlist_replace_rcu(&pid->pid_chain,
&pid_next->pid_chain);
@@ -223,6 +233,78 @@ task_t *find_task_by_pid_type(int type,
EXPORT_SYMBOL(find_task_by_pid_type);
+struct task_ref init_tref = {
+ .count = ATOMIC_INIT(0),
+ .type = PIDTYPE_PID,
+ .task = NULL,
+};
+
+void tref_put(struct task_ref *ref)
+{
+ if (ref && (ref != &init_tref) && atomic_dec_and_test(&ref->count)) {
+ write_lock(&tasklist_lock);
+ if (ref->task)
+ ref->task->pids[ref->type].ref = &init_tref;
+ write_unlock(&tasklist_lock);
+ kfree(ref);
+ }
+}
+
+struct task_ref *tref_get(struct task_ref *ref)
+{
+ if (!ref)
+ ref = &init_tref;
+ if (ref != &init_tref)
+ atomic_inc(&ref->count);
+ return ref;
+}
+
+struct task_ref *tref_get_by_task(task_t *task, enum pid_type type)
+{
+ struct task_ref *ref;
+
+ write_lock(&tasklist_lock);
+ ref = &init_tref;
+ if (task && pid_alive(task)) {
+ ref = task->pids[type].ref;
+ if ((ref == &init_tref) && pid_alive(task)) {
+ struct task_ref *new_ref;
+ new_ref = kmalloc(sizeof(*new_ref), GFP_KERNEL);
+ if (new_ref) {
+ atomic_set(&new_ref->count, 0);
+ new_ref->type = type;
+ new_ref->task = task;
+ task->pids[type].ref = new_ref;
+ ref = new_ref;
+ }
+ }
+ }
+ ref = tref_get(ref);
+ write_unlock(&tasklist_lock);
+ return ref;
+}
+
+struct task_ref *tref_get_by_pid(int who, enum pid_type type)
+{
+ task_t *task;
+ task = find_task_by_pid_type(type, who);
+ return tref_get_by_task(task, type);
+}
+
+void tref_set(struct task_ref **dst, struct task_ref *ref)
+{
+ tref_put(*dst);
+ if (!ref)
+ ref = &init_tref;
+ *dst = ref;
+}
+
+void tref_clear(struct task_ref **dst)
+{
+ tref_put(*dst);
+ *dst = &init_tref;
+}
+
/*
* This function switches the PIDs if a non-leader thread calls
* sys_execve() - this must be done without releasing the PID.
--
1.1.5.g3480
Currently do_each_task_pid/while_each_task_pid make it
possible to iterate through each task of a group of
processes given the pid of the group.
do_each_task/while_each_task do the same thing but assume
you are atarting with the first task of a process group.
This is needed so we can start with a task_ref instead
of a pid.
Signed-off-by: Eric W. Biederman <[email protected]>
---
include/linux/pid.h | 12 ++++++++++++
1 files changed, 12 insertions(+), 0 deletions(-)
34e71e62a13a78350239c18e3e3c31e79593380f
diff --git a/include/linux/pid.h b/include/linux/pid.h
index e206149..7fa9b2e 100644
--- a/include/linux/pid.h
+++ b/include/linux/pid.h
@@ -54,4 +54,16 @@ extern void switch_exec_pids(struct task
hlist_unhashed(&(task)->pids[type].pid_chain)); \
} \
+#define do_each_task(task, type) \
+ if (task) { \
+ prefetch((task)->pids[type].pid_list.next); \
+ do {
+
+#define while_each_task(task, type) \
+ } while (task = pid_task((task)->pids[type].pid_list.next, \
+ (type)), \
+ prefetch((task)->pids[type].pid_list.next), \
+ hlist_unhashed(&(task)->pids[type].pid_chain)); \
+ }
+
#endif /* _LINUX_PID_H */
--
1.1.5.g3480
Currently most functions for sending a signal to
a task or group of tasks take a pid as an argoument.
kill_ref instead takes a task_ref. I would use a
function that simply takes a task but the tasklist_lock
must be taken to ensure that there is not a race in
derferencing task_ref->task.
kill_tref can stand in for either kill_proc, or kill_pg
depending on the type of the reference.
Signed-off-by: Eric W. Biederman <[email protected]>
---
include/linux/sched.h | 3 +++
kernel/signal.c | 31 +++++++++++++++++++++++++++++++
2 files changed, 34 insertions(+), 0 deletions(-)
2d2627206f44ce032e521705efb5f712d440ba13
diff --git a/include/linux/sched.h b/include/linux/sched.h
index 12f3cc5..8cd8075 100644
--- a/include/linux/sched.h
+++ b/include/linux/sched.h
@@ -1109,6 +1109,9 @@ extern int send_sig_info(int, struct sig
extern int send_group_sig_info(int, struct siginfo *, struct task_struct *);
extern int force_sigsegv(int, struct task_struct *);
extern int force_sig_info(int, struct siginfo *, struct task_struct *);
+extern int __kill_tref_info(int sig, struct siginfo *info, struct task_ref *ref);
+extern int kill_tref_info(int sig, struct siginfo *info, struct task_ref *ref);
+extern int kill_tref(struct task_ref *ref, int sig, int priv);
extern int __kill_pg_info(int sig, struct siginfo *info, pid_t pgrp);
extern int kill_pg_info(int, struct siginfo *, pid_t);
extern int kill_proc_info(int, struct siginfo *, pid_t);
diff --git a/kernel/signal.c b/kernel/signal.c
index d3efafd..20a67ae 100644
--- a/kernel/signal.c
+++ b/kernel/signal.c
@@ -1146,6 +1146,32 @@ retry:
return ret;
}
+int __kill_tref_info(int sig, struct siginfo *info, struct task_ref *ref)
+{
+ struct task_struct *p = ref->task;
+ int retval, success;
+
+ success = 0;
+ retval = -ESRCH;
+ do_each_task(p, ref->type) {
+ int err = group_send_sig_info(sig, info, p);
+ success |= !err;
+ retval = err;
+ } while_each_task(p, ref->type);
+ return success ? 0 : retval;
+}
+
+int kill_tref_info(int sig, struct siginfo *info, struct task_ref *ref)
+{
+ int retval;
+
+ read_lock(&tasklist_lock);
+ retval = __kill_tref_info(sig, info, ref);
+ read_unlock(&tasklist_lock);
+
+ return retval;
+}
+
/*
* kill_pg_info() sends a signal to a process group: this is what the tty
* control characters do (^C, ^Z etc)
@@ -1365,6 +1391,11 @@ kill_proc(pid_t pid, int sig, int priv)
return kill_proc_info(sig, __si_special(priv), pid);
}
+int kill_tref(struct task_ref *ref, int sig, int priv)
+{
+ return kill_tref_info(sig, __si_special(priv), ref);
+}
+
/*
* These functions support sending signals using preallocated sigqueue
* structures. This is needed "because realtime applications cannot
--
1.1.5.g3480
This is a classic example of a random kernel subsystem
holding a pid for purposes of signalling it later.
Signed-off-by: Eric W. Biederman <[email protected]>
---
drivers/char/keyboard.c | 8 ++++----
drivers/char/vt_ioctl.c | 5 +++--
2 files changed, 7 insertions(+), 6 deletions(-)
7ed8301463a49ad03f8c9de2bbf8c41a5d9843ea
diff --git a/drivers/char/keyboard.c b/drivers/char/keyboard.c
index 8b603b2..4e1f2e0 100644
--- a/drivers/char/keyboard.c
+++ b/drivers/char/keyboard.c
@@ -109,7 +109,8 @@ struct kbd_struct kbd_table[MAX_NR_CONSO
static struct kbd_struct *kbd = kbd_table;
static struct kbd_struct kbd0;
-int spawnpid, spawnsig;
+TASK_REF(spawnpid);
+int spawnsig;
/*
* Variables exported for vt.c
@@ -567,9 +568,8 @@ static void fn_compose(struct vc_data *v
static void fn_spawn_con(struct vc_data *vc, struct pt_regs *regs)
{
- if (spawnpid)
- if (kill_proc(spawnpid, spawnsig, 1))
- spawnpid = 0;
+ if (kill_tref(spawnpid, spawnsig, 1))
+ tref_clear(&spawnpid);
}
static void fn_SAK(struct vc_data *vc, struct pt_regs *regs)
diff --git a/drivers/char/vt_ioctl.c b/drivers/char/vt_ioctl.c
index 24011e7..f8a527b 100644
--- a/drivers/char/vt_ioctl.c
+++ b/drivers/char/vt_ioctl.c
@@ -646,12 +646,13 @@ int vt_ioctl(struct tty_struct *tty, str
*/
case KDSIGACCEPT:
{
- extern int spawnpid, spawnsig;
+ extern struct task_ref *spawnpid;
+ extern int spawnsig;
if (!perm || !capable(CAP_KILL))
return -EPERM;
if (!valid_signal(arg) || arg < 1 || arg == SIGKILL)
return -EINVAL;
- spawnpid = current->pid;
+ tref_set(&spawnpid, tref_get_by_task(current, PIDTYPE_PID));
spawnsig = arg;
return 0;
}
--
1.1.5.g3480
File handles can be requested to send sigio and sigurg
to processes. This code modifies the code to track
those processes making the interface safe from pid
wrap around issues.
It's not a big deal deal but since it is easy to track
the proceses we might as well.
Signed-off-by: Eric W. Biederman <[email protected]>
---
fs/fcntl.c | 45 ++++++++++++++++++++++++---------------------
fs/file_table.c | 1 +
include/linux/fs.h | 1 +
net/socket.c | 5 ++++-
4 files changed, 30 insertions(+), 22 deletions(-)
4c0475dcf7c47c1590c5d7dd06c28a457ebabe33
diff --git a/fs/fcntl.c b/fs/fcntl.c
index 5f96786..af1b02a 100644
--- a/fs/fcntl.c
+++ b/fs/fcntl.c
@@ -251,8 +251,17 @@ static int setfl(int fd, struct file * f
static void f_modown(struct file *filp, unsigned long pid,
uid_t uid, uid_t euid, int force)
{
+
+ enum pid_type type;
+ int who = pid;
+ type = PIDTYPE_PID;
+ if (who < 0) {
+ type = PIDTYPE_PGID;
+ who = -who;
+ }
write_lock_irq(&filp->f_owner.lock);
if (force || !filp->f_owner.pid) {
+ tref_set(&filp->f_owner.tref, tref_get_by_pid(who, type));
filp->f_owner.pid = pid;
filp->f_owner.uid = uid;
filp->f_owner.euid = euid;
@@ -317,7 +326,9 @@ static long do_fcntl(int fd, unsigned in
* current syscall conventions, the only way
* to fix this will be in libc.
*/
- err = filp->f_owner.pid;
+ err = 0;
+ if (filp->f_owner.tref->task)
+ err = filp->f_owner.pid;
force_successful_syscall_return();
break;
case F_SETOWN:
@@ -469,6 +480,7 @@ static void send_sigio_to_task(struct ta
void send_sigio(struct fown_struct *fown, int fd, int band)
{
struct task_struct *p;
+ enum pid_type type;
int pid;
read_lock(&fown->lock);
@@ -477,16 +489,11 @@ void send_sigio(struct fown_struct *fown
goto out_unlock_fown;
read_lock(&tasklist_lock);
- if (pid > 0) {
- p = find_task_by_pid(pid);
- if (p) {
- send_sigio_to_task(p, fown, fd, band);
- }
- } else {
- do_each_task_pid(-pid, PIDTYPE_PGID, p) {
- send_sigio_to_task(p, fown, fd, band);
- } while_each_task_pid(-pid, PIDTYPE_PGID, p);
- }
+ type = fown->tref->type;
+ p = fown->tref->task;
+ do_each_task(p, type) {
+ send_sigio_to_task(p, fown, fd, band);
+ } while_each_task(p, type);
read_unlock(&tasklist_lock);
out_unlock_fown:
read_unlock(&fown->lock);
@@ -503,6 +510,7 @@ int send_sigurg(struct fown_struct *fown
{
struct task_struct *p;
int pid, ret = 0;
+ enum pid_type type;
read_lock(&fown->lock);
pid = fown->pid;
@@ -512,16 +520,11 @@ int send_sigurg(struct fown_struct *fown
ret = 1;
read_lock(&tasklist_lock);
- if (pid > 0) {
- p = find_task_by_pid(pid);
- if (p) {
- send_sigurg_to_task(p, fown);
- }
- } else {
- do_each_task_pid(-pid, PIDTYPE_PGID, p) {
- send_sigurg_to_task(p, fown);
- } while_each_task_pid(-pid, PIDTYPE_PGID, p);
- }
+ type = fown->tref->type;
+ p = fown->tref->task;
+ do_each_task(p, type) {
+ send_sigurg_to_task(p, fown);
+ } while_each_task(p, type);
read_unlock(&tasklist_lock);
out_unlock_fown:
read_unlock(&fown->lock);
diff --git a/fs/file_table.c b/fs/file_table.c
index 768b581..fb70a30 100644
--- a/fs/file_table.c
+++ b/fs/file_table.c
@@ -97,6 +97,7 @@ struct file *get_empty_filp(void)
rwlock_init(&f->f_owner.lock);
/* f->f_version: 0 */
INIT_LIST_HEAD(&f->f_u.fu_list);
+ f->f_owner.tref = &init_tref;
return f;
over:
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 84bb449..35449c2 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -587,6 +587,7 @@ extern struct block_device *I_BDEV(struc
struct fown_struct {
rwlock_t lock; /* protects pid, uid, euid fields */
+ struct task_ref *tref; /* Reference to the task/process group */
int pid; /* pid or -pgrp where SIGIO should be sent */
uid_t uid, euid; /* uid/euid of process setting the owner */
void *security;
diff --git a/net/socket.c b/net/socket.c
index b38a263..4294a77 100644
--- a/net/socket.c
+++ b/net/socket.c
@@ -855,7 +855,10 @@ static long sock_ioctl(struct file *file
break;
case FIOGETOWN:
case SIOCGPGRP:
- err = put_user(sock->file->f_owner.pid, (int __user *)argp);
+ pid = 0;
+ if (sock->file->f_owner.tref->task)
+ pid = sock->file->f_owner.pid;
+ err = put_user(pid, (int __user *)argp);
break;
case SIOCGIFBR:
case SIOCSIFBR:
--
1.1.5.g3480
On Jan 29, 2006, at 02:19, Eric W. Biederman wrote:
> In the pid virtualization thread it was noticed that currently in
> the kernel we have a variables that hold pids of processes or
> process groups that we later intend to send signals to.
>
> It was suggest that we instead use pointers to struct task_struct
> and reference count them to get around this problem.
Very nice! I like the way you did this! I didn't have time to go
over it in much detail, but I approve of the general idea. Thanks!
Cheers,
Kyle Moffett
--
Unix was not designed to stop people from doing stupid things,
because that would also stop them from doing clever things.
-- Doug Gwyn
Eric W. Biederman wrote:
> @@ -317,7 +326,9 @@ static long do_fcntl(int fd, unsigned in
> * current syscall conventions, the only way
> * to fix this will be in libc.
> */
> - err = filp->f_owner.pid;
> + err = 0;
> + if (filp->f_owner.tref->task)
> + err = filp->f_owner.pid;
Probably not very important, but why don't you use
filp->f_owner.tref->task->pid? This way you could completely get rid of
the pid field in fown_struct.
> --- a/net/socket.c
> +++ b/net/socket.c
> @@ -855,7 +855,10 @@ static long sock_ioctl(struct file *file
> break;
> case FIOGETOWN:
> case SIOCGPGRP:
> - err = put_user(sock->file->f_owner.pid, (int __user *)argp);
> + pid = 0;
> + if (sock->file->f_owner.tref->task)
> + pid = sock->file->f_owner.pid;
> + err = put_user(pid, (int __user *)argp);
Same here.
-- Suleiman
Eric W. Biederman wrote:
Just a few typos in comments.
> @@ -17,6 +17,8 @@ struct pid
> struct hlist_node pid_chain;
> /* list of pids with the same nr, only one of them is in the hash */
> struct list_head pid_list;
> + /* Does a weak references of this type exsit to the task struct? */
> + struct task_ref *ref;
exsit -> exist
> +/* Note to read a usable value task value from struct task_ref
> + * the tasklist_lock must be held. The atomic property of single
> + * word reads will keep any vaule you read consistent but it doesn't
vaule -> value
> + * protect you from the race of the task exiting on another cpu and
> + * having it's task_struct freed or reused. Holding the tasklist_lock
> + * prevents the task from going away as you derference the task pointer.
derference -> dereference
-- Suleiman
Suleiman Souhlal <[email protected]> writes:
> Eric W. Biederman wrote:
>
>> @@ -317,7 +326,9 @@ static long do_fcntl(int fd, unsigned in
>> * current syscall conventions, the only way
>> * to fix this will be in libc.
>> */
>> - err = filp->f_owner.pid;
>> + err = 0;
>> + if (filp->f_owner.tref->task)
>> + err = filp->f_owner.pid;
>
> Probably not very important, but why don't you use
> filp->f_owner.tref->task->pid? This way you could completely get rid of the pid
> field in fown_struct.
Two reasons.
One because task->pid is not the proper value for a thread group.
And because task might be NULL.
Basically using filp->f_owner.tref->task->pid as my source is a much
more complicated expression.
Eric
On Sun, Jan 29, 2006 at 12:22:34AM -0700, Eric W. Biederman wrote:
> +struct task_ref
> +{
> + atomic_t count;
Please use a struct kref here, instead of your own atomic_t, as that's
why it is in the kernel :)
> + enum pid_type type;
> + struct task_struct *task;
> +};
thanks,
greg k-h
Greg KH <[email protected]> writes:
> On Sun, Jan 29, 2006 at 12:22:34AM -0700, Eric W. Biederman wrote:
>> +struct task_ref
>> +{
>> + atomic_t count;
>
> Please use a struct kref here, instead of your own atomic_t, as that's
> why it is in the kernel :)
>
>> + enum pid_type type;
>> + struct task_struct *task;
>> +};
>
> thanks,
I would rather not. Whenever I look at struct kref it seems to be an over
abstraction, and as such I find it confusing to work with. I know
whenever I look at the sysfs code I have to actively remind myself
that the kref in the structure is not a pointer to a kref.
What does the kref abstraction buy? How does it simplify things?
We already have equivalent functions in atomic_t on which it is built.
It is a funny piece of code to come up in this thread where the
argument has been don't over abstract.... (with reference to pids).
If I can see where struct kref buys something or helps in some way
I will be willing to consider it.
Eric
On Sun, Jan 29, 2006 at 02:58:51PM -0700, Eric W. Biederman wrote:
> Greg KH <[email protected]> writes:
>
> > On Sun, Jan 29, 2006 at 12:22:34AM -0700, Eric W. Biederman wrote:
> >> +struct task_ref
> >> +{
> >> + atomic_t count;
> >
> > Please use a struct kref here, instead of your own atomic_t, as that's
> > why it is in the kernel :)
> >
> >> + enum pid_type type;
> >> + struct task_struct *task;
> >> +};
> >
> > thanks,
>
> I would rather not. Whenever I look at struct kref it seems to be an over
> abstraction, and as such I find it confusing to work with. I know
> whenever I look at the sysfs code I have to actively remind myself
> that the kref in the structure is not a pointer to a kref.
>
> What does the kref abstraction buy? How does it simplify things?
> We already have equivalent functions in atomic_t on which it is built.
It ensures that you get the logic of the reference counting correctly.
It forces you to do the logic of the get and put and release properly.
To roughly quote Andrew Morton, "When I see a kref, I know it is used
properly, otherwise I am forced to read through the code to see if the
author got the reference counting logic correct."
It costs _nothing_ to use it, and let's everyone know you got the logic
correct.
So don't feel it is a "abstraction", it's a helper for both the author
(who doesn't have to get the atomic_t calls correct), and for everyone
else who has to read the code.
thanks,
greg k-h
--- a/lib/kref.c 2006-01-30 06:11:04.000000000 +0100
+++ b/lib/kref.c 2006-01-30 06:13:32.000000000 +0100
@@ -52,7 +52,12 @@
WARN_ON(release == NULL);
WARN_ON(release == (void (*)(struct kref *))kfree);
- if (atomic_dec_and_test(&kref->refcount)) {
+ /*
+ * if current count is one, we are the last user and can release object
+ * right now, avoiding an atomic operation on 'refcount'
+ */
+ if ((atomic_read(&kref->refcount) == 1) ||
+ (atomic_dec_and_test(&kref->refcount))) {
release(kref);
return 1;
}
On Jan 30, 2006, at 00:19, Eric Dumazet wrote:
> - if (atomic_dec_and_test(&kref->refcount)) {
> + /*
> + * if current count is one, we are the last user and can release
> object
> + * right now, avoiding an atomic operation on 'refcount'
> + */
> + if ((atomic_read(&kref->refcount) == 1) ||
Uhh, I think you got this test reversed. Didn't you mean != 1?
Otherwise you only do the dec_and_test when the refcount is one,
which means that you leak everything kref-ed.
> + (atomic_dec_and_test(&kref->refcount))) {
> release(kref);
> return 1;
> }
Cheers,
Kyle Moffett
--
There are two ways of constructing a software design. One way is to
make it so simple that there are obviously no deficiencies. And the
other way is to make it so complicated that there are no obvious
deficiencies. The first method is far more difficult.
-- C.A.R. Hoare
Kyle Moffett a ?crit :
> On Jan 30, 2006, at 00:19, Eric Dumazet wrote:
>> - if (atomic_dec_and_test(&kref->refcount)) {
>> + /*
>> + * if current count is one, we are the last user and can release
>> object
>> + * right now, avoiding an atomic operation on 'refcount'
>> + */
>> + if ((atomic_read(&kref->refcount) == 1) ||
>
> Uhh, I think you got this test reversed. Didn't you mean != 1?
> Otherwise you only do the dec_and_test when the refcount is one, which
> means that you leak everything kref-ed.
>
Not at all :)
Your mail is just another proof why kref is a good abstraction :)
If you are the last user of a kref, (refcount = 1), then
you are sure that nobody else but you is using the object, and as we are
kref_put() this object, the atomic_dec_and-test *will* set the count the
object and you are going to release() object.
The release() function is not going to look at kref_count again, just free the
resources and the object.
Maybe a change in the documentation is necessary to explain this point
(release() can e called while the apparent krefcount is 1)
Or in kref_put doing this :
if (atomic_read(&kref->refcount) == 1) {
atomic_set(&kref->refcount, 0);
release(kref);
}
if (atomic_dec_and_test(&kref->refcount)) {
release(kref);
return 1;
}
Eric
On Jan 30, 2006, at 00:46, Eric Dumazet wrote:
> Kyle Moffett a ?crit :
>> On Jan 30, 2006, at 00:19, Eric Dumazet wrote:
>>> - if (atomic_dec_and_test(&kref->refcount)) {
>>> + /*
>>> + * if current count is one, we are the last user and can
>>> release object
>>> + * right now, avoiding an atomic operation on 'refcount'
>>> + */
>>> + if ((atomic_read(&kref->refcount) == 1) ||
>> Uhh, I think you got this test reversed. Didn't you mean != 1?
>> Otherwise you only do the dec_and_test when the refcount is one,
>> which means that you leak everything kref-ed.
>
> Not at all :)
>
> Your mail is just another proof why kref is a good abstraction :)
>
> If you are the last user of a kref, (refcount = 1), then
> you are sure that nobody else but you is using the object, and as
> we are kref_put() this object, the atomic_dec_and-test *will* set
> the count the object and you are going to release() object.
>
> The release() function is not going to look at kref_count again,
> just free the resources and the object.
OHHH, I see where I got confused. The indentation was bad, dunno if
it was my end or yours, so I misread it as this:
if (atomic_read(...) == 1) {
atomic_dec_and_test(...);
...
}
instead of this:
if (atomic_read(...) == 1 ||
atomic_dec_and_test(...)) {
...
}
This should teach me not to reply this late at night. Sorry for the
confusion.
Cheers,
Kyle Moffett
--
They _will_ find opposing experts to say it isn't, if you push hard
enough the wrong way. Idiots with a PhD aren't hard to buy.
-- Rob Landley
On Ne 29-01-06 00:33:27, Eric W. Biederman wrote:
>
> This is a classic example of a random kernel subsystem
> holding a pid for purposes of signalling it later.
>
> Signed-off-by: Eric W. Biederman <[email protected]>
>
>
> ---
>
> drivers/char/keyboard.c | 8 ++++----
> drivers/char/vt_ioctl.c | 5 +++--
> 2 files changed, 7 insertions(+), 6 deletions(-)
>
> 7ed8301463a49ad03f8c9de2bbf8c41a5d9843ea
> diff --git a/drivers/char/keyboard.c b/drivers/char/keyboard.c
> index 8b603b2..4e1f2e0 100644
> --- a/drivers/char/keyboard.c
> +++ b/drivers/char/keyboard.c
> @@ -109,7 +109,8 @@ struct kbd_struct kbd_table[MAX_NR_CONSO
> static struct kbd_struct *kbd = kbd_table;
> static struct kbd_struct kbd0;
>
> -int spawnpid, spawnsig;
> +TASK_REF(spawnpid);
Could we get some nicer syntax of declaration? This does not look like
declaration, and looks ugly to my eyes.
Pavel
--
Thanks, Sharp!
On Mon, Jan 30, 2006 at 06:19:35AM +0100, Eric Dumazet wrote:
> Example of improvement in kref_put() :
>
> [PATCH] kref : Avoid an atomic operation in kref_put() when the last
> reference is dropped. On most platforms, atomic_read() is a plan read of
> the counter and involves no atomic at all.
No, we wat to decrement and test at the same time, to protect against
any race where someone is incrementing right when we are dropping the
last reference.
So, thanks, but I'm not going to apply this.
greg k-h
Greg KH a ?crit :
> On Mon, Jan 30, 2006 at 06:19:35AM +0100, Eric Dumazet wrote:
>> Example of improvement in kref_put() :
>>
>> [PATCH] kref : Avoid an atomic operation in kref_put() when the last
>> reference is dropped. On most platforms, atomic_read() is a plan read of
>> the counter and involves no atomic at all.
>
> No, we wat to decrement and test at the same time, to protect against
> any race where someone is incrementing right when we are dropping the
> last reference.
Sorry ? Me confused !
What I am saying is :
If a CPUA is doing a kref_put() and refcount == 1, then another CPU *cannot*
change the refcount, because only one CPU has a valid reference on the object.
(CPUA )
Proof :
If CPUB cannot kref_put() as well because only CPUA owns a refcount. (If two
CPUS could kref_put, then counter would become -1 !)
If CPUB is going to do a kref_get() : Not allowed by kref specs (Rule 3)
(It would mean that both CPUA/B have a reference)
'The kref_get() does not require a lock,
since we already have a valid pointer that we own a refcount for.'
So CPUA and CPUB cannot both own a refcount on a object having refcount=1
If you allow this, then current implementation is buggy as well.
kref->refcount == 1
CPUA :
if (atomic_dec_and_test(&kref->refcount)) {
returns TRUE condition
[refcount is now 0]
CPUB :
atomic_inc(&kref->refcount)
[refcount is now 1]
CPUA :
release(kref); (object freed)
CPUB :
doing some work on object freed by CPUA. *kaboom*
CPUB : kref_put()
refcount back to 0 -> object freed twice *kaboom*
>
> So, thanks, but I'm not going to apply this.
>
> greg k-h
>
Me confused.
Yes, kref abstraction is good :)
Greg KH <[email protected]> writes:
> On Sun, Jan 29, 2006 at 02:58:51PM -0700, Eric W. Biederman wrote:
>>
>> What does the kref abstraction buy? How does it simplify things?
>> We already have equivalent functions in atomic_t on which it is built.
>
> It ensures that you get the logic of the reference counting correctly.
> It forces you to do the logic of the get and put and release properly.
>
> To roughly quote Andrew Morton, "When I see a kref, I know it is used
> properly, otherwise I am forced to read through the code to see if the
> author got the reference counting logic correct."
>
> It costs _nothing_ to use it, and let's everyone know you got the logic
> correct.
>
> So don't feel it is a "abstraction", it's a helper for both the author
> (who doesn't have to get the atomic_t calls correct), and for everyone
> else who has to read the code.
So looking at kref and looking at my code I have a small amount of feedback.
It looks like using kref would increase the size of my code as I would
have to add an additional function. Further I don't see that it would
simplify anything in my code.
The extra debugging checks in krefs are nice, but not something I
am lusting over.
As for code recognition I don't see how recognizing the atomic_t idiom
versus the kref idiom is much different. But I haven't had this issue
come up in a code review so I have no practical experience there.
The biggest issue I can find with kref is the name. kref is not a
reference it is a count. A reference count if you want to be long
about it. Just skimming code using it looks like kref is a pointer to
a generic object that you are getting and putting.
It also doesn't help that current krefs are used in little enough code
that I still find it jarring to see one. One more abstraction to
read. Whereas atomic_t are everywhere, so I am familiar with them.
So those are all of the nits I can pick. :) I don't kref seems to be
a perfectly usable piece of code but not one that seems to help my
code. Unless it becomes a mandate from the code correctness police.
Eric
Pavel Machek <[email protected]> writes:
> On Ne 29-01-06 00:33:27, Eric W. Biederman wrote:
>>
>> This is a classic example of a random kernel subsystem
>> holding a pid for purposes of signalling it later.
>>
>> Signed-off-by: Eric W. Biederman <[email protected]>
>>
>>
>> ---
>>
>> drivers/char/keyboard.c | 8 ++++----
>> drivers/char/vt_ioctl.c | 5 +++--
>> 2 files changed, 7 insertions(+), 6 deletions(-)
>>
>> 7ed8301463a49ad03f8c9de2bbf8c41a5d9843ea
>> diff --git a/drivers/char/keyboard.c b/drivers/char/keyboard.c
>> index 8b603b2..4e1f2e0 100644
>> --- a/drivers/char/keyboard.c
>> +++ b/drivers/char/keyboard.c
>> @@ -109,7 +109,8 @@ struct kbd_struct kbd_table[MAX_NR_CONSO
>> static struct kbd_struct *kbd = kbd_table;
>> static struct kbd_struct kbd0;
>>
>> -int spawnpid, spawnsig;
>> +TASK_REF(spawnpid);
>
> Could we get some nicer syntax of declaration? This does not look like
> declaration, and looks ugly to my eyes.
Any suggestions?
Does
struct task_ref *spawnpid = &init_tref;
I modeled it after how we did this is done for lists.
Eric
Eric Dumazet <[email protected]> writes:
> Greg KH a ?crit :
>> On Mon, Jan 30, 2006 at 06:19:35AM +0100, Eric Dumazet wrote:
>>> Example of improvement in kref_put() :
>>>
>>> [PATCH] kref : Avoid an atomic operation in kref_put() when the last
>>> reference is dropped. On most platforms, atomic_read() is a plan read of the
>>> counter and involves no atomic at all.
>> No, we wat to decrement and test at the same time, to protect against
>> any race where someone is incrementing right when we are dropping the
>> last reference.
>
> Sorry ? Me confused !
Largely I think you have the right of it, that the optimization is
correct. My biggest complaint is that the common case is going to be
several references to the data structure. Releasing the references
will always be slow. To do the read you need to get the value into
the cache line.
So it looks to me like you are optimizing the wrong case and
bloating the icache with unnecessary code.
Eric
Hi!
> >> 7ed8301463a49ad03f8c9de2bbf8c41a5d9843ea
> >> diff --git a/drivers/char/keyboard.c b/drivers/char/keyboard.c
> >> index 8b603b2..4e1f2e0 100644
> >> --- a/drivers/char/keyboard.c
> >> +++ b/drivers/char/keyboard.c
> >> @@ -109,7 +109,8 @@ struct kbd_struct kbd_table[MAX_NR_CONSO
> >> static struct kbd_struct *kbd = kbd_table;
> >> static struct kbd_struct kbd0;
> >>
> >> -int spawnpid, spawnsig;
> >> +TASK_REF(spawnpid);
> >
> > Could we get some nicer syntax of declaration? This does not look like
> > declaration, and looks ugly to my eyes.
>
> Any suggestions?
>
> Does
> struct task_ref *spawnpid = &init_tref;
Looks must better...
Pavel
--
Thanks, Sharp!
Pavel Machek <[email protected]> writes:
> Hi!
>
>> >> 7ed8301463a49ad03f8c9de2bbf8c41a5d9843ea
>> >> diff --git a/drivers/char/keyboard.c b/drivers/char/keyboard.c
>> >> index 8b603b2..4e1f2e0 100644
>> >> --- a/drivers/char/keyboard.c
>> >> +++ b/drivers/char/keyboard.c
>> >> @@ -109,7 +109,8 @@ struct kbd_struct kbd_table[MAX_NR_CONSO
>> >> static struct kbd_struct *kbd = kbd_table;
>> >> static struct kbd_struct kbd0;
>> >>
>> >> -int spawnpid, spawnsig;
>> >> +TASK_REF(spawnpid);
>> >
>> > Could we get some nicer syntax of declaration? This does not look like
>> > declaration, and looks ugly to my eyes.
>>
>> Any suggestions?
>>
>> Does
>> struct task_ref *spawnpid = &init_tref;
>
> Looks must better...
The challenge for me is that I make assumptions that a struct task_ref *
always points to something valid.
Maybe it is sufficient to spell that out in comments. Having a helper
function though feels like it encourages it even more. I guess
the code oopsing if it is a NULL pointer helps too.
Eric
Eric W. Biederman a ?crit :
> Eric Dumazet <[email protected]> writes:
>
>> Greg KH a ?crit :
>>> On Mon, Jan 30, 2006 at 06:19:35AM +0100, Eric Dumazet wrote:
>>>> Example of improvement in kref_put() :
>>>>
>>>> [PATCH] kref : Avoid an atomic operation in kref_put() when the last
>>>> reference is dropped. On most platforms, atomic_read() is a plan read of the
>>>> counter and involves no atomic at all.
>>> No, we wat to decrement and test at the same time, to protect against
>>> any race where someone is incrementing right when we are dropping the
>>> last reference.
>> Sorry ? Me confused !
>
> Largely I think you have the right of it, that the optimization is
> correct. My biggest complaint is that the common case is going to be
> several references to the data structure. Releasing the references
> will always be slow. To do the read you need to get the value into
> the cache line.
>
> So it looks to me like you are optimizing the wrong case and
> bloating the icache with unnecessary code.
This function is not inlined.
Adding a test and a branch is a matter of 7 bytes.
'Bloating the icache' is a litle bit off :)
Avoiding an atomic is important. This is already done elsewhere in the kernel,
in a inlined function with *many* call sites :
(See kfree_skb() in include/linux//skbuff.h )
/*
* If users == 1, we are the only owner and are can avoid redundant
* atomic change.
*/
/**
* kfree_skb - free an sk_buff
* @skb: buffer to free
*
* Drop a reference to the buffer and free it if the usage count has
* hit zero.
*/
static inline void kfree_skb(struct sk_buff *skb)
{
if (likely(atomic_read(&skb->users) == 1))
smp_rmb();
else if (likely(!atomic_dec_and_test(&skb->users)))
return;
__kfree_skb(skb);
}
This is a valid optimization : an atomic_dec_and_test() is very expensive.
Eric
Eric Dumazet <[email protected]> writes:
> This function is not inlined.
>
> Adding a test and a branch is a matter of 7 bytes.
>
> 'Bloating the icache' is a litle bit off :)
The size all depends on your architecture. But I agree it
is not much of a size increase in general.
> Avoiding an atomic is important. This is already done elsewhere in the kernel,
> in a inlined function with *many* call sites :
>
> (See kfree_skb() in include/linux//skbuff.h )
>
> /*
> * If users == 1, we are the only owner and are can avoid redundant
> * atomic change.
> */
>
> /**
> * kfree_skb - free an sk_buff
> * @skb: buffer to free
> *
> * Drop a reference to the buffer and free it if the usage count has
> * hit zero.
> */
> static inline void kfree_skb(struct sk_buff *skb)
> {
> if (likely(atomic_read(&skb->users) == 1))
> smp_rmb();
> else if (likely(!atomic_dec_and_test(&skb->users)))
> return;
> __kfree_skb(skb);
> }
>
>
> This is a valid optimization : an atomic_dec_and_test() is very expensive.
It is a valid optimization if you will normally have only one user
like a skb. For other structures that frequently have many users I'm
not at all certain it is a useful optimization.
Eric
On Mon, Jan 30, 2006 at 01:13:12PM -0700, Eric W. Biederman wrote:
> Greg KH <[email protected]> writes:
>
> > On Sun, Jan 29, 2006 at 02:58:51PM -0700, Eric W. Biederman wrote:
> >>
> >> What does the kref abstraction buy? How does it simplify things?
> >> We already have equivalent functions in atomic_t on which it is built.
> >
> > It ensures that you get the logic of the reference counting correctly.
> > It forces you to do the logic of the get and put and release properly.
> >
> > To roughly quote Andrew Morton, "When I see a kref, I know it is used
> > properly, otherwise I am forced to read through the code to see if the
> > author got the reference counting logic correct."
> >
> > It costs _nothing_ to use it, and let's everyone know you got the logic
> > correct.
> >
> > So don't feel it is a "abstraction", it's a helper for both the author
> > (who doesn't have to get the atomic_t calls correct), and for everyone
> > else who has to read the code.
>
> So looking at kref and looking at my code I have a small amount of feedback.
> It looks like using kref would increase the size of my code as I would
> have to add an additional function. Further I don't see that it would
> simplify anything in my code.
What function would you have to add? A release function? You should
have that logic somewhere already, so net gain should be the same.
> The extra debugging checks in krefs are nice, but not something I
> am lusting over.
Heh, but they do catch a lot of improper usages.
> As for code recognition I don't see how recognizing the atomic_t idiom
> versus the kref idiom is much different. But I haven't had this issue
> come up in a code review so I have no practical experience there.
It's not a different "idiom", but rather, a way to implement the
reference counting logic without having to code it all by hand, and
ensure that you get it correct.
> The biggest issue I can find with kref is the name. kref is not a
> reference it is a count. A reference count if you want to be long
> about it. Just skimming code using it looks like kref is a pointer to
> a generic object that you are getting and putting.
Yes. You are "getting" and "putting" your structure, without having to
handle the reference count. I don't see the issue here...
> It also doesn't help that current krefs are used in little enough code
> that I still find it jarring to see one.
They are used all the time, and there's not that many atomic_t usages in
the kernel that do reference counting that have not been already
converted to use kref (the vfs not-withstanding, for other reasons.)
> One more abstraction to read. Whereas atomic_t are everywhere, so I
> am familiar with them.
They should not be "everywhere" as reference counts, except for the vm
and mm cores.
> So those are all of the nits I can pick. :) I don't kref seems to be
> a perfectly usable piece of code but not one that seems to help my
> code. Unless it becomes a mandate from the code correctness police.
As you are adding a new reference count, it is highly encouraged that
you not reimplement the same logic, but rather use the functions already
provided in the kernel to do that work for you. That way you don't have
to write the extra code, and we don't have to check the logic for you.
thanks,
greg k-h
Greg KH <[email protected]> writes:
> On Mon, Jan 30, 2006 at 01:13:12PM -0700, Eric W. Biederman wrote:
>
> What function would you have to add? A release function? You should
> have that logic somewhere already, so net gain should be the same.
Yes inline in my if statement all 3 lines of it. Adding a function
makes the code less clear.
>> The extra debugging checks in krefs are nice, but not something I
>> am lusting over.
>
> Heh, but they do catch a lot of improper usages.
Agreed.
>> As for code recognition I don't see how recognizing the atomic_t idiom
>> versus the kref idiom is much different. But I haven't had this issue
>> come up in a code review so I have no practical experience there.
>
> It's not a different "idiom", but rather, a way to implement the
> reference counting logic without having to code it all by hand, and
> ensure that you get it correct.
Yes a way to implement the reference counting logic that requires more
lines of code to implement and obscures my code in this instance.
> They are used all the time, and there's not that many atomic_t usages in
> the kernel that do reference counting that have not been already
> converted to use kref (the vfs not-withstanding, for other reasons.)
Well either that is all code that is still to be merged in the
mainline or I am looking with a very different filter than you are.
A quick grep for atomic_t in the kernel include files (filtering
out the asm headers where it was defined) I got several screen fulls
of hits. Things like the network stack, shared process resources are
places I have worked lately and where atomic_t is used, all of the
core bits of the kernel. My similar grep for kref should barely returned
a screen full of hits.
>> So those are all of the nits I can pick. :) I don't kref seems to be
>> a perfectly usable piece of code but not one that seems to help my
>> code. Unless it becomes a mandate from the code correctness police.
>
> As you are adding a new reference count, it is highly encouraged that
> you not reimplement the same logic, but rather use the functions already
> provided in the kernel to do that work for you. That way you don't have
> to write the extra code, and we don't have to check the logic for you.
If there was something to implement I would agree with you. In drivers
out at the fringes of the kernel where looking at some of the code submissions
you wonder if the implementors know C I can totally understand anything
to make things less error prone. But that isn't what this code is.
Given how short my code is and what the effects are I am actually
disappointed that the review only got as far as noticing I was using
an atomic_t. Oh well. Thanks for getting that far.
Eric
At the moment I am going to say thanks for the comments.
So far no one has said hey this is what I have been looking for
and pid wrap around in the kernel is a very bad thing, thanks
for solving my problem.
Currently this feels like overkill so I am going to shelve this
approach for now.
Eric
Quoting Eric W. Biederman ([email protected]):
>
> At the moment I am going to say thanks for the comments.
>
> So far no one has said hey this is what I have been looking for
> and pid wrap around in the kernel is a very bad thing, thanks
> for solving my problem.
>
> Currently this feels like overkill so I am going to shelve this
> approach for now.
Ok, then let me jump in belatedly and say I think this is a good
thing, and it will solve a problem we all (involved in this thread)
will face shortly.
And kudos for (a) a very nice, clean patchset, and (b) solving the
fowner/signal problem!
I think this would be a very good thing to go in.
thanks,
-serge