2012-02-15 14:44:00

by Oleg Nesterov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

> +/* The caller must have pinned the task */
> +static struct file *
> +get_file_raw_ptr(struct task_struct *task, unsigned int idx)
> +{
> + struct fdtable *fdt;
> + struct file *file;
> +
> + spin_lock(&task->files->file_lock);

task->files can be NULL, we can race with exit_files().

> + fdt = files_fdtable(task->files);
> + if (idx < fdt->max_fds)
> + file = fdt->fd[idx];

You can probably rely on rcu instead of ->file_lock, but this is minor.

> +SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
> + unsigned long, idx1, unsigned long, idx2)
> +{
> + struct task_struct *task1, *task2;
> + int ret;
> +
> + rcu_read_lock();
> +
> + /*
> + * Tasks are looked up in caller's PID namespace only.
> + */
> + task1 = find_task_by_vpid(pid1);
> + task2 = find_task_by_vpid(pid2);
> + if (!task1 || !task2)
> + goto err_no_task;
> +
> + get_task_struct(task1);
> + get_task_struct(task2);
> +
> + rcu_read_unlock();
> +
> + /*
> + * One should have enough rights to inspect task details.
> + */
> + if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
> + !ptrace_may_access(task2, PTRACE_MODE_READ)) {
> + ret = -EACCES;

Well, probably this is fine... but may be you can add a comment.
The task can change its credentials right after ptrace_may_access()
succeeds. This _looks_ wrong, perhaps it makes sense to add the
"we do not care" note.

Oleg.


2012-02-15 15:10:16

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On Wed, Feb 15, 2012 at 03:36:06PM +0100, Oleg Nesterov wrote:
>
> task->files can be NULL, we can race with exit_files().

So I need to call get_files_struct. Thanks Oleg!

>
> > + fdt = files_fdtable(task->files);
> > + if (idx < fdt->max_fds)
> > + file = fdt->fd[idx];
>
> You can probably rely on rcu instead of ->file_lock, but this is minor.
>

I think so. I'll be updating the patch anyway (on top of current
one) so I'll address this comment too. Thanks!

> > +
> > + /*
> > + * One should have enough rights to inspect task details.
> > + */
> > + if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
> > + !ptrace_may_access(task2, PTRACE_MODE_READ)) {
> > + ret = -EACCES;
>
> Well, probably this is fine... but may be you can add a comment.
> The task can change its credentials right after ptrace_may_access()
> succeeds. This _looks_ wrong, perhaps it makes sense to add the
> "we do not care" note.
>

Wait, how it's differ from other ptrace_may_access calls all over
the kernel? I suppose I'm missing something obvious?

Cyrill

2012-02-15 15:46:07

by Oleg Nesterov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On 02/15, Cyrill Gorcunov wrote:
>
> On Wed, Feb 15, 2012 at 03:36:06PM +0100, Oleg Nesterov wrote:
> >
> > > +
> > > + /*
> > > + * One should have enough rights to inspect task details.
> > > + */
> > > + if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
> > > + !ptrace_may_access(task2, PTRACE_MODE_READ)) {
> > > + ret = -EACCES;
> >
> > Well, probably this is fine... but may be you can add a comment.
> > The task can change its credentials right after ptrace_may_access()
> > succeeds. This _looks_ wrong, perhaps it makes sense to add the
> > "we do not care" note.
> >
>
> Wait, how it's differ from other ptrace_may_access calls all over
> the kernel? I suppose I'm missing something obvious?

For example? Say, mm_access() is fine because it returns ->mm
which we are going to play with.

But map_files_d_revalidate/proc_map_files_get_link looks wrong,
there are obviously racy and should use mm_access().

Probably something else is wrong too.

Once again, I am not saying that this code really has the security
problems. I simply do not know. But it looks wrong without the
comment. I do not really understand why do we need ptrace_may_access(),
but whatever reason we have how we can trust it? Say, when KCMP_VM
checks ->mm, all we know is that PTRACE_MODE_READ succeed in the
past. This looks confusing, imho.

Oleg.

2012-02-15 16:13:37

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On Wed, Feb 15, 2012 at 04:38:16PM +0100, Oleg Nesterov wrote:
...
> >
> > Wait, how it's differ from other ptrace_may_access calls all over
> > the kernel? I suppose I'm missing something obvious?
>
> For example? Say, mm_access() is fine because it returns ->mm
> which we are going to play with.

So, say we have

environ_read
mm_for_maps
mm_access
success, and first reader continue here

then say task change own credentials and all
this sequence fails because access is not granted
anymore (say for second reader), but first reader
still able to continue reading because access was
graned earlier.

So I don't understand how it's different from what
is provided in this patch. What I'm missing?

> Once again, I am not saying that this code really has the security
> problems. I simply do not know. But it looks wrong without the
> comment. I do not really understand why do we need ptrace_may_access(),
> but whatever reason we have how we can trust it? Say, when KCMP_VM
> checks ->mm, all we know is that PTRACE_MODE_READ succeed in the
> past. This looks confusing, imho.

Adding the comment is not a problem. The problem is that I
dont understand if there error in patch or not, can we stick
with ptrace_may_access or need something different here?
The idea was exactly like -- if you have enough rights to
proceed ptrace_may_access then syscall should continue
executing and return comparision result.

Cyrill

2012-02-15 16:15:48

by Oleg Nesterov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

Not a comment, but the question. I am just curious...

> +/*
> + * We don't expose real in-memory order of objects for security
> + * reasons, still the comparison results should be suitable for
> + * sorting. Thus, we obfuscate kernel pointers values and compare
> + * the production instead.
> + */
> +static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
> +
> +static long kptr_obfuscate(long v, int type)
> +{
> + return (v ^ cookies[type][0]) * cookies[type][1];
> +}

OK, but why do we need this per type? Just to add more obfuscation
or there is another reason?

> +static __init int kcmp_cookies_init(void)
> +{
> + int i;
> +
> + get_random_bytes(cookies, sizeof(cookies));
> +
> + for (i = 0; i < KCMP_TYPES; i++)
> + cookies[i][1] |= (~(~0UL >> 1) | 1);

I am puzzled, help ;) this is equal to

cookies[i][1] |= -LONG_MAX;
or
cookies[i][1] |= (LONG_MIN | 1);

for what? why do we want to set these 2 bits (MSB and LSB) ?

Oleg.

2012-02-15 16:27:59

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On Wed, Feb 15, 2012 at 05:06:52PM +0100, Oleg Nesterov wrote:
> Not a comment, but the question. I am just curious...
>
> > +/*
> > + * We don't expose real in-memory order of objects for security
> > + * reasons, still the comparison results should be suitable for
> > + * sorting. Thus, we obfuscate kernel pointers values and compare
> > + * the production instead.
> > + */
> > +static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
> > +
> > +static long kptr_obfuscate(long v, int type)
> > +{
> > + return (v ^ cookies[type][0]) * cookies[type][1];
> > +}
>
> OK, but why do we need this per type? Just to add more obfuscation
> or there is another reason?

Just to add more obfuscation.

>
> > +static __init int kcmp_cookies_init(void)
> > +{
> > + int i;
> > +
> > + get_random_bytes(cookies, sizeof(cookies));
> > +
> > + for (i = 0; i < KCMP_TYPES; i++)
> > + cookies[i][1] |= (~(~0UL >> 1) | 1);
>
> I am puzzled, help ;) this is equal to
>
> cookies[i][1] |= -LONG_MAX;
> or
> cookies[i][1] |= (LONG_MIN | 1);
>
> for what? why do we want to set these 2 bits (MSB and LSB) ?

Letme quote hpa@ here :)

| This code is wrong. You will have a zero cookie, legitimately, once in
| 2^32 or 2^64 attempts, depending on the bitness.
|
| The other thing is that for the multiplicative cookie you should OR in
| the value (~(~0UL >> 1) | 1) in order to make sure that the value is (a)
| large and (b) odd.

Cyrill

2012-02-15 16:30:21

by Oleg Nesterov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On 02/15, Cyrill Gorcunov wrote:
>
> On Wed, Feb 15, 2012 at 04:38:16PM +0100, Oleg Nesterov wrote:
> ...
> > >
> > > Wait, how it's differ from other ptrace_may_access calls all over
> > > the kernel? I suppose I'm missing something obvious?
> >
> > For example? Say, mm_access() is fine because it returns ->mm
> > which we are going to play with.
>
> So, say we have
>
> environ_read
> mm_for_maps
> mm_access
> success, and first reader continue here
>
> then say task change own credentials and all
> this sequence fails because access is not granted
> anymore (say for second reader), but first reader
> still able to continue reading because access was
> graned earlier.

Can't understand... Yes, environ_read() can succeed for the
first reader, and then later it can fail for the same/another
reader. And?

> So I don't understand how it's different from what
> is provided in this patch. What I'm missing?

environ_read() does

mm = mm_access(task);
if (mm)
do_something(mm);

even if it races with, say, execve(setuid_app) we can't read the
new ->mm.

while your code (very roughly) does something like

mm = mm_access(task);

if (mm)
do_something(task->mm);

while it is quite possible that mm != task->mm.

> > Once again, I am not saying that this code really has the security
> > problems. I simply do not know. But it looks wrong without the
> > comment. I do not really understand why do we need ptrace_may_access(),
> > but whatever reason we have how we can trust it? Say, when KCMP_VM
> > checks ->mm, all we know is that PTRACE_MODE_READ succeed in the
> > past. This looks confusing, imho.
>
> Adding the comment is not a problem. The problem is that I
> dont understand if there error in patch or not, can we stick
> with ptrace_may_access or need something different here?
> The idea was exactly like -- if you have enough rights to
> proceed ptrace_may_access then syscall should continue
> executing and return comparision result.

My only point is: this check is obviously racy, and thus it looks
confusing. Whether this is fine or not, I do not know. Personally
I see no reason for ptrace_may_access(), but I am not security
expert.

Oleg.

2012-02-15 17:53:26

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On Wed, Feb 15, 2012 at 05:22:22PM +0100, Oleg Nesterov wrote:
>
> > So I don't understand how it's different from what
> > is provided in this patch. What I'm missing?
>
> environ_read() does
>
> mm = mm_access(task);
> if (mm)
> do_something(mm);
>
> even if it races with, say, execve(setuid_app) we can't read the
> new ->mm.

Wait, I'm confused

process 1 (reader) process 2 ("task" itself)
mm = mm_access(task);
task changes own credentials
so reader can't access on next
read if it would try, but since
access already granted... it
continues do_something(mm)
if (mm)
do_something(mm);

So in the patch I tried the same, once access is granted it
belongs to a caller.

>
> while your code (very roughly) does something like
>
> mm = mm_access(task);
> if (mm)
> do_something(task->mm);
>
> while it is quite possible that mm != task->mm.

Oleg, could you please explain me where it happens
that task->mm (I've got access to) will be changed
to some new -mm while I'm inspecting it.

If permission changed while the caller inside syscall,
it's the same situation as with mm_access above. No?

>
> My only point is: this check is obviously racy, and thus it looks
> confusing. Whether this is fine or not, I do not know. Personally
> I see no reason for ptrace_may_access(), but I am not security
> expert.

The idea was -- non-privilege caller should not have access
to privileged tasks.

Cyrill

2012-02-15 18:32:10

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On Wed, Feb 15, 2012 at 07:10:08PM +0400, Cyrill Gorcunov wrote:
> On Wed, Feb 15, 2012 at 03:36:06PM +0100, Oleg Nesterov wrote:
> >
> > task->files can be NULL, we can race with exit_files().
>
> So I need to call get_files_struct. Thanks Oleg!
>

Something like below I think, right? (it seems this
patc doesn't hit lenux-next yet, so here is interdiff
output)

Cyrill
---
diff -u linux-2.6.git/kernel/kcmp.c linux-2.6.git/kernel/kcmp.c
--- linux-2.6.git/kernel/kcmp.c
+++ linux-2.6.git/kernel/kcmp.c
@@ -44,16 +44,25 @@
static struct file *
get_file_raw_ptr(struct task_struct *task, unsigned int idx)
{
+ struct files_struct *files;
struct fdtable *fdt;
struct file *file;

- spin_lock(&task->files->file_lock);
- fdt = files_fdtable(task->files);
+ files = get_files_struct(task);
+ if (!files)
+ return NULL;
+
+ rcu_read_lock();
+
+ fdt = files_fdtable(files);
if (idx < fdt->max_fds)
file = fdt->fd[idx];
else
file = NULL;
- spin_unlock(&task->files->file_lock);
+
+ rcu_read_unlock();
+
+ put_files_struct(files);

return file;
}

2012-02-15 18:51:27

by Oleg Nesterov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On 02/15, Cyrill Gorcunov wrote:
>
> On Wed, Feb 15, 2012 at 05:22:22PM +0100, Oleg Nesterov wrote:
> >
> > > So I don't understand how it's different from what
> > > is provided in this patch. What I'm missing?
> >
> > environ_read() does
> >
> > mm = mm_access(task);
> > if (mm)
> > do_something(mm);
> >
> > even if it races with, say, execve(setuid_app) we can't read the
> > new ->mm.
>
> Wait, I'm confused
>
> process 1 (reader) process 2 ("task" itself)
> mm = mm_access(task);
> task changes own credentials
> so reader can't access on next
> read if it would try, but since
> access already granted... it
> continues do_something(mm)
> if (mm)
> do_something(mm);
>
> So in the patch I tried the same, once access is granted it
> belongs to a caller.

See the "execve(setuid_app)", this is what I meant. Even if we
race with execve() and the task raises its privileges we can't
read the new ->mm, we will read the old mm for which we have
(had) the rights to access.

> > while your code (very roughly) does something like
> >
> > mm = mm_access(task);
> > if (mm)
> > do_something(task->mm);
> >
> > while it is quite possible that mm != task->mm.
>
> Oleg, could you please explain me where it happens
> that task->mm (I've got access to) will be changed
> to some new -mm while I'm inspecting it.

Cough... this is question I am trying to ask ;)

Let me try again. To simplify, lets discuss the KCMP_VM case
only.

I do not really understand why do we need ptrace_may_access().
I do not see any security problems with kcmp_ptr(task->mm), but
I am not expert.

However, you added this check so I assume you have some reason.
But this can race with execve(setuid_app) and KCMP_VM can play
with task->mm after this task raises its caps. If this is fine,
then why do we need ptrace_may_access?

OK, please ignore. I sent the initial email just becase KCMP_FILE
is buggy.

> > > + for (i = 0; i < KCMP_TYPES; i++)
> > > + cookies[i][1] |= (~(~0UL >> 1) | 1);
> >
> > I am puzzled, help ;) this is equal to
> >
> > cookies[i][1] |= -LONG_MAX;
> > or
> > cookies[i][1] |= (LONG_MIN | 1);
> >
> > for what? why do we want to set these 2 bits (MSB and LSB) ?
>
> Letme quote hpa@ here :)
>
> | This code is wrong. You will have a zero cookie, legitimately, once in
> | 2^32 or 2^64 attempts, depending on the bitness.
> |
> | The other thing is that for the multiplicative cookie you should OR in
> | the value (~(~0UL >> 1) | 1) in order to make sure that the value is (a)
> | large and (b) odd.

OK, thanks.

Oleg.

2012-02-15 19:14:13

by Oleg Nesterov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On 02/15, Cyrill Gorcunov wrote:
>
> On Wed, Feb 15, 2012 at 07:10:08PM +0400, Cyrill Gorcunov wrote:
> > On Wed, Feb 15, 2012 at 03:36:06PM +0100, Oleg Nesterov wrote:
> > >
> > > task->files can be NULL, we can race with exit_files().
> >
> > So I need to call get_files_struct. Thanks Oleg!
> >
>
> Something like below I think, right? (it seems this
> patc doesn't hit lenux-next yet, so here is interdiff
> output)
>
> Cyrill
> ---
> diff -u linux-2.6.git/kernel/kcmp.c linux-2.6.git/kernel/kcmp.c
> --- linux-2.6.git/kernel/kcmp.c
> +++ linux-2.6.git/kernel/kcmp.c
> @@ -44,16 +44,25 @@
> static struct file *
> get_file_raw_ptr(struct task_struct *task, unsigned int idx)
> {
> + struct files_struct *files;
> struct fdtable *fdt;
> struct file *file;
>
> - spin_lock(&task->files->file_lock);
> - fdt = files_fdtable(task->files);
> + files = get_files_struct(task);
> + if (!files)
> + return NULL;
> +
> + rcu_read_lock();
> +
> + fdt = files_fdtable(files);
> if (idx < fdt->max_fds)
> file = fdt->fd[idx];
> else
> file = NULL;
> - spin_unlock(&task->files->file_lock);
> +
> + rcu_read_unlock();
> +
> + put_files_struct(files);

I think this should work. Or you can simply do

struct file *file = NULL;

task_lock(task);
rcu_read_lock();
if (task->files)
file = fcheck_files(task->files, idx);
rcu_read_unlock();
task_unlock(task);

Oleg.

2012-02-15 19:18:59

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On Wed, Feb 15, 2012 at 08:06:22PM +0100, Oleg Nesterov wrote:
>
> I think this should work. Or you can simply do
>
> struct file *file = NULL;
>
> task_lock(task);
> rcu_read_lock();
> if (task->files)
> file = fcheck_files(task->files, idx);
> rcu_read_unlock();
> task_unlock(task);
>

Yeah, thanks, this one is better.

Cyrill

2012-02-15 19:56:19

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On Wed, Feb 15, 2012 at 07:43:36PM +0100, Oleg Nesterov wrote:
...
>
> Cough... this is question I am trying to ask ;)
>
> Let me try again. To simplify, lets discuss the KCMP_VM case
> only.
>
> I do not really understand why do we need ptrace_may_access().
> I do not see any security problems with kcmp_ptr(task->mm), but
> I am not expert.
>
> However, you added this check so I assume you have some reason.
> But this can race with execve(setuid_app) and KCMP_VM can play
> with task->mm after this task raises its caps. If this is fine,
> then why do we need ptrace_may_access?
>

This makes me scratch the head ;) I think ptrace_may_access (or
some other security test) should remain since it's somehow weird
if non-root task will be able to find objects order from privileged
task. Thus I need to find a way how to handle execve(setuid_app).
Need to think...

Cyrill

2012-02-15 20:02:26

by Vasily Kulikov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On Wed, Feb 15, 2012 at 23:56 +0400, Cyrill Gorcunov wrote:
> On Wed, Feb 15, 2012 at 07:43:36PM +0100, Oleg Nesterov wrote:
> ...
> >
> > Cough... this is question I am trying to ask ;)
> >
> > Let me try again. To simplify, lets discuss the KCMP_VM case
> > only.
> >
> > I do not really understand why do we need ptrace_may_access().
> > I do not see any security problems with kcmp_ptr(task->mm), but
> > I am not expert.
> >
> > However, you added this check so I assume you have some reason.
> > But this can race with execve(setuid_app) and KCMP_VM can play
> > with task->mm after this task raises its caps. If this is fine,
> > then why do we need ptrace_may_access?
> >
>
> This makes me scratch the head ;) I think ptrace_may_access (or
> some other security test) should remain since it's somehow weird
> if non-root task will be able to find objects order from privileged
> task. Thus I need to find a way how to handle execve(setuid_app).
> Need to think...

Look at fs/proc/base.c:lock_trace() - it locks ->cred_guard_mutex
for the whole period of time when it uses a resource.

--
Vasiliy Kulikov
http://www.openwall.com - bringing security into open computing environments

2012-02-15 20:05:39

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On Wed, Feb 15, 2012 at 11:57:33PM +0400, Vasiliy Kulikov wrote:
> >
> > This makes me scratch the head ;) I think ptrace_may_access (or
> > some other security test) should remain since it's somehow weird
> > if non-root task will be able to find objects order from privileged
> > task. Thus I need to find a way how to handle execve(setuid_app).
> > Need to think...
>
> Look at fs/proc/base.c:lock_trace() - it locks ->cred_guard_mutex
> for the whole period of time when it uses a resource.

Yup, thanks Vasiliy! I've just found cred_guard_mutex in
install_exec_creds. Now I'm thinking if this is what we
need here ;)

Cyrill

2012-02-15 20:25:44

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On Thu, Feb 16, 2012 at 12:05:33AM +0400, Cyrill Gorcunov wrote:
> On Wed, Feb 15, 2012 at 11:57:33PM +0400, Vasiliy Kulikov wrote:
> > >
> > > This makes me scratch the head ;) I think ptrace_may_access (or
> > > some other security test) should remain since it's somehow weird
> > > if non-root task will be able to find objects order from privileged
> > > task. Thus I need to find a way how to handle execve(setuid_app).
> > > Need to think...
> >
> > Look at fs/proc/base.c:lock_trace() - it locks ->cred_guard_mutex
> > for the whole period of time when it uses a resource.
>
> Yup, thanks Vasiliy! I've just found cred_guard_mutex in
> install_exec_creds. Now I'm thinking if this is what we
> need here ;)
>

Something like below I think (not yet tested, overall update).

Cyrill
---
diff -u linux-2.6.git/kernel/kcmp.c linux-2.6.git/kernel/kcmp.c
--- linux-2.6.git/kernel/kcmp.c
+++ linux-2.6.git/kernel/kcmp.c
@@ -44,20 +44,34 @@
static struct file *
get_file_raw_ptr(struct task_struct *task, unsigned int idx)
{
- struct fdtable *fdt;
- struct file *file;
+ struct file *file = NULL;

- spin_lock(&task->files->file_lock);
- fdt = files_fdtable(task->files);
- if (idx < fdt->max_fds)
- file = fdt->fd[idx];
- else
- file = NULL;
- spin_unlock(&task->files->file_lock);
+ task_lock(task);
+ rcu_read_lock();
+
+ if (task->files)
+ file = fcheck_files(task->files, idx);
+
+ rcu_read_unlock();
+ task_unlock(task);

return file;
}

+static int may_access(struct task_struct *task)
+{
+ int err = mutex_lock_killable(&task->signal->cred_guard_mutex);
+ if (err)
+ return err;
+
+ if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
+ mutex_unlock(&task->signal->cred_guard_mutex);
+ return -EPERM;
+ }
+
+ return 0;
+}
+
SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
unsigned long, idx1, unsigned long, idx2)
{
@@ -82,11 +96,12 @@
/*
* One should have enough rights to inspect task details.
*/
- if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
- !ptrace_may_access(task2, PTRACE_MODE_READ)) {
- ret = -EACCES;
+ ret = may_access(task1);
+ if (ret)
goto err;
- }
+ ret = may_access(task2);
+ if (ret)
+ goto err_unlock;

switch (type) {
case KCMP_FILE: {
@@ -130,6 +145,9 @@
break;
}

+ mutex_unlock(&task2->signal->cred_guard_mutex);
+err_unlock:
+ mutex_unlock(&task1->signal->cred_guard_mutex);
err:
put_task_struct(task1);
put_task_struct(task2);

2012-02-15 21:09:40

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On Thu, Feb 16, 2012 at 12:25:38AM +0400, Cyrill Gorcunov wrote:
>
> Something like below I think (not yet tested, overall update).
>

OK, this one I've just tested. Please review, so I won't miss
something obvious (I'm fetching linux-next now, if the patch
is already there I'll cook one on top).

Cyrill
---
diff -u linux-2.6.git/kernel/kcmp.c linux-2.6.git/kernel/kcmp.c
--- linux-2.6.git/kernel/kcmp.c
+++ linux-2.6.git/kernel/kcmp.c
@@ -44,20 +44,38 @@
static struct file *
get_file_raw_ptr(struct task_struct *task, unsigned int idx)
{
- struct fdtable *fdt;
- struct file *file;
+ struct file *file = NULL;

- spin_lock(&task->files->file_lock);
- fdt = files_fdtable(task->files);
- if (idx < fdt->max_fds)
- file = fdt->fd[idx];
- else
- file = NULL;
- spin_unlock(&task->files->file_lock);
+ task_lock(task);
+ rcu_read_lock();
+
+ if (task->files)
+ file = fcheck_files(task->files, idx);
+
+ rcu_read_unlock();
+ task_unlock(task);

return file;
}

+static void access_unlock(struct task_struct *task)
+{
+ mutex_unlock(&task->signal->cred_guard_mutex);
+}
+
+static int access_trylock(struct task_struct *task)
+{
+ if (!mutex_trylock(&task->signal->cred_guard_mutex))
+ return -EBUSY;
+
+ if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
+ mutex_unlock(&task->signal->cred_guard_mutex);
+ return -EPERM;
+ }
+
+ return 0;
+}
+
SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
unsigned long, idx1, unsigned long, idx2)
{
@@ -82,11 +100,12 @@
/*
* One should have enough rights to inspect task details.
*/
- if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
- !ptrace_may_access(task2, PTRACE_MODE_READ)) {
- ret = -EACCES;
+ ret = access_trylock(task1);
+ if (ret)
goto err;
- }
+ ret = access_trylock(task2);
+ if (ret)
+ goto err_unlock;

switch (type) {
case KCMP_FILE: {
@@ -130,6 +149,9 @@
break;
}

+ access_unlock(task2);
+err_unlock:
+ access_unlock(task1);
err:
put_task_struct(task1);
put_task_struct(task2);

2012-02-15 21:58:13

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On Thu, Feb 16, 2012 at 01:09:34AM +0400, Cyrill Gorcunov wrote:
> On Thu, Feb 16, 2012 at 12:25:38AM +0400, Cyrill Gorcunov wrote:
> >
> > Something like below I think (not yet tested, overall update).
> >
>
> OK, this one I've just tested. Please review, so I won't miss
> something obvious (I'm fetching linux-next now, if the patch
> is already there I'll cook one on top).
> ---

Unfortunately I dont see this patch in linux-next, so I cooked
it as interdiff'ed. Please review.

Maybe (if this patch is fine) we could drop v8 and I would
squash this changed into v9 (together with update to self
test)?

Cyrill


Attachments:
(No filename) (619.00 B)
sys-kcmp-interdiff (2.15 kB)
Download all attachments

2012-02-16 14:58:06

by Oleg Nesterov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On 02/16, Cyrill Gorcunov wrote:
>
> +static int access_trylock(struct task_struct *task)
> +{
> + if (!mutex_trylock(&task->signal->cred_guard_mutex))
> + return -EBUSY;
> +
> + if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
> + mutex_unlock(&task->signal->cred_guard_mutex);
> + return -EPERM;
> + }
> +
> + return 0;
> +}

OK, this looks correct, but I don't really understand _trylock.
This means the caller should always retry if -EBUSY, and
kcmp(pid, pid) can never succeed. Sure, kcmp() doesn't make
a lot of sense if pid1 == pid2, but this looks a bit strange.

You could simply do

int mutex_double_lock_killable(struct mutex *m1, struct mutex *m2)
{
int err;

if (m2 > m1)
swap(m1, m2);

err = mutex_lock_killable(m1);

if (!err && likely(m1 != m2)) {
err = mutex_lock_killable_nested(m2);
if (err)
mutex_unlock(m1);
}

return err;
}

but I won't unsist.

Oleg.

2012-02-16 15:13:47

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On Thu, Feb 16, 2012 at 03:49:54PM +0100, Oleg Nesterov wrote:
> On 02/16, Cyrill Gorcunov wrote:
> >
> > +static int access_trylock(struct task_struct *task)
> > +{
> > + if (!mutex_trylock(&task->signal->cred_guard_mutex))
> > + return -EBUSY;
> > +
> > + if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
> > + mutex_unlock(&task->signal->cred_guard_mutex);
> > + return -EPERM;
> > + }
> > +
> > + return 0;
> > +}
>
> OK, this looks correct, but I don't really understand _trylock.
> This means the caller should always retry if -EBUSY, and
> kcmp(pid, pid) can never succeed. Sure, kcmp() doesn't make
> a lot of sense if pid1 == pid2, but this looks a bit strange.
>

Hi Oleg, sure I can make it this way, also I think if pid1 == pid2
and idx1 == idx2 I can return 0 immediately.

> You could simply do
> int mutex_double_lock_killable(struct mutex *m1, struct mutex *m2)
> {
> int err;
>
> if (m2 > m1)
> swap(m1, m2);
>
> err = mutex_lock_killable(m1);
>
> if (!err && likely(m1 != m2)) {
> err = mutex_lock_killable_nested(m2);
> if (err)
> mutex_unlock(m1);
> }
>
> return err;
> }
>
> but I won't insist.

Initially I wanted kcmp would be brining minimum impact
and if mutex is already taken by someone, we would not sleep
but return immediately with -EBUSY and it would be up to caller
to deside if to try again or make some delay first. I simply
not sure what is better here.

Cyrill

2012-02-16 16:49:10

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On Thu, Feb 16, 2012 at 07:13:40PM +0400, Cyrill Gorcunov wrote:
...
>
> > You could simply do
> > int mutex_double_lock_killable(struct mutex *m1, struct mutex *m2)
> > {
> > int err;
> >
> > if (m2 > m1)
> > swap(m1, m2);
> >
> > err = mutex_lock_killable(m1);
> >
> > if (!err && likely(m1 != m2)) {
> > err = mutex_lock_killable_nested(m2);
> > if (err)
> > mutex_unlock(m1);
> > }
> >
> > return err;
> > }
> >
> > but I won't insist.
>
> Initially I wanted kcmp would be brining minimum impact
> and if mutex is already taken by someone, we would not sleep
> but return immediately with -EBUSY and it would be up to caller
> to deside if to try again or make some delay first. I simply
> not sure what is better here.
>
This one should do the trick.

Cyrill
---
From: Cyrill Gorcunov <[email protected]>
Subject: syscalls, x86: Make __NR_kcmp to work with equivalent pids

In case if pid1 is equal to pid2 the kcmp will return -EBUSY,
which makes no sence. Make it able to work with equivalent pids.
Selftest is extended as well.

Repored-by: Oleg Nesterov <[email protected]>
Signed-off-by: Cyrill Gorcunov <[email protected]>
---
diff -u linux-2.6.git/kernel/kcmp.c linux-2.6.git/kernel/kcmp.c
--- linux-2.6.git/kernel/kcmp.c
+++ linux-2.6.git/kernel/kcmp.c
@@ -58,22 +58,31 @@
return file;
}

-static void access_unlock(struct task_struct *task)
+static void kcmp_unlock(struct mutex *m1, struct mutex *m2)
{
- mutex_unlock(&task->signal->cred_guard_mutex);
+ if (m2 > m1)
+ swap(m1, m2);
+
+ if (likely(m2 != m1))
+ mutex_unlock(m2);
+ mutex_unlock(m1);
}

-static int access_trylock(struct task_struct *task)
+static int kcmp_lock(struct mutex *m1, struct mutex *m2)
{
- if (!mutex_trylock(&task->signal->cred_guard_mutex))
- return -EBUSY;
+ int err;
+
+ if (m2 > m1)
+ swap(m1, m2);

- if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
- mutex_unlock(&task->signal->cred_guard_mutex);
- return -EPERM;
+ err = mutex_lock_killable(m1);
+ if (!err && likely(m1 != m2)) {
+ err = mutex_lock_killable_nested(m2, SINGLE_DEPTH_NESTING);
+ if (err)
+ mutex_unlock(m1);
}

- return 0;
+ return err;
}

SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
@@ -100,12 +109,15 @@
/*
* One should have enough rights to inspect task details.
*/
- ret = access_trylock(task1);
+ ret = kcmp_lock(&task1->signal->cred_guard_mutex,
+ &task2->signal->cred_guard_mutex);
if (ret)
goto err;
- ret = access_trylock(task2);
- if (ret)
+ if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
+ !ptrace_may_access(task2, PTRACE_MODE_READ)) {
+ ret = -EPERM;
goto err_unlock;
+ }

switch (type) {
case KCMP_FILE: {
@@ -149,9 +161,9 @@
break;
}

- access_unlock(task2);
err_unlock:
- access_unlock(task1);
+ kcmp_unlock(&task1->signal->cred_guard_mutex,
+ &task2->signal->cred_guard_mutex);
err:
put_task_struct(task1);
put_task_struct(task2);
diff -u linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c
--- linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c
+++ linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c
@@ -74,6 +74,15 @@
ret = -1;
} else
printf("PASS: 0 returned as expected\n");
+
+ /* Compare with self */
+ ret = sys_kcmp(pid1, pid1, KCMP_VM, 0, 0);
+ if (ret) {
+ printf("FAIL: 0 expected but %li returned\n", ret);
+ ret = -1;
+ } else
+ printf("PASS: 0 returned as expected\n");
+
exit(ret);
}

2012-02-16 17:49:14

by Oleg Nesterov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On 02/16, Cyrill Gorcunov wrote:
>
> -static void access_unlock(struct task_struct *task)
> +static void kcmp_unlock(struct mutex *m1, struct mutex *m2)
> {
> - mutex_unlock(&task->signal->cred_guard_mutex);
> + if (m2 > m1)
> + swap(m1, m2);

Well, the order doesn't matter in case of _unlock, you can remove
this part. Not that it really hurts though, I won't argue.

Oleg.

2012-02-16 17:58:16

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On Thu, Feb 16, 2012 at 06:40:47PM +0100, Oleg Nesterov wrote:
> On 02/16, Cyrill Gorcunov wrote:
> >
> > -static void access_unlock(struct task_struct *task)
> > +static void kcmp_unlock(struct mutex *m1, struct mutex *m2)
> > {
> > - mutex_unlock(&task->signal->cred_guard_mutex);
> > + if (m2 > m1)
> > + swap(m1, m2);
>
> Well, the order doesn't matter in case of _unlock, you can remove
> this part. Not that it really hurts though, I won't argue.
>

It drops some instructions so I think it worth removing (still
unlocking not in reverse order is something which always make
me nervious ;)

Cyrill

2012-02-16 18:26:00

by Vasily Kulikov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On Thu, Feb 16, 2012 at 20:49 +0400, Cyrill Gorcunov wrote:
> + err = mutex_lock_killable(m1);
> + if (!err && likely(m1 != m2)) {
> + err = mutex_lock_killable_nested(m2, SINGLE_DEPTH_NESTING);

Doesn't it lead to a deadlock?

mutex_lock_killable(task1)
| mutex_lock_killable(task2)
mutex_lock_killable_nested(task2) |
(locked) |
mutex_lock_killable_nested(task1)
(locked)

I suppose you should use some global lock (kcmp_lock) before both locks.

Thanks,

--
Vasiliy Kulikov
http://www.openwall.com - bringing security into open computing environments

2012-02-16 18:35:01

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On Thu, Feb 16, 2012 at 10:21:06PM +0400, Vasiliy Kulikov wrote:
> On Thu, Feb 16, 2012 at 20:49 +0400, Cyrill Gorcunov wrote:
> > + err = mutex_lock_killable(m1);
> > + if (!err && likely(m1 != m2)) {
> > + err = mutex_lock_killable_nested(m2, SINGLE_DEPTH_NESTING);
>
> Doesn't it lead to a deadlock?
>
> mutex_lock_killable(task1)
> | mutex_lock_killable(task2)
> mutex_lock_killable_nested(task2) |
> (locked) |
> mutex_lock_killable_nested(task1)
> (locked)
>
> I suppose you should use some global lock (kcmp_lock) before both locks.

but here is if (m1 > m2) and swap() do take place.

Cyrill

2012-02-16 18:38:09

by Vasily Kulikov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On Thu, Feb 16, 2012 at 22:34 +0400, Cyrill Gorcunov wrote:
> On Thu, Feb 16, 2012 at 10:21:06PM +0400, Vasiliy Kulikov wrote:
> > On Thu, Feb 16, 2012 at 20:49 +0400, Cyrill Gorcunov wrote:
> > > + err = mutex_lock_killable(m1);
> > > + if (!err && likely(m1 != m2)) {
> > > + err = mutex_lock_killable_nested(m2, SINGLE_DEPTH_NESTING);
> >
> > Doesn't it lead to a deadlock?
> >
> > mutex_lock_killable(task1)
> > | mutex_lock_killable(task2)
> > mutex_lock_killable_nested(task2) |
> > (locked) |
> > mutex_lock_killable_nested(task1)
> > (locked)
> >
> > I suppose you should use some global lock (kcmp_lock) before both locks.
>
> but here is if (m1 > m2) and swap() do take place.

Ah, ok. Then this deadlock scenario is impossible, sorry.

--
Vasiliy Kulikov
http://www.openwall.com - bringing security into open computing environments

2012-02-16 19:11:35

by Oleg Nesterov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On 02/16, Cyrill Gorcunov wrote:
>
> On Thu, Feb 16, 2012 at 06:40:47PM +0100, Oleg Nesterov wrote:
> > On 02/16, Cyrill Gorcunov wrote:
> > >
> > > -static void access_unlock(struct task_struct *task)
> > > +static void kcmp_unlock(struct mutex *m1, struct mutex *m2)
> > > {
> > > - mutex_unlock(&task->signal->cred_guard_mutex);
> > > + if (m2 > m1)
> > > + swap(m1, m2);
> >
> > Well, the order doesn't matter in case of _unlock, you can remove
> > this part. Not that it really hurts though, I won't argue.
>
> It drops some instructions so I think it worth removing

Yes.

> (still
> unlocking not in reverse order is something which always make
> me nervious ;)

Yes ;)

so let me repeat, I am not arguing. But IMHO every piece of code
should be understandable. Personally I do not mind at all, just
I _personally_ think this code _can_ confuse the reader, "damn why
we can't simply unlock in any order???".

If you add the "not really needed" comment above this swap - I agree.
If you simply remove this swap - I agree as well.

But. I won't argue if you prefer to keep this patch as is. You are the
author. If it looks better to _you_ - OK, this is correct (afaics).

Oleg.

2012-02-16 19:11:46

by Oleg Nesterov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On 02/16, Vasiliy Kulikov wrote:
>
> On Thu, Feb 16, 2012 at 20:49 +0400, Cyrill Gorcunov wrote:
> > + err = mutex_lock_killable(m1);
> > + if (!err && likely(m1 != m2)) {
> > + err = mutex_lock_killable_nested(m2, SINGLE_DEPTH_NESTING);
>
> Doesn't it lead to a deadlock?
>
> mutex_lock_killable(task1)
> | mutex_lock_killable(task2)
> mutex_lock_killable_nested(task2) |
> (locked) |
> mutex_lock_killable_nested(task1)
> (locked)

Please note the

if (m1 >= m2)
swap(m1. m2)

at the start.

Oleg.

2012-02-16 19:23:40

by H. Peter Anvin

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On 02/16/2012 11:03 AM, Oleg Nesterov wrote:
>
> so let me repeat, I am not arguing. But IMHO every piece of code
> should be understandable.

Amen!

-hpa


--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2012-02-16 19:29:58

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On Thu, Feb 16, 2012 at 08:03:21PM +0100, Oleg Nesterov wrote:
> On 02/16, Cyrill Gorcunov wrote:
> >
> > On Thu, Feb 16, 2012 at 06:40:47PM +0100, Oleg Nesterov wrote:
> > > On 02/16, Cyrill Gorcunov wrote:
> > > >
> > > > -static void access_unlock(struct task_struct *task)
> > > > +static void kcmp_unlock(struct mutex *m1, struct mutex *m2)
> > > > {
> > > > - mutex_unlock(&task->signal->cred_guard_mutex);
> > > > + if (m2 > m1)
> > > > + swap(m1, m2);
> > >
> > > Well, the order doesn't matter in case of _unlock, you can remove
> > > this part. Not that it really hurts though, I won't argue.
> >
> > It drops some instructions so I think it worth removing
>
> Yes.
>

Final one ;) I agreed on every line of your comment, thanks a lot Oleg!
---
From: Cyrill Gorcunov <[email protected]>
Subject: syscalls, x86: Make __NR_kcmp to work with equivalent pids

In case if pid1 is equal to pid2 the kcmp will return -EBUSY,
which makes no sence. Make it able to work with equivalent pids.
Selftest is extended as well.

Repored-by: Oleg Nesterov <[email protected]>
Signed-off-by: Cyrill Gorcunov <[email protected]>
---
diff -u linux-2.6.git/kernel/kcmp.c linux-2.6.git/kernel/kcmp.c
--- linux-2.6.git/kernel/kcmp.c
+++ linux-2.6.git/kernel/kcmp.c
@@ -58,22 +58,28 @@
return file;
}

-static void access_unlock(struct task_struct *task)
+static void kcmp_unlock(struct mutex *m1, struct mutex *m2)
{
- mutex_unlock(&task->signal->cred_guard_mutex);
+ if (likely(m2 != m1))
+ mutex_unlock(m2);
+ mutex_unlock(m1);
}

-static int access_trylock(struct task_struct *task)
+static int kcmp_lock(struct mutex *m1, struct mutex *m2)
{
- if (!mutex_trylock(&task->signal->cred_guard_mutex))
- return -EBUSY;
+ int err;

- if (!ptrace_may_access(task, PTRACE_MODE_READ)) {
- mutex_unlock(&task->signal->cred_guard_mutex);
- return -EPERM;
+ if (m2 > m1)
+ swap(m1, m2);
+
+ err = mutex_lock_killable(m1);
+ if (!err && likely(m1 != m2)) {
+ err = mutex_lock_killable_nested(m2, SINGLE_DEPTH_NESTING);
+ if (err)
+ mutex_unlock(m1);
}

- return 0;
+ return err;
}

SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
@@ -100,12 +106,15 @@
/*
* One should have enough rights to inspect task details.
*/
- ret = access_trylock(task1);
+ ret = kcmp_lock(&task1->signal->cred_guard_mutex,
+ &task2->signal->cred_guard_mutex);
if (ret)
goto err;
- ret = access_trylock(task2);
- if (ret)
+ if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
+ !ptrace_may_access(task2, PTRACE_MODE_READ)) {
+ ret = -EPERM;
goto err_unlock;
+ }

switch (type) {
case KCMP_FILE: {
@@ -149,9 +158,9 @@
break;
}

- access_unlock(task2);
err_unlock:
- access_unlock(task1);
+ kcmp_unlock(&task1->signal->cred_guard_mutex,
+ &task2->signal->cred_guard_mutex);
err:
put_task_struct(task1);
put_task_struct(task2);
diff -u linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c
--- linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c
+++ linux-2.6.git/tools/testing/selftests/kcmp/kcmp_test.c
@@ -74,6 +74,15 @@
ret = -1;
} else
printf("PASS: 0 returned as expected\n");
+
+ /* Compare with self */
+ ret = sys_kcmp(pid1, pid1, KCMP_VM, 0, 0);
+ if (ret) {
+ printf("FAIL: 0 expected but %li returned\n", ret);
+ ret = -1;
+ } else
+ printf("PASS: 0 returned as expected\n");
+
exit(ret);
}

2012-02-16 19:52:12

by Andrew Morton

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On Thu, 16 Feb 2012 23:29:52 +0400
Cyrill Gorcunov <[email protected]> wrote:

> Final one ;)

syscalls-x86-add-__nr_kcmp-syscall-v8-fix-fix-fix-fix-fix.patch
is a tie for the world record!

I folded everything back into a single patch. Here's what we currently
have:


From: Cyrill Gorcunov <[email protected]>
Subject: syscalls, x86: add __NR_kcmp syscall

While doing the checkpoint-restore in the user space one need to determine
whether various kernel objects (like mm_struct-s of file_struct-s) are
shared between tasks and restore this state.

The 2nd step can be solved by using appropriate CLONE_ flags and the
unshare syscall, while there's currently no ways for solving the 1st one.

One of the ways for checking whether two tasks share e.g. mm_struct is to
provide some mm_struct ID of a task to its proc file, but showing such
info considered to be not that good for security reasons.

Thus after some debates we end up in conclusion that using that named
'comparison' syscall might be the best candidate. So here is it --
__NR_kcmp.

It takes up to 5 arguments - the pids of the two tasks (which
characteristics should be compared), the comparison type and (in case of
comparison of files) two file descriptors.

Lookups for pids are done in the caller's PID namespace only.

At moment only x86 is supported and tested.

[[email protected]: fix up selftests, warnings]
Signed-off-by: Cyrill Gorcunov <[email protected]>
Cc: "Eric W. Biederman" <[email protected]>
Cc: Pavel Emelyanov <[email protected]>
Cc: Andrey Vagin <[email protected]>
Cc: KOSAKI Motohiro <[email protected]>
Cc: Ingo Molnar <[email protected]>
Cc: H. Peter Anvin <[email protected]>
Cc: Thomas Gleixner <[email protected]>
Cc: Glauber Costa <[email protected]>
Cc: Andi Kleen <[email protected]>
Cc: Tejun Heo <[email protected]>
Cc: Matt Helsley <[email protected]>
Cc: Pekka Enberg <[email protected]>
Cc: Eric Dumazet <[email protected]>
Cc: Vasiliy Kulikov <[email protected]>
Cc: Alexey Dobriyan <[email protected]>
Cc: [email protected]
Cc: Michal Marek <[email protected]>
Cc: Frederic Weisbecker <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---

arch/x86/syscalls/syscall_32.tbl | 1
arch/x86/syscalls/syscall_64.tbl | 1
include/linux/kcmp.h | 17 +
include/linux/syscalls.h | 2
kernel/Makefile | 3
kernel/kcmp.c | 186 +++++++++++++++++++++
kernel/sys_ni.c | 3
tools/testing/selftests/Makefile | 2
tools/testing/selftests/kcmp/Makefile | 29 +++
tools/testing/selftests/kcmp/kcmp_test.c | 94 ++++++++++
10 files changed, 337 insertions(+), 1 deletion(-)

diff -puN arch/x86/syscalls/syscall_32.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8 arch/x86/syscalls/syscall_32.tbl
--- a/arch/x86/syscalls/syscall_32.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8
+++ a/arch/x86/syscalls/syscall_32.tbl
@@ -355,3 +355,4 @@
346 i386 setns sys_setns
347 i386 process_vm_readv sys_process_vm_readv compat_sys_process_vm_readv
348 i386 process_vm_writev sys_process_vm_writev compat_sys_process_vm_writev
+349 i386 kcmp sys_kcmp
diff -puN arch/x86/syscalls/syscall_64.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8 arch/x86/syscalls/syscall_64.tbl
--- a/arch/x86/syscalls/syscall_64.tbl~syscalls-x86-add-__nr_kcmp-syscall-v8
+++ a/arch/x86/syscalls/syscall_64.tbl
@@ -318,3 +318,4 @@
309 64 getcpu sys_getcpu
310 64 process_vm_readv sys_process_vm_readv
311 64 process_vm_writev sys_process_vm_writev
+312 64 kcmp sys_kcmp
diff -puN /dev/null include/linux/kcmp.h
--- /dev/null
+++ a/include/linux/kcmp.h
@@ -0,0 +1,17 @@
+#ifndef _LINUX_KCMP_H
+#define _LINUX_KCMP_H
+
+/* Comparison type */
+enum kcmp_type {
+ KCMP_FILE,
+ KCMP_VM,
+ KCMP_FILES,
+ KCMP_FS,
+ KCMP_SIGHAND,
+ KCMP_IO,
+ KCMP_SYSVSEM,
+
+ KCMP_TYPES,
+};
+
+#endif /* _LINUX_KCMP_H */
diff -puN include/linux/syscalls.h~syscalls-x86-add-__nr_kcmp-syscall-v8 include/linux/syscalls.h
--- a/include/linux/syscalls.h~syscalls-x86-add-__nr_kcmp-syscall-v8
+++ a/include/linux/syscalls.h
@@ -857,4 +857,6 @@ asmlinkage long sys_process_vm_writev(pi
unsigned long riovcnt,
unsigned long flags);

+asmlinkage long sys_kcmp(pid_t pid1, pid_t pid2, int type,
+ unsigned long idx1, unsigned long idx2);
#endif
diff -puN kernel/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8 kernel/Makefile
--- a/kernel/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8
+++ a/kernel/Makefile
@@ -25,6 +25,9 @@ endif
obj-y += sched/
obj-y += power/

+ifeq ($(CONFIG_CHECKPOINT_RESTORE),y)
+obj-$(CONFIG_X86) += kcmp.o
+endif
obj-$(CONFIG_FREEZER) += freezer.o
obj-$(CONFIG_PROFILING) += profile.o
obj-$(CONFIG_STACKTRACE) += stacktrace.o
diff -puN /dev/null kernel/kcmp.c
--- /dev/null
+++ a/kernel/kcmp.c
@@ -0,0 +1,186 @@
+#include <linux/kernel.h>
+#include <linux/syscalls.h>
+#include <linux/fdtable.h>
+#include <linux/string.h>
+#include <linux/random.h>
+#include <linux/module.h>
+#include <linux/init.h>
+#include <linux/cache.h>
+#include <linux/bug.h>
+#include <linux/err.h>
+#include <linux/kcmp.h>
+
+#include <asm/unistd.h>
+
+/*
+ * We don't expose real in-memory order of objects for security
+ * reasons, still the comparison results should be suitable for
+ * sorting. Thus, we obfuscate kernel pointers values and compare
+ * the production instead.
+ */
+static unsigned long cookies[KCMP_TYPES][2] __read_mostly;
+
+static long kptr_obfuscate(long v, int type)
+{
+ return (v ^ cookies[type][0]) * cookies[type][1];
+}
+
+/*
+ * 0 - equal, i.e. v1 = v2
+ * 1 - less than, i.e. v1 < v2
+ * 2 - greater than, i.e. v1 > v2
+ * 3 - not equal but ordering unavailable (reserved for future)
+ */
+static int kcmp_ptr(void *v1, void *v2, enum kcmp_type type)
+{
+ long ret;
+
+ ret = kptr_obfuscate((long)v1, type) - kptr_obfuscate((long)v2, type);
+
+ return (ret < 0) | ((ret > 0) << 1);
+}
+
+/* The caller must have pinned the task */
+static struct file *
+get_file_raw_ptr(struct task_struct *task, unsigned int idx)
+{
+ struct file *file = NULL;
+
+ task_lock(task);
+ rcu_read_lock();
+
+ if (task->files)
+ file = fcheck_files(task->files, idx);
+
+ rcu_read_unlock();
+ task_unlock(task);
+
+ return file;
+}
+
+static void kcmp_unlock(struct mutex *m1, struct mutex *m2)
+{
+ if (likely(m2 != m1))
+ mutex_unlock(m2);
+ mutex_unlock(m1);
+}
+
+static int kcmp_lock(struct mutex *m1, struct mutex *m2)
+{
+ int err;
+
+ if (m2 > m1)
+ swap(m1, m2);
+
+ err = mutex_lock_killable(m1);
+ if (!err && likely(m1 != m2)) {
+ err = mutex_lock_killable_nested(m2, SINGLE_DEPTH_NESTING);
+ if (err)
+ mutex_unlock(m1);
+ }
+
+ return err;
+}
+
+SYSCALL_DEFINE5(kcmp, pid_t, pid1, pid_t, pid2, int, type,
+ unsigned long, idx1, unsigned long, idx2)
+{
+ struct task_struct *task1, *task2;
+ int ret;
+
+ rcu_read_lock();
+
+ /*
+ * Tasks are looked up in caller's PID namespace only.
+ */
+ task1 = find_task_by_vpid(pid1);
+ task2 = find_task_by_vpid(pid2);
+ if (!task1 || !task2)
+ goto err_no_task;
+
+ get_task_struct(task1);
+ get_task_struct(task2);
+
+ rcu_read_unlock();
+
+ /*
+ * One should have enough rights to inspect task details.
+ */
+ ret = kcmp_lock(&task1->signal->cred_guard_mutex,
+ &task2->signal->cred_guard_mutex);
+ if (ret)
+ goto err;
+ if (!ptrace_may_access(task1, PTRACE_MODE_READ) ||
+ !ptrace_may_access(task2, PTRACE_MODE_READ)) {
+ ret = -EPERM;
+ goto err_unlock;
+ }
+
+ switch (type) {
+ case KCMP_FILE: {
+ struct file *filp1, *filp2;
+
+ filp1 = get_file_raw_ptr(task1, idx1);
+ filp2 = get_file_raw_ptr(task2, idx2);
+
+ if (filp1 && filp2)
+ ret = kcmp_ptr(filp1, filp2, KCMP_FILE);
+ else
+ ret = -EBADF;
+ break;
+ }
+ case KCMP_VM:
+ ret = kcmp_ptr(task1->mm, task2->mm, KCMP_VM);
+ break;
+ case KCMP_FILES:
+ ret = kcmp_ptr(task1->files, task2->files, KCMP_FILES);
+ break;
+ case KCMP_FS:
+ ret = kcmp_ptr(task1->fs, task2->fs, KCMP_FS);
+ break;
+ case KCMP_SIGHAND:
+ ret = kcmp_ptr(task1->sighand, task2->sighand, KCMP_SIGHAND);
+ break;
+ case KCMP_IO:
+ ret = kcmp_ptr(task1->io_context, task2->io_context, KCMP_IO);
+ break;
+ case KCMP_SYSVSEM:
+#ifdef CONFIG_SYSVIPC
+ ret = kcmp_ptr(task1->sysvsem.undo_list,
+ task2->sysvsem.undo_list,
+ KCMP_SYSVSEM);
+#else
+ ret = -EOPNOTSUP;
+#endif
+ break;
+ default:
+ ret = -EINVAL;
+ break;
+ }
+
+err_unlock:
+ kcmp_unlock(&task1->signal->cred_guard_mutex,
+ &task2->signal->cred_guard_mutex);
+err:
+ put_task_struct(task1);
+ put_task_struct(task2);
+
+ return ret;
+
+err_no_task:
+ rcu_read_unlock();
+ return -ESRCH;
+}
+
+static __init int kcmp_cookies_init(void)
+{
+ int i;
+
+ get_random_bytes(cookies, sizeof(cookies));
+
+ for (i = 0; i < KCMP_TYPES; i++)
+ cookies[i][1] |= (~(~0UL >> 1) | 1);
+
+ return 0;
+}
+arch_initcall(kcmp_cookies_init);
diff -puN kernel/sys_ni.c~syscalls-x86-add-__nr_kcmp-syscall-v8 kernel/sys_ni.c
--- a/kernel/sys_ni.c~syscalls-x86-add-__nr_kcmp-syscall-v8
+++ a/kernel/sys_ni.c
@@ -203,3 +203,6 @@ cond_syscall(sys_fanotify_mark);
cond_syscall(sys_name_to_handle_at);
cond_syscall(sys_open_by_handle_at);
cond_syscall(compat_sys_open_by_handle_at);
+
+/* compare kernel pointers */
+cond_syscall(sys_kcmp);
diff -puN /dev/null tools/testing/selftests/kcmp/Makefile
--- /dev/null
+++ a/tools/testing/selftests/kcmp/Makefile
@@ -0,0 +1,29 @@
+uname_M := $(shell uname -m 2>/dev/null || echo not)
+ARCH ?= $(shell echo $(uname_M) | sed -e s/i.86/i386/)
+ifeq ($(ARCH),i386)
+ ARCH := X86
+ CFLAGS := -DCONFIG_X86_32 -D__i386__
+endif
+ifeq ($(ARCH),x86_64)
+ ARCH := X86
+ CFLAGS := -DCONFIG_X86_64 -D__x86_64__
+endif
+
+CFLAGS += -I../../../../arch/x86/include/generated/
+CFLAGS += -I../../../../include/
+CFLAGS += -I../../../../usr/include/
+CFLAGS += -I../../../../arch/x86/include/
+
+all:
+ifeq ($(ARCH),X86)
+ gcc $(CFLAGS) kcmp_test.c -o run_test
+else
+ echo "Not an x86 target, can't build kcmp selftest"
+endif
+
+run-tests: all
+ ./kcmp_test
+
+clean:
+ rm -fr ./run_test
+ rm -fr ./test-file
diff -puN /dev/null tools/testing/selftests/kcmp/kcmp_test.c
--- /dev/null
+++ a/tools/testing/selftests/kcmp/kcmp_test.c
@@ -0,0 +1,94 @@
+#define _GNU_SOURCE
+
+#include <stdio.h>
+#include <stdlib.h>
+#include <signal.h>
+#include <limits.h>
+#include <unistd.h>
+#include <errno.h>
+#include <string.h>
+#include <fcntl.h>
+
+#include <linux/unistd.h>
+#include <linux/kcmp.h>
+
+#include <sys/syscall.h>
+#include <sys/types.h>
+#include <sys/stat.h>
+#include <sys/wait.h>
+
+static long sys_kcmp(int pid1, int pid2, int type, int fd1, int fd2)
+{
+ return syscall(__NR_kcmp, pid1, pid2, type, fd1, fd2);
+}
+
+int main(int argc, char **argv)
+{
+ const char kpath[] = "kcmp-test-file";
+ int pid1, pid2;
+ int fd1, fd2;
+ int status;
+
+ fd1 = open(kpath, O_RDWR | O_CREAT | O_TRUNC, 0644);
+ pid1 = getpid();
+
+ if (fd1 < 0) {
+ perror("Can't create file");
+ exit(1);
+ }
+
+ pid2 = fork();
+ if (pid2 < 0) {
+ perror("fork failed");
+ exit(1);
+ }
+
+ if (!pid2) {
+ int pid2 = getpid();
+ int ret;
+
+ fd2 = open(kpath, O_RDWR, 0644);
+ if (fd2 < 0) {
+ perror("Can't open file");
+ exit(1);
+ }
+
+ /* An example of output and arguments */
+ printf("pid1: %6d pid2: %6d FD: %2ld FILES: %2ld VM: %2ld "
+ "FS: %2ld SIGHAND: %2ld IO: %2ld SYSVSEM: %2ld "
+ "INV: %2ld\n",
+ pid1, pid2,
+ sys_kcmp(pid1, pid2, KCMP_FILE, fd1, fd2),
+ sys_kcmp(pid1, pid2, KCMP_FILES, 0, 0),
+ sys_kcmp(pid1, pid2, KCMP_VM, 0, 0),
+ sys_kcmp(pid1, pid2, KCMP_FS, 0, 0),
+ sys_kcmp(pid1, pid2, KCMP_SIGHAND, 0, 0),
+ sys_kcmp(pid1, pid2, KCMP_IO, 0, 0),
+ sys_kcmp(pid1, pid2, KCMP_SYSVSEM, 0, 0),
+
+ /* This one should fail */
+ sys_kcmp(pid1, pid2, KCMP_TYPES + 1, 0, 0));
+
+ /* This one should return same fd */
+ ret = sys_kcmp(pid1, pid2, KCMP_FILE, fd1, fd1);
+ if (ret) {
+ printf("FAIL: 0 expected but %d returned\n", ret);
+ ret = -1;
+ } else
+ printf("PASS: 0 returned as expected\n");
+
+ /* Compare with self */
+ ret = sys_kcmp(pid1, pid1, KCMP_VM, 0, 0);
+ if (ret) {
+ printf("FAIL: 0 expected but %li returned\n", ret);
+ ret = -1;
+ } else
+ printf("PASS: 0 returned as expected\n");
+
+ exit(ret);
+ }
+
+ waitpid(pid2, &status, P_ALL);
+
+ return 0;
+}
diff -puN tools/testing/selftests/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8 tools/testing/selftests/Makefile
--- a/tools/testing/selftests/Makefile~syscalls-x86-add-__nr_kcmp-syscall-v8
+++ a/tools/testing/selftests/Makefile
@@ -1,4 +1,4 @@
-TARGETS = breakpoints vm
+TARGETS = breakpoints vm kcmp

all:
for TARGET in $(TARGETS); do \
_

2012-02-16 20:01:57

by Cyrill Gorcunov

[permalink] [raw]
Subject: Re: + syscalls-x86-add-__nr_kcmp-syscall-v8.patch added to -mm tree

On Thu, Feb 16, 2012 at 11:52:08AM -0800, Andrew Morton wrote:
> On Thu, 16 Feb 2012 23:29:52 +0400
> Cyrill Gorcunov <[email protected]> wrote:
>
> > Final one ;)
>
> syscalls-x86-add-__nr_kcmp-syscall-v8-fix-fix-fix-fix-fix.patch
> is a tie for the world record!

It was close to v13 ;)

>
> I folded everything back into a single patch. Here's what we currently
> have:
>

Thanks a lot, Andrew!

Cyrill