2012-08-13 01:53:56

by Al Viro

[permalink] [raw]
Subject: [flame^Wreview] net: netprio_cgroup: rework update socket logic

Ladies and gentlemen, who the devil had reviewed that little gem?

commit 406a3c638ce8b17d9704052c07955490f732c2b8
Author: John Fastabend <[email protected]>
Date: Fri Jul 20 10:39:25 2012 +0000

is a bleeding bogosity that doesn't pass even the most cursory
inspection. It iterates through descriptor tables of a bunch
of processes, doing this:
file = fcheck_files(files, fd);
if (!file)
continue;

path = d_path(&file->f_path, tmp, PAGE_SIZE);
rv = sscanf(path, "socket:[%lu]", &s);
if (rv <= 0)
continue;

sock = sock_from_file(file, &err);
if (!err)
sock_update_netprioidx(sock->sk, p);
Note the charming use of sscanf() for pattern-matching. 's' (inode
number of socket) is completely unused afterwards; what happens here
is a very badly written attempt to skip non-sockets. Why, will
sock_from_file() blow up on non-sockets? And isn't there some less
obnoxious way to check that the file is a sockfs one? Let's see:
struct socket *sock_from_file(struct file *file, int *err)
{
if (file->f_op == &socket_file_ops)
return file->private_data; /* set in sock_map_fd */

*err = -ENOTSOCK;
return NULL;
}
... and the first line is exactly that - a check that we are on sockfs.
_Far_ less expensive one, at that, so it's not even that we are avoiding
a costly test. In other words, all masturbation with d_path() is absolutely
pointless.

Furthermore, it's racy; had been even more so before the delayed fput series
went in, but even now it's not safe. fcheck_files() does *NOT* guarantee
that file is not getting closed right now. rcu_read_lock() prevents only
freeing and potential reuse of struct file we'd got; it might go all the
way through final fput() just as we look at it. So file->f_path is not
protected by anything. Worse yet, neither is struct socket itself - we
might be going through sock_release() at the same time, so sock->sk might
very well be NULL, leaving us a oops even after we dump d_path() idiocy.

To make it even funnier, there's such thing as SCM_RIGHTS datagrams and
descriptor passing. In other words, it's *not* going to catch all sockets
that would be caught by the earlier variant.

What the hell it is about cgroups that turns otherwise sane people into
gibbering idiots? Other than the Old Ones' apparent involvement in
the mental processes that had lead to the entire concept, that is...

Let's take a closer look at the entire net/core/netprio_cgroup.c, shall we?

Right at the beginning of that Fine Piece of Software we find this:
static atomic_t max_prioidx = ATOMIC_INIT(0);
Why is it atomic? We have all of four accesses to it; one writer and three
readers. The writer and one of the readers are under the same spinlock;
the rest of readers are under rtnl_lock(). Moreover, the sole writer is
in get_prioidx(), which is called in one place and is immediately followed
by grabbing rtnl_lock(). So shifting it down there would've put *all*
accesses of that sucker under the same mutex...

Next to that one we have
static DEFINE_SPINLOCK(prioidx_map_lock);
What is that one for? Two places touching that sucker; get_prioidx() and
put_prioidx(). spin_lock_irqsave()/spin_unlock_irqrestore() in both.
Why irqsave? Somebody calling that from interrupt context? Looks odd...
OK, there's not a lot of callers - cgrp_create(), cgrp_destroy(). And
neither looks like something that could be called from an interrupt.
The former has GFP_KERNEL allocation right in the beginning. Oh, lookie -
both do rtnl_lock(). So not only is all this wanking with irqs pure
cargo-culting, the whole spinlock is pointless; we can simply shift all
this stuff under rtnl_lock().

After get_prioidx()/put_priodix() we come to the following gem:
for (i = 0;
old_priomap && (i < old_priomap->priomap_len);
i++)
new_priomap->priomap[i] = old_priomap->priomap[i];
Why are we checking old_priomap on every step? Sure, gcc will take
that out of loop, but why obfuscate the damn thing?
if (old_priomap)
memcpy(new_priomap->priomap, old_priomap,
sizeof(u32) * old_priomap->priomap_len);
would be far more idiomatic... Anyway, we are extending the damn array,
copying contents of the old one to replacement. Understandable. Where
do we modify the contents of that array, anyway? Aha:
rtnl_lock();
for_each_netdev(&init_net, dev) {
map = rtnl_dereference(dev->priomap);
if (map && cs->prioidx < map->priomap_len)
map->priomap[cs->prioidx] = 0;
}
rtnl_unlock();
and
rcu_read_lock();
map = rcu_dereference(dev->priomap);
if (map)
map->priomap[prioidx] = priority;
rcu_read_unlock();
The first one is serialized with the reallocate-and-copy by rtnl_lock().
The second one, though, is immediately suspicious - rcu_read_lock() in
updater. It's in write_priomap(), which is ->write_string() in some object
deep in the bloated bowels of cgroup. Only one caller: cgroup_write_string().
No locks grabbed. Called from cgroup_file_write(), again without any locks.
Which is ->write() of file_operations. And that is callable without any
locks whatsoever. OK, so we definitely have a race here.

Incidentally, we'd dropped rtnl_lock() just before that point. IOW, it's
trivially fixed by moving that thing into write_update_netdev_table()...

BTW, speaking of highly non-idiomatic code: strstr(devname, " "). Why
not introduce strcasestr(), if we are going for maximal obfuscation here?
Sigh...


2012-08-13 02:31:00

by Al Viro

[permalink] [raw]
Subject: Re: [flame^Wreview] net: netprio_cgroup: rework update socket logic

On Mon, Aug 13, 2012 at 02:53:48AM +0100, Al Viro wrote:
> if (old_priomap)
> memcpy(new_priomap->priomap, old_priomap,
^^^^^^^^^^^
old_priomap->priomap,
that is.

2012-08-13 05:55:34

by John Fastabend

[permalink] [raw]
Subject: Re: [flame^Wreview] net: netprio_cgroup: rework update socket logic

On 8/12/2012 6:53 PM, Al Viro wrote:
> Ladies and gentlemen, who the devil had reviewed that little gem?
>
> commit 406a3c638ce8b17d9704052c07955490f732c2b8
> Author: John Fastabend <[email protected]>
> Date: Fri Jul 20 10:39:25 2012 +0000
>
> is a bleeding bogosity that doesn't pass even the most cursory
> inspection. It iterates through descriptor tables of a bunch
> of processes, doing this:
> file = fcheck_files(files, fd);
> if (!file)
> continue;
>
> path = d_path(&file->f_path, tmp, PAGE_SIZE);
> rv = sscanf(path, "socket:[%lu]", &s);
> if (rv <= 0)
> continue;
>
> sock = sock_from_file(file, &err);
> if (!err)
> sock_update_netprioidx(sock->sk, p);
> Note the charming use of sscanf() for pattern-matching. 's' (inode
> number of socket) is completely unused afterwards; what happens here
> is a very badly written attempt to skip non-sockets. Why, will
> sock_from_file() blow up on non-sockets? And isn't there some less
> obnoxious way to check that the file is a sockfs one? Let's see:
> struct socket *sock_from_file(struct file *file, int *err)
> {
> if (file->f_op == &socket_file_ops)
> return file->private_data; /* set in sock_map_fd */
>
> *err = -ENOTSOCK;
> return NULL;
> }
> ... and the first line is exactly that - a check that we are on sockfs.
> _Far_ less expensive one, at that, so it's not even that we are avoiding
> a costly test. In other words, all masturbation with d_path() is absolutely
> pointless.
>
> Furthermore, it's racy; had been even more so before the delayed fput series
> went in, but even now it's not safe. fcheck_files() does *NOT* guarantee
> that file is not getting closed right now. rcu_read_lock() prevents only
> freeing and potential reuse of struct file we'd got; it might go all the
> way through final fput() just as we look at it. So file->f_path is not
> protected by anything. Worse yet, neither is struct socket itself - we
> might be going through sock_release() at the same time, so sock->sk might
> very well be NULL, leaving us a oops even after we dump d_path() idiocy.
>
> To make it even funnier, there's such thing as SCM_RIGHTS datagrams and
> descriptor passing. In other words, it's *not* going to catch all sockets
> that would be caught by the earlier variant.
>

OK clearly I screwed it up thanks for reviewing Al. How about this.

fdt = files_fdtable(files);
for (fd = 0; fd < fdt->max_fds; fd++) {
struct socket *sock;
int err = 0;

sock = sockfd_lookup(fd, &err);
if (!sock) {
lock_sock(sock->sk);
sock_update_netprioidx(sock->sk, p);
release_sock(sock->sk);
sockfd_put(sock);
}
}

sockfd_lookup will call fget() and also test file->f_op. testing this
now.

2012-08-13 06:24:11

by John Fastabend

[permalink] [raw]
Subject: Re: [flame^Wreview] net: netprio_cgroup: rework update socket logic

> OK clearly I screwed it up thanks for reviewing Al. How about this.
>
> fdt = files_fdtable(files);
> for (fd = 0; fd < fdt->max_fds; fd++) {
> struct socket *sock;
> int err = 0;
>
> sock = sockfd_lookup(fd, &err);
> if (!sock) {
typo ^^^^^^
if (sock) {

to be honest I can't see why I didn't use sockfd_lookup in the first
place...

2012-08-13 09:08:33

by Joe Perches

[permalink] [raw]
Subject: Re: [flame^Wreview] net: netprio_cgroup: rework update socket logic

On Sun, 2012-08-12 at 22:55 -0700, John Fastabend wrote:
> On 8/12/2012 6:53 PM, Al Viro wrote:
> > Ladies and gentlemen, who the devil had reviewed that little gem?
> >
> > commit 406a3c638ce8b17d9704052c07955490f732c2b8
> > Author: John Fastabend <[email protected]>
> > Date: Fri Jul 20 10:39:25 2012 +0000
[]
> OK clearly I screwed it up thanks for reviewing Al. How about this.
>
> fdt = files_fdtable(files);
> for (fd = 0; fd < fdt->max_fds; fd++) {
> struct socket *sock;
> int err = 0;

Don't need to initialize err if you're not using it.

> sock = sockfd_lookup(fd, &err);
> if (!sock) {

Of course you mean
if (sock)

> lock_sock(sock->sk);
> sock_update_netprioidx(sock->sk, p);
> release_sock(sock->sk);
> sockfd_put(sock);
> }
> }

2012-08-13 11:23:08

by Neil Horman

[permalink] [raw]
Subject: Re: [flame^Wreview] net: netprio_cgroup: rework update socket logic

On Sun, Aug 12, 2012 at 10:55:17PM -0700, John Fastabend wrote:
> On 8/12/2012 6:53 PM, Al Viro wrote:
> > Ladies and gentlemen, who the devil had reviewed that little gem?
> >
> >commit 406a3c638ce8b17d9704052c07955490f732c2b8
> >Author: John Fastabend <[email protected]>
> >Date: Fri Jul 20 10:39:25 2012 +0000
> >
> >is a bleeding bogosity that doesn't pass even the most cursory
> >inspection. It iterates through descriptor tables of a bunch
> >of processes, doing this:
> > file = fcheck_files(files, fd);
> > if (!file)
> > continue;
> >
> > path = d_path(&file->f_path, tmp, PAGE_SIZE);
> > rv = sscanf(path, "socket:[%lu]", &s);
> > if (rv <= 0)
> > continue;
> >
> > sock = sock_from_file(file, &err);
> > if (!err)
> > sock_update_netprioidx(sock->sk, p);
> >Note the charming use of sscanf() for pattern-matching. 's' (inode
> >number of socket) is completely unused afterwards; what happens here
> >is a very badly written attempt to skip non-sockets. Why, will
> >sock_from_file() blow up on non-sockets? And isn't there some less
> >obnoxious way to check that the file is a sockfs one? Let's see:
> >struct socket *sock_from_file(struct file *file, int *err)
> >{
> > if (file->f_op == &socket_file_ops)
> > return file->private_data; /* set in sock_map_fd */
> >
> > *err = -ENOTSOCK;
> > return NULL;
> >}
> >... and the first line is exactly that - a check that we are on sockfs.
> >_Far_ less expensive one, at that, so it's not even that we are avoiding
> >a costly test. In other words, all masturbation with d_path() is absolutely
> >pointless.
> >
> >Furthermore, it's racy; had been even more so before the delayed fput series
> >went in, but even now it's not safe. fcheck_files() does *NOT* guarantee
> >that file is not getting closed right now. rcu_read_lock() prevents only
> >freeing and potential reuse of struct file we'd got; it might go all the
> >way through final fput() just as we look at it. So file->f_path is not
> >protected by anything. Worse yet, neither is struct socket itself - we
> >might be going through sock_release() at the same time, so sock->sk might
> >very well be NULL, leaving us a oops even after we dump d_path() idiocy.
> >
> >To make it even funnier, there's such thing as SCM_RIGHTS datagrams and
> >descriptor passing. In other words, it's *not* going to catch all sockets
> >that would be caught by the earlier variant.
> >
>
Yes, thank you Al, I should have reviewed this more throughly.

> OK clearly I screwed it up thanks for reviewing Al. How about this.
>
> fdt = files_fdtable(files);
> for (fd = 0; fd < fdt->max_fds; fd++) {
> struct socket *sock;
> int err = 0;
>
> sock = sockfd_lookup(fd, &err);
> if (!sock) {
> lock_sock(sock->sk);
What are you protecting with lock_sock here? The other call sites don't make
use of the socket lock. Arguagbly they could, but I don't think they need to.
As long as the fd doesn't go away we should be able to update the
sk_cgrp_prioidx safely.

Regards
Neil

2012-08-13 12:18:33

by Al Viro

[permalink] [raw]
Subject: Re: [flame^Wreview] net: netprio_cgroup: rework update socket logic

On Sun, Aug 12, 2012 at 11:23:59PM -0700, John Fastabend wrote:
> >OK clearly I screwed it up thanks for reviewing Al. How about this.
> >
> > fdt = files_fdtable(files);
> > for (fd = 0; fd < fdt->max_fds; fd++) {
> > struct socket *sock;
> > int err = 0;
> >
> > sock = sockfd_lookup(fd, &err);
> > if (!sock) {
> typo ^^^^^^
> if (sock) {
>
> to be honest I can't see why I didn't use sockfd_lookup in the first
> place...

Look at sockfd_lookup() arguments and ask yourself - "how could it possibly
guess which descriptor table I want it to look into?" If you want to
kinda-sorta deal with the close() races here, the original loop would
be salvagable by replacing rcu_read_lock() with spin_lock(&files->file_lock);

HOWEVER, it still doesn't address more fundamental problem - somebody
creating a socket and passing it to you in SCM_RIGHTS datagram will
leave you with a socket you can do IO on, still tagged according to who
had created it.

AFAICS, the whole point of that exercise was to allow third-party changing
the priorities of traffic on sockets already created by a process we now
move to a different cgroup. Consider e.g. this:
process A spawns a dozen of children. All children are reading
from the same AF_UNIX socket. Parent listens for requests of some kind
(e.g. it's an httpd, etc.). Once request arrives, it hands it off to
a child, by stuffing some information *and* established connection to
client into SCM_RIGHTS datagram, sends it to shared AF_UNIX socket, closes
the descriptor it got from accept(2) (one it has passed to worker in
SCM_RIGHTS) and moves on. The next child to become free will recvmsg()
on that socket. That will get the thing passed by parent installed into
its descriptor table; resulting descriptor will be found in ancillary
data (see unix(7) and cmsg(3)) and the child goes on to handle request,
using that descriptor to talk to client.
I'm not saying that it's a particulary smart way to implement
worker pools, but it's legitimate and your whole point was to avoid
the need to modify userland code, wasn't it? Now think what'll happen
to that model if you take the whole busily working bunch and move it
to a different cgroup. Nevermind the races, assume that everyone gets
moved very quickly. Both the parent and all children got moved and
their descriptor tables had been walked through by your code. All sockets
present got relabeled... Which does nothing to opened sockets sitting
in the datagrams currently in AF_UNIX socket's queue. They are already
not present in descriptor table of the parent. And they are yet to be
picked by the children...

2012-08-13 16:58:15

by John Fastabend

[permalink] [raw]
Subject: Re: [flame^Wreview] net: netprio_cgroup: rework update socket logic

[...]

> HOWEVER, it still doesn't address more fundamental problem - somebody
> creating a socket and passing it to you in SCM_RIGHTS datagram will
> leave you with a socket you can do IO on, still tagged according to who
> had created it.
>
> AFAICS, the whole point of that exercise was to allow third-party changing
> the priorities of traffic on sockets already created by a process we now
> move to a different cgroup. Consider e.g. this:

Correct that is the point of the exercise.

To fix this specific case we could add a call to sock_update_netprioidx
in scm_recv to set the sk_cgrp_prioidx value.

2012-08-13 17:01:15

by Al Viro

[permalink] [raw]
Subject: Re: [flame^Wreview] net: netprio_cgroup: rework update socket logic

On Mon, Aug 13, 2012 at 09:58:12AM -0700, John Fastabend wrote:
> [...]
>
> >HOWEVER, it still doesn't address more fundamental problem - somebody
> >creating a socket and passing it to you in SCM_RIGHTS datagram will
> >leave you with a socket you can do IO on, still tagged according to who
> >had created it.
> >
> >AFAICS, the whole point of that exercise was to allow third-party changing
> >the priorities of traffic on sockets already created by a process we now
> >move to a different cgroup. Consider e.g. this:
>
> Correct that is the point of the exercise.
>
> To fix this specific case we could add a call to sock_update_netprioidx
> in scm_recv to set the sk_cgrp_prioidx value.

On every received descriptor, that is? Eeek...

2012-08-13 17:31:31

by John Fastabend

[permalink] [raw]
Subject: Re: [flame^Wreview] net: netprio_cgroup: rework update socket logic

On 8/13/2012 10:01 AM, Al Viro wrote:
> On Mon, Aug 13, 2012 at 09:58:12AM -0700, John Fastabend wrote:
>> [...]
>>
>>> HOWEVER, it still doesn't address more fundamental problem - somebody
>>> creating a socket and passing it to you in SCM_RIGHTS datagram will
>>> leave you with a socket you can do IO on, still tagged according to who
>>> had created it.
>>>
>>> AFAICS, the whole point of that exercise was to allow third-party changing
>>> the priorities of traffic on sockets already created by a process we now
>>> move to a different cgroup. Consider e.g. this:
>>
>> Correct that is the point of the exercise.
>>
>> To fix this specific case we could add a call to sock_update_netprioidx
>> in scm_recv to set the sk_cgrp_prioidx value.
>
> On every received descriptor, that is? Eeek...
>

We are already iterating through the files in scm_detach_fds called from
scm_recv(). This would be an extra (file->f_op == &socket_file_ops)
check here and then the sock update.