LinuxLists.cc - [RFC] revoke(2) and generic handling of things like remove_proc

2013-04-05 04:29:36

Subject: [RFC] revoke(2) and generic handling of things like remove_proc_entry()

After some digging in procfs logics for revoking further file IO after
remove_proc_entry() (and figuring out what to do for debugfs - it also
needs something similar), I think I've got something that has potential
to become a working revoke(2) with very low overhead. It will require
some preparations and there are several interesting questions regarding
the semantics, but it looks like there's a decent chance of making it
work:
* procfs and debugfs - definitely can use it
* sysfs - probably can switch to that from its home-grown mechanism
* tty hangup handling - might be possible to switch to that
mechanism.
* ALSA analog of hangup - ditto.
* revoke(2) - would be nice; might be doable, depending on the
degree of generality we want.

Below is an outline of implementation; discussion of the fun spots of
semantics is in the very end.

Two new objects are introduced -

struct revokable { // freeing of whatever contains it is RCU'd
atomic_t in_use; // number of threads in methods,
// negative => going away
spinlock_t lock;
hlist_head list; // protected by ->lock, goes through
// struct revoke->node.
struct completion *c;
void (*kick)(struct revokable *); // discussion in the end; NULL
// for procfs and friends.
};

struct revoke { // per-file, lives until __fput()
struct file *file; // never changes
struct revokable *revokable; // RCU protected
struct hlist_node list; // protected by ->revokable->lock
bool closing; // ditto
struct completion *c; // ditto
};

struct file gets a new field - struct revoke *f_revoke (next to ->f_op, never
changes after open, usually NULL).

New helpers:
int make_revokable(file, revokable) - to be used during ->open();
it'll allocate file->f_revoke and associate it with revokable. Should be
the last thing done by ->open() (see below for discussion - that's
one of the most inconvenient areas of this API)
inline bool start_using(file) - called before file method calls
except for ->release(); if it returns false, don't make a call, the
file has been revoked. Attempt to revoke will make sure no new
users can appear (i.e. that start_using() will fail from now on) and
wait until all of them are done. What to do on start_using() failure
is up to the caller - depends on the method (e.g. POLLERR is probably
the right one for ->poll(), -EIO - for ->write(), 0 might be right for
->read(), etc.)
inline void stop_using(file) - called after the method call.
void release_revoke(revoke) - __fput() calls that instead of calling
->release() if file->f_revoke is non-NULL.
void revoke_it(revokable) - revoke all files associated with this
revokable; ->release() will be called on each of those files and no method
calls will be issued after it returns. No references to revokable will
outlive the grace period.

static inline bool start_using(struct file *f)
{
struct revoke *revoke = f->f_revoke;
if (likely(!revoke))
return true; /* non-revokable file */
return __start_using(revoke);
}

static inline void stop_using(struct file *f)
{
struct revoke *revoke = f->f_revoke;
if (unlikely(revoke))
__stop_using(revoke);
}
}

bool __start_using(struct revoke *revoke)
{
struct revokable *r;
rcu_read_lock();
r = rcu_dereference(revoke->revokable);
if (unlikely(!r)) {
rcu_read_unlock();
return false; /* revoked */
}
if (likely(atomic_inc_unless_negative(&r->in_use))) {
rcu_read_unlock();
return true; /* we are using it now */
}
rcu_read_unlock();
return false; /* it's being revoked right now */
}

#define BIAS <large negative constant>

void __stop_using(struct revoke *revoke)
{
struct revokable *r;
r = rcu_dereference_protected(revoke->revokable, 1);
BUG_ON(!r);
if (atomic_dec_return(&r->in_use) == BIAS)
complete(r->c);
}

/* called with r->lock held by caller, unlocks it */
static void __release_revoke(struct revokable *r, struct revoke *revoke)
{
if (revoke->closing) {
DECLARE_COMPLETION_ONSTACK(c);
revoke->c = &c;
spin_unlock(&r->lock);
wait_for_completion(&c);
} else {
struct file *file;
revoke->closing = 1;
spin_unlock(&r->lock);
file = revoke->file;
if (file->f_op->release)
file->f_op->release(file_inode(file), file);
spin_lock(&r->lock);
hlist_del_init(&revoke->list);
rcu_assign_pointer(revoke->revokable, NULL);
rcu_read_lock(); /* prevent freeing of r */
if (revoke->c)
complete(revoke->c);
spin_unlock(&r->lock);
rcu_read_unlock();
}
}

void release_revoke(struct revoke *revoke)
{
struct revokable *r;
rcu_read_lock();
r = rcu_dereference(revoke->revokable);
if (!r) {
/* already closed by revokation */
rcu_read_unlock();
return;
}
spin_lock(&r->lock);
if (unlikely(hlist_unhashed(&revoke->node))) {
/* just been revoked */
spin_unlock(&r->lock);
rcu_read_unlock();
return;
}
/*
* OK, revoke_it() couldn't have been finished yet
* it'll have to get r->lock before it's through, so
* we can drop rcu_read_lock
*/
rcu_read_unlock();
__release_revoke(r, revoke);
kfree(revoke);
}

void revoke_it(struct revokable *r)
{
DECLARE_COMPLETION_ONSTACK(c);
r->c = &c;
if (atomic_add_return(BIAS, &r->in_use) != BIAS) {
if (r->kick)
r->kick(r);
wait_for_completion(c);
}
while (1) {
struct hlist_node *p;
spin_lock(&r->lock);
p = r->list.first;
if (!p)
break;
__release_revoke(r, hlist_entry(p, struct revoke, list));
}
spin_unlock(&r->lock);
}

int make_revokable(struct file *f, struct revokable *r)
{
struct revoke *revoke = kzalloc(sizeof(struct revoke), GFP_KERNEL);
if (!revoke)
return -ENOMEM;
if (!atomic_inc_unless_negative(&r->in_use)) {
kfree(revoke);
return -ENOENT;
}
revoke->file = f;
f->f_revoke = revoke;
spin_lock(&r->lock);
hlist_add_head(&revoke->list, &r->list);
spin_unlock(&r->lock);
__stop_using(revoke);
return 0;
}

procfs would have struct revokable embedded into proc_dir_entry, with
freeing of those guys RCUd. It would set ->f_op to ->proc_fops and
call make_revokable(file, &pde->revokable) in proc_reg_open(); no wrappers
for other methods needed anymore. All file_operations instances fed to
proc_create() et.al. would lose ->owner - it's already not needed for those,
actually. remove_proc_entry()/remove_proc_subtree() would call revoke_it()
on everything we are removing.

Ugly spots and questions re semantics:

1) ->open() instance calling make_revokable() would bloody better touch
nothing after make_revokable() succeeds, since at that point struct file
is visible to revoke_it() and ->release() may be called by it.

2) thou shalt not call revoke_it() from any methods of any file affected
by it. Deadlock for obvious reasons. Do it asynchronously if you really
need it.

3) that ->kick() thing: for something like procfs it's a non-issue, but
for anything like hangup/snd_card_disconnect/sys_revoke we'll need to
supply that method. It should terminate any indefinite waits for IO of
the stuff sitting in ->read()/->write()/etc.

4) nasty semantics issue - mmap() vs. revoke (of any sort, including
remove_proc_entry(), etc.). Suppose a revokable file had been mmapped;
now it's going away. What should we do to its VMAs? Right now sysfs
and procfs get away with that, but only because there's only one thing
that has ->mmap() there - /proc/bus/pci and sysfs equivalents. I've
no idea how does pci_mmap_page_range() interact with PCI hotplug (and
I'm not at all sure that whatever it does isn't racy wrt device removal),
but I suspect that it strongly depends on lack of ->fault() for those
VMAs, which makes killing all PTEs pointing to pages in question enough.
How generic do we want to make it? Anybody wanting to add more files
that could be mmapped in procfs/sysfs/debugfs deserves to be hurt, but
if we start playing with revoke(2), restriction might become inconvenient.
I'm not sure what kind of behaviour do we want there - *BSD at least
used to have revoke(2) only for character devices that had no mmap()...

5) another fun semantics question - what should revoke(2) do to open(2)
happening before it completes? In terms of implementation above, should
we put a new struct revokable in place before we'd finished revoke_it()
on the old one?

6) how do we get from revoke(2) to call of revoke_it() on the right object?
Note that revoke(2) is done by pathname; we might want an ...at() variant,
but all we'll have to play with will be inode, not an opened file. We probably
ought to put it into file_operations and use ->i_fop->revoke() to get there
(with cdev_revoke() looking for relevant struct cdev, etc.). Not sure; we'll
probably need to play with different variants once the generic part of
infrastructure is there.

7) we need to reduce the number of places where we do ->f_op->something();
it's not too horrible, but it's definitely more than I like. ->read() and
->write() are the main offenders in that respect and many of those are in
really ugly parts of tree...

8) if we convert vhangup(2) to that mechanism, serial consoles are going to
be a thing to watch out for. Extra ->release() and...

2013-04-05 19:56:15

by Greg Kroah-Hartman

[permalink] [raw]

Subject: Re: [RFC] revoke(2) and generic handling of things like remove_proc_entry()

On Fri, Apr 05, 2013 at 05:29:32AM +0100, Al Viro wrote:
> After some digging in procfs logics for revoking further file IO after
> remove_proc_entry() (and figuring out what to do for debugfs - it also
> needs something similar), I think I've got something that has potential
> to become a working revoke(2) with very low overhead. It will require
> some preparations and there are several interesting questions regarding
> the semantics, but it looks like there's a decent chance of making it
> work:
> * procfs and debugfs - definitely can use it
> * sysfs - probably can switch to that from its home-grown mechanism

I agree.

> * tty hangup handling - might be possible to switch to that
> mechanism.

That would be very good. Especially as isn't this "traditionally" the
only thing that revoke(2) would work on?

> * ALSA analog of hangup - ditto.
> * revoke(2) - would be nice; might be doable, depending on the
> degree of generality we want.

All character devices would be great to have from my end, as well as the
sysfs/procfs/debugfs/libfs items above.

> Below is an outline of implementation; discussion of the fun spots of
> semantics is in the very end.
>
> Two new objects are introduced -
>
> struct revokable { // freeing of whatever contains it is RCU'd
> atomic_t in_use; // number of threads in methods,
> // negative => going away

Which methods do you mean here?

> spinlock_t lock;
> hlist_head list; // protected by ->lock, goes through
> // struct revoke->node.
> struct completion *c;
> void (*kick)(struct revokable *); // discussion in the end; NULL
> // for procfs and friends.
> };
>
> struct revoke { // per-file, lives until __fput()
> struct file *file; // never changes
> struct revokable *revokable; // RCU protected
> struct hlist_node list; // protected by ->revokable->lock
> bool closing; // ditto
> struct completion *c; // ditto
> };
>
> struct file gets a new field - struct revoke *f_revoke (next to ->f_op, never
> changes after open, usually NULL).

This looks reasonable.

> New helpers:
> int make_revokable(file, revokable) - to be used during ->open();
> it'll allocate file->f_revoke and associate it with revokable. Should be
> the last thing done by ->open() (see below for discussion - that's
> one of the most inconvenient areas of this API)
> inline bool start_using(file) - called before file method calls
> except for ->release(); if it returns false, don't make a call, the
> file has been revoked. Attempt to revoke will make sure no new
> users can appear (i.e. that start_using() will fail from now on) and
> wait until all of them are done. What to do on start_using() failure
> is up to the caller - depends on the method (e.g. POLLERR is probably
> the right one for ->poll(), -EIO - for ->write(), 0 might be right for
> ->read(), etc.)

The vfs core would call start_using(), or would filesystems / drivers
need to do this?

> inline void stop_using(file) - called after the method call.

Which method? The one that called start_using() first? Makes sense.

> void release_revoke(revoke) - __fput() calls that instead of calling
> ->release() if file->f_revoke is non-NULL.
> void revoke_it(revokable) - revoke all files associated with this
> revokable; ->release() will be called on each of those files and no method
> calls will be issued after it returns. No references to revokable will
> outlive the grace period.
>
> static inline bool start_using(struct file *f)
> {
> struct revoke *revoke = f->f_revoke;
> if (likely(!revoke))
> return true; /* non-revokable file */
> return __start_using(revoke);
> }
>
> static inline void stop_using(struct file *f)
> {
> struct revoke *revoke = f->f_revoke;
> if (unlikely(revoke))
> __stop_using(revoke);
> }
> }
>
> bool __start_using(struct revoke *revoke)
> {
> struct revokable *r;
> rcu_read_lock();
> r = rcu_dereference(revoke->revokable);
> if (unlikely(!r)) {
> rcu_read_unlock();
> return false; /* revoked */
> }
> if (likely(atomic_inc_unless_negative(&r->in_use))) {
> rcu_read_unlock();
> return true; /* we are using it now */
> }
> rcu_read_unlock();
> return false; /* it's being revoked right now */
> }
>
> #define BIAS <large negative constant>
>
> void __stop_using(struct revoke *revoke)
> {
> struct revokable *r;
> r = rcu_dereference_protected(revoke->revokable, 1);
> BUG_ON(!r);
> if (atomic_dec_return(&r->in_use) == BIAS)
> complete(r->c);
> }
>
> /* called with r->lock held by caller, unlocks it */
> static void __release_revoke(struct revokable *r, struct revoke *revoke)
> {
> if (revoke->closing) {
> DECLARE_COMPLETION_ONSTACK(c);
> revoke->c = &c;
> spin_unlock(&r->lock);
> wait_for_completion(&c);
> } else {
> struct file *file;
> revoke->closing = 1;
> spin_unlock(&r->lock);
> file = revoke->file;
> if (file->f_op->release)
> file->f_op->release(file_inode(file), file);
> spin_lock(&r->lock);
> hlist_del_init(&revoke->list);
> rcu_assign_pointer(revoke->revokable, NULL);
> rcu_read_lock(); /* prevent freeing of r */
> if (revoke->c)
> complete(revoke->c);
> spin_unlock(&r->lock);
> rcu_read_unlock();
> }
> }
>
> void release_revoke(struct revoke *revoke)
> {
> struct revokable *r;
> rcu_read_lock();
> r = rcu_dereference(revoke->revokable);
> if (!r) {
> /* already closed by revokation */
> rcu_read_unlock();
> return;
> }
> spin_lock(&r->lock);
> if (unlikely(hlist_unhashed(&revoke->node))) {
> /* just been revoked */
> spin_unlock(&r->lock);
> rcu_read_unlock();
> return;
> }
> /*
> * OK, revoke_it() couldn't have been finished yet
> * it'll have to get r->lock before it's through, so
> * we can drop rcu_read_lock
> */
> rcu_read_unlock();
> __release_revoke(r, revoke);
> kfree(revoke);
> }
>
> void revoke_it(struct revokable *r)
> {
> DECLARE_COMPLETION_ONSTACK(c);
> r->c = &c;
> if (atomic_add_return(BIAS, &r->in_use) != BIAS) {
> if (r->kick)
> r->kick(r);
> wait_for_completion(c);
> }
> while (1) {
> struct hlist_node *p;
> spin_lock(&r->lock);
> p = r->list.first;
> if (!p)
> break;
> __release_revoke(r, hlist_entry(p, struct revoke, list));
> }
> spin_unlock(&r->lock);
> }
>
> int make_revokable(struct file *f, struct revokable *r)
> {
> struct revoke *revoke = kzalloc(sizeof(struct revoke), GFP_KERNEL);
> if (!revoke)
> return -ENOMEM;
> if (!atomic_inc_unless_negative(&r->in_use)) {
> kfree(revoke);
> return -ENOENT;
> }
> revoke->file = f;
> f->f_revoke = revoke;
> spin_lock(&r->lock);
> hlist_add_head(&revoke->list, &r->list);
> spin_unlock(&r->lock);
> __stop_using(revoke);
> return 0;
> }
>
> procfs would have struct revokable embedded into proc_dir_entry, with
> freeing of those guys RCUd. It would set ->f_op to ->proc_fops and
> call make_revokable(file, &pde->revokable) in proc_reg_open(); no wrappers
> for other methods needed anymore. All file_operations instances fed to
> proc_create() et.al. would lose ->owner - it's already not needed for those,
> actually. remove_proc_entry()/remove_proc_subtree() would call revoke_it()
> on everything we are removing.

debugfs could do the same, that sounds sane.

> Ugly spots and questions re semantics:
>
> 1) ->open() instance calling make_revokable() would bloody better touch
> nothing after make_revokable() succeeds, since at that point struct file
> is visible to revoke_it() and ->release() may be called by it.

Reasonable constraint.

> 2) thou shalt not call revoke_it() from any methods of any file affected
> by it. Deadlock for obvious reasons. Do it asynchronously if you really
> need it.

Reasonable.

> 3) that ->kick() thing: for something like procfs it's a non-issue, but
> for anything like hangup/snd_card_disconnect/sys_revoke we'll need to
> supply that method. It should terminate any indefinite waits for IO of
> the stuff sitting in ->read()/->write()/etc.

The tty layer should be able to handle this.

> 4) nasty semantics issue - mmap() vs. revoke (of any sort, including
> remove_proc_entry(), etc.). Suppose a revokable file had been mmapped;
> now it's going away. What should we do to its VMAs? Right now sysfs
> and procfs get away with that, but only because there's only one thing
> that has ->mmap() there - /proc/bus/pci and sysfs equivalents. I've
> no idea how does pci_mmap_page_range() interact with PCI hotplug (and
> I'm not at all sure that whatever it does isn't racy wrt device removal),

The page range should just start returning 0xff all over the place, the
BIOS should have kept the mapping around, as it can't really assign it
anywhere else, so all _should_ be fine here.

> but I suspect that it strongly depends on lack of ->fault() for those
> VMAs, which makes killing all PTEs pointing to pages in question enough.
> How generic do we want to make it? Anybody wanting to add more files
> that could be mmapped in procfs/sysfs/debugfs deserves to be hurt, but
> if we start playing with revoke(2), restriction might become inconvenient.
> I'm not sure what kind of behaviour do we want there - *BSD at least
> used to have revoke(2) only for character devices that had no mmap()...

I think that's a reasonable constraint, although tearing down the VMAs
might be possible if we just invalidate the file handle "forcefully"
(i.e. manually tear them down and then further accesses should through a
SIGSEV fail, or am I missing something more basic here?)

> 5) another fun semantics question - what should revoke(2) do to open(2)
> happening before it completes? In terms of implementation above, should
> we put a new struct revokable in place before we'd finished revoke_it()
> on the old one?

If we can do so race-free, yes, that makes sense.

> 6) how do we get from revoke(2) to call of revoke_it() on the right object?
> Note that revoke(2) is done by pathname; we might want an ...at() variant,
> but all we'll have to play with will be inode, not an opened file.

Can we make revoke(2) require a valid file handle? Is there a POSIX
spec for revoke(2) that we have to follow here, or given that we haven't
had one yet, are we free to define whatever we want without people
getting that upset?

> We probably ought to put it into file_operations and use
> ->i_fop->revoke() to get there (with cdev_revoke() looking for
> relevant struct cdev, etc.). Not sure; we'll probably need to play
> with different variants once the generic part of infrastructure is
> there.

Yeah, that's going to be tricky, but for cdevs we should be ok, I hope.

> 7) we need to reduce the number of places where we do ->f_op->something();
> it's not too horrible, but it's definitely more than I like. ->read() and
> ->write() are the main offenders in that respect and many of those are in
> really ugly parts of tree...
>
> 8) if we convert vhangup(2) to that mechanism, serial consoles are going to
> be a thing to watch out for. Extra ->release() and...

I don't know if we can convert vhangup to revoke, but internally, it
might be close to the same thing, that will take some testing.

Anyway, I'm all for this, I've wanted a revoke() for a long time, and
have tried half-heartedly a number of times in the past, failing
everytime. As this looks to solve a number of other issues we are
currently having elsewhere in some filesystems, that makes it even
better.

thanks,

greg k-h

2013-04-05 20:51:42

by Al Viro

[permalink] [raw]

Subject: Re: [RFC] revoke(2) and generic handling of things like remove_proc_entry()

On Fri, Apr 05, 2013 at 12:56:09PM -0700, Greg Kroah-Hartman wrote:

> Which methods do you mean here?

file->f_op->some_method()

> The vfs core would call start_using(), or would filesystems / drivers
> need to do this?

The former; we have relatively few places that call file_operations
members directly and we'd turn each of those into
if (likely(start_using(file)) {
res = file->f_op->foo(....);
stop_using(file);
} else {
res = error_value_appropriate_for_foo;
}

> > 4) nasty semantics issue - mmap() vs. revoke (of any sort, including
> > remove_proc_entry(), etc.). Suppose a revokable file had been mmapped;
> > now it's going away. What should we do to its VMAs? Right now sysfs
> > and procfs get away with that, but only because there's only one thing
> > that has ->mmap() there - /proc/bus/pci and sysfs equivalents. I've
> > no idea how does pci_mmap_page_range() interact with PCI hotplug (and
> > I'm not at all sure that whatever it does isn't racy wrt device removal),
>
> The page range should just start returning 0xff all over the place, the
> BIOS should have kept the mapping around, as it can't really assign it
> anywhere else, so all _should_ be fine here.

Umm... 0xff or SIGSEGV?

> I think that's a reasonable constraint, although tearing down the VMAs
> might be possible if we just invalidate the file handle "forcefully"
> (i.e. manually tear them down and then further accesses should through a
> SIGSEV fail, or am I missing something more basic here?)

The question is how to do that in a reasonably clean way; we would've done
as part of ->kick(), I suppose, or right next to it.

> > 6) how do we get from revoke(2) to call of revoke_it() on the right object?
> > Note that revoke(2) is done by pathname; we might want an ...at() variant,
> > but all we'll have to play with will be inode, not an opened file.
>
> Can we make revoke(2) require a valid file handle? Is there a POSIX
> spec for revoke(2) that we have to follow here, or given that we haven't
> had one yet, are we free to define whatever we want without people
> getting that upset?

BSD one takes a pathname and so do all derived ones...

2013-04-05 22:46:20

by Greg Kroah-Hartman

[permalink] [raw]

Subject: Re: [RFC] revoke(2) and generic handling of things like remove_proc_entry()

On Fri, Apr 05, 2013 at 09:51:37PM +0100, Al Viro wrote:
> On Fri, Apr 05, 2013 at 12:56:09PM -0700, Greg Kroah-Hartman wrote:
> > > 4) nasty semantics issue - mmap() vs. revoke (of any sort, including
> > > remove_proc_entry(), etc.). Suppose a revokable file had been mmapped;
> > > now it's going away. What should we do to its VMAs? Right now sysfs
> > > and procfs get away with that, but only because there's only one thing
> > > that has ->mmap() there - /proc/bus/pci and sysfs equivalents. I've
> > > no idea how does pci_mmap_page_range() interact with PCI hotplug (and
> > > I'm not at all sure that whatever it does isn't racy wrt device removal),
> >
> > The page range should just start returning 0xff all over the place, the
> > BIOS should have kept the mapping around, as it can't really assign it
> > anywhere else, so all _should_ be fine here.
>
> Umm... 0xff or SIGSEGV?

I think, at first glance, 0xff, as the area is still "mapped" to the
device, and that never gets invaldated from what I can tell, despite the
device now being gone.

> > I think that's a reasonable constraint, although tearing down the VMAs
> > might be possible if we just invalidate the file handle "forcefully"
> > (i.e. manually tear them down and then further accesses should through a
> > SIGSEV fail, or am I missing something more basic here?)
>
> The question is how to do that in a reasonably clean way; we would've done
> as part of ->kick(), I suppose, or right next to it.

I don't really know, sorry.

> > > 6) how do we get from revoke(2) to call of revoke_it() on the right object?
> > > Note that revoke(2) is done by pathname; we might want an ...at() variant,
> > > but all we'll have to play with will be inode, not an opened file.
> >
> > Can we make revoke(2) require a valid file handle? Is there a POSIX
> > spec for revoke(2) that we have to follow here, or given that we haven't
> > had one yet, are we free to define whatever we want without people
> > getting that upset?
>
> BSD one takes a pathname and so do all derived ones...

Ugh, ok, they were there first, fair enough.

Hm, how do they solve this type of race condition? Last time I looked
(middle of last year) at one of the revoke BSD implementations, I don't
recall anything special to try to prevent this. Is it that they just
don't care as almost no one uses it, and it's only for tty devices? Or
did I miss something?

thanks,

greg k-h

2013-04-06 03:01:17

by Hannes Frederic Sowa

[permalink] [raw]

Subject: Re: [RFC] revoke(2) and generic handling of things like remove_proc_entry()

2013-04-06 05:00:54

by Al Viro

[permalink] [raw]

Subject: Re: [RFC] revoke(2) and generic handling of things like remove_proc_entry()

On Fri, Apr 05, 2013 at 05:29:32AM +0100, Al Viro wrote:
> 4) nasty semantics issue - mmap() vs. revoke (of any sort, including
> remove_proc_entry(), etc.). Suppose a revokable file had been mmapped;
> now it's going away. What should we do to its VMAs? Right now sysfs
> and procfs get away with that, but only because there's only one thing
> that has ->mmap() there - /proc/bus/pci and sysfs equivalents. I've
> no idea how does pci_mmap_page_range() interact with PCI hotplug (and
> I'm not at all sure that whatever it does isn't racy wrt device removal),
> but I suspect that it strongly depends on lack of ->fault() for those
> VMAs, which makes killing all PTEs pointing to pages in question enough.
> How generic do we want to make it? Anybody wanting to add more files
> that could be mmapped in procfs/sysfs/debugfs deserves to be hurt, but
> if we start playing with revoke(2), restriction might become inconvenient.
> I'm not sure what kind of behaviour do we want there - *BSD at least
> used to have revoke(2) only for character devices that had no mmap()...

Actually, after looking at what sysfs does... We might get away with
the following
* new vma flag - VM_REVOKABLE; set by mmap() if ->f_revoke is
non-NULL. We are short on spare bits there, but there still are some...
* start_using_vma(vma) that checks the presence of that flag,
returns true if it's absent and __start_using(vma->vm_file->f_revoke)
otherwise; a matching stop_using_vma(vma) as well.
* surround vma method calls with start_using_vma/stop_using_vma,
similar to file ones. Do what fs/sysfs/bin.c wrappers do for revoked
ones - VM_FAULT_SIGBUS for ->fault() and ->page_mkwrite(), -EINVAL for
->access() and ->set_policy(), vma->vm_policy for ->get_policy(),
0 for ->migrate(), "do nothing" for ->open() (and I'm not at all sure that
this one is correct), hell knows what for ->close(). Note that the *only*
instance with ->open and without ->close is sysfs pile of wrappers itself...

Hell knows... We have few enough call sites for ->vm_op->foo() to make
it feasible and overhead would be trivial. OTOH, I'm not sure what's the
right behaviour for mmap of something like drm after revoke(2) - leaving
writable pages there looks wrong...

BTW, snd_card_disconnect() doesn't do anything to existing mappings; smells
like a bug, and there we do have ones with non-trivial ->mmap(). Could
ALSA folks comment?

One note about the mockup implementation upthread - __release_revoke() should
suck in a bit more than just ->release() - turning fasync off should also go
there.

2013-04-11 20:48:43

by Eric W. Biederman

[permalink] [raw]

Subject: Re: [RFC] revoke(2) and generic handling of things like remove_proc_entry()

Al Viro <[email protected]> writes:

> On Fri, Apr 05, 2013 at 05:29:32AM +0100, Al Viro wrote:
>> 4) nasty semantics issue - mmap() vs. revoke (of any sort, including
>> remove_proc_entry(), etc.). Suppose a revokable file had been mmapped;
>> now it's going away. What should we do to its VMAs? Right now sysfs
>> and procfs get away with that, but only because there's only one thing
>> that has ->mmap() there - /proc/bus/pci and sysfs equivalents. I've
>> no idea how does pci_mmap_page_range() interact with PCI hotplug (and
>> I'm not at all sure that whatever it does isn't racy wrt device removal),
>> but I suspect that it strongly depends on lack of ->fault() for those
>> VMAs, which makes killing all PTEs pointing to pages in question enough.
>> How generic do we want to make it? Anybody wanting to add more files
>> that could be mmapped in procfs/sysfs/debugfs deserves to be hurt, but
>> if we start playing with revoke(2), restriction might become inconvenient.
>> I'm not sure what kind of behaviour do we want there - *BSD at least
>> used to have revoke(2) only for character devices that had no mmap()...
>
> Actually, after looking at what sysfs does... We might get away with
> the following
> * new vma flag - VM_REVOKABLE; set by mmap() if ->f_revoke is
> non-NULL. We are short on spare bits there, but there still are some...
> * start_using_vma(vma) that checks the presence of that flag,
> returns true if it's absent and __start_using(vma->vm_file->f_revoke)
> otherwise; a matching stop_using_vma(vma) as well.
> * surround vma method calls with start_using_vma/stop_using_vma,
> similar to file ones. Do what fs/sysfs/bin.c wrappers do for revoked
> ones - VM_FAULT_SIGBUS for ->fault() and ->page_mkwrite(), -EINVAL for
> ->access() and ->set_policy(), vma->vm_policy for ->get_policy(),
> 0 for ->migrate(), "do nothing" for ->open() (and I'm not at all sure that
> this one is correct), hell knows what for ->close(). Note that the *only*
> instance with ->open and without ->close is sysfs pile of wrappers
> itself...

sysfs used to wrap close but nothing used it, and on review it turned to
be impossible to properly call close on all of the vmas on revoke with
all of the proper locks held without help from the mm subsystem. So now
sysfs refuses a mapping if ->close is implemented. In practice that
means having an implementation of ->open in sysfs is likely a waste of
space because open and closed are always paired.

My memory is that the sysfs wrappers work for a pretty narrow range of
cases, and are things only work correctly because memory mapping pci
uses none of the sophisticated facilities.

Last time I was looking at this I was noticing that there is a lock
(mmap_sem?) that is held over every ->vm_op->foo() call. If that is
true today it should be possible to just grab that lock and change
vm_ops. That makes for a very cheap and easy implementation, except for
the covolutions needed for taking the lock.

> Hell knows... We have few enough call sites for ->vm_op->foo() to make
> it feasible and overhead would be trivial. OTOH, I'm not sure what's the
> right behaviour for mmap of something like drm after revoke(2) - leaving
> writable pages there looks wrong...

I'm not certain of the peculiarities of drm. In general the correct
semantics of a revoked memory area is that of what we do with a mmap
area on truncate. Retaining the vma with no pages mapped and generating
SIGBUS on an access.

> BTW, snd_card_disconnect() doesn't do anything to existing mappings; smells
> like a bug, and there we do have ones with non-trivial ->mmap(). Could
> ALSA folks comment?

uio might be worth considering as well. As some uio devices are
actually hotpluggable..

> One note about the mockup implementation upthread - __release_revoke() should
> suck in a bit more than just ->release() - turning fasync off should also go
> there.

If we can do add useful support at the fs and mm layers without
affecting performance I am all for it. I remember that tends to make
things easier. As an alternative let me suggest what I had intended to
do if/when I ever got back to working on revoke.

Make a library like libfs that can be used for files that want to
implement revoke support.

In that library implement what can be implemented reliably and correctly
and error on the sophisticated cases we can't support.

With the semantics and the basic users figured out move what bits we can
into the vfs or the mm subsystem to make things easier.

With a library at the very least we have one implementation that we can
debug and work with instead of a different implementation of revoke for
each different kind of file.

Eric

2013-04-11 23:29:19

by Al Viro

[permalink] [raw]

Subject: Re: [RFC] revoke(2) and generic handling of things like remove_proc_entry()

On Thu, Apr 11, 2013 at 01:48:26PM -0700, Eric W. Biederman wrote:

> Last time I was looking at this I was noticing that there is a lock
> (mmap_sem?) that is held over every ->vm_op->foo() call. If that is
> true today it should be possible to just grab that lock and change
> vm_ops. That makes for a very cheap and easy implementation, except for
> the covolutions needed for taking the lock.

3-rd party down_write(&mm->mmap_sem) is a Bloody Bad Idea(tm). VM locking is
complicated enough as it is and making it cope with such things would make it
even more convoluted.

> If we can do add useful support at the fs and mm layers without
> affecting performance I am all for it. I remember that tends to make
> things easier. As an alternative let me suggest what I had intended to
> do if/when I ever got back to working on revoke.
>
> Make a library like libfs that can be used for files that want to
> implement revoke support.
>
> In that library implement what can be implemented reliably and correctly
> and error on the sophisticated cases we can't support.
>
> With the semantics and the basic users figured out move what bits we can
> into the vfs or the mm subsystem to make things easier.
>
> With a library at the very least we have one implementation that we can
> debug and work with instead of a different implementation of revoke for
> each different kind of file.

Yecchh... revoke() as a syscall or revoke as something that happens when
kernel decides that file has gone away? The latter includes
procfs/debugfs/sysfs at the very least. Do we want to require all of those
to use that library? I would rather try to avoid a need for wrappers, TBH...

You have a very good point re ->close() - the locking conditions for it are
such that making revoke do it is extremely inconvenient. IMO it means that
mmap should check for attempts to set ->vm_op on vma with non-NULL
->vm_file->f_revoke and fail if it runs into such.