2011-06-19 23:52:17

by Al Viro

[permalink] [raw]
Subject: [RFC] get_write_access()/deny_write_access() without inode->i_lock

I'm seriously tempted to throw away i_lock uses in
{get,deny}_write_access(), as in the patch below. The question is, how
badly will it suck on various architectures? I'd expect it to be not
worse than the current version, but...
BTW, I wonder if we need barriers in {put,allow}_write_access (in
either version).

Related question: would it make sense to turn that into
atomic_inc_unless_negative/atomic_dec_unless_positive? I don't
remember any code doing that kind of stuff - no idea if there are
any potential users for that.

diff --git a/fs/namei.c b/fs/namei.c
index 26bef77..7dffe2e 100644
--- a/fs/namei.c
+++ b/fs/namei.c
@@ -341,52 +341,6 @@ ok:
return security_inode_exec_permission(inode, flags);
}

-/*
- * get_write_access() gets write permission for a file.
- * put_write_access() releases this write permission.
- * This is used for regular files.
- * We cannot support write (and maybe mmap read-write shared) accesses and
- * MAP_DENYWRITE mmappings simultaneously. The i_writecount field of an inode
- * can have the following values:
- * 0: no writers, no VM_DENYWRITE mappings
- * < 0: (-i_writecount) vm_area_structs with VM_DENYWRITE set exist
- * > 0: (i_writecount) users are writing to the file.
- *
- * Normally we operate on that counter with atomic_{inc,dec} and it's safe
- * except for the cases where we don't hold i_writecount yet. Then we need to
- * use {get,deny}_write_access() - these functions check the sign and refuse
- * to do the change if sign is wrong. Exclusion between them is provided by
- * the inode->i_lock spinlock.
- */
-
-int get_write_access(struct inode * inode)
-{
- spin_lock(&inode->i_lock);
- if (atomic_read(&inode->i_writecount) < 0) {
- spin_unlock(&inode->i_lock);
- return -ETXTBSY;
- }
- atomic_inc(&inode->i_writecount);
- spin_unlock(&inode->i_lock);
-
- return 0;
-}
-
-int deny_write_access(struct file * file)
-{
- struct inode *inode = file->f_path.dentry->d_inode;
-
- spin_lock(&inode->i_lock);
- if (atomic_read(&inode->i_writecount) > 0) {
- spin_unlock(&inode->i_lock);
- return -ETXTBSY;
- }
- atomic_dec(&inode->i_writecount);
- spin_unlock(&inode->i_lock);
-
- return 0;
-}
-
/**
* path_get - get a reference to a path
* @path: path to get the reference to
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 7302e44..ab89aa3 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -2194,8 +2194,43 @@ static inline bool execute_ok(struct inode *inode)
return (inode->i_mode & S_IXUGO) || S_ISDIR(inode->i_mode);
}

-extern int get_write_access(struct inode *);
-extern int deny_write_access(struct file *);
+/*
+ * get_write_access() gets write permission for a file.
+ * put_write_access() releases this write permission.
+ * This is used for regular files.
+ * We cannot support write (and maybe mmap read-write shared) accesses and
+ * MAP_DENYWRITE mmappings simultaneously. The i_writecount field of an inode
+ * can have the following values:
+ * 0: no writers, no VM_DENYWRITE mappings
+ * < 0: (-i_writecount) vm_area_structs with VM_DENYWRITE set exist
+ * > 0: (i_writecount) users are writing to the file.
+ *
+ * Normally we operate on that counter with atomic_{inc,dec} and it's safe
+ * except for the cases where we don't hold i_writecount yet. Then we need to
+ * use {get,deny}_write_access() - these functions check the sign and refuse
+ * to do the change if sign is wrong.
+ */
+static inline int get_write_access(struct inode *inode)
+{
+ int v, v1;
+ for (v = atomic_read(&inode->i_writecount); v >= 0; v = v1) {
+ v1 = atomic_cmpxchg(&inode->i_writecount, v, v + 1);
+ if (likely(v1 == v))
+ return 0;
+ }
+ return -ETXTBSY;
+}
+static inline int deny_write_access(struct file *file)
+{
+ struct inode *inode = file->f_path.dentry->d_inode;
+ int v, v1;
+ for (v = atomic_read(&inode->i_writecount); v <= 0; v = v1) {
+ v1 = atomic_cmpxchg(&inode->i_writecount, v, v + 1);
+ if (likely(v1 == v))
+ return 0;
+ }
+ return -ETXTBSY;
+}
static inline void put_write_access(struct inode * inode)
{
atomic_dec(&inode->i_writecount);


2011-06-20 12:47:24

by David Howells

[permalink] [raw]
Subject: Re: [RFC] get_write_access()/deny_write_access() without inode->i_lock

Al Viro <[email protected]> wrote:

> + for (v = atomic_read(&inode->i_writecount); v >= 0; v = v1) {
> + v1 = atomic_cmpxchg(&inode->i_writecount, v, v + 1);
> + if (likely(v1 == v))
> + return 0;
> + }

You don't need to reissue the atomic_read(). atomic_cmpxchg() returns the
current value of the memory location. Just set v to v1 before going round the
loop again.

David

2011-06-20 13:18:24

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] get_write_access()/deny_write_access() without inode->i_lock

On Mon, Jun 20, 2011 at 01:47:17PM +0100, David Howells wrote:
> Al Viro <[email protected]> wrote:
>
> > + for (v = atomic_read(&inode->i_writecount); v >= 0; v = v1) {
> > + v1 = atomic_cmpxchg(&inode->i_writecount, v, v + 1);
> > + if (likely(v1 == v))
> > + return 0;
> > + }
>
> You don't need to reissue the atomic_read(). atomic_cmpxchg() returns the
> current value of the memory location. Just set v to v1 before going round the
> loop again.

That's precisely what that loop is doing...

2011-06-20 13:21:13

by David Howells

[permalink] [raw]
Subject: Re: [RFC] get_write_access()/deny_write_access() without inode->i_lock

David Howells <[email protected]> wrote:

> > + for (v = atomic_read(&inode->i_writecount); v >= 0; v = v1) {
> > + v1 = atomic_cmpxchg(&inode->i_writecount, v, v + 1);
> > + if (likely(v1 == v))
> > + return 0;
> > + }
>
> You don't need to reissue the atomic_read(). atomic_cmpxchg() returns the
> current value of the memory location. Just set v to v1 before going round the
> loop again.

Ah... Never mind - you did that anyway.

David

2011-06-20 13:22:19

by Frantisek Hrbata

[permalink] [raw]
Subject: Re: [RFC] get_write_access()/deny_write_access() without inode->i_lock

On Mon, Jun 20, 2011 at 12:51:47AM +0100, Al Viro wrote:
> I'm seriously tempted to throw away i_lock uses in
> {get,deny}_write_access(), as in the patch below. The question is, how
> badly will it suck on various architectures? I'd expect it to be not
> worse than the current version, but...
> BTW, I wonder if we need barriers in {put,allow}_write_access (in
> either version).
>
> Related question: would it make sense to turn that into
> atomic_inc_unless_negative/atomic_dec_unless_positive? I don't
> remember any code doing that kind of stuff - no idea if there are
> any potential users for that.
>
> diff --git a/fs/namei.c b/fs/namei.c
> index 26bef77..7dffe2e 100644
> --- a/fs/namei.c
> +++ b/fs/namei.c
> @@ -341,52 +341,6 @@ ok:
> return security_inode_exec_permission(inode, flags);
> }
>
> -/*
> - * get_write_access() gets write permission for a file.
> - * put_write_access() releases this write permission.
> - * This is used for regular files.
> - * We cannot support write (and maybe mmap read-write shared) accesses and
> - * MAP_DENYWRITE mmappings simultaneously. The i_writecount field of an inode
> - * can have the following values:
> - * 0: no writers, no VM_DENYWRITE mappings
> - * < 0: (-i_writecount) vm_area_structs with VM_DENYWRITE set exist
> - * > 0: (i_writecount) users are writing to the file.
> - *
> - * Normally we operate on that counter with atomic_{inc,dec} and it's safe
> - * except for the cases where we don't hold i_writecount yet. Then we need to
> - * use {get,deny}_write_access() - these functions check the sign and refuse
> - * to do the change if sign is wrong. Exclusion between them is provided by
> - * the inode->i_lock spinlock.
> - */
> -
> -int get_write_access(struct inode * inode)
> -{
> - spin_lock(&inode->i_lock);
> - if (atomic_read(&inode->i_writecount) < 0) {
> - spin_unlock(&inode->i_lock);
> - return -ETXTBSY;
> - }
> - atomic_inc(&inode->i_writecount);
> - spin_unlock(&inode->i_lock);
> -
> - return 0;
> -}
> -
> -int deny_write_access(struct file * file)
> -{
> - struct inode *inode = file->f_path.dentry->d_inode;
> -
> - spin_lock(&inode->i_lock);
> - if (atomic_read(&inode->i_writecount) > 0) {
> - spin_unlock(&inode->i_lock);
> - return -ETXTBSY;
> - }
> - atomic_dec(&inode->i_writecount);
> - spin_unlock(&inode->i_lock);
> -
> - return 0;
> -}
> -
> /**
> * path_get - get a reference to a path
> * @path: path to get the reference to
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 7302e44..ab89aa3 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -2194,8 +2194,43 @@ static inline bool execute_ok(struct inode *inode)
> return (inode->i_mode & S_IXUGO) || S_ISDIR(inode->i_mode);
> }
>
> -extern int get_write_access(struct inode *);
> -extern int deny_write_access(struct file *);
> +/*
> + * get_write_access() gets write permission for a file.
> + * put_write_access() releases this write permission.
> + * This is used for regular files.
> + * We cannot support write (and maybe mmap read-write shared) accesses and
> + * MAP_DENYWRITE mmappings simultaneously. The i_writecount field of an inode
> + * can have the following values:
> + * 0: no writers, no VM_DENYWRITE mappings
> + * < 0: (-i_writecount) vm_area_structs with VM_DENYWRITE set exist
> + * > 0: (i_writecount) users are writing to the file.
> + *
> + * Normally we operate on that counter with atomic_{inc,dec} and it's safe
> + * except for the cases where we don't hold i_writecount yet. Then we need to
> + * use {get,deny}_write_access() - these functions check the sign and refuse
> + * to do the change if sign is wrong.
> + */
> +static inline int get_write_access(struct inode *inode)
> +{
> + int v, v1;
> + for (v = atomic_read(&inode->i_writecount); v >= 0; v = v1) {
> + v1 = atomic_cmpxchg(&inode->i_writecount, v, v + 1);
> + if (likely(v1 == v))
> + return 0;
> + }
> + return -ETXTBSY;
> +}
> +static inline int deny_write_access(struct file *file)
> +{
> + struct inode *inode = file->f_path.dentry->d_inode;
> + int v, v1;
> + for (v = atomic_read(&inode->i_writecount); v <= 0; v = v1) {
> + v1 = atomic_cmpxchg(&inode->i_writecount, v, v + 1);
^^^^^
Shouldn't i_writecount be decreased here. Looks like cut & paste problem to
me. Please ignore if I'm wrong.

> + if (likely(v1 == v))
> + return 0;
> + }
> + return -ETXTBSY;
> +}
> static inline void put_write_access(struct inode * inode)
> {
> atomic_dec(&inode->i_writecount);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

--
Frantisek Hrbata

2011-06-20 14:15:51

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] get_write_access()/deny_write_access() without inode->i_lock

On Mon, Jun 20, 2011 at 03:21:44PM +0200, Frantisek Hrbata wrote:
> > +static inline int deny_write_access(struct file *file)
> > +{
> > + struct inode *inode = file->f_path.dentry->d_inode;
> > + int v, v1;
> > + for (v = atomic_read(&inode->i_writecount); v <= 0; v = v1) {
> > + v1 = atomic_cmpxchg(&inode->i_writecount, v, v + 1);
> ^^^^^
> Shouldn't i_writecount be decreased here. Looks like cut & paste problem to
> me.

Yes, it should and yes, it is. Thanks for spotting...

2011-06-20 15:56:32

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] get_write_access()/deny_write_access() without inode->i_lock

On Sun, Jun 19, 2011 at 4:51 PM, Al Viro <[email protected]> wrote:
> ? ? ? ?I'm seriously tempted to throw away i_lock uses in
> {get,deny}_write_access(), as in the patch below. ?The question is, how
> badly will it suck on various architectures? ?I'd expect it to be not
> worse than the current version, but...

It might be worse, because doing a read-before-write can turn a single
cache operation ("get for write") into multiple cache operations ("get
for read" followed by "make exclusive").

We had that exact issue with some other users of the "read + cmpxchg" model.

The way we fixed it before was to simply omit the read, and turn that
into a "guess".

In other words, I'd suggest you get rid of the "atomic_read()"
entirely, and just assume that the write counter was zero to begin
with. Even if that is a wrong assumption (and it probably isn't all
that wrong), it can actually be more efficient to essentiall go
through the loop twice: the first time yoou use the cmpxchg as just an
odd way to do a read. It basically bcomes a read-with-write-intent,
and solves the cacheline issue.

At that point, it's likely faster in pretty much all cases except for
UP (where the spinlocks just go away, and "cmpxchg" is slower than
normal code).

Linus

2011-06-20 16:13:57

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] get_write_access()/deny_write_access() without inode->i_lock

On Mon, Jun 20, 2011 at 08:55:38AM -0700, Linus Torvalds wrote:
> On Sun, Jun 19, 2011 at 4:51 PM, Al Viro <[email protected]> wrote:
> > ? ? ? ?I'm seriously tempted to throw away i_lock uses in
> > {get,deny}_write_access(), as in the patch below. ?The question is, how
> > badly will it suck on various architectures? ?I'd expect it to be not
> > worse than the current version, but...
>
> It might be worse, because doing a read-before-write can turn a single
> cache operation ("get for write") into multiple cache operations ("get
> for read" followed by "make exclusive").

Er... The current mainline does atomic_read() followed by atomic_inc(),
so we get the same thing (plus the spin_lock()/spin_unlock()), don't we?

> We had that exact issue with some other users of the "read + cmpxchg" model.
>
> The way we fixed it before was to simply omit the read, and turn that
> into a "guess".
>
> In other words, I'd suggest you get rid of the "atomic_read()"
> entirely, and just assume that the write counter was zero to begin
> with. Even if that is a wrong assumption (and it probably isn't all
> that wrong), it can actually be more efficient to essentiall go
> through the loop twice: the first time yoou use the cmpxchg as just an
> odd way to do a read. It basically bcomes a read-with-write-intent,
> and solves the cacheline issue.

For get_write_access() it's probably the right assumption for everything but
/dev/tty*; for deny_write_access() it's not - a lot of binaries are run by
more than one process...

FWIW, I wonder what will the things look like on ll/sc architectures;
maybe it's really better to turn that into atomic_inc_unless_negative()
and let the architectures override the default...

2011-06-20 16:23:17

by Linus Torvalds

[permalink] [raw]
Subject: Re: [RFC] get_write_access()/deny_write_access() without inode->i_lock

On Mon, Jun 20, 2011 at 9:13 AM, Al Viro <[email protected]> wrote:
>
> Er... ?The current mainline does atomic_read() followed by atomic_inc(),
> so we get the same thing (plus the spin_lock()/spin_unlock()), don't we?

Yes. Unless the spinlock is in the same cacheline. No reason not to
fix that, though.

Of course, if the "ETXTBUSY" case is the common case (which I doubt),
then not doing the write at all would be the optimal case. But I doubt
that case is even worth really worrying about ;)

> For get_write_access() it's probably the right assumption for everything but
> /dev/tty*; for deny_write_access() it's not - a lot of binaries are run by
> more than one process...

Note the fact that EVEN IF WE GUESS INCORRECTLY, performance is likely
better by guessing rather than reading, unless you know the thing is
already in the local CPU cache.

Doing the loop twice instead of once is still *much* faster than an
extra cache transaction that goes to the bus (or L3 or whatever).

> FWIW, I wonder what will the things look like on ll/sc architectures;

There are no ll/sc architectures worth worrying about, so I don't
think that's the primary concern. That said, I don't disagree with
creating a "atomic_inc_unless_negative()" helper.

Linus

2011-06-20 16:42:18

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] get_write_access()/deny_write_access() without inode->i_lock

On Mon, Jun 20, 2011 at 09:22:32AM -0700, Linus Torvalds wrote:
> On Mon, Jun 20, 2011 at 9:13 AM, Al Viro <[email protected]> wrote:
> >
> > Er... ?The current mainline does atomic_read() followed by atomic_inc(),
> > so we get the same thing (plus the spin_lock()/spin_unlock()), don't we?
>
> Yes. Unless the spinlock is in the same cacheline. No reason not to
> fix that, though.
>
> Of course, if the "ETXTBUSY" case is the common case (which I doubt),
> then not doing the write at all would be the optimal case. But I doubt
> that case is even worth really worrying about ;)

It isn't, unless your box is spinning in attempts to do something like
opening /bin/sh for write. In which case you've got worse problems ;-)
>
> > For get_write_access() it's probably the right assumption for everything but
> > /dev/tty*; for deny_write_access() it's not - a lot of binaries are run by
> > more than one process...
>
> Note the fact that EVEN IF WE GUESS INCORRECTLY, performance is likely
> better by guessing rather than reading, unless you know the thing is
> already in the local CPU cache.
>
> Doing the loop twice instead of once is still *much* faster than an
> extra cache transaction that goes to the bus (or L3 or whatever).
>
> > FWIW, I wonder what will the things look like on ll/sc architectures;
>
> There are no ll/sc architectures worth worrying about, so I don't
> think that's the primary concern. That said, I don't disagree with
> creating a "atomic_inc_unless_negative()" helper.

OK... Let me see if I got it right:
static inline int atomic_inc_unless_negative(atomic_t *p)
{
int v, v1;
for (v = 0; v >= 0; v = v1) {
v1 = atomic_cmpxchg(p, v, v + 1);
if (v == v1)
return 1;
}
return 0;
}
with get_write_access(inode) becoming
return atomic_inc_unless_negative(&inode->i_writecount) ? 0 : -ETXTBUSY;
and similar for atomic_dec_unless_positive()/deny_write_access()?

BTW, atomic_add_unless/atomic_inc_not_zero is done as read/cmpxchg on
everything I've checked, including alpha and sparc. I suspect that
it's seriously suboptimal there...

2011-06-20 17:03:10

by Al Viro

[permalink] [raw]
Subject: Re: [RFC] get_write_access()/deny_write_access() without inode->i_lock

On Mon, Jun 20, 2011 at 05:42:11PM +0100, Al Viro wrote:

> BTW, atomic_add_unless/atomic_inc_not_zero is done as read/cmpxchg on
> everything I've checked, including alpha and sparc. I suspect that
> it's seriously suboptimal there...

Actually, on sparc it's probably fine. alpha, OTOH, probably ought to switch
to something similar to what ppc is doing...

2011-06-20 19:49:24

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC] get_write_access()/deny_write_access() without inode->i_lock

Al Viro <[email protected]> writes:

> I'm seriously tempted to throw away i_lock uses in
> {get,deny}_write_access(), as in the patch below. The question is, how

Are there any known workload where the spinlock contends badly here?
Or what's the motivation for it?

Thanks,

-Andi

--
[email protected] -- Speaking for myself only