2014-10-01 11:28:26

by Maxim Patlasov

[permalink] [raw]
Subject: Re: [PATCH 0/5] fuse: handle release synchronously (v4)

On 10/01/2014 12:44 AM, Linus Torvalds wrote:
> On Tue, Sep 30, 2014 at 12:19 PM, Miklos Szeredi <[email protected]> wrote:
>> What about flock(2), FL_SETLEASE, etc semantics (which are the sane ones,
>> compared to the POSIX locks shit which mandates release of lock on each close(2)
>> instead of "when all [duplicate] descriptors have been closed")?
>>
>> You have to do that from ->release(), there's no question about that.
> We do locks_remove_file() independently on ->release, but yes, it's
> basically done just before the last release.
>
> But it has the *exact* same semantics as release, including very much
> having nothing what-so-ever to do with "last close()".
>
> If the file descriptor is opened for other reasons (ie mmap, /proc
> accesses, whatever), then that delays locks_remove_file() the same way
> it delays release.
>
> None of that has *anothing* to do with "synchronous". Thinking it does is wrong.
>
> And none of this has *anything* to do with the issue that Maxim
> pointed to in the mailing list web page, which was about write caches,
> and how you cannot (and MUST NOT) delay them until release time.

I apologise for mentioning that mailing list web page in my title
message. This was really misleading, I had to think about it in advance.
Of course, write caches must be flushed in scope of ->flush(), not
->release(). Let me please set forth an use-case that led me to those
patches.

We implemented a FUSE-based distributed storage solution intended for
keeping images of VMs (virtual machines) and their configuration files.
The way how VMs use images makes exclusive-open()er semantics very
attractive: while a VM is using its image on a node, the concurrent
access from other nodes to that image is neither desirable nor
necessary. So, we acquire an exclusive lease on FUSE_OPEN and release
it on FUSE_RELEASE. This is quite natural and has obviously nothing to
do with FUSE_FLUSH.

Following such semantics, there are two choices for handling open() if
the file is currently exclusively locked by a remote node: (a) return
EBUSY; (b) block until the remote node release the file. We decided for
(a), because (b) is very inconvenient in practice: most applications
handle failed open(2) properly, but very few are clever enough to spawn
a separate thread with open() and kill it if the open() has not
succeeded in a reasonable time.

The patches I sent make essentially one thing: they make FUSE
->release() wait for ACK from userspace before return. Without these
patches, any attempt to test or use our storage in valid use-cases led
to spurious EBUSY. For example, while migrating a VM from one node to
another, we firstly close the image file on source node, then try to
open it on destination node, but fail because FUSE_RELEASE is not
processed by userspace on source node yet.

Given those patches must die, do you have any ideas how to resolve that
"spurious EBUSY" problem?

Thanks,
Maxim


2014-10-09 08:14:25

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 0/5] fuse: handle release synchronously (v4)

On Wed, Oct 1, 2014 at 1:28 PM, Maxim Patlasov <[email protected]> wrote:
> Given those patches must die, do you have any ideas how to resolve that
> "spurious EBUSY" problem?

Check the "sync_release" branch of fuse:

git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git sync_release

And same branch name for libfuse:

git://git.code.sf.net/p/fuse/fuse sync_release

What it does is send RELEASE from ->flush() after checking the
refcount of file (being careful about RCU accesses).

Lightly tested, more testing, as well as review, is welcome.

Thanks,
Miklos

2014-10-16 10:31:28

by Maxim Patlasov

[permalink] [raw]
Subject: Re: [PATCH 0/5] fuse: handle release synchronously (v4)

Hi Miklos,

On 10/09/2014 12:14 PM, Miklos Szeredi wrote:
> On Wed, Oct 1, 2014 at 1:28 PM, Maxim Patlasov <[email protected]> wrote:
>> Given those patches must die, do you have any ideas how to resolve that
>> "spurious EBUSY" problem?
> Check the "sync_release" branch of fuse:
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git sync_release
>
> And same branch name for libfuse:
>
> git://git.code.sf.net/p/fuse/fuse sync_release
>
> What it does is send RELEASE from ->flush() after checking the
> refcount of file (being careful about RCU accesses).
>
> Lightly tested, more testing, as well as review, is welcome.

Thank you very much for efforts, highly appreciated! I've had a close
look at your patches and found a few issues. Most of them can be easily
fixed, but one puzzles me: the way how you detect last flush is not race
free. Something as simple as:

int main(int argc, char *argv[])
{
int fd = open(argv[1], O_RDWR);
fork();
}

may easily dive into fuse_try_sync_release() concurrently and both
observe file->f_count == 2. Then both return falling back to sending the
release asynchronously. This makes sync/async behaviour unpredictable
even for well-behaved applications which don't do any esoteric things
like racing i/o with close or exiting while a descriptor is in-flight in
a unix domain socket.

I cannot see any way to recognise last flush without help of VFS layer,
can you?

Thanks,
Maxim

2014-10-16 13:43:55

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 0/5] fuse: handle release synchronously (v4)

On Thu, Oct 16, 2014 at 12:31 PM, Maxim Patlasov
<[email protected]> wrote:

> Something as simple as:
>
> int main(int argc, char *argv[])
> {
> int fd = open(argv[1], O_RDWR);
> fork();
> }
>
> may easily dive into fuse_try_sync_release() concurrently and both observe
> file->f_count == 2. Then both return falling back to sending the release
> asynchronously. This makes sync/async behaviour unpredictable even for
> well-behaved applications which don't do any esoteric things like racing i/o
> with close or exiting while a descriptor is in-flight in a unix domain
> socket.
>
> I cannot see any way to recognise last flush without help of VFS layer, can
> you?

No.

One idea is to change ->flush() so it's responsible for fput()-ing the
file. That way we could take control of the actual refcount
decrement. There are only 20 flush instances in the tree, so it
wouldn't be a huge change.

Thanks,
Miklos

2014-10-16 13:54:45

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/5] fuse: handle release synchronously (v4)

On Thu, Oct 16, 2014 at 3:43 PM, Miklos Szeredi <[email protected]> wrote:
>
> One idea is to change ->flush() so it's responsible for fput()-ing the
> file. That way we could take control of the actual refcount
> decrement. There are only 20 flush instances in the tree, so it
> wouldn't be a huge change.

Since that *still* wouldn't fix the problem with the whole "count
elevated by other things" issue, I really don't want to hear about
these random broken hacks that are fundamentally broken crap.

Really. Stop cc'ing me with "let's implement this hack that cannot
work in general". I'm not interested. There's a reason we don't do
this. We don't make up random hacks that we know cannot work in the
general case.

Linus

2014-10-17 08:55:20

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 0/5] fuse: handle release synchronously (v4)

On Thu, Oct 16, 2014 at 03:54:42PM +0200, Linus Torvalds wrote:
> On Thu, Oct 16, 2014 at 3:43 PM, Miklos Szeredi <[email protected]> wrote:
> >
> > One idea is to change ->flush() so it's responsible for fput()-ing the
> > file. That way we could take control of the actual refcount
> > decrement. There are only 20 flush instances in the tree, so it
> > wouldn't be a huge change.
>
> Since that *still* wouldn't fix the problem with the whole "count
> elevated by other things" issue, I really don't want to hear about
> these random broken hacks that are fundamentally broken crap.

The problem with those "count elevated by other things" is that they are
actually bugs. Take the following test case (this is not made up, I really got
bug reports agains something like this).

mount("/dev/foo", "/mnt");
fd = creat("/mnt/bar");
close(fd);
umount("/mnt")

If that umount fails with EBUSY, that's a bug, since we've released the only
known resource we've opened on that mount.

Now, if some completely unrelated "lsof" instance goes fishing in /proc, then
that will be able to hold that release up, making the test fail.

And it's actually trivially fixable. See attached patch.

Thanks,
Miklos

diff --git a/fs/proc/fd.c b/fs/proc/fd.c
index e11d7c590bb0..f8a67dafb04f 100644
--- a/fs/proc/fd.c
+++ b/fs/proc/fd.c
@@ -19,6 +19,8 @@ static int seq_show(struct seq_file *m, void *v)
{
struct files_struct *files = NULL;
int f_flags = 0, ret = -ENOENT;
+ loff_t f_pos;
+ int mnt_id;
struct file *file = NULL;
struct task_struct *task;

@@ -41,7 +43,13 @@ static int seq_show(struct seq_file *m, void *v)
if (close_on_exec(fd, fdt))
f_flags |= O_CLOEXEC;

- get_file(file);
+ f_pos = file->f_pos;
+ mnt_id = real_mount(file->f_path.mnt)->mnt_id;
+
+ if (file->f_op->show_fdinfo)
+ get_file(file);
+ else
+ file = NULL;
ret = 0;
}
spin_unlock(&files->file_lock);
@@ -50,11 +58,11 @@ static int seq_show(struct seq_file *m, void *v)

if (!ret) {
seq_printf(m, "pos:\t%lli\nflags:\t0%o\nmnt_id:\t%i\n",
- (long long)file->f_pos, f_flags,
- real_mount(file->f_path.mnt)->mnt_id);
- if (file->f_op->show_fdinfo)
+ (long long) f_pos, f_flags, mnt_id);
+ if (file) {
ret = file->f_op->show_fdinfo(m, file);
- fput(file);
+ fput(file);
+ }
}

return ret;

2014-10-18 15:35:46

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/5] fuse: handle release synchronously (v4)

On Fri, Oct 17, 2014 at 1:55 AM, Miklos Szeredi <[email protected]> wrote:
>
> The problem with those "count elevated by other things" is that they are
> actually bugs.

No they aren't. You think they are, and then you find one case, and
ignore all the others.

Look around for AIO. Look around for the loop driver. Look around for
a number of things that do "fget()" and that you completely ignored.

So no, your patch doesn't change the fundamental issue in any way,
shape, or form.

I asked you to stop emailing me with these broken "let's fix one
special case, because I can't be bothered to understand the big
picture" patches. This was another one of those hacky sh*t patches
that doesn't actually change the deeper issue. Stop it. Seriously.
This idiotic thread has been going on for too long.

Linus

2014-10-18 15:40:12

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/5] fuse: handle release synchronously (v4)

On Sat, Oct 18, 2014 at 8:35 AM, Linus Torvalds
<[email protected]> wrote:
>
> Look around for AIO. Look around for the loop driver. Look around for
> a number of things that do "fget()" and that you completely ignored.

.. actually, there are more instances of "get_file()" than of
"fget()", the aio one just happened to be the latter form. Lots and
lots of ways to get ahold of a file descriptor that keeps it open past
the "last close".

Linus

2014-10-18 18:01:16

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 0/5] fuse: handle release synchronously (v4)

On Sat, Oct 18, 2014 at 5:40 PM, Linus Torvalds
<[email protected]> wrote:
> On Sat, Oct 18, 2014 at 8:35 AM, Linus Torvalds
> <[email protected]> wrote:
>>
>> Look around for AIO. Look around for the loop driver. Look around for
>> a number of things that do "fget()" and that you completely ignored.
>
> .. actually, there are more instances of "get_file()" than of
> "fget()", the aio one just happened to be the latter form. Lots and
> lots of ways to get ahold of a file descriptor that keeps it open past
> the "last close".

And what you don't get is that there's a deep difference between those
and the /proc file access case.

And the difference is that one is done because of an explicit action
by the holder of the open file. And the other is done by some random
process doing non-invasive examination of the holder of the open-file.

So basically: we simply don't care if last close does not happen to
release the file *iff* it was because of some explicit action that
obviously has or could have such a side effect. Is that so hard to
understand?

In other words, we care about doing that last release synchronously if
it provably is the last release of that file and happens to be done
from close() (or munmap()). And then all your examples of loop driver
and aio are pointless, because we *know* they will be holding onto
that descriptor, the same as we know, that after dup(), close() will
not release the file and the (non-IDIOTIX) locks together with the
file.

Thanks,
Miklos

2014-10-18 18:22:51

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 0/5] fuse: handle release synchronously (v4)

On Sat, Oct 18, 2014 at 08:40:05AM -0700, Linus Torvalds wrote:
> On Sat, Oct 18, 2014 at 8:35 AM, Linus Torvalds
> <[email protected]> wrote:
> >
> > Look around for AIO. Look around for the loop driver. Look around for
> > a number of things that do "fget()" and that you completely ignored.
>
> .. actually, there are more instances of "get_file()" than of
> "fget()", the aio one just happened to be the latter form. Lots and
> lots of ways to get ahold of a file descriptor that keeps it open past
> the "last close".

FWIW, procfs patch touches a very annoying issue: ->show_fdinfo() being
blocking. I would really like to get rid of that particular get_file()
and even more so - of get_files_struct() in there.

I certainly agree that anyone who expects that close() means the end of IO
is completely misguided. Mappings don't disappear on close(), neither does
a descriptor returned by dup(), or one that child got over fork(),
or something sent over in SCM_RIGHTS datagram, or, as you suggested, made
backing store for /dev/loop, etc.

What's more, in the example given upthread, somebody might've spotted that
file in /proc/<pid>/fd/* and *opened* it. At which point umount would
have to fail with EBUSY. And the same lsof(8) might've done just that.

It's not a matter of correctness or security, especially since somebody who
could do that, could've stopped your process, PTRACE_POKEd a fairly short
series of syscalls that would connect to AF_UNIX socket, send the file
over to them and clean after itself, then single-stepped through all of that,
restored the original state and resumed your process.

It is a QoI matter, though. And get_files_struct() in there is a lot more
annoying than get_file()/fput(). Suppose you catch the process during
exit(). All of a sudden, read from /proc/<pid>/fdinfo/<n> ends up doing
shitloads of filp_close(). It would be nice to avoid that.

Folks, how much pain would it be to make ->show_fdinfo() non-blocking?

2014-10-18 18:24:58

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 0/5] fuse: handle release synchronously (v4)

On Sat, Oct 18, 2014 at 08:01:13PM +0200, Miklos Szeredi wrote:

> And what you don't get is that there's a deep difference between those
> and the /proc file access case.
>
> And the difference is that one is done because of an explicit action
> by the holder of the open file. And the other is done by some random
> process doing non-invasive examination of the holder of the open-file.

Such as ls -l /proc/*/fd/*? That will give you the same transient EBUSY
from umount, if it hits the right moment...

2014-10-18 18:38:30

by Linus Torvalds

[permalink] [raw]
Subject: Re: [PATCH 0/5] fuse: handle release synchronously (v4)

On Sat, Oct 18, 2014 at 11:01 AM, Miklos Szeredi <[email protected]> wrote:
>
> And what you don't get is that there's a deep difference between those
> and the /proc file access case.

No there isn't.

Your "action by the holder" argument is pure and utter garbage, for a
very simple and core reason: the *filesystem* doesn't know or care. So
from a fuse standpoint, the difference is totally and entirely
irrelevant.

Your "synchronous fput" fails. It fails totally regardless of that
"who incremented the file user count" issue. Face it, your patch is
broken. And it's *fundamentally* broken, which is why I'm so tired of
your stupid ad-hoc hacks that cannot possibly work.

Your umount EBUSY case is somewhat relevant to the "who holds the file
open", but quite frankly, that's not a filesystem issue, it's more of
a system management issue. So umount might get EBUSY. That's not
something the filesystem should/could care about, that's a MIS issue
that is entirely irrelevant, and the answer is "if somebody does lsof,
that might keep the filesystem busy". Tough.

Linus

2014-10-18 18:45:34

by Al Viro

[permalink] [raw]
Subject: Re: [PATCH 0/5] fuse: handle release synchronously (v4)

On Sat, Oct 18, 2014 at 07:24:54PM +0100, Al Viro wrote:
> On Sat, Oct 18, 2014 at 08:01:13PM +0200, Miklos Szeredi wrote:
>
> > And what you don't get is that there's a deep difference between those
> > and the /proc file access case.
> >
> > And the difference is that one is done because of an explicit action
> > by the holder of the open file. And the other is done by some random
> > process doing non-invasive examination of the holder of the open-file.
>
> Such as ls -l /proc/*/fd/*? That will give you the same transient EBUSY
> from umount, if it hits the right moment...

stat -L /proc/*/fd/*, that is...

2014-10-18 22:45:00

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [PATCH 0/5] fuse: handle release synchronously (v4)

Al Viro <[email protected]> writes:

> On Sat, Oct 18, 2014 at 08:40:05AM -0700, Linus Torvalds wrote:
>> On Sat, Oct 18, 2014 at 8:35 AM, Linus Torvalds
>> <[email protected]> wrote:
>> >
>> > Look around for AIO. Look around for the loop driver. Look around for
>> > a number of things that do "fget()" and that you completely ignored.
>>
>> .. actually, there are more instances of "get_file()" than of
>> "fget()", the aio one just happened to be the latter form. Lots and
>> lots of ways to get ahold of a file descriptor that keeps it open past
>> the "last close".
>
> FWIW, procfs patch touches a very annoying issue: ->show_fdinfo() being
> blocking. I would really like to get rid of that particular get_file()
> and even more so - of get_files_struct() in there.
>
> I certainly agree that anyone who expects that close() means the end of IO
> is completely misguided. Mappings don't disappear on close(), neither does
> a descriptor returned by dup(), or one that child got over fork(),
> or something sent over in SCM_RIGHTS datagram, or, as you suggested, made
> backing store for /dev/loop, etc.
>
> What's more, in the example given upthread, somebody might've spotted that
> file in /proc/<pid>/fd/* and *opened* it. At which point umount would
> have to fail with EBUSY. And the same lsof(8) might've done just that.
>
> It's not a matter of correctness or security, especially since somebody who
> could do that, could've stopped your process, PTRACE_POKEd a fairly short
> series of syscalls that would connect to AF_UNIX socket, send the file
> over to them and clean after itself, then single-stepped through all of that,
> restored the original state and resumed your process.
>
> It is a QoI matter, though. And get_files_struct() in there is a lot more
> annoying than get_file()/fput(). Suppose you catch the process during
> exit(). All of a sudden, read from /proc/<pid>/fdinfo/<n> ends up doing
> shitloads of filp_close(). It would be nice to avoid that.
>
> Folks, how much pain would it be to make ->show_fdinfo() non-blocking?

I took a quick look and there are a couple of instances in tun,
eventpoll, and fanotify/inotify that take a spinlock while traversing
the data that needs to be printed.

So it would take a good hard stare at those pieces of code to understand
the locking, and potentially rewrite those routines.

The only one I am particularly familiar with tun did not look
fundamentally hard to change but it also isn't something I would
casually do either, as it would be easy to introduce nasty races by
accident.

Eric