2019-03-27 10:46:01

by Kirill Smelkov

[permalink] [raw]
Subject: [RESEND4, PATCH 0/2] fuse: don't stuck clients on retrieve_notify with size > max_write

Miklos,

On Thu, Mar 14, 2019 at 01:45:20PM +0300, Kirill Smelkov wrote:
> Miklos,
>
> On Thu, Feb 28, 2019 at 02:47:57PM +0300, Kirill Smelkov wrote:
> > On Thu, Feb 28, 2019 at 09:10:15AM +0100, Miklos Szeredi wrote:
> > > On Wed, Feb 27, 2019 at 9:39 PM Kirill Smelkov <[email protected]> wrote:
> > >
> > > > I more or less agree with this statement. However can we please make the
> > > > breakage to be explicitly visible with an error instead of exhibiting it
> > > > via harder to debug stucks/deadlocks? For example sys_read < max_write
> > > > -> error instead of getting stuck. And if notify_retrieve requests
> > > > buffer larger than max_write -> error or cut to max_write, but don't
> > > > return OK when we know we will never send what was requested to
> > > > filesystem even if it uses max_write sized reads. What is the point of
> > > > breaking in hard to diagnose way when we can make the breakage showing
> > > > itself explicitly? Would a patch for such behaviour accepted?
> > >
> > > Sure, if it's only adds a couple of lines. Adding more than say ten
> > > lines for such a non-bug fix is definitely excessive.
> >
> > Ok, thanks. Please consider applying the following patch. (It's a bit
> > pity to hear the problem is not considered to be a bug, but anyway).
> >
> > I will also send the second patch as another mail, since I could not
> > made `git am --scissors` to apply several patched extracted from one
> > mail successfully.
>
> [...]
>
> On Thu, Mar 07, 2019 at 12:34:21PM +0300, Kirill Smelkov wrote:
> > Ping. Miklos, is there anything wrong with this patch and its
> > second counterpart?
>
> As we were talking here are those patches. The first one cuts notify_retrieve
> request to max_write and is one line only. The second one returns error to
> filesystem server if it is buggy and does sys_read with buffer size <
> max_write. It is 2 lines of code and 7 lines of comments.
>
> I still think that the patches fix real bugs. It is a bug if server behaviour
> is a bit non-confirming or simply on an edge of being correct or questionable,
> and instead of properly getting plain error from kernel, the whole system gets
> stuck. It is a bug because bug amplification factor here is at least one order
> of magnitude instead of staying ~1x.
>
> I'm sending the patches for the third time already, but did not get any
> feedback. Could you please have a look?

It's been ~ 1 month already since we agreed on the approach and initial
postings of the patches that follow the agreed way:

https://lwn.net/ml/linux-fsdevel/[email protected]/

Since then the patches were resent several times but without getting any
feedback from you.

Is there anything wrong with the patches? Could you please have a look?
I understand everyone is busy but 1 month seems to be too much and I'm
wondering whether maybe my mails got classified as spam or something
else on your side.

Thanks beforehand,
Kirill

Kirill Smelkov (2):
fuse: retrieve: cap requested size to negotiated max_write
fuse: require /dev/fuse reads to have enough buffer capacity as negotiated

fs/fuse/dev.c | 12 +++++++++++-
1 file changed, 11 insertions(+), 1 deletion(-)

--
2.21.0.392.gf8f6787159


2019-03-27 10:32:37

by Kirill Smelkov

[permalink] [raw]
Subject: [RESEND4, PATCH 2/2] fuse: require /dev/fuse reads to have enough buffer capacity as negotiated

A FUSE filesystem server queues /dev/fuse sys_read calls to get
filesystem requests to handle. It does not know in advance what would be
that request as it can be anything that client issues - LOOKUP, READ,
WRITE, ... Many requests are short and retrieve data from the
filesystem. However WRITE and NOTIFY_REPLY write data into filesystem.

Before getting into operation phase, FUSE filesystem server and kernel
client negotiate what should be the maximum write size the client will
ever issue. After negotiation the contract in between server/client is
that the filesystem server then should queue /dev/fuse sys_read calls with
enough buffer capacity to receive any client request - WRITE in
particular, while FUSE client should not, in particular, send WRITE
requests with > negotiated max_write payload. FUSE client in kernel and
libfuse historically reserve 4K for request header. This way the
contract is that filesystem server should queue sys_reads with
4K+max_write buffer.

If the filesystem server does not follow this contract, what can happen
is that fuse_dev_do_read will see that request size is > buffer size,
and then it will return EIO to client who issued the request but won't
indicate in any way that there is a problem to filesystem server.
This can be hard to diagnose because for some requests, e.g. for
NOTIFY_REPLY which mimics WRITE, there is no client thread that is
waiting for request completion and that EIO goes nowhere, while on
filesystem server side things look like the kernel is not replying back
after successful NOTIFY_RETRIEVE request made by the server.

-> We can make the problem easy to diagnose if we indicate via error
return to filesystem server when it is violating the contract.
This should not practically cause problems because if a filesystem
server is using shorter buffer, writes to it were already very likely to
cause EIO, and if the filesystem is read-only it should be too following
8K minimum buffer size (= either FUSE_MIN_READ_BUFFER, see 1d3d752b47,
or = 4K + min(max_write)=4k cared to be so by process_init_reply).

Please see [1] for context where the problem of stuck filesystem was hit
for real (because kernel client was incorrectly sending more than
max_write data with NOTIFY_REPLY; see also previous patch), how the
situation was traced and for more involving patch that did not make it
into the tree.

[1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2

Signed-off-by: Kirill Smelkov <[email protected]>
Cc: Han-Wen Nienhuys <[email protected]>
Cc: Jakob Unterwurzacher <[email protected]>
---
fs/fuse/dev.c | 10 ++++++++++
1 file changed, 10 insertions(+)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 38e94bc43053..8fdfbafed037 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1317,6 +1317,16 @@ static ssize_t fuse_dev_do_read(struct fuse_dev *fud, struct file *file,
unsigned reqsize;
unsigned int hash;

+ /*
+ * Require sane minimum read buffer - that has capacity for fixed part
+ * of any request header + negotated max_write room for data. If the
+ * requirement is not satisfied return EINVAL to the filesystem server
+ * to indicate that it is not following FUSE server/client contract.
+ * Don't dequeue / abort any request.
+ */
+ if (nbytes < (fc->conn_init ? 4096 + fc->max_write : FUSE_MIN_READ_BUFFER))
+ return -EINVAL;
+
restart:
spin_lock(&fiq->waitq.lock);
err = -EAGAIN;
--
2.21.0.392.gf8f6787159

2019-03-27 10:46:37

by Kirill Smelkov

[permalink] [raw]
Subject: [RESEND4, PATCH 1/2] fuse: retrieve: cap requested size to negotiated max_write

FUSE filesystem server and kernel client negotiate during initialization
phase, what should be the maximum write size the client will ever issue.
Correspondingly the filesystem server then queues sys_read calls to read
requests with buffer capacity large enough to carry request header
+ that max_write bytes. A filesystem server is free to set its max_write
in anywhere in the range between [1·page, fc->max_pages·page]. In
particular go-fuse[2] sets max_write by default as 64K, wheres default
fc->max_pages corresponds to 128K. Libfuse also allows users to
configure max_write, but by default presets it to possible maximum.

If max_write is < fc->max_pages·page, and in NOTIFY_RETRIEVE handler we
allow to retrieve more than max_write bytes, corresponding prepared
NOTIFY_REPLY will be thrown away by fuse_dev_do_read, because the
filesystem server, in full correspondence with server/client contract,
will be only queuing sys_read with ~max_write buffer capacity, and
fuse_dev_do_read throws away requests that cannot fit into server
request buffer. In turn the filesystem server could get stuck waiting
indefinitely for NOTIFY_REPLY since NOTIFY_RETRIEVE handler returned OK
which is understood by clients as that NOTIFY_REPLY was queued and will
be sent back.

-> Cap requested size to negotiate max_write to avoid the problem.
This aligns with the way NOTIFY_RETRIEVE handler works, which already
unconditionally caps requested retrieve size to fuse_conn->max_pages.
This way it should not hurt NOTIFY_RETRIEVE semantic if we return less
data than was originally requested.

Please see [1] for context where the problem of stuck filesystem was hit
for real, how the situation was traced and for more involving patch that
did not make it into the tree.

[1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2
[2] https://github.com/hanwen/go-fuse

Signed-off-by: Kirill Smelkov <[email protected]>
Cc: Han-Wen Nienhuys <[email protected]>
Cc: Jakob Unterwurzacher <[email protected]>
Cc: <[email protected]> # v2.6.36+
---
fs/fuse/dev.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 8a63e52785e9..38e94bc43053 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1749,7 +1749,7 @@ static int fuse_retrieve(struct fuse_conn *fc, struct inode *inode,
offset = outarg->offset & ~PAGE_MASK;
file_size = i_size_read(inode);

- num = outarg->size;
+ num = min(outarg->size, fc->max_write);
if (outarg->offset > file_size)
num = 0;
else if (outarg->offset + num > file_size)
--
2.21.0.392.gf8f6787159


2019-04-24 10:47:30

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RESEND4, PATCH 1/2] fuse: retrieve: cap requested size to negotiated max_write

On Wed, Mar 27, 2019 at 11:15 AM Kirill Smelkov <[email protected]> wrote:
>
> FUSE filesystem server and kernel client negotiate during initialization
> phase, what should be the maximum write size the client will ever issue.
> Correspondingly the filesystem server then queues sys_read calls to read
> requests with buffer capacity large enough to carry request header
> + that max_write bytes. A filesystem server is free to set its max_write
> in anywhere in the range between [1·page, fc->max_pages·page]. In
> particular go-fuse[2] sets max_write by default as 64K, wheres default
> fc->max_pages corresponds to 128K. Libfuse also allows users to
> configure max_write, but by default presets it to possible maximum.
>
> If max_write is < fc->max_pages·page, and in NOTIFY_RETRIEVE handler we
> allow to retrieve more than max_write bytes, corresponding prepared
> NOTIFY_REPLY will be thrown away by fuse_dev_do_read, because the
> filesystem server, in full correspondence with server/client contract,
> will be only queuing sys_read with ~max_write buffer capacity, and
> fuse_dev_do_read throws away requests that cannot fit into server
> request buffer. In turn the filesystem server could get stuck waiting
> indefinitely for NOTIFY_REPLY since NOTIFY_RETRIEVE handler returned OK
> which is understood by clients as that NOTIFY_REPLY was queued and will
> be sent back.
>
> -> Cap requested size to negotiate max_write to avoid the problem.
> This aligns with the way NOTIFY_RETRIEVE handler works, which already
> unconditionally caps requested retrieve size to fuse_conn->max_pages.
> This way it should not hurt NOTIFY_RETRIEVE semantic if we return less
> data than was originally requested.
>
> Please see [1] for context where the problem of stuck filesystem was hit
> for real, how the situation was traced and for more involving patch that
> did not make it into the tree.
>
> [1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2
> [2] https://github.com/hanwen/go-fuse
>
> Signed-off-by: Kirill Smelkov <[email protected]>
> Cc: Han-Wen Nienhuys <[email protected]>
> Cc: Jakob Unterwurzacher <[email protected]>
> Cc: <[email protected]> # v2.6.36+
> ---
> fs/fuse/dev.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> index 8a63e52785e9..38e94bc43053 100644
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -1749,7 +1749,7 @@ static int fuse_retrieve(struct fuse_conn *fc, struct inode *inode,
> offset = outarg->offset & ~PAGE_MASK;
> file_size = i_size_read(inode);
>
> - num = outarg->size;
> + num = min(outarg->size, fc->max_write);

This is wrong: the max_size limited num is overwritten if constrained
by file size.

Also the patch is whitespace damaged.

Thanks,
Miklos

2019-04-24 10:51:15

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RESEND4, PATCH 2/2] fuse: require /dev/fuse reads to have enough buffer capacity as negotiated

On Wed, Mar 27, 2019 at 11:44 AM Kirill Smelkov <[email protected]> wrote:
>
> A FUSE filesystem server queues /dev/fuse sys_read calls to get
> filesystem requests to handle. It does not know in advance what would be
> that request as it can be anything that client issues - LOOKUP, READ,
> WRITE, ... Many requests are short and retrieve data from the
> filesystem. However WRITE and NOTIFY_REPLY write data into filesystem.
>
> Before getting into operation phase, FUSE filesystem server and kernel
> client negotiate what should be the maximum write size the client will
> ever issue. After negotiation the contract in between server/client is
> that the filesystem server then should queue /dev/fuse sys_read calls with
> enough buffer capacity to receive any client request - WRITE in
> particular, while FUSE client should not, in particular, send WRITE
> requests with > negotiated max_write payload. FUSE client in kernel and
> libfuse historically reserve 4K for request header. This way the
> contract is that filesystem server should queue sys_reads with
> 4K+max_write buffer.
>
> If the filesystem server does not follow this contract, what can happen
> is that fuse_dev_do_read will see that request size is > buffer size,
> and then it will return EIO to client who issued the request but won't
> indicate in any way that there is a problem to filesystem server.
> This can be hard to diagnose because for some requests, e.g. for
> NOTIFY_REPLY which mimics WRITE, there is no client thread that is
> waiting for request completion and that EIO goes nowhere, while on
> filesystem server side things look like the kernel is not replying back
> after successful NOTIFY_RETRIEVE request made by the server.
>
> -> We can make the problem easy to diagnose if we indicate via error
> return to filesystem server when it is violating the contract.
> This should not practically cause problems because if a filesystem
> server is using shorter buffer, writes to it were already very likely to
> cause EIO, and if the filesystem is read-only it should be too following
> 8K minimum buffer size (= either FUSE_MIN_READ_BUFFER, see 1d3d752b47,
> or = 4K + min(max_write)=4k cared to be so by process_init_reply).
>
> Please see [1] for context where the problem of stuck filesystem was hit
> for real (because kernel client was incorrectly sending more than
> max_write data with NOTIFY_REPLY; see also previous patch), how the
> situation was traced and for more involving patch that did not make it
> into the tree.
>
> [1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2

Applied.

Thanks,
Miklos

2019-04-24 12:13:16

by Kirill Smelkov

[permalink] [raw]
Subject: Re: [RESEND4, PATCH 1/2] fuse: retrieve: cap requested size to negotiated max_write

On Wed, Apr 24, 2019 at 12:44:50PM +0200, Miklos Szeredi wrote:
> On Wed, Mar 27, 2019 at 11:15 AM Kirill Smelkov <[email protected]> wrote:
> >
> > FUSE filesystem server and kernel client negotiate during initialization
> > phase, what should be the maximum write size the client will ever issue.
> > Correspondingly the filesystem server then queues sys_read calls to read
> > requests with buffer capacity large enough to carry request header
> > + that max_write bytes. A filesystem server is free to set its max_write
> > in anywhere in the range between [1·page, fc->max_pages·page]. In
> > particular go-fuse[2] sets max_write by default as 64K, wheres default
> > fc->max_pages corresponds to 128K. Libfuse also allows users to
> > configure max_write, but by default presets it to possible maximum.
> >
> > If max_write is < fc->max_pages·page, and in NOTIFY_RETRIEVE handler we
> > allow to retrieve more than max_write bytes, corresponding prepared
> > NOTIFY_REPLY will be thrown away by fuse_dev_do_read, because the
> > filesystem server, in full correspondence with server/client contract,
> > will be only queuing sys_read with ~max_write buffer capacity, and
> > fuse_dev_do_read throws away requests that cannot fit into server
> > request buffer. In turn the filesystem server could get stuck waiting
> > indefinitely for NOTIFY_REPLY since NOTIFY_RETRIEVE handler returned OK
> > which is understood by clients as that NOTIFY_REPLY was queued and will
> > be sent back.
> >
> > -> Cap requested size to negotiate max_write to avoid the problem.
> > This aligns with the way NOTIFY_RETRIEVE handler works, which already
> > unconditionally caps requested retrieve size to fuse_conn->max_pages.
> > This way it should not hurt NOTIFY_RETRIEVE semantic if we return less
> > data than was originally requested.
> >
> > Please see [1] for context where the problem of stuck filesystem was hit
> > for real, how the situation was traced and for more involving patch that
> > did not make it into the tree.
> >
> > [1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2
> > [2] https://github.com/hanwen/go-fuse
> >
> > Signed-off-by: Kirill Smelkov <[email protected]>
> > Cc: Han-Wen Nienhuys <[email protected]>
> > Cc: Jakob Unterwurzacher <[email protected]>
> > Cc: <[email protected]> # v2.6.36+
> > ---
> > fs/fuse/dev.c | 2 +-
> > 1 file changed, 1 insertion(+), 1 deletion(-)
> >
> > diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
> > index 8a63e52785e9..38e94bc43053 100644
> > --- a/fs/fuse/dev.c
> > +++ b/fs/fuse/dev.c
> > @@ -1749,7 +1749,7 @@ static int fuse_retrieve(struct fuse_conn *fc, struct inode *inode,
> > offset = outarg->offset & ~PAGE_MASK;
> > file_size = i_size_read(inode);
> >
> > - num = outarg->size;
> > + num = min(outarg->size, fc->max_write);
>
> This is wrong: the max_size limited num is overwritten if constrained
> by file size.

I assume you are meaning this:

--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1745,15 +1745,15 @@ static int fuse_retrieve(struct fuse_conn *fc, struct inode *inode,
unsigned int offset;
size_t total_len = 0;
unsigned int num_pages;

offset = outarg->offset & ~PAGE_MASK;
file_size = i_size_read(inode);

- num = outarg->size;
+ num = min(outarg->size, fc->max_write);
if (outarg->offset > file_size)
num = 0;
else if (outarg->offset + num > file_size)
num = file_size - outarg->offset; <-- THIS

num_pages = (num + offset + PAGE_SIZE - 1) >> PAGE_SHIFT;
num_pages = min(num_pages, fc->max_pages);

and then in this case (offset + num > file_size) num overwrite

num = file_size - offset

can make num only smaller, right? And then the patch is not wrong because there
is no other num overwriting in this function except when num is being further
decremented in loop that prepares pages to retrieve.

Or am I missing something? Would it be more clear to cap num to
max_write after all calculations? But then - if we are not sure that
file_size check can only lower num - we have a problem: we are no longer
sure that num is <= outarg->size.


> Also the patch is whitespace damaged.

I've tried to do the following in my mutt on "RESEND4, PATCH 1/2"
message:

|(cd ~/src/linux/linux && git am -)

and the patch applied successfully. So could you please clarify what
"whitespace damaged" means?

Attaching the patch once again just in case.

Kirill

---- 8< ----
From 000a5a6c91992f2a361a846cd050444964920f85 Mon Sep 17 00:00:00 2001
From: Kirill Smelkov <[email protected]>
Date: Wed, 27 Mar 2019 13:00:56 +0300
Subject: [PATCH] fuse: retrieve: cap requested size to negotiated max_write
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

FUSE filesystem server and kernel client negotiate during initialization
phase, what should be the maximum write size the client will ever issue.
Correspondingly the filesystem server then queues sys_read calls to read
requests with buffer capacity large enough to carry request header
+ that max_write bytes. A filesystem server is free to set its max_write
in anywhere in the range between [1·page, fc->max_pages·page]. In
particular go-fuse[2] sets max_write by default as 64K, wheres default
fc->max_pages corresponds to 128K. Libfuse also allows users to
configure max_write, but by default presets it to possible maximum.

If max_write is < fc->max_pages·page, and in NOTIFY_RETRIEVE handler we
allow to retrieve more than max_write bytes, corresponding prepared
NOTIFY_REPLY will be thrown away by fuse_dev_do_read, because the
filesystem server, in full correspondence with server/client contract,
will be only queuing sys_read with ~max_write buffer capacity, and
fuse_dev_do_read throws away requests that cannot fit into server
request buffer. In turn the filesystem server could get stuck waiting
indefinitely for NOTIFY_REPLY since NOTIFY_RETRIEVE handler returned OK
which is understood by clients as that NOTIFY_REPLY was queued and will
be sent back.

-> Cap requested size to negotiate max_write to avoid the problem.
This aligns with the way NOTIFY_RETRIEVE handler works, which already
unconditionally caps requested retrieve size to fuse_conn->max_pages.
This way it should not hurt NOTIFY_RETRIEVE semantic if we return less
data than was originally requested.

Please see [1] for context where the problem of stuck filesystem was hit
for real, how the situation was traced and for more involving patch that
did not make it into the tree.

[1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2
[2] https://github.com/hanwen/go-fuse

Signed-off-by: Kirill Smelkov <[email protected]>
Cc: Han-Wen Nienhuys <[email protected]>
Cc: Jakob Unterwurzacher <[email protected]>
Cc: <[email protected]> # v2.6.36+
---
fs/fuse/dev.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/fs/fuse/dev.c b/fs/fuse/dev.c
index 9971a35cf1ef..1e2efd873201 100644
--- a/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1749,7 +1749,7 @@ static int fuse_retrieve(struct fuse_conn *fc, struct inode *inode,
offset = outarg->offset & ~PAGE_MASK;
file_size = i_size_read(inode);

- num = outarg->size;
+ num = min(outarg->size, fc->max_write);
if (outarg->offset > file_size)
num = 0;
else if (outarg->offset + num > file_size)
--
2.21.0.765.geec228f530

2019-04-24 12:19:07

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RESEND4, PATCH 1/2] fuse: retrieve: cap requested size to negotiated max_write

On Wed, Apr 24, 2019 at 1:56 PM Kirill Smelkov <[email protected]> wrote:

> I assume you are meaning this:
>
> --- a/fs/fuse/dev.c
> +++ b/fs/fuse/dev.c
> @@ -1745,15 +1745,15 @@ static int fuse_retrieve(struct fuse_conn *fc, struct inode *inode,
> unsigned int offset;
> size_t total_len = 0;
> unsigned int num_pages;
>
> offset = outarg->offset & ~PAGE_MASK;
> file_size = i_size_read(inode);
>
> - num = outarg->size;
> + num = min(outarg->size, fc->max_write);
> if (outarg->offset > file_size)
> num = 0;
> else if (outarg->offset + num > file_size)
> num = file_size - outarg->offset; <-- THIS
>
> num_pages = (num + offset + PAGE_SIZE - 1) >> PAGE_SHIFT;
> num_pages = min(num_pages, fc->max_pages);
>
> and then in this case (offset + num > file_size) num overwrite
>
> num = file_size - offset
>
> can make num only smaller, right? And then the patch is not wrong because there
> is no other num overwriting in this function except when num is being further
> decremented in loop that prepares pages to retrieve.

You're right, of course.


> > Also the patch is whitespace damaged.
>
> I've tried to do the following in my mutt on "RESEND4, PATCH 1/2"
> message:
>
> |(cd ~/src/linux/linux && git am -)
>
> and the patch applied successfully. So could you please clarify what
> "whitespace damaged" means?

Hmm, apparently this (and only this) message is "quoted-printable"
encoded. git-am seems to handle it fine, but my script doesn't.
Anyway, I'll do it manually.

Thanks,
Miklos

2019-04-24 12:27:55

by Kirill Smelkov

[permalink] [raw]
Subject: Re: [RESEND4, PATCH 2/2] fuse: require /dev/fuse reads to have enough buffer capacity as negotiated

On Wed, Apr 24, 2019 at 12:48:36PM +0200, Miklos Szeredi wrote:
> On Wed, Mar 27, 2019 at 11:44 AM Kirill Smelkov <[email protected]> wrote:
> >
> > A FUSE filesystem server queues /dev/fuse sys_read calls to get
> > filesystem requests to handle. It does not know in advance what would be
> > that request as it can be anything that client issues - LOOKUP, READ,
> > WRITE, ... Many requests are short and retrieve data from the
> > filesystem. However WRITE and NOTIFY_REPLY write data into filesystem.
> >
> > Before getting into operation phase, FUSE filesystem server and kernel
> > client negotiate what should be the maximum write size the client will
> > ever issue. After negotiation the contract in between server/client is
> > that the filesystem server then should queue /dev/fuse sys_read calls with
> > enough buffer capacity to receive any client request - WRITE in
> > particular, while FUSE client should not, in particular, send WRITE
> > requests with > negotiated max_write payload. FUSE client in kernel and
> > libfuse historically reserve 4K for request header. This way the
> > contract is that filesystem server should queue sys_reads with
> > 4K+max_write buffer.
> >
> > If the filesystem server does not follow this contract, what can happen
> > is that fuse_dev_do_read will see that request size is > buffer size,
> > and then it will return EIO to client who issued the request but won't
> > indicate in any way that there is a problem to filesystem server.
> > This can be hard to diagnose because for some requests, e.g. for
> > NOTIFY_REPLY which mimics WRITE, there is no client thread that is
> > waiting for request completion and that EIO goes nowhere, while on
> > filesystem server side things look like the kernel is not replying back
> > after successful NOTIFY_RETRIEVE request made by the server.
> >
> > -> We can make the problem easy to diagnose if we indicate via error
> > return to filesystem server when it is violating the contract.
> > This should not practically cause problems because if a filesystem
> > server is using shorter buffer, writes to it were already very likely to
> > cause EIO, and if the filesystem is read-only it should be too following
> > 8K minimum buffer size (= either FUSE_MIN_READ_BUFFER, see 1d3d752b47,
> > or = 4K + min(max_write)=4k cared to be so by process_init_reply).
> >
> > Please see [1] for context where the problem of stuck filesystem was hit
> > for real (because kernel client was incorrectly sending more than
> > max_write data with NOTIFY_REPLY; see also previous patch), how the
> > situation was traced and for more involving patch that did not make it
> > into the tree.
> >
> > [1] https://marc.info/?l=linux-fsdevel&m=155057023600853&w=2
>
> Applied.

Thanks. Looking forward for it to appear in fuse.git#for-next

Kirill

2019-04-24 13:04:51

by Kirill Smelkov

[permalink] [raw]
Subject: Re: [RESEND4, PATCH 1/2] fuse: retrieve: cap requested size to negotiated max_write

On Wed, Apr 24, 2019 at 02:17:27PM +0200, Miklos Szeredi wrote:
> On Wed, Apr 24, 2019 at 1:56 PM Kirill Smelkov <[email protected]> wrote:
>
> > I assume you are meaning this:
> >
> > --- a/fs/fuse/dev.c
> > +++ b/fs/fuse/dev.c
> > @@ -1745,15 +1745,15 @@ static int fuse_retrieve(struct fuse_conn *fc, struct inode *inode,
> > unsigned int offset;
> > size_t total_len = 0;
> > unsigned int num_pages;
> >
> > offset = outarg->offset & ~PAGE_MASK;
> > file_size = i_size_read(inode);
> >
> > - num = outarg->size;
> > + num = min(outarg->size, fc->max_write);
> > if (outarg->offset > file_size)
> > num = 0;
> > else if (outarg->offset + num > file_size)
> > num = file_size - outarg->offset; <-- THIS
> >
> > num_pages = (num + offset + PAGE_SIZE - 1) >> PAGE_SHIFT;
> > num_pages = min(num_pages, fc->max_pages);
> >
> > and then in this case (offset + num > file_size) num overwrite
> >
> > num = file_size - offset
> >
> > can make num only smaller, right? And then the patch is not wrong because there
> > is no other num overwriting in this function except when num is being further
> > decremented in loop that prepares pages to retrieve.
>
> You're right, of course.

Thanks. Does it mean that the patch is ok? Do I need to rework
something?


> > > Also the patch is whitespace damaged.
> >
> > I've tried to do the following in my mutt on "RESEND4, PATCH 1/2"
> > message:
> >
> > |(cd ~/src/linux/linux && git am -)
> >
> > and the patch applied successfully. So could you please clarify what
> > "whitespace damaged" means?
>
> Hmm, apparently this (and only this) message is "quoted-printable"
> encoded. git-am seems to handle it fine, but my script doesn't.
> Anyway, I'll do it manually.

I see. Probably it is not "quoted-printable" as

Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

suggests and it is maybe due to UTF-8 characters (I used "·" several
times in patch description). Anyway if it helps you can pull the patch
from here

https://lab.nexedi.com/kirr/linux.git y/fuse-retrieve-cap-max_write

and then cherry-pick it (git cherry-pick fd482f96537a) to where needed.

Thanks again for feedback,

Kirill

2019-04-24 13:21:14

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RESEND4, PATCH 1/2] fuse: retrieve: cap requested size to negotiated max_write

On Wed, Apr 24, 2019 at 2:31 PM Kirill Smelkov <[email protected]> wrote:

> Thanks. Does it mean that the patch is ok? Do I need to rework
> something?

Pushed to #for-next with all the rest. Made some changes to the
patches, so please verify.

> I see. Probably it is not "quoted-printable" as
>
> Content-Type: text/plain; charset=utf-8
> Content-Transfer-Encoding: 8bit

Well, google converts it to quoted-printable then:

Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable

> suggests and it is maybe due to UTF-8 characters (I used "·" several
> times in patch description).

Please refrain from gratuitous use of non-ascii. That middle-dot is
written as "*" in C (fixed the patch description).

Thanks,
Miklos

2019-04-24 15:06:12

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [RESEND4, PATCH 1/2] fuse: retrieve: cap requested size to negotiated max_write

On Wed, Apr 24, 2019 at 4:22 PM Kirill Smelkov <[email protected]> wrote:
> - FUSE_PRECISE_INVAL_DATA:
>
> --- b/include/uapi/linux/fuse.h
> +++ b/include/uapi/linux/fuse.h
> @@ -266,7 +266,7 @@
> * FUSE_MAX_PAGES: init_out.max_pages contains the max number of req pages
> * FUSE_CACHE_SYMLINKS: cache READLINK responses
> * FUSE_NO_OPENDIR_SUPPORT: kernel supports zero-message opendir
> - * FUSE_PRECISE_INVAL_DATA: filesystem is fully responsible for data cache invalidation
> + * FUSE_PRECISE_INVAL_DATA: filesystem is fully responsible for invalidation
> */
> #define FUSE_ASYNC_READ (1 << 0)
> #define FUSE_POSIX_LOCKS (1 << 1)
>
> the "data cache" in "for data cache invalidation" has particular meaning
> and semantic: the filesystem promises to explicitly invalidate data of

Right; better name: FUSE_EXPLICIT_INVAL_DATA. Will push fixed version.

> Your amendment for FOPEN_STREAM in uapi/linux/fuse.h (see above) also
> suggests that it is better to be more explicit in that file.
>
> --- b/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -913,13 +913,8 @@
> fc->dont_mask = 1;
> if (arg->flags & FUSE_AUTO_INVAL_DATA)
> fc->auto_inval_data = 1;
> - if (arg->flags & FUSE_PRECISE_INVAL_DATA)
> + else if (arg->flags & FUSE_PRECISE_INVAL_DATA)
> fc->precise_inval_data = 1;
> - if (fc->auto_inval_data && fc->precise_inval_data) {
> - pr_warn("filesystem requested both auto and "
> - "precise cache control - using auto\n");
> - fc->precise_inval_data = 0;
> - }
> if (arg->flags & FUSE_DO_READDIRPLUS) {
> fc->do_readdirplus = 1;
> if (arg->flags & FUSE_READDIRPLUS_AUTO)
>
> Even though it is ok for me personally (I could be careful and use only
> FUSE_PRECISE_INVAL_DATA) I still think usage of both "auto" and "precise"
> invalidation modes deserves a warning. It is only at filesystem init time. What
> is the reason not to print it?

The warning makes no sense. It should either be failure or silent override.

> - "fuse: retrieve: cap requested size to negotiated max_write"
>
> Signed-off-by: Kirill Smelkov <[email protected]>
> Cc: Han-Wen Nienhuys <[email protected]>
> Cc: Jakob Unterwurzacher <[email protected]>
> -Cc: <[email protected]> # v2.6.36+
>
> what is the reason not to include this patch into stable series?

This doens't fix any bugs out there, but there is a slight chance of
regression (so it might possibly have to be reverted in the future) so
it absolutely makes no sense to backport it to stable.

Thanks,
Miklos

2019-04-24 18:11:54

by Kirill Smelkov

[permalink] [raw]
Subject: Re: [RESEND4, PATCH 1/2] fuse: retrieve: cap requested size to negotiated max_write

On Wed, Apr 24, 2019 at 05:02:42PM +0200, Miklos Szeredi wrote:
> On Wed, Apr 24, 2019 at 4:22 PM Kirill Smelkov <[email protected]> wrote:
> > - FUSE_PRECISE_INVAL_DATA:
> >
> > --- b/include/uapi/linux/fuse.h
> > +++ b/include/uapi/linux/fuse.h
> > @@ -266,7 +266,7 @@
> > * FUSE_MAX_PAGES: init_out.max_pages contains the max number of req pages
> > * FUSE_CACHE_SYMLINKS: cache READLINK responses
> > * FUSE_NO_OPENDIR_SUPPORT: kernel supports zero-message opendir
> > - * FUSE_PRECISE_INVAL_DATA: filesystem is fully responsible for data cache invalidation
> > + * FUSE_PRECISE_INVAL_DATA: filesystem is fully responsible for invalidation
> > */
> > #define FUSE_ASYNC_READ (1 << 0)
> > #define FUSE_POSIX_LOCKS (1 << 1)
> >
> > the "data cache" in "for data cache invalidation" has particular meaning
> > and semantic: the filesystem promises to explicitly invalidate data of
>
> Right; better name: FUSE_EXPLICIT_INVAL_DATA. Will push fixed version.

- * FUSE_PRECISE_INVAL_DATA: filesystem is fully responsible for invalidation
+ * FUSE_EXPLICIT_INVAL_DATA: only invalidate cached pages on explicit request

...

/** Filesystem is fully reponsible for page cache invalidation. */
- unsigned precise_inval_data:1;
+ unsigned explicit_inval_data:1;

Ok, let it be this way.


> > Your amendment for FOPEN_STREAM in uapi/linux/fuse.h (see above) also
> > suggests that it is better to be more explicit in that file.
> >
> > --- b/fs/fuse/inode.c
> > +++ b/fs/fuse/inode.c
> > @@ -913,13 +913,8 @@
> > fc->dont_mask = 1;
> > if (arg->flags & FUSE_AUTO_INVAL_DATA)
> > fc->auto_inval_data = 1;
> > - if (arg->flags & FUSE_PRECISE_INVAL_DATA)
> > + else if (arg->flags & FUSE_PRECISE_INVAL_DATA)
> > fc->precise_inval_data = 1;
> > - if (fc->auto_inval_data && fc->precise_inval_data) {
> > - pr_warn("filesystem requested both auto and "
> > - "precise cache control - using auto\n");
> > - fc->precise_inval_data = 0;
> > - }
> > if (arg->flags & FUSE_DO_READDIRPLUS) {
> > fc->do_readdirplus = 1;
> > if (arg->flags & FUSE_READDIRPLUS_AUTO)
> >
> > Even though it is ok for me personally (I could be careful and use only
> > FUSE_PRECISE_INVAL_DATA) I still think usage of both "auto" and "precise"
> > invalidation modes deserves a warning. It is only at filesystem init time. What
> > is the reason not to print it?
>
> The warning makes no sense. It should either be failure or silent override.

Ok.


> > - "fuse: retrieve: cap requested size to negotiated max_write"
> >
> > Signed-off-by: Kirill Smelkov <[email protected]>
> > Cc: Han-Wen Nienhuys <[email protected]>
> > Cc: Jakob Unterwurzacher <[email protected]>
> > -Cc: <[email protected]> # v2.6.36+
> >
> > what is the reason not to include this patch into stable series?
>
> This doens't fix any bugs out there, but there is a slight chance of
> regression (so it might possibly have to be reverted in the future) so
> it absolutely makes no sense to backport it to stable.

Ok.


Thanks again for tossing the patches,

Kirill

2019-04-24 20:18:28

by Kirill Smelkov

[permalink] [raw]
Subject: Re: [RESEND4, PATCH 1/2] fuse: retrieve: cap requested size to negotiated max_write

On Wed, Apr 24, 2019 at 03:19:11PM +0200, Miklos Szeredi wrote:
> On Wed, Apr 24, 2019 at 2:31 PM Kirill Smelkov <[email protected]> wrote:
>
> > Thanks. Does it mean that the patch is ok? Do I need to rework
> > something?
>
> Pushed to #for-next with all the rest. Made some changes to the
> patches, so please verify.

Thanks a lot. I've verified all changes and it is indeed better to amend
something:

- FOPEN_STREAM:

--- b/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -232,7 +232,7 @@
* FOPEN_KEEP_CACHE: don't invalidate the data cache on open
* FOPEN_NONSEEKABLE: the file is not seekable
* FOPEN_CACHE_DIR: allow caching this directory
- * FOPEN_STREAM: the file is stream-like
+ * FOPEN_STREAM: the file is stream-like (no file position at all)
*/
#define FOPEN_DIRECT_IO (1 << 0)
#define FOPEN_KEEP_CACHE (1 << 1)

I agree, it is better this way (no amendment needed here).


- FUSE_PRECISE_INVAL_DATA:

--- b/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -266,7 +266,7 @@
* FUSE_MAX_PAGES: init_out.max_pages contains the max number of req pages
* FUSE_CACHE_SYMLINKS: cache READLINK responses
* FUSE_NO_OPENDIR_SUPPORT: kernel supports zero-message opendir
- * FUSE_PRECISE_INVAL_DATA: filesystem is fully responsible for data cache invalidation
+ * FUSE_PRECISE_INVAL_DATA: filesystem is fully responsible for invalidation
*/
#define FUSE_ASYNC_READ (1 << 0)
#define FUSE_POSIX_LOCKS (1 << 1)

the "data cache" in "for data cache invalidation" has particular meaning
and semantic: the filesystem promises to explicitly invalidate data of
the file, but it does not promise to explicitly invalidate attributes. I
understand it is a long line, and I myself tried to remove extra words,
but "data cache" here is semantically needed, so I left it. The
particular behaviour of FUSE_PRECISE_INVAL_DATA also covers only data
cache invalidations. For example file attributes, if not explicitly
invalidated by filesystem, are still invalidated by kernel by its
heuristic and due to negotiated attributes timeout, which is not
"precise".

If it is "precise invalidation for everything: data and attributes" we
should probably rename it to FUSE_PRECISE_INVAL and change the patch to
also cover attributes and not invalidate them automatically. However I
suggest to keep two things separate - data and attributes and not to
change the patch. By the way, description for "auto" invalidation mode is

FUSE_AUTO_INVAL_DATA: automatically invalidate cached pages

which tells both from mode name and from its description that it is
about data.

uapi/linux/fuse.h is often used by userspace people as a document to
understand FUSE protocol (at least I used it this way). I thus suggest
to restore "data cache" since it makes semantic difference.

Your amendment for FOPEN_STREAM in uapi/linux/fuse.h (see above) also
suggests that it is better to be more explicit in that file.

--- b/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -913,13 +913,8 @@
fc->dont_mask = 1;
if (arg->flags & FUSE_AUTO_INVAL_DATA)
fc->auto_inval_data = 1;
- if (arg->flags & FUSE_PRECISE_INVAL_DATA)
+ else if (arg->flags & FUSE_PRECISE_INVAL_DATA)
fc->precise_inval_data = 1;
- if (fc->auto_inval_data && fc->precise_inval_data) {
- pr_warn("filesystem requested both auto and "
- "precise cache control - using auto\n");
- fc->precise_inval_data = 0;
- }
if (arg->flags & FUSE_DO_READDIRPLUS) {
fc->do_readdirplus = 1;
if (arg->flags & FUSE_READDIRPLUS_AUTO)

Even though it is ok for me personally (I could be careful and use only
FUSE_PRECISE_INVAL_DATA) I still think usage of both "auto" and "precise"
invalidation modes deserves a warning. It is only at filesystem init time. What
is the reason not to print it?

- "fuse: retrieve: cap requested size to negotiated max_write"

Signed-off-by: Kirill Smelkov <[email protected]>
Cc: Han-Wen Nienhuys <[email protected]>
Cc: Jakob Unterwurzacher <[email protected]>
-Cc: <[email protected]> # v2.6.36+

what is the reason not to include this patch into stable series?


- "fuse: require /dev/fuse reads to have enough buffer capacity"

--- b/fs/fuse/dev.c
+++ b/fs/fuse/dev.c
@@ -1324,7 +1324,7 @@
* to indicate that it is not following FUSE server/client contract.
* Don't dequeue / abort any request.
*/
- if (nbytes < (fc->conn_init ? 4096 + fc->max_write : FUSE_MIN_READ_BUFFER))
+ if (nbytes < max_t(size_t, FUSE_MIN_READ_BUFFER, 4096 + fc->max_write))
return -EINVAL;

ok, this seems to be correct, since fc, including fc->max_write, is
initialized as all zeros by fuse_conn_init (no amendment needed here).


> > I see. Probably it is not "quoted-printable" as
> >
> > Content-Type: text/plain; charset=utf-8
> > Content-Transfer-Encoding: 8bit
>
> Well, google converts it to quoted-printable then:
>
> Content-Type: text/plain; charset=utf-8
> Content-Transfer-Encoding: quoted-printable

I see.

> > suggests and it is maybe due to UTF-8 characters (I used "·" several
> > times in patch description).
>
> Please refrain from gratuitous use of non-ascii. That middle-dot is
> written as "*" in C (fixed the patch description).

Ok, I will try not to use unicode for my kernel fuse bits.

Thanks, again, a lot for merging the patches.

Kirill

2019-04-25 06:00:44

by Kirill Smelkov

[permalink] [raw]
Subject: Re: [RESEND4, PATCH 1/2] fuse: retrieve: cap requested size to negotiated max_write

On Wed, Apr 24, 2019 at 09:09:58PM +0300, Kirill Smelkov wrote:
> Thanks again for tossing the patches,

I have to appologize here: I used the word "tossing" here as if it would
mean "processing" or "applying", but a friend of mine just said that it
means something like "throwing away" which completely changes the
message. Sorry, I did not meant that - it's just my english for which I
beg you pardon.

All I wanted to say is that I'm grateful that you reviewed and applied
the patches. So thanks and appologize for the inconveninece.

Kirill