Hi,
The patchset fixes a few subtle races stemmed from incorrect expectation
of what fuse_set_nowrite() guarantees. The fact that it makes fi->writectr
negative and waits for fi->writectr == FUSE_NOWRITE ensures only two things:
1) If there are any in-flight writeback requests right now, let's wait for
them being completed.
2) Suspend processing new writeback requests until fuse_release_nowrite().
Both are related to communication between in-kernel fuse and userspace
fuse daemon. But fuse_set_nowrite() does not prevent generic kernel code
from sending dirty pages to writeback resulting in fuse_writepage being
called. I.e. fi->queued_writes may grow independently on fuse_set_nowrite()
machinery.
As soon as fuse_writepage_locked() called end_page_writeback() generic
kernel code may do with the page virtually anything w/o notifying fuse. See
per-patch descriptions for details of some races.
Thanks,
Maxim
---
Maxim Patlasov (2):
fuse: postpone end_page_writeback() in fuse_writepage_locked()
fuse: wait for writeback in fuse_file_fallocate()
fs/fuse/file.c | 56 ++++++++++++++++++++++++++++++++++++++++++++++----------
1 files changed, 46 insertions(+), 10 deletions(-)
--
Signature
The patch fixes a race between ftruncate(2), mmap-ed write and write(2):
1) An user makes a page dirty via mmap-ed write.
2) The user performs shrinking truncate(2) intended to purge the page.
3) Before fuse_do_setattr calls truncate_pagecache, the page goes to
writeback. fuse_writepage_locked fills FUSE_WRITE request and releases
the original page by end_page_writeback.
4) fuse_do_setattr() completes and successfully returns. Since now, i_mutex
is free.
5) Ordinary write(2) extends i_size back to cover the page. Note that
fuse_send_write_pages do wait for fuse writeback, but for another
page->index.
6) fuse_writepage_locked proceeds by queueing FUSE_WRITE request.
fuse_send_writepage is supposed to crop inarg->size of the request,
but it doesn't because i_size has already been extended back.
Moving end_page_writeback to the end of fuse_writepage_locked fixes the race
because now the fact that truncate_pagecache is successfully returned infers
that fuse_writepage_locked has already called end_page_writeback. And this,
in turn, infers that fuse_flush_writepages has already called
fuse_send_writepage, and the latter used valid (shrunk) i_size. write(2)
could not extend it because of i_mutex held by ftruncate(2).
Signed-off-by: Maxim Patlasov <[email protected]>
---
fs/fuse/file.c | 3 ++-
1 files changed, 2 insertions(+), 1 deletions(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 5c121fe..d1715b3 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1529,7 +1529,6 @@ static int fuse_writepage_locked(struct page *page)
inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
- end_page_writeback(page);
spin_lock(&fc->lock);
list_add(&req->writepages_entry, &fi->writepages);
@@ -1537,6 +1536,8 @@ static int fuse_writepage_locked(struct page *page)
fuse_flush_writepages(inode);
spin_unlock(&fc->lock);
+ end_page_writeback(page);
+
return 0;
err_free:
The patch fixes a race between mmap-ed write and fallocate(PUNCH_HOLE):
1) An user makes a page dirty via mmap-ed write.
2) The user performs fallocate(2) with mode == PUNCH_HOLE|KEEP_SIZE
and <offset, size> covering the page.
3) Before truncate_pagecache_range call from fuse_file_fallocate,
the page goes to write-back. The page is fully processed by fuse_writepage
(including end_page_writeback on the page), but fuse_flush_writepages did
nothing because fi->writectr < 0.
4) truncate_pagecache_range is called and fuse_file_fallocate is finishing
by calling fuse_release_nowrite. The latter triggers processing queued
write-back request which will write stale date to the hole soon.
Signed-off-by: Maxim Patlasov <[email protected]>
---
fs/fuse/file.c | 53 ++++++++++++++++++++++++++++++++++++++++++++---------
1 files changed, 44 insertions(+), 9 deletions(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index d1715b3..2b18c4b 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -344,6 +344,31 @@ static bool fuse_page_is_writeback(struct inode *inode, pgoff_t index)
return found;
}
+static bool fuse_range_is_writeback(struct inode *inode, pgoff_t idx_from,
+ pgoff_t idx_to)
+{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_req *req;
+ bool found = false;
+
+ spin_lock(&fc->lock);
+ list_for_each_entry(req, &fi->writepages, writepages_entry) {
+ pgoff_t curr_index;
+
+ BUG_ON(req->inode != inode);
+ curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
+ if (!(idx_from >= curr_index + req->num_pages ||
+ idx_to < curr_index)) {
+ found = true;
+ break;
+ }
+ }
+ spin_unlock(&fc->lock);
+
+ return found;
+}
+
/*
* Wait for page writeback to be completed.
*
@@ -358,6 +383,19 @@ static int fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index)
return 0;
}
+static void fuse_wait_on_writeback(struct inode *inode, pgoff_t start,
+ size_t bytes)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ pgoff_t idx_from, idx_to;
+
+ idx_from = start >> PAGE_CACHE_SHIFT;
+ idx_to = (start + bytes - 1) >> PAGE_CACHE_SHIFT;
+
+ wait_event(fi->page_waitq,
+ !fuse_range_is_writeback(inode, idx_from, idx_to));
+}
+
static int fuse_flush(struct file *file, fl_owner_t id)
{
struct inode *inode = file_inode(file);
@@ -2478,8 +2516,11 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
if (lock_inode) {
mutex_lock(&inode->i_mutex);
- if (mode & FALLOC_FL_PUNCH_HOLE)
- fuse_set_nowrite(inode);
+ if (mode & FALLOC_FL_PUNCH_HOLE) {
+ truncate_pagecache_range(inode, offset,
+ offset + length - 1);
+ fuse_wait_on_writeback(inode, offset, length);
+ }
}
req = fuse_get_req_nopages(fc);
@@ -2508,17 +2549,11 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
if (!(mode & FALLOC_FL_KEEP_SIZE))
fuse_write_update_size(inode, offset + length);
- if (mode & FALLOC_FL_PUNCH_HOLE)
- truncate_pagecache_range(inode, offset, offset + length - 1);
-
fuse_invalidate_attr(inode);
out:
- if (lock_inode) {
- if (mode & FALLOC_FL_PUNCH_HOLE)
- fuse_release_nowrite(inode);
+ if (lock_inode)
mutex_unlock(&inode->i_mutex);
- }
return err;
}
On 08/12/2013 12:39 PM, Maxim Patlasov wrote:
> The patch fixes a race between mmap-ed write and fallocate(PUNCH_HOLE):
>
> 1) An user makes a page dirty via mmap-ed write.
> 2) The user performs fallocate(2) with mode == PUNCH_HOLE|KEEP_SIZE
> and <offset, size> covering the page.
> 3) Before truncate_pagecache_range call from fuse_file_fallocate,
> the page goes to write-back. The page is fully processed by fuse_writepage
> (including end_page_writeback on the page), but fuse_flush_writepages did
> nothing because fi->writectr < 0.
> 4) truncate_pagecache_range is called and fuse_file_fallocate is finishing
> by calling fuse_release_nowrite. The latter triggers processing queued
> write-back request which will write stale date to the hole soon.
>
> Signed-off-by: Maxim Patlasov <[email protected]>
> ---
Hi Maxim,
Nice catch and description, one minor concern...
> fs/fuse/file.c | 53 ++++++++++++++++++++++++++++++++++++++++++++---------
> 1 files changed, 44 insertions(+), 9 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index d1715b3..2b18c4b 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -344,6 +344,31 @@ static bool fuse_page_is_writeback(struct inode *inode, pgoff_t index)
> return found;
> }
>
> +static bool fuse_range_is_writeback(struct inode *inode, pgoff_t idx_from,
> + pgoff_t idx_to)
> +{
> + struct fuse_conn *fc = get_fuse_conn(inode);
> + struct fuse_inode *fi = get_fuse_inode(inode);
> + struct fuse_req *req;
> + bool found = false;
> +
> + spin_lock(&fc->lock);
> + list_for_each_entry(req, &fi->writepages, writepages_entry) {
> + pgoff_t curr_index;
> +
> + BUG_ON(req->inode != inode);
> + curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
> + if (!(idx_from >= curr_index + req->num_pages ||
> + idx_to < curr_index)) {
> + found = true;
> + break;
> + }
> + }
> + spin_unlock(&fc->lock);
> +
> + return found;
> +}
> +
> /*
> * Wait for page writeback to be completed.
> *
> @@ -358,6 +383,19 @@ static int fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index)
> return 0;
> }
>
> +static void fuse_wait_on_writeback(struct inode *inode, pgoff_t start,
> + size_t bytes)
> +{
> + struct fuse_inode *fi = get_fuse_inode(inode);
> + pgoff_t idx_from, idx_to;
> +
> + idx_from = start >> PAGE_CACHE_SHIFT;
> + idx_to = (start + bytes - 1) >> PAGE_CACHE_SHIFT;
> +
> + wait_event(fi->page_waitq,
> + !fuse_range_is_writeback(inode, idx_from, idx_to));
> +}
> +
> static int fuse_flush(struct file *file, fl_owner_t id)
> {
> struct inode *inode = file_inode(file);
> @@ -2478,8 +2516,11 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
>
> if (lock_inode) {
> mutex_lock(&inode->i_mutex);
> - if (mode & FALLOC_FL_PUNCH_HOLE)
> - fuse_set_nowrite(inode);
> + if (mode & FALLOC_FL_PUNCH_HOLE) {
> + truncate_pagecache_range(inode, offset,
> + offset + length - 1);
> + fuse_wait_on_writeback(inode, offset, length);
> + }
If this happens to be the first attempt on an fs that doesn't support
fallocate, we'll return -EOPNOTSUPP after having already punched out the
data in the pagecache. What about replacing the nowrite logic with a
flush (and still followed by your new writeback wait logic) rather than
moving the pagecache truncate?
Brian
> }
>
> req = fuse_get_req_nopages(fc);
> @@ -2508,17 +2549,11 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
> if (!(mode & FALLOC_FL_KEEP_SIZE))
> fuse_write_update_size(inode, offset + length);
>
> - if (mode & FALLOC_FL_PUNCH_HOLE)
> - truncate_pagecache_range(inode, offset, offset + length - 1);
> -
> fuse_invalidate_attr(inode);
>
> out:
> - if (lock_inode) {
> - if (mode & FALLOC_FL_PUNCH_HOLE)
> - fuse_release_nowrite(inode);
> + if (lock_inode)
> mutex_unlock(&inode->i_mutex);
> - }
>
> return err;
> }
>
Hi,
08/13/2013 04:05 PM, Brian Foster пишет:
> ...
> @@ -2478,8 +2516,11 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
>
> if (lock_inode) {
> mutex_lock(&inode->i_mutex);
> - if (mode & FALLOC_FL_PUNCH_HOLE)
> - fuse_set_nowrite(inode);
> + if (mode & FALLOC_FL_PUNCH_HOLE) {
> + truncate_pagecache_range(inode, offset,
> + offset + length - 1);
> + fuse_wait_on_writeback(inode, offset, length);
> + }
> If this happens to be the first attempt on an fs that doesn't support
> fallocate, we'll return -EOPNOTSUPP after having already punched out the
> data in the pagecache.
Yes, this is unpleasant, but it's not critical, imo. We're returning an
error code (even though equal to -EOPNOTSUPP) and a sane application
should not make any assumption about current state of the punched
region. Also, the application intended to discard given region of the
file, so why should it pay care for its content afterwards?
> What about replacing the nowrite logic with a
> flush (and still followed by your new writeback wait logic) rather than
> moving the pagecache truncate?
The "flush" you mentioned should firstly flush page cache.
invalidate_inode_pages2_range() seems to be a candidate. We definitely
cannot ignore error code from it because it can be fuse_launder_page()
who got -ENOMEM from fuse_writepage_locked(). In case of err == -ENOMEM,
we could safely fail fallocate, but what should we do if it's -EBUSY?
Any ideas?
Thanks,
Maxim
On 08/13/2013 08:56 AM, Maxim Patlasov wrote:
> Hi,
>
> 08/13/2013 04:05 PM, Brian Foster пишет:
>> ...
>> @@ -2478,8 +2516,11 @@ static long fuse_file_fallocate(struct file
>> *file, int mode, loff_t offset,
>> if (lock_inode) {
>> mutex_lock(&inode->i_mutex);
>> - if (mode & FALLOC_FL_PUNCH_HOLE)
>> - fuse_set_nowrite(inode);
>> + if (mode & FALLOC_FL_PUNCH_HOLE) {
>> + truncate_pagecache_range(inode, offset,
>> + offset + length - 1);
>> + fuse_wait_on_writeback(inode, offset, length);
>> + }
>> If this happens to be the first attempt on an fs that doesn't support
>> fallocate, we'll return -EOPNOTSUPP after having already punched out the
>> data in the pagecache.
>
> Yes, this is unpleasant, but it's not critical, imo. We're returning an
> error code (even though equal to -EOPNOTSUPP) and a sane application
> should not make any assumption about current state of the punched
> region. Also, the application intended to discard given region of the
> file, so why should it pay care for its content afterwards?
>
I agree, though most users probably wouldn't expect that a blatant error
like EOPNOTSUPP leave the range in a weird state. What's more, it only
does so if it's the first attempt and behaves more appropriately after
that.
>> What about replacing the nowrite logic with a
>> flush (and still followed by your new writeback wait logic) rather than
>> moving the pagecache truncate?
>
> The "flush" you mentioned should firstly flush page cache.
> invalidate_inode_pages2_range() seems to be a candidate. We definitely
> cannot ignore error code from it because it can be fuse_launder_page()
> who got -ENOMEM from fuse_writepage_locked(). In case of err == -ENOMEM,
> we could safely fail fallocate, but what should we do if it's -EBUSY?
> Any ideas?
>
I was referring to something like filemap_write_and_wait_range(), for
example. Then continue to use truncate_pagecache_range() as we do today.
Thoughts?
Brian
> Thanks,
> Maxim
08/13/2013 05:23 PM, Brian Foster пишет:
> On 08/13/2013 08:56 AM, Maxim Patlasov wrote:
>> Hi,
>>
>> 08/13/2013 04:05 PM, Brian Foster пишет:
>>> ...
>>> @@ -2478,8 +2516,11 @@ static long fuse_file_fallocate(struct file
>>> *file, int mode, loff_t offset,
>>> if (lock_inode) {
>>> mutex_lock(&inode->i_mutex);
>>> - if (mode & FALLOC_FL_PUNCH_HOLE)
>>> - fuse_set_nowrite(inode);
>>> + if (mode & FALLOC_FL_PUNCH_HOLE) {
>>> + truncate_pagecache_range(inode, offset,
>>> + offset + length - 1);
>>> + fuse_wait_on_writeback(inode, offset, length);
>>> + }
>>> If this happens to be the first attempt on an fs that doesn't support
>>> fallocate, we'll return -EOPNOTSUPP after having already punched out the
>>> data in the pagecache.
>> Yes, this is unpleasant, but it's not critical, imo. We're returning an
>> error code (even though equal to -EOPNOTSUPP) and a sane application
>> should not make any assumption about current state of the punched
>> region. Also, the application intended to discard given region of the
>> file, so why should it pay care for its content afterwards?
>>
> I agree, though most users probably wouldn't expect that a blatant error
> like EOPNOTSUPP leave the range in a weird state. What's more, it only
> does so if it's the first attempt and behaves more appropriately after
> that.
>
>>> What about replacing the nowrite logic with a
>>> flush (and still followed by your new writeback wait logic) rather than
>>> moving the pagecache truncate?
>> The "flush" you mentioned should firstly flush page cache.
>> invalidate_inode_pages2_range() seems to be a candidate. We definitely
>> cannot ignore error code from it because it can be fuse_launder_page()
>> who got -ENOMEM from fuse_writepage_locked(). In case of err == -ENOMEM,
>> we could safely fail fallocate, but what should we do if it's -EBUSY?
>> Any ideas?
>>
> I was referring to something like filemap_write_and_wait_range(), for
> example. Then continue to use truncate_pagecache_range() as we do today.
> Thoughts
Looks nice. I'll send updated patch after some testing. Thanks a lot for
the suggestion!
Maxim
The patch fixes a race between mmap-ed write and fallocate(PUNCH_HOLE):
1) An user makes a page dirty via mmap-ed write.
2) The user performs fallocate(2) with mode == PUNCH_HOLE|KEEP_SIZE
and <offset, size> covering the page.
3) Before truncate_pagecache_range call from fuse_file_fallocate,
the page goes to write-back. The page is fully processed by fuse_writepage
(including end_page_writeback on the page), but fuse_flush_writepages did
nothing because fi->writectr < 0.
4) truncate_pagecache_range is called and fuse_file_fallocate is finishing
by calling fuse_release_nowrite. The latter triggers processing queued
write-back request which will write stale data to the hole soon.
Changed in v2 (thanks to Brian for suggestion):
- Do not truncate page cache until FUSE_FALLOCATE succeeded. Otherwise,
we can end up in returning -ENOTSUPP while user data is already punched
from page cache. Use filemap_write_and_wait_range() instead.
Signed-off-by: Maxim Patlasov <[email protected]>
---
fs/fuse/file.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++------
1 files changed, 48 insertions(+), 6 deletions(-)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index d1715b3..e88ba8b 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -344,6 +344,31 @@ static bool fuse_page_is_writeback(struct inode *inode, pgoff_t index)
return found;
}
+static bool fuse_range_is_writeback(struct inode *inode, pgoff_t idx_from,
+ pgoff_t idx_to)
+{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_req *req;
+ bool found = false;
+
+ spin_lock(&fc->lock);
+ list_for_each_entry(req, &fi->writepages, writepages_entry) {
+ pgoff_t curr_index;
+
+ BUG_ON(req->inode != inode);
+ curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
+ if (!(idx_from >= curr_index + req->num_pages ||
+ idx_to < curr_index)) {
+ found = true;
+ break;
+ }
+ }
+ spin_unlock(&fc->lock);
+
+ return found;
+}
+
/*
* Wait for page writeback to be completed.
*
@@ -358,6 +383,19 @@ static int fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index)
return 0;
}
+static void fuse_wait_on_writeback(struct inode *inode, pgoff_t start,
+ size_t bytes)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ pgoff_t idx_from, idx_to;
+
+ idx_from = start >> PAGE_CACHE_SHIFT;
+ idx_to = (start + bytes - 1) >> PAGE_CACHE_SHIFT;
+
+ wait_event(fi->page_waitq,
+ !fuse_range_is_writeback(inode, idx_from, idx_to));
+}
+
static int fuse_flush(struct file *file, fl_owner_t id)
{
struct inode *inode = file_inode(file);
@@ -2478,8 +2516,15 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
if (lock_inode) {
mutex_lock(&inode->i_mutex);
- if (mode & FALLOC_FL_PUNCH_HOLE)
- fuse_set_nowrite(inode);
+ if (mode & FALLOC_FL_PUNCH_HOLE) {
+ loff_t endbyte = offset + length - 1;
+ err = filemap_write_and_wait_range(inode->i_mapping,
+ offset, endbyte);
+ if (err)
+ goto out;
+
+ fuse_wait_on_writeback(inode, offset, length);
+ }
}
req = fuse_get_req_nopages(fc);
@@ -2514,11 +2559,8 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
fuse_invalidate_attr(inode);
out:
- if (lock_inode) {
- if (mode & FALLOC_FL_PUNCH_HOLE)
- fuse_release_nowrite(inode);
+ if (lock_inode)
mutex_unlock(&inode->i_mutex);
- }
return err;
}
On 08/16/2013 07:30 AM, Maxim Patlasov wrote:
> The patch fixes a race between mmap-ed write and fallocate(PUNCH_HOLE):
>
> 1) An user makes a page dirty via mmap-ed write.
> 2) The user performs fallocate(2) with mode == PUNCH_HOLE|KEEP_SIZE
> and <offset, size> covering the page.
> 3) Before truncate_pagecache_range call from fuse_file_fallocate,
> the page goes to write-back. The page is fully processed by fuse_writepage
> (including end_page_writeback on the page), but fuse_flush_writepages did
> nothing because fi->writectr < 0.
> 4) truncate_pagecache_range is called and fuse_file_fallocate is finishing
> by calling fuse_release_nowrite. The latter triggers processing queued
> write-back request which will write stale data to the hole soon.
>
> Changed in v2 (thanks to Brian for suggestion):
> - Do not truncate page cache until FUSE_FALLOCATE succeeded. Otherwise,
> we can end up in returning -ENOTSUPP while user data is already punched
> from page cache. Use filemap_write_and_wait_range() instead.
>
> Signed-off-by: Maxim Patlasov <[email protected]>
> ---
Looks good to me, thanks Maxim.
Reviewed-by: Brian Foster <[email protected]>
> fs/fuse/file.c | 54 ++++++++++++++++++++++++++++++++++++++++++++++++------
> 1 files changed, 48 insertions(+), 6 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index d1715b3..e88ba8b 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -344,6 +344,31 @@ static bool fuse_page_is_writeback(struct inode *inode, pgoff_t index)
> return found;
> }
>
> +static bool fuse_range_is_writeback(struct inode *inode, pgoff_t idx_from,
> + pgoff_t idx_to)
> +{
> + struct fuse_conn *fc = get_fuse_conn(inode);
> + struct fuse_inode *fi = get_fuse_inode(inode);
> + struct fuse_req *req;
> + bool found = false;
> +
> + spin_lock(&fc->lock);
> + list_for_each_entry(req, &fi->writepages, writepages_entry) {
> + pgoff_t curr_index;
> +
> + BUG_ON(req->inode != inode);
> + curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
> + if (!(idx_from >= curr_index + req->num_pages ||
> + idx_to < curr_index)) {
> + found = true;
> + break;
> + }
> + }
> + spin_unlock(&fc->lock);
> +
> + return found;
> +}
> +
> /*
> * Wait for page writeback to be completed.
> *
> @@ -358,6 +383,19 @@ static int fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index)
> return 0;
> }
>
> +static void fuse_wait_on_writeback(struct inode *inode, pgoff_t start,
> + size_t bytes)
> +{
> + struct fuse_inode *fi = get_fuse_inode(inode);
> + pgoff_t idx_from, idx_to;
> +
> + idx_from = start >> PAGE_CACHE_SHIFT;
> + idx_to = (start + bytes - 1) >> PAGE_CACHE_SHIFT;
> +
> + wait_event(fi->page_waitq,
> + !fuse_range_is_writeback(inode, idx_from, idx_to));
> +}
> +
> static int fuse_flush(struct file *file, fl_owner_t id)
> {
> struct inode *inode = file_inode(file);
> @@ -2478,8 +2516,15 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
>
> if (lock_inode) {
> mutex_lock(&inode->i_mutex);
> - if (mode & FALLOC_FL_PUNCH_HOLE)
> - fuse_set_nowrite(inode);
> + if (mode & FALLOC_FL_PUNCH_HOLE) {
> + loff_t endbyte = offset + length - 1;
> + err = filemap_write_and_wait_range(inode->i_mapping,
> + offset, endbyte);
> + if (err)
> + goto out;
> +
> + fuse_wait_on_writeback(inode, offset, length);
> + }
> }
>
> req = fuse_get_req_nopages(fc);
> @@ -2514,11 +2559,8 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
> fuse_invalidate_attr(inode);
>
> out:
> - if (lock_inode) {
> - if (mode & FALLOC_FL_PUNCH_HOLE)
> - fuse_release_nowrite(inode);
> + if (lock_inode)
> mutex_unlock(&inode->i_mutex);
> - }
>
> return err;
> }
>
On Fri, Aug 16, 2013 at 03:30:27PM +0400, Maxim Patlasov wrote:
> The patch fixes a race between mmap-ed write and fallocate(PUNCH_HOLE):
>
> 1) An user makes a page dirty via mmap-ed write.
> 2) The user performs fallocate(2) with mode == PUNCH_HOLE|KEEP_SIZE
> and <offset, size> covering the page.
> 3) Before truncate_pagecache_range call from fuse_file_fallocate,
> the page goes to write-back. The page is fully processed by fuse_writepage
> (including end_page_writeback on the page), but fuse_flush_writepages did
> nothing because fi->writectr < 0.
> 4) truncate_pagecache_range is called and fuse_file_fallocate is finishing
> by calling fuse_release_nowrite. The latter triggers processing queued
> write-back request which will write stale data to the hole soon.
>
> Changed in v2 (thanks to Brian for suggestion):
> - Do not truncate page cache until FUSE_FALLOCATE succeeded. Otherwise,
> we can end up in returning -ENOTSUPP while user data is already punched
> from page cache. Use filemap_write_and_wait_range() instead.
The problem with fuse_wait_on_writeback() is starvation. You could have the
page range continually being dirtied and written back and fallocate() livelocked
in fuse_wait_on_writeback() for ever AFAICS.
So having a barrier like FUSE_NOWRITE is good but then we need to take care of
throwing away the truncated part of the queue. But that should be doable by
passing the truncated range explicitly to fuse_release_nowrite().
Thanks,
Miklos
On Mon, Aug 12, 2013 at 6:39 PM, Maxim Patlasov <[email protected]> wrote:
> The patch fixes a race between ftruncate(2), mmap-ed write and write(2):
>
> 1) An user makes a page dirty via mmap-ed write.
> 2) The user performs shrinking truncate(2) intended to purge the page.
> 3) Before fuse_do_setattr calls truncate_pagecache, the page goes to
> writeback. fuse_writepage_locked fills FUSE_WRITE request and releases
> the original page by end_page_writeback.
> 4) fuse_do_setattr() completes and successfully returns. Since now, i_mutex
> is free.
> 5) Ordinary write(2) extends i_size back to cover the page. Note that
> fuse_send_write_pages do wait for fuse writeback, but for another
> page->index.
> 6) fuse_writepage_locked proceeds by queueing FUSE_WRITE request.
> fuse_send_writepage is supposed to crop inarg->size of the request,
> but it doesn't because i_size has already been extended back.
>
> Moving end_page_writeback to the end of fuse_writepage_locked fixes the race
> because now the fact that truncate_pagecache is successfully returned infers
> that fuse_writepage_locked has already called end_page_writeback. And this,
> in turn, infers that fuse_flush_writepages has already called
> fuse_send_writepage, and the latter used valid (shrunk) i_size. write(2)
> could not extend it because of i_mutex held by ftruncate(2).
>
> Signed-off-by: Maxim Patlasov <[email protected]>
Applied this, too.
Thanks,
Miklos
> ---
> fs/fuse/file.c | 3 ++-
> 1 files changed, 2 insertions(+), 1 deletions(-)
>
> diff --git a/fs/fuse/file.c b/fs/fuse/file.c
> index 5c121fe..d1715b3 100644
> --- a/fs/fuse/file.c
> +++ b/fs/fuse/file.c
> @@ -1529,7 +1529,6 @@ static int fuse_writepage_locked(struct page *page)
>
> inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
> inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
> - end_page_writeback(page);
>
> spin_lock(&fc->lock);
> list_add(&req->writepages_entry, &fi->writepages);
> @@ -1537,6 +1536,8 @@ static int fuse_writepage_locked(struct page *page)
> fuse_flush_writepages(inode);
> spin_unlock(&fc->lock);
>
> + end_page_writeback(page);
> +
> return 0;
>
> err_free:
>
Hi,
08/29/2013 07:41 PM, Miklos Szeredi пишет:
> On Fri, Aug 16, 2013 at 03:30:27PM +0400, Maxim Patlasov wrote:
>> The patch fixes a race between mmap-ed write and fallocate(PUNCH_HOLE):
>>
>> 1) An user makes a page dirty via mmap-ed write.
>> 2) The user performs fallocate(2) with mode == PUNCH_HOLE|KEEP_SIZE
>> and <offset, size> covering the page.
>> 3) Before truncate_pagecache_range call from fuse_file_fallocate,
>> the page goes to write-back. The page is fully processed by fuse_writepage
>> (including end_page_writeback on the page), but fuse_flush_writepages did
>> nothing because fi->writectr < 0.
>> 4) truncate_pagecache_range is called and fuse_file_fallocate is finishing
>> by calling fuse_release_nowrite. The latter triggers processing queued
>> write-back request which will write stale data to the hole soon.
>>
>> Changed in v2 (thanks to Brian for suggestion):
>> - Do not truncate page cache until FUSE_FALLOCATE succeeded. Otherwise,
>> we can end up in returning -ENOTSUPP while user data is already punched
>> from page cache. Use filemap_write_and_wait_range() instead.
> The problem with fuse_wait_on_writeback() is starvation. You could have the
> page range continually being dirtied and written back and fallocate() livelocked
> in fuse_wait_on_writeback() for ever AFAICS.
Yes, I agree. I thought being infinitely dirtied is impossible if
i_mutex is held, but now I understand it's not true for mmap-ed writes.
I need to think more about it (livelock avoidance).
>
> So having a barrier like FUSE_NOWRITE is good but then we need to take care of
> throwing away the truncated part of the queue. But that should be doable by
> passing the truncated range explicitly to fuse_release_nowrite().
Yes, I considered this approach, but splitting a fuse request into two
in fuse_send_writepage() made me sick. What if allocation fails?
Thanks,
Maxim
On Thu, Aug 29, 2013 at 08:27:30PM +0400, Maxim Patlasov wrote:
> >So having a barrier like FUSE_NOWRITE is good but then we need to take care
> >of throwing away the truncated part of the queue. But that should be doable
> >by passing the truncated range explicitly to fuse_release_nowrite().
>
> Yes, I considered this approach, but splitting a fuse request into
> two in fuse_send_writepage() made me sick. What if allocation fails?
Heh, yeah.
I can think of a hundred ways this could be solved without needing an
allocation. Probably none of them worth the hassle.
Or if the hole fits inside the write we could just zero out the affected pages.
Which is cheating a bit, but no one will notice ;)
Thanks,
Miklos
BTW, isn't it enough to do the filemap_write_and_wait() *plus* the
fuse_set_nowrite()?
Thanks,
Miklos
On Thu, Aug 29, 2013 at 6:41 PM, Miklos Szeredi <[email protected]> wrote:
> BTW, isn't it enough to do the filemap_write_and_wait() *plus* the
> fuse_set_nowrite()?
Thought about it a bit and I think this should do fine.
Any writes before the fallocate will go trough before the fallocate.
i_mutex guarantees that only one instance of fuse_set_nowrite() is
running. Any mmaped writes during the fallocate() will go after the
fallocate request and the page cache truncation and that's fine too.
Page cache is consistent since it doens't contain pages for those
writes to the hole. Subsequent reads to that area will fill them in.
Any other concerns?
Thanks,
Miklos
08/30/2013 01:13 PM, Miklos Szeredi пишет:
> On Thu, Aug 29, 2013 at 6:41 PM, Miklos Szeredi <[email protected]> wrote:
>> BTW, isn't it enough to do the filemap_write_and_wait() *plus* the
>> fuse_set_nowrite()?
> Thought about it a bit and I think this should do fine.
>
> Any writes before the fallocate will go trough before the fallocate.
> i_mutex guarantees that only one instance of fuse_set_nowrite() is
> running. Any mmaped writes during the fallocate() will go after the
> fallocate request and the page cache truncation and that's fine too.
> Page cache is consistent since it doens't contain pages for those
> writes to the hole. Subsequent reads to that area will fill them in.
>
> Any other concerns?
No. What you suggest looks as a neat and correct solution. I'll resend
the updated patch after some testing (since now till Monday).
As for proof-of-correctness, all you wrote above is correct, but the
first point had been boiling my mind for a while. I came to the
following reasoning (hopefully it is what you meant):
The fact that filemap_write_and_wait() returned infers that
end_page_writeback() was called for all relevant pages. And fuse doesn't
call it before adding request to fi->queued_writes and calling
fuse_flush_writepages(). And the latter, in turn, guarantees proper
accounting of request in fi->writectr. Here, of course, it's crucial
that we can't have concurrent fuse_set_nowrite(), as you explained.
Hence, so far as fi->writectr was bumped, fuse_set_nowrite() we call
after filemap_write_and_wait() would wait until all changes have gone to
the server.
Thanks,
Maxim
On Fri, Aug 30, 2013 at 1:33 PM, Maxim Patlasov <[email protected]> wrote:
> 08/30/2013 01:13 PM, Miklos Szeredi пишет:
>
>> On Thu, Aug 29, 2013 at 6:41 PM, Miklos Szeredi <[email protected]> wrote:
>>>
>>> BTW, isn't it enough to do the filemap_write_and_wait() *plus* the
>>> fuse_set_nowrite()?
>>
>> Thought about it a bit and I think this should do fine.
>>
>> Any writes before the fallocate will go trough before the fallocate.
>> i_mutex guarantees that only one instance of fuse_set_nowrite() is
>> running. Any mmaped writes during the fallocate() will go after the
>> fallocate request and the page cache truncation and that's fine too.
>> Page cache is consistent since it doens't contain pages for those
>> writes to the hole. Subsequent reads to that area will fill them in.
>>
>> Any other concerns?
>
>
> No. What you suggest looks as a neat and correct solution. I'll resend the
> updated patch after some testing (since now till Monday).
>
> As for proof-of-correctness, all you wrote above is correct, but the first
> point had been boiling my mind for a while. I came to the following
> reasoning (hopefully it is what you meant):
>
> The fact that filemap_write_and_wait() returned infers that
> end_page_writeback() was called for all relevant pages. And fuse doesn't
> call it before adding request to fi->queued_writes and calling
> fuse_flush_writepages(). And the latter, in turn, guarantees proper
> accounting of request in fi->writectr. Here, of course, it's crucial that we
> can't have concurrent fuse_set_nowrite(), as you explained. Hence, so far as
> fi->writectr was bumped, fuse_set_nowrite() we call after
> filemap_write_and_wait() would wait until all changes have gone to the
> server.
Any news about this?
Thanks,
Miklos
On 09/11/2013 02:12 PM, Miklos Szeredi wrote:
> On Fri, Aug 30, 2013 at 1:33 PM, Maxim Patlasov <[email protected]> wrote:
>> 08/30/2013 01:13 PM, Miklos Szeredi пишет:
>>
>>> On Thu, Aug 29, 2013 at 6:41 PM, Miklos Szeredi <[email protected]> wrote:
>>>> BTW, isn't it enough to do the filemap_write_and_wait() *plus* the
>>>> fuse_set_nowrite()?
>>> Thought about it a bit and I think this should do fine.
>>>
>>> Any writes before the fallocate will go trough before the fallocate.
>>> i_mutex guarantees that only one instance of fuse_set_nowrite() is
>>> running. Any mmaped writes during the fallocate() will go after the
>>> fallocate request and the page cache truncation and that's fine too.
>>> Page cache is consistent since it doens't contain pages for those
>>> writes to the hole. Subsequent reads to that area will fill them in.
>>>
>>> Any other concerns?
>>
>> No. What you suggest looks as a neat and correct solution. I'll resend the
>> updated patch after some testing (since now till Monday).
>>
>> As for proof-of-correctness, all you wrote above is correct, but the first
>> point had been boiling my mind for a while. I came to the following
>> reasoning (hopefully it is what you meant):
>>
>> The fact that filemap_write_and_wait() returned infers that
>> end_page_writeback() was called for all relevant pages. And fuse doesn't
>> call it before adding request to fi->queued_writes and calling
>> fuse_flush_writepages(). And the latter, in turn, guarantees proper
>> accounting of request in fi->writectr. Here, of course, it's crucial that we
>> can't have concurrent fuse_set_nowrite(), as you explained. Hence, so far as
>> fi->writectr was bumped, fuse_set_nowrite() we call after
>> filemap_write_and_wait() would wait until all changes have gone to the
>> server.
> Any news about this?
Testing updated patch revealed a problem (fsx caught data corruption).
Then I instrumented debug version to get a cue. The debug version
survived several days of testing, but now I discovered that that test
setup was not fully correct. I'll re-run it now.
Thanks,
Maxim