2013-06-29 17:42:04

by Maxim Patlasov

[permalink] [raw]
Subject: [PATCH v5 00/16] fuse: An attempt to implement a write-back cache policy

Hi,

This is the fifth iteration of Pavel Emelyanov's patch-set implementing
write-back policy for FUSE page cache. Initial patch-set description was
the following:

One of the problems with the existing FUSE implementation is that it uses the
write-through cache policy which results in performance problems on certain
workloads. E.g. when copying a big file into a FUSE file the cp pushes every
128k to the userspace synchronously. This becomes a problem when the userspace
back-end uses networking for storing the data.

A good solution of this is switching the FUSE page cache into a write-back policy.
With this file data are pushed to the userspace with big chunks (depending on the
dirty memory limits, but this is much more than 128k) which lets the FUSE daemons
handle the size updates in a more efficient manner.

The writeback feature is per-connection and is explicitly configurable at the
init stage (is it worth making it CAP_SOMETHING protected?) When the writeback is
turned ON:

* still copy writeback pages to temporary buffer when sending a writeback request
and finish the page writeback immediately

* make kernel maintain the inode's i_size to avoid frequent i_size synchronization
with the user space

* take NR_WRITEBACK_TEMP into account when makeing balance_dirty_pages decision.
This protects us from having too many dirty pages on FUSE

The provided patchset survives the fsx test. Performance measurements are not yet
all finished, but the mentioned copying of a huge file becomes noticeably faster
even on machines with few RAM and doesn't make the system stuck (the dirty pages
balancer does its work OK). Applies on top of v3.5-rc4.

We are currently exploring this with our own distributed storage implementation
which is heavily oriented on storing big blobs of data with extremely rare meta-data
updates (virtual machines' and containers' disk images). With the existing cache
policy a typical usage scenario -- copying a big VM disk into a cloud -- takes way
too much time to proceed, much longer than if it was simply scp-ed over the same
network. The write-back policy (as I mentioned) noticeably improves this scenario.
Kirill (in Cc) can share more details about the performance and the storage concepts
details if required.

Changed in v2:
- numerous bugfixes:
- fuse_write_begin and fuse_writepages_fill and fuse_writepage_locked must wait
on page writeback because page writeback can extend beyond the lifetime of
the page-cache page
- fuse_send_writepages can end_page_writeback on original page only after adding
request to fi->writepages list; otherwise another writeback may happen inside
the gap between end_page_writeback and adding to the list
- fuse_direct_io must wait on page writeback; otherwise data corruption is possible
due to reordering requests
- fuse_flush must flush dirty memory and wait for all writeback on given inode
before sending FUSE_FLUSH to userspace; otherwise FUSE_FLUSH is not reliable
- fuse_file_fallocate must hold i_mutex around FUSE_FALLOCATE and i_size update;
otherwise a race with a writer extending i_size is possible
- fix handling errors in fuse_writepages and fuse_send_writepages
- handle i_mtime intelligently if writeback cache is on (see patch #7 (update i_mtime
on buffered writes) for details.
- put enabling writeback cache under fusermount control; (see mount option
'allow_wbcache' introduced by patch #13 (turn writeback cache on))
- rebased on v3.7-rc5

Changed in v3:
- rebased on for-next branch of the fuse tree (fb05f41f5f96f7423c53da4d87913fb44fd0565d)

Changed in v4:
- rebased on for-next branch of the fuse tree (634734b63ac39e137a1c623ba74f3e062b6577db)
- fixed fuse_fillattr() for non-writeback_chace case
- added comments explaining why we cannot trust size from server
- rewrote patch handling i_mtime; it's titled Trust-kernel-i_mtime-only now
- simplified patch titled Flush-files-on-wb-close
- eliminated code duplications from fuse_readpage() ans fuse_prepare_write()
- added comment about "disk full" errors to fuse_write_begin()

Changed in v5:
- rebased on for-next branch of the fuse tree (e5c5f05dca0cf90f0f3bb1aea85dcf658baff185)
with "hold i_mutex in fuse_file_fallocate() - v2" patch applied manually
- updated patch descriptions according to Miklos' demand (using From: for
patches initially developed by someone else)
- moved ->writepages() to a separate patch and enabled it unconditionally
for mmaped writeback
- moved restructuring fuse_readpage to a separate patch
- avoided use of union for fuse_fill_data
- grabbed a ref to the request in fuse_send_writepages

Thanks,
Maxim

---

Maxim Patlasov (10):
fuse: Connection bit for enabling writeback
fuse: Trust kernel i_mtime only
fuse: Flush files on wb close
fuse: restructure fuse_readpage()
fuse: Implement write_begin/write_end callbacks
fuse: fuse_writepage_locked() should wait on writeback
fuse: fuse_flush() should wait on writeback
fuse: Fix O_DIRECT operations vs cached writeback misorder - v2
fuse: Turn writeback cache on
mm: strictlimit feature

Pavel Emelyanov (6):
fuse: Linking file to inode helper
fuse: Getting file for writeback helper
fuse: Prepare to handle short reads
fuse: Prepare to handle multiple pages in writeback
fuse: Trust kernel i_size only - v4
fuse: Implement writepages callback


fs/fs-writeback.c | 2
fs/fuse/cuse.c | 5
fs/fuse/dir.c | 127 +++++++++-
fs/fuse/file.c | 572 ++++++++++++++++++++++++++++++++++++++-----
fs/fuse/fuse_i.h | 26 ++
fs/fuse/inode.c | 38 ++-
include/linux/backing-dev.h | 5
include/linux/pagemap.h | 1
include/linux/writeback.h | 3
include/uapi/linux/fuse.h | 2
mm/backing-dev.c | 3
mm/page-writeback.c | 131 ++++++++--
12 files changed, 794 insertions(+), 121 deletions(-)

--
Signature


2013-06-29 17:42:40

by Maxim Patlasov

[permalink] [raw]
Subject: [PATCH 02/16] fuse: Getting file for writeback helper

From: Pavel Emelyanov <[email protected]>

There will be a .writepageS callback implementation which will need to
get a fuse_file out of a fuse_inode, thus make a helper for this.

Signed-off-by: Maxim Patlasov <[email protected]>
Signed-off-by: Pavel Emelyanov <[email protected]>
---
fs/fuse/file.c | 24 ++++++++++++++++--------
1 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index cc858ee..6b6d307 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1505,6 +1505,20 @@ static void fuse_writepage_end(struct fuse_conn *fc, struct fuse_req *req)
fuse_writepage_free(fc, req);
}

+static struct fuse_file *fuse_write_file(struct fuse_conn *fc,
+ struct fuse_inode *fi)
+{
+ struct fuse_file *ff;
+
+ spin_lock(&fc->lock);
+ BUG_ON(list_empty(&fi->write_files));
+ ff = list_entry(fi->write_files.next, struct fuse_file, write_entry);
+ fuse_file_get(ff);
+ spin_unlock(&fc->lock);
+
+ return ff;
+}
+
static int fuse_writepage_locked(struct page *page)
{
struct address_space *mapping = page->mapping;
@@ -1512,7 +1526,6 @@ static int fuse_writepage_locked(struct page *page)
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_inode *fi = get_fuse_inode(inode);
struct fuse_req *req;
- struct fuse_file *ff;
struct page *tmp_page;

set_page_writeback(page);
@@ -1526,13 +1539,8 @@ static int fuse_writepage_locked(struct page *page)
if (!tmp_page)
goto err_free;

- spin_lock(&fc->lock);
- BUG_ON(list_empty(&fi->write_files));
- ff = list_entry(fi->write_files.next, struct fuse_file, write_entry);
- req->ff = fuse_file_get(ff);
- spin_unlock(&fc->lock);
-
- fuse_write_fill(req, ff, page_offset(page), 0);
+ req->ff = fuse_write_file(fc, fi);
+ fuse_write_fill(req, req->ff, page_offset(page), 0);

copy_highpage(tmp_page, page);
req->misc.write.in.write_flags |= FUSE_WRITE_CACHE;

2013-06-29 17:42:47

by Maxim Patlasov

[permalink] [raw]
Subject: [PATCH 03/16] fuse: Prepare to handle short reads

From: Pavel Emelyanov <[email protected]>

A helper which gets called when read reports less bytes than was requested.
See patch #6 (trust kernel i_size only) for details.

Signed-off-by: Maxim Patlasov <[email protected]>
Signed-off-by: Pavel Emelyanov <[email protected]>
---
fs/fuse/file.c | 21 +++++++++++++--------
1 files changed, 13 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6b6d307..ea70814 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -653,6 +653,15 @@ static void fuse_read_update_size(struct inode *inode, loff_t size,
spin_unlock(&fc->lock);
}

+static void fuse_short_read(struct fuse_req *req, struct inode *inode,
+ u64 attr_ver)
+{
+ size_t num_read = req->out.args[0].size;
+
+ loff_t pos = page_offset(req->pages[0]) + num_read;
+ fuse_read_update_size(inode, pos, attr_ver);
+}
+
static int fuse_readpage(struct file *file, struct page *page)
{
struct fuse_io_priv io = { .async = 0, .file = file };
@@ -690,18 +699,18 @@ static int fuse_readpage(struct file *file, struct page *page)
req->page_descs[0].length = count;
num_read = fuse_send_read(req, &io, pos, count, NULL);
err = req->out.h.error;
- fuse_put_request(fc, req);

if (!err) {
/*
* Short read means EOF. If file size is larger, truncate it
*/
if (num_read < count)
- fuse_read_update_size(inode, pos + num_read, attr_ver);
+ fuse_short_read(req, inode, attr_ver);

SetPageUptodate(page);
}

+ fuse_put_request(fc, req);
fuse_invalidate_attr(inode); /* atime changed */
out:
unlock_page(page);
@@ -724,13 +733,9 @@ static void fuse_readpages_end(struct fuse_conn *fc, struct fuse_req *req)
/*
* Short read means EOF. If file size is larger, truncate it
*/
- if (!req->out.h.error && num_read < count) {
- loff_t pos;
+ if (!req->out.h.error && num_read < count)
+ fuse_short_read(req, inode, req->misc.read.attr_ver);

- pos = page_offset(req->pages[0]) + num_read;
- fuse_read_update_size(inode, pos,
- req->misc.read.attr_ver);
- }
fuse_invalidate_attr(inode); /* atime changed */
}

2013-06-29 17:43:00

by Maxim Patlasov

[permalink] [raw]
Subject: [PATCH 04/16] fuse: Prepare to handle multiple pages in writeback

From: Pavel Emelyanov <[email protected]>

The .writepages callback will issue writeback requests with more than one
page aboard. Make existing end/check code be aware of this.

Signed-off-by: Maxim Patlasov <[email protected]>
---
fs/fuse/file.c | 24 ++++++++++++++++--------
1 files changed, 16 insertions(+), 8 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index ea70814..90fb235 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -350,7 +350,8 @@ static bool fuse_page_is_writeback(struct inode *inode, pgoff_t index)

BUG_ON(req->inode != inode);
curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
- if (curr_index == index) {
+ if (curr_index <= index &&
+ index < curr_index + req->num_pages) {
found = true;
break;
}
@@ -1425,7 +1426,10 @@ static ssize_t fuse_direct_write(struct file *file, const char __user *buf,

static void fuse_writepage_free(struct fuse_conn *fc, struct fuse_req *req)
{
- __free_page(req->pages[0]);
+ int i;
+
+ for (i = 0; i < req->num_pages; i++)
+ __free_page(req->pages[i]);
fuse_file_put(req->ff, false);
}

@@ -1434,11 +1438,14 @@ static void fuse_writepage_finish(struct fuse_conn *fc, struct fuse_req *req)
struct inode *inode = req->inode;
struct fuse_inode *fi = get_fuse_inode(inode);
struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info;
+ int i;

list_del(&req->writepages_entry);
- dec_bdi_stat(bdi, BDI_WRITEBACK);
- dec_zone_page_state(req->pages[0], NR_WRITEBACK_TEMP);
- bdi_writeout_inc(bdi);
+ for (i = 0; i < req->num_pages; i++) {
+ dec_bdi_stat(bdi, BDI_WRITEBACK);
+ dec_zone_page_state(req->pages[i], NR_WRITEBACK_TEMP);
+ bdi_writeout_inc(bdi);
+ }
wake_up(&fi->page_waitq);
}

@@ -1450,14 +1457,15 @@ __acquires(fc->lock)
struct fuse_inode *fi = get_fuse_inode(req->inode);
loff_t size = i_size_read(req->inode);
struct fuse_write_in *inarg = &req->misc.write.in;
+ __u64 data_size = req->num_pages * PAGE_CACHE_SIZE;

if (!fc->connected)
goto out_free;

- if (inarg->offset + PAGE_CACHE_SIZE <= size) {
- inarg->size = PAGE_CACHE_SIZE;
+ if (inarg->offset + data_size <= size) {
+ inarg->size = data_size;
} else if (inarg->offset < size) {
- inarg->size = size & (PAGE_CACHE_SIZE - 1);
+ inarg->size = size - inarg->offset;
} else {
/* Got truncated off completely */
goto out_free;

2013-06-29 17:43:16

by Maxim Patlasov

[permalink] [raw]
Subject: [PATCH 05/16] fuse: Connection bit for enabling writeback

From: Pavel Emelyanov <[email protected]>

Off (0) by default. Will be used in the next patches and will be turned
on at the very end.

Signed-off-by: Maxim Patlasov <[email protected]>
Signed-off-by: Pavel Emelyanov <[email protected]>
---
fs/fuse/fuse_i.h | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index fde7249..a0062a0 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -476,6 +476,9 @@ struct fuse_conn {
/** Set if bdi is valid */
unsigned bdi_initialized:1;

+ /** write-back cache policy (default is write-through) */
+ unsigned writeback_cache:1;
+
/*
* The following bitfields are only for optimization purposes
* and hence races in setting them will not cause malfunction

2013-06-29 17:42:37

by Maxim Patlasov

[permalink] [raw]
Subject: [PATCH 01/16] fuse: Linking file to inode helper

From: Pavel Emelyanov <[email protected]>

When writeback is ON every writeable file should be in per-inode write list,
not only mmap-ed ones. Thus introduce a helper for this linkage.

Signed-off-by: Maxim Patlasov <[email protected]>
Signed-off-by: Pavel Emelyanov <[email protected]>
---
fs/fuse/file.c | 33 +++++++++++++++++++--------------
1 files changed, 19 insertions(+), 14 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 35f2810..cc858ee 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -171,6 +171,22 @@ int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
}
EXPORT_SYMBOL_GPL(fuse_do_open);

+static void fuse_link_write_file(struct file *file)
+{
+ struct inode *inode = file_inode(file);
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_file *ff = file->private_data;
+ /*
+ * file may be written through mmap, so chain it onto the
+ * inodes's write_file list
+ */
+ spin_lock(&fc->lock);
+ if (list_empty(&ff->write_entry))
+ list_add(&ff->write_entry, &fi->write_files);
+ spin_unlock(&fc->lock);
+}
+
void fuse_finish_open(struct inode *inode, struct file *file)
{
struct fuse_file *ff = file->private_data;
@@ -1615,20 +1631,9 @@ static const struct vm_operations_struct fuse_file_vm_ops = {

static int fuse_file_mmap(struct file *file, struct vm_area_struct *vma)
{
- if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE)) {
- struct inode *inode = file_inode(file);
- struct fuse_conn *fc = get_fuse_conn(inode);
- struct fuse_inode *fi = get_fuse_inode(inode);
- struct fuse_file *ff = file->private_data;
- /*
- * file may be written through mmap, so chain it onto the
- * inodes's write_file list
- */
- spin_lock(&fc->lock);
- if (list_empty(&ff->write_entry))
- list_add(&ff->write_entry, &fi->write_files);
- spin_unlock(&fc->lock);
- }
+ if ((vma->vm_flags & VM_SHARED) && (vma->vm_flags & VM_MAYWRITE))
+ fuse_link_write_file(file);
+
file_accessed(file);
vma->vm_ops = &fuse_file_vm_ops;
return 0;

2013-06-29 17:44:53

by Maxim Patlasov

[permalink] [raw]
Subject: [PATCH 06/16] fuse: Trust kernel i_size only - v4

From: Pavel Emelyanov <[email protected]>

Make fuse think that when writeback is on the inode's i_size is always
up-to-date and not update it with the value received from the userspace.
This is done because the page cache code may update i_size without letting
the FS know.

This assumption implies fixing the previously introduced short-read helper --
when a short read occurs the 'hole' is filled with zeroes.

fuse_file_fallocate() is also fixed because now we should keep i_size up to
date, so it must be updated if FUSE_FALLOCATE request succeeded.

Changed in v2:
- improved comment in fuse_short_read()
- fixed fuse_file_fallocate() for KEEP_SIZE mode

Changed in v3:
- fixed fuse_fillattr() not to use local i_size if writeback-cache is off
- added a comment explaining why we cannot trust attr.size from server

Changed in v4:
- do not change fuse_file_fallocate() because what we need is already
implemented in commits a200a2d and 14c1441

Signed-off-by: Maxim V. Patlasov <[email protected]>
---
fs/fuse/dir.c | 13 +++++++++++--
fs/fuse/file.c | 26 ++++++++++++++++++++++++--
fs/fuse/inode.c | 11 +++++++++--
3 files changed, 44 insertions(+), 6 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index f3f783d..b755884 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -851,6 +851,11 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
struct kstat *stat)
{
unsigned int blkbits;
+ struct fuse_conn *fc = get_fuse_conn(inode);
+
+ /* see the comment in fuse_change_attributes() */
+ if (fc->writeback_cache && S_ISREG(inode->i_mode))
+ attr->size = i_size_read(inode);

stat->dev = inode->i_sb->s_dev;
stat->ino = attr->ino;
@@ -1576,6 +1581,7 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
struct fuse_setattr_in inarg;
struct fuse_attr_out outarg;
bool is_truncate = false;
+ bool is_wb = fc->writeback_cache;
loff_t oldsize;
int err;

@@ -1645,7 +1651,9 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
fuse_change_attributes_common(inode, &outarg.attr,
attr_timeout(&outarg));
oldsize = inode->i_size;
- i_size_write(inode, outarg.attr.size);
+ /* see the comment in fuse_change_attributes() */
+ if (!is_wb || is_truncate || !S_ISREG(inode->i_mode))
+ i_size_write(inode, outarg.attr.size);

if (is_truncate) {
/* NOTE: this may release/reacquire fc->lock */
@@ -1657,7 +1665,8 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
* Only call invalidate_inode_pages2() after removing
* FUSE_NOWRITE, otherwise fuse_launder_page() would deadlock.
*/
- if (S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
+ if ((is_truncate || !is_wb) &&
+ S_ISREG(inode->i_mode) && oldsize != outarg.attr.size) {
truncate_pagecache(inode, oldsize, outarg.attr.size);
invalidate_inode_pages2(inode->i_mapping);
}
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 90fb235..98bc0d0 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -658,9 +658,31 @@ static void fuse_short_read(struct fuse_req *req, struct inode *inode,
u64 attr_ver)
{
size_t num_read = req->out.args[0].size;
+ struct fuse_conn *fc = get_fuse_conn(inode);
+
+ if (fc->writeback_cache) {
+ /*
+ * A hole in a file. Some data after the hole are in page cache,
+ * but have not reached the client fs yet. So, the hole is not
+ * present there.
+ */
+ int i;
+ int start_idx = num_read >> PAGE_CACHE_SHIFT;
+ size_t off = num_read & (PAGE_CACHE_SIZE - 1);
+
+ for (i = start_idx; i < req->num_pages; i++) {
+ struct page *page = req->pages[i];
+ void *mapaddr = kmap_atomic(page);

- loff_t pos = page_offset(req->pages[0]) + num_read;
- fuse_read_update_size(inode, pos, attr_ver);
+ memset(mapaddr + off, 0, PAGE_CACHE_SIZE - off);
+
+ kunmap_atomic(mapaddr);
+ off = 0;
+ }
+ } else {
+ loff_t pos = page_offset(req->pages[0]) + num_read;
+ fuse_read_update_size(inode, pos, attr_ver);
+ }
}

static int fuse_readpage(struct file *file, struct page *page)
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 9a0cdde..121638d 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -197,6 +197,7 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
{
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_inode *fi = get_fuse_inode(inode);
+ bool is_wb = fc->writeback_cache;
loff_t oldsize;
struct timespec old_mtime;

@@ -210,10 +211,16 @@ void fuse_change_attributes(struct inode *inode, struct fuse_attr *attr,
fuse_change_attributes_common(inode, attr, attr_valid);

oldsize = inode->i_size;
- i_size_write(inode, attr->size);
+ /*
+ * In case of writeback_cache enabled, the cached writes beyond EOF
+ * extend local i_size without keeping userspace server in sync. So,
+ * attr->size coming from server can be stale. We cannot trust it.
+ */
+ if (!is_wb || !S_ISREG(inode->i_mode))
+ i_size_write(inode, attr->size);
spin_unlock(&fc->lock);

- if (S_ISREG(inode->i_mode)) {
+ if (!is_wb && S_ISREG(inode->i_mode)) {
bool inval = false;

if (oldsize != attr->size) {

2013-06-29 17:45:08

by Maxim Patlasov

[permalink] [raw]
Subject: [PATCH 07/16] fuse: Trust kernel i_mtime only

Let the kernel maintain i_mtime locally:
- clear S_NOCMTIME
- implement i_op->update_time()
- flush mtime on fsync and last close
- update i_mtime explicitly on truncate and fallocate

Fuse inode flag FUSE_I_MTIME_UPDATED serves as indication that local i_mtime
should be flushed to the server eventually. Some operations (direct write,
truncate, fallocate and setattr) leads to updating mtime on server. So, we can
clear FUSE_I_MTIME_UPDATED when such an operation is completed. This is safe
because these operations (as well as i_op->update_time and fsync) are
protected by i_mutex.

Signed-off-by: Maxim Patlasov <[email protected]>
---
fs/fuse/dir.c | 116 ++++++++++++++++++++++++++++++++++++++++++++++++------
fs/fuse/file.c | 38 +++++++++++++++---
fs/fuse/fuse_i.h | 6 ++-
fs/fuse/inode.c | 13 +++++-
4 files changed, 151 insertions(+), 22 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index b755884..cd67a0c 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -854,8 +854,11 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
struct fuse_conn *fc = get_fuse_conn(inode);

/* see the comment in fuse_change_attributes() */
- if (fc->writeback_cache && S_ISREG(inode->i_mode))
+ if (fc->writeback_cache && S_ISREG(inode->i_mode)) {
attr->size = i_size_read(inode);
+ attr->mtime = inode->i_mtime.tv_sec;
+ attr->mtimensec = inode->i_mtime.tv_nsec;
+ }

stat->dev = inode->i_sb->s_dev;
stat->ino = attr->ino;
@@ -1565,6 +1568,89 @@ void fuse_release_nowrite(struct inode *inode)
spin_unlock(&fc->lock);
}

+static void fuse_setattr_fill(struct fuse_conn *fc, struct fuse_req *req,
+ struct inode *inode,
+ struct fuse_setattr_in *inarg_p,
+ struct fuse_attr_out *outarg_p)
+{
+ req->in.h.opcode = FUSE_SETATTR;
+ req->in.h.nodeid = get_node_id(inode);
+ req->in.numargs = 1;
+ req->in.args[0].size = sizeof(*inarg_p);
+ req->in.args[0].value = inarg_p;
+ req->out.numargs = 1;
+ if (fc->minor < 9)
+ req->out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE;
+ else
+ req->out.args[0].size = sizeof(*outarg_p);
+ req->out.args[0].value = outarg_p;
+}
+
+/*
+ * Flush inode->i_mtime to the server
+ */
+int fuse_flush_mtime(struct file *file, bool nofail)
+{
+ struct inode *inode = file->f_mapping->host;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_req *req = NULL;
+ struct fuse_setattr_in inarg;
+ struct fuse_attr_out outarg;
+ int err;
+
+ if (nofail) {
+ req = fuse_get_req_nofail_nopages(fc, file);
+ } else {
+ req = fuse_get_req_nopages(fc);
+ if (IS_ERR(req))
+ return PTR_ERR(req);
+ }
+
+ memset(&inarg, 0, sizeof(inarg));
+ memset(&outarg, 0, sizeof(outarg));
+
+ inarg.valid |= FATTR_MTIME;
+ inarg.mtime = inode->i_mtime.tv_sec;
+ inarg.mtimensec = inode->i_mtime.tv_nsec;
+
+ fuse_setattr_fill(fc, req, inode, &inarg, &outarg);
+ fuse_request_send(fc, req);
+ err = req->out.h.error;
+ fuse_put_request(fc, req);
+
+ if (!err)
+ clear_bit(FUSE_I_MTIME_UPDATED, &fi->state);
+
+ return err;
+}
+
+static inline void set_mtime_helper(struct inode *inode, struct timespec mtime)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ inode->i_mtime = mtime;
+ clear_bit(FUSE_I_MTIME_UPDATED, &fi->state);
+}
+
+/*
+ * S_NOCMTIME is clear, so we need to update inode->i_mtime manually. But
+ * we can also clear FUSE_I_MTIME_UPDATED if FUSE_SETATTR has just changed
+ * mtime on server.
+ */
+static void fuse_set_mtime_local(struct iattr *iattr, struct inode *inode)
+{
+ unsigned ivalid = iattr->ia_valid;
+
+ if ((ivalid & ATTR_MTIME) && update_mtime(ivalid)) {
+ if (ivalid & ATTR_MTIME_SET)
+ set_mtime_helper(inode, iattr->ia_mtime);
+ else
+ set_mtime_helper(inode, current_fs_time(inode->i_sb));
+ } else if (ivalid & ATTR_SIZE)
+ set_mtime_helper(inode, current_fs_time(inode->i_sb));
+}
+
/*
* Set attributes, and at the same time refresh them.
*
@@ -1621,17 +1707,7 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
inarg.valid |= FATTR_LOCKOWNER;
inarg.lock_owner = fuse_lock_owner_id(fc, current->files);
}
- req->in.h.opcode = FUSE_SETATTR;
- req->in.h.nodeid = get_node_id(inode);
- req->in.numargs = 1;
- req->in.args[0].size = sizeof(inarg);
- req->in.args[0].value = &inarg;
- req->out.numargs = 1;
- if (fc->minor < 9)
- req->out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE;
- else
- req->out.args[0].size = sizeof(outarg);
- req->out.args[0].value = &outarg;
+ fuse_setattr_fill(fc, req, inode, &inarg, &outarg);
fuse_request_send(fc, req);
err = req->out.h.error;
fuse_put_request(fc, req);
@@ -1648,6 +1724,10 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
}

spin_lock(&fc->lock);
+ /* the kernel maintains i_mtime locally */
+ if (fc->writeback_cache && S_ISREG(inode->i_mode))
+ fuse_set_mtime_local(attr, inode);
+
fuse_change_attributes_common(inode, &outarg.attr,
attr_timeout(&outarg));
oldsize = inode->i_size;
@@ -1872,6 +1952,17 @@ static int fuse_removexattr(struct dentry *entry, const char *name)
return err;
}

+static int fuse_update_time(struct inode *inode, struct timespec *now,
+ int flags)
+{
+ if (flags & S_MTIME) {
+ inode->i_mtime = *now;
+ set_bit(FUSE_I_MTIME_UPDATED, &get_fuse_inode(inode)->state);
+ BUG_ON(!S_ISREG(inode->i_mode));
+ }
+ return 0;
+}
+
static const struct inode_operations fuse_dir_inode_operations = {
.lookup = fuse_lookup,
.mkdir = fuse_mkdir,
@@ -1911,6 +2002,7 @@ static const struct inode_operations fuse_common_inode_operations = {
.getxattr = fuse_getxattr,
.listxattr = fuse_listxattr,
.removexattr = fuse_removexattr,
+ .update_time = fuse_update_time,
};

static const struct inode_operations fuse_symlink_inode_operations = {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 98bc0d0..f53697c 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -291,6 +291,10 @@ static int fuse_open(struct inode *inode, struct file *file)

static int fuse_release(struct inode *inode, struct file *file)
{
+ if (test_bit(FUSE_I_MTIME_UPDATED,
+ &get_fuse_inode(inode)->state))
+ fuse_flush_mtime(file, true);
+
fuse_release_common(file, FUSE_RELEASE);

/* return value is ignored by VFS */
@@ -458,6 +462,13 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end,

fuse_sync_writes(inode);

+ if (test_bit(FUSE_I_MTIME_UPDATED,
+ &get_fuse_inode(inode)->state)) {
+ int err = fuse_flush_mtime(file, false);
+ if (err)
+ goto out;
+ }
+
req = fuse_get_req_nopages(fc);
if (IS_ERR(req)) {
err = PTR_ERR(req);
@@ -948,16 +959,21 @@ static size_t fuse_send_write(struct fuse_req *req, struct fuse_io_priv *io,
return req->misc.write.out.size;
}

-void fuse_write_update_size(struct inode *inode, loff_t pos)
+bool fuse_write_update_size(struct inode *inode, loff_t pos)
{
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_inode *fi = get_fuse_inode(inode);
+ bool ret = false;

spin_lock(&fc->lock);
fi->attr_version = ++fc->attr_version;
- if (pos > inode->i_size)
+ if (pos > inode->i_size) {
i_size_write(inode, pos);
+ ret = true;
+ }
spin_unlock(&fc->lock);
+
+ return ret;
}

static size_t fuse_send_write_pages(struct fuse_req *req, struct file *file,
@@ -2495,9 +2511,11 @@ fuse_direct_IO(int rw, struct kiocb *iocb, const struct iovec *iov,
}

if (rw == WRITE) {
- if (ret > 0)
+ if (ret > 0) {
+ struct fuse_inode *fi = get_fuse_inode(inode);
fuse_write_update_size(inode, pos);
- else if (ret < 0 && offset + count > i_size)
+ clear_bit(FUSE_I_MTIME_UPDATED, &fi->state);
+ } else if (ret < 0 && offset + count > i_size)
fuse_do_truncate(file);
}

@@ -2553,8 +2571,16 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
goto out;

/* we could have extended the file */
- if (!(mode & FALLOC_FL_KEEP_SIZE))
- fuse_write_update_size(inode, offset + length);
+ if (!(mode & FALLOC_FL_KEEP_SIZE)) {
+ bool changed = fuse_write_update_size(inode, offset + length);
+
+ if (changed && fc->writeback_cache) {
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ inode->i_mtime = current_fs_time(inode->i_sb);
+ clear_bit(FUSE_I_MTIME_UPDATED, &fi->state);
+ }
+ }

if (mode & FALLOC_FL_PUNCH_HOLE)
truncate_pagecache_range(inode, offset, offset + length - 1);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index a0062a0..82abb82 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -115,6 +115,8 @@ struct fuse_inode {
enum {
/** Advise readdirplus */
FUSE_I_ADVISE_RDPLUS,
+ /** i_mtime has been updated locally; a flush to userspace needed */
+ FUSE_I_MTIME_UPDATED,
};

struct fuse_conn;
@@ -867,7 +869,9 @@ long fuse_ioctl_common(struct file *file, unsigned int cmd,
unsigned fuse_file_poll(struct file *file, poll_table *wait);
int fuse_dev_release(struct inode *inode, struct file *file);

-void fuse_write_update_size(struct inode *inode, loff_t pos);
+bool fuse_write_update_size(struct inode *inode, loff_t pos);
+
+int fuse_flush_mtime(struct file *file, bool nofail);

int fuse_do_setattr(struct inode *inode, struct iattr *attr,
struct file *file);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 121638d..591b46c 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -170,8 +170,11 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
inode->i_blocks = attr->blocks;
inode->i_atime.tv_sec = attr->atime;
inode->i_atime.tv_nsec = attr->atimensec;
- inode->i_mtime.tv_sec = attr->mtime;
- inode->i_mtime.tv_nsec = attr->mtimensec;
+ /* mtime from server may be stale due to local buffered write */
+ if (!fc->writeback_cache || !S_ISREG(inode->i_mode)) {
+ inode->i_mtime.tv_sec = attr->mtime;
+ inode->i_mtime.tv_nsec = attr->mtimensec;
+ }
inode->i_ctime.tv_sec = attr->ctime;
inode->i_ctime.tv_nsec = attr->ctimensec;

@@ -249,6 +252,8 @@ static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr)
{
inode->i_mode = attr->mode & S_IFMT;
inode->i_size = attr->size;
+ inode->i_mtime.tv_sec = attr->mtime;
+ inode->i_mtime.tv_nsec = attr->mtimensec;
if (S_ISREG(inode->i_mode)) {
fuse_init_common(inode);
fuse_init_file_inode(inode);
@@ -295,7 +300,9 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
return NULL;

if ((inode->i_state & I_NEW)) {
- inode->i_flags |= S_NOATIME|S_NOCMTIME;
+ inode->i_flags |= S_NOATIME;
+ if (!fc->writeback_cache)
+ inode->i_flags |= S_NOCMTIME;
inode->i_generation = generation;
inode->i_data.backing_dev_info = &fc->bdi;
fuse_init_inode(inode, attr);

2013-06-29 17:45:17

by Maxim Patlasov

[permalink] [raw]
Subject: [PATCH 08/16] fuse: Flush files on wb close

From: Pavel Emelyanov <[email protected]>

Any write request requires a file handle to report to the userspace. Thus
when we close a file (and free the fuse_file with this info) we have to
flush all the outstanding dirty pages.

filemap_write_and_wait() is enough because every page under fuse writeback
is accounted in ff->count. This delays actual close until all fuse wb is
completed.

In case of "write cache" turned off, the flush is ensured by fuse_vma_close().

Signed-off-by: Maxim Patlasov <[email protected]>
---
fs/fuse/file.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index f53697c..799bf46 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -291,6 +291,12 @@ static int fuse_open(struct inode *inode, struct file *file)

static int fuse_release(struct inode *inode, struct file *file)
{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+
+ /* see fuse_vma_close() for !writeback_cache case */
+ if (fc->writeback_cache)
+ filemap_write_and_wait(file->f_mapping);
+
if (test_bit(FUSE_I_MTIME_UPDATED,
&get_fuse_inode(inode)->state))
fuse_flush_mtime(file, true);

2013-06-29 17:45:33

by Maxim Patlasov

[permalink] [raw]
Subject: [PATCH 09/16] fuse: restructure fuse_readpage()

Move the code filling and sending read request to a separate function. Future
patches will use it for .write_begin -- partial modification of a page
requires reading the page from the storage very similarly to what fuse_readpage
does.

Signed-off-by: Maxim Patlasov <[email protected]>
---
fs/fuse/file.c | 55 +++++++++++++++++++++++++++++++++++++------------------
1 files changed, 37 insertions(+), 18 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 799bf46..e693e35 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -702,21 +702,14 @@ static void fuse_short_read(struct fuse_req *req, struct inode *inode,
}
}

-static int fuse_readpage(struct file *file, struct page *page)
+static int __fuse_readpage(struct file *file, struct page *page, size_t count,
+ int *err, struct fuse_req **req_pp, u64 *attr_ver_p)
{
struct fuse_io_priv io = { .async = 0, .file = file };
struct inode *inode = page->mapping->host;
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_req *req;
size_t num_read;
- loff_t pos = page_offset(page);
- size_t count = PAGE_CACHE_SIZE;
- u64 attr_ver;
- int err;
-
- err = -EIO;
- if (is_bad_inode(inode))
- goto out;

/*
* Page writeback can extend beyond the lifetime of the
@@ -726,20 +719,45 @@ static int fuse_readpage(struct file *file, struct page *page)
fuse_wait_on_page_writeback(inode, page->index);

req = fuse_get_req(fc, 1);
- err = PTR_ERR(req);
+ *err = PTR_ERR(req);
if (IS_ERR(req))
- goto out;
+ return 0;

- attr_ver = fuse_get_attr_version(fc);
+ if (attr_ver_p)
+ *attr_ver_p = fuse_get_attr_version(fc);

req->out.page_zeroing = 1;
req->out.argpages = 1;
req->num_pages = 1;
req->pages[0] = page;
req->page_descs[0].length = count;
- num_read = fuse_send_read(req, &io, pos, count, NULL);
- err = req->out.h.error;

+ num_read = fuse_send_read(req, &io, page_offset(page), count, NULL);
+ *err = req->out.h.error;
+
+ if (*err)
+ fuse_put_request(fc, req);
+ else
+ *req_pp = req;
+
+ return num_read;
+}
+
+static int fuse_readpage(struct file *file, struct page *page)
+{
+ struct inode *inode = page->mapping->host;
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_req *req = NULL;
+ size_t num_read;
+ size_t count = PAGE_CACHE_SIZE;
+ u64 attr_ver = 0;
+ int err;
+
+ err = -EIO;
+ if (is_bad_inode(inode))
+ goto out;
+
+ num_read = __fuse_readpage(file, page, count, &err, &req, &attr_ver);
if (!err) {
/*
* Short read means EOF. If file size is larger, truncate it
@@ -749,10 +767,11 @@ static int fuse_readpage(struct file *file, struct page *page)

SetPageUptodate(page);
}
-
- fuse_put_request(fc, req);
- fuse_invalidate_attr(inode); /* atime changed */
- out:
+ if (req) {
+ fuse_put_request(fc, req);
+ fuse_invalidate_attr(inode); /* atime changed */
+ }
+out:
unlock_page(page);
return err;
}

2013-06-29 17:45:43

by Maxim Patlasov

[permalink] [raw]
Subject: [PATCH 10/16] fuse: Implement writepages callback

From: Pavel Emelyanov <[email protected]>

The .writepages one is required to make each writeback request carry more than
one page on it. The patch enables optimized behaviour unconditionally,
i.e. mmap-ed writes will benefit from the patch even if fc->writeback_cache=0.

Signed-off-by: Maxim Patlasov <[email protected]>
---
fs/fuse/file.c | 170 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 170 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index e693e35..ad3bb8a 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1657,6 +1657,175 @@ static int fuse_writepage(struct page *page, struct writeback_control *wbc)
return err;
}

+struct fuse_fill_wb_data {
+ struct fuse_req *req;
+ struct fuse_file *ff;
+ struct inode *inode;
+ unsigned nr_pages;
+};
+
+static int fuse_send_writepages(struct fuse_fill_wb_data *data)
+{
+ int i, all_ok = 1;
+ struct fuse_req *req = data->req;
+ struct inode *inode = data->inode;
+ struct backing_dev_info *bdi = inode->i_mapping->backing_dev_info;
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ loff_t off = -1;
+
+ if (!data->ff)
+ data->ff = fuse_write_file(fc, fi);
+
+ if (!data->ff) {
+ for (i = 0; i < req->num_pages; i++)
+ end_page_writeback(req->pages[i]);
+ return -EIO;
+ }
+
+ req->inode = inode;
+ req->misc.write.in.offset = page_offset(req->pages[0]);
+
+ spin_lock(&fc->lock);
+ list_add(&req->writepages_entry, &fi->writepages);
+ spin_unlock(&fc->lock);
+
+ for (i = 0; i < req->num_pages; i++) {
+ struct page *page = req->pages[i];
+ struct page *tmp_page;
+
+ tmp_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+ if (tmp_page) {
+ copy_highpage(tmp_page, page);
+ inc_bdi_stat(bdi, BDI_WRITEBACK);
+ inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
+ } else
+ all_ok = 0;
+ req->pages[i] = tmp_page;
+ if (i == 0)
+ off = page_offset(page);
+
+ end_page_writeback(page);
+ }
+
+ if (!all_ok) {
+ for (i = 0; i < req->num_pages; i++) {
+ struct page *page = req->pages[i];
+ if (page) {
+ dec_bdi_stat(bdi, BDI_WRITEBACK);
+ dec_zone_page_state(page, NR_WRITEBACK_TEMP);
+ __free_page(page);
+ req->pages[i] = NULL;
+ }
+ }
+
+ spin_lock(&fc->lock);
+ list_del(&req->writepages_entry);
+ wake_up(&fi->page_waitq);
+ spin_unlock(&fc->lock);
+ return -ENOMEM;
+ }
+
+ __fuse_get_request(req);
+ req->ff = fuse_file_get(data->ff);
+ fuse_write_fill(req, data->ff, off, 0);
+
+ req->misc.write.in.write_flags |= FUSE_WRITE_CACHE;
+ req->in.argpages = 1;
+ fuse_page_descs_length_init(req, 0, req->num_pages);
+ req->end = fuse_writepage_end;
+ req->background = 1;
+
+ spin_lock(&fc->lock);
+ list_add_tail(&req->list, &fi->queued_writes);
+ fuse_flush_writepages(data->inode);
+ spin_unlock(&fc->lock);
+
+ return 0;
+}
+
+static int fuse_writepages_fill(struct page *page,
+ struct writeback_control *wbc, void *_data)
+{
+ struct fuse_fill_wb_data *data = _data;
+ struct fuse_req *req = data->req;
+ struct inode *inode = data->inode;
+ struct fuse_conn *fc = get_fuse_conn(inode);
+
+ if (fuse_page_is_writeback(inode, page->index)) {
+ if (wbc->sync_mode != WB_SYNC_ALL) {
+ redirty_page_for_writepage(wbc, page);
+ unlock_page(page);
+ return 0;
+ }
+ fuse_wait_on_page_writeback(inode, page->index);
+ }
+
+ if (req->num_pages &&
+ (req->num_pages == FUSE_MAX_PAGES_PER_REQ ||
+ (req->num_pages + 1) * PAGE_CACHE_SIZE > fc->max_write ||
+ req->pages[req->num_pages - 1]->index + 1 != page->index)) {
+ int err;
+
+ err = fuse_send_writepages(data);
+ if (err) {
+ unlock_page(page);
+ return err;
+ }
+ fuse_put_request(fc, req);
+
+ data->req = req =
+ fuse_request_alloc_nofs(FUSE_MAX_PAGES_PER_REQ);
+ if (!req) {
+ unlock_page(page);
+ return -ENOMEM;
+ }
+ }
+
+ req->pages[req->num_pages] = page;
+ req->num_pages++;
+
+ if (test_set_page_writeback(page))
+ BUG();
+
+ unlock_page(page);
+
+ return 0;
+}
+
+static int fuse_writepages(struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+ struct inode *inode = mapping->host;
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_fill_wb_data data;
+ int err;
+
+ err = -EIO;
+ if (is_bad_inode(inode))
+ goto out;
+
+ data.ff = NULL;
+ data.inode = inode;
+ data.req = fuse_request_alloc_nofs(FUSE_MAX_PAGES_PER_REQ);
+ err = -ENOMEM;
+ if (!data.req)
+ goto out_put;
+
+ err = write_cache_pages(mapping, wbc, fuse_writepages_fill, &data);
+ if (data.req) {
+ if (!err && data.req->num_pages)
+ err = fuse_send_writepages(&data);
+
+ fuse_put_request(fc, data.req);
+ }
+out_put:
+ if (data.ff)
+ fuse_file_put(data.ff, false);
+out:
+ return err;
+}
+
static int fuse_launder_page(struct page *page)
{
int err = 0;
@@ -2663,6 +2832,7 @@ static const struct file_operations fuse_direct_io_file_operations = {
static const struct address_space_operations fuse_file_aops = {
.readpage = fuse_readpage,
.writepage = fuse_writepage,
+ .writepages = fuse_writepages,
.launder_page = fuse_launder_page,
.readpages = fuse_readpages,
.set_page_dirty = __set_page_dirty_nobuffers,

2013-06-29 17:46:01

by Maxim Patlasov

[permalink] [raw]
Subject: [PATCH 11/16] fuse: Implement write_begin/write_end callbacks

From: Pavel Emelyanov <[email protected]>

The .write_begin and .write_end are requiered to use generic routines
(generic_file_aio_write --> ... --> generic_perform_write) for buffered
writes.

Signed-off-by: Maxim Patlasov <[email protected]>
---
fs/fuse/file.c | 97 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 files changed, 97 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index ad3bb8a..89b08d6 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1826,6 +1826,101 @@ out:
return err;
}

+/*
+ * Determine the number of bytes of data the page contains
+ */
+static inline unsigned fuse_page_length(struct page *page)
+{
+ loff_t i_size = i_size_read(page_file_mapping(page)->host);
+
+ if (i_size > 0) {
+ pgoff_t page_index = page_file_index(page);
+ pgoff_t end_index = (i_size - 1) >> PAGE_CACHE_SHIFT;
+ if (page_index < end_index)
+ return PAGE_CACHE_SIZE;
+ if (page_index == end_index)
+ return ((i_size - 1) & ~PAGE_CACHE_MASK) + 1;
+ }
+ return 0;
+}
+
+static int fuse_prepare_write(struct fuse_conn *fc, struct file *file,
+ struct page *page, loff_t pos, unsigned len)
+{
+ struct fuse_req *req = NULL;
+ unsigned num_read;
+ unsigned page_len;
+ int err;
+
+ if (PageUptodate(page) || (len == PAGE_CACHE_SIZE))
+ return 0;
+
+ page_len = fuse_page_length(page);
+ if (!page_len) {
+ zero_user(page, 0, PAGE_CACHE_SIZE);
+ return 0;
+ }
+
+ num_read = __fuse_readpage(file, page, page_len, &err, &req, NULL);
+ if (req)
+ fuse_put_request(fc, req);
+ if (err) {
+ unlock_page(page);
+ page_cache_release(page);
+ } else if (num_read != PAGE_CACHE_SIZE) {
+ zero_user_segment(page, num_read, PAGE_CACHE_SIZE);
+ }
+ return err;
+}
+
+/*
+ * It's worthy to make sure that space is reserved on disk for the write,
+ * but how to implement it without killing performance need more thinking.
+ */
+static int fuse_write_begin(struct file *file, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned flags,
+ struct page **pagep, void **fsdata)
+{
+ pgoff_t index = pos >> PAGE_CACHE_SHIFT;
+ struct fuse_conn *fc = get_fuse_conn(file->f_dentry->d_inode);
+
+ BUG_ON(!fc->writeback_cache);
+
+ *pagep = grab_cache_page_write_begin(mapping, index, flags);
+ if (!*pagep)
+ return -ENOMEM;
+
+ return fuse_prepare_write(fc, file, *pagep, pos, len);
+}
+
+static int fuse_commit_write(struct file *file, struct page *page,
+ unsigned from, unsigned to)
+{
+ struct inode *inode = page->mapping->host;
+ loff_t pos = ((loff_t)page->index << PAGE_CACHE_SHIFT) + to;
+
+ if (!PageUptodate(page))
+ SetPageUptodate(page);
+
+ fuse_write_update_size(inode, pos);
+ set_page_dirty(page);
+ return 0;
+}
+
+static int fuse_write_end(struct file *file, struct address_space *mapping,
+ loff_t pos, unsigned len, unsigned copied,
+ struct page *page, void *fsdata)
+{
+ unsigned from = pos & (PAGE_CACHE_SIZE - 1);
+
+ fuse_commit_write(file, page, from, from+copied);
+
+ unlock_page(page);
+ page_cache_release(page);
+
+ return copied;
+}
+
static int fuse_launder_page(struct page *page)
{
int err = 0;
@@ -2838,6 +2933,8 @@ static const struct address_space_operations fuse_file_aops = {
.set_page_dirty = __set_page_dirty_nobuffers,
.bmap = fuse_bmap,
.direct_IO = fuse_direct_IO,
+ .write_begin = fuse_write_begin,
+ .write_end = fuse_write_end,
};

void fuse_init_file_inode(struct inode *inode)

2013-06-29 17:46:21

by Maxim Patlasov

[permalink] [raw]
Subject: [PATCH 12/16] fuse: fuse_writepage_locked() should wait on writeback

fuse_writepage_locked() should never submit new i/o for given page->index
if there is another one 'in progress' already. In most cases it's safe to
wait on page writeback. But if it was called due to memory shortage
(WB_SYNC_NONE), we should redirty page rather than blocking caller.

Signed-off-by: Maxim Patlasov <[email protected]>
---
fs/fuse/file.c | 18 +++++++++++++++---
1 files changed, 15 insertions(+), 3 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 89b08d6..f48f796 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1595,7 +1595,8 @@ static struct fuse_file *fuse_write_file(struct fuse_conn *fc,
return ff;
}

-static int fuse_writepage_locked(struct page *page)
+static int fuse_writepage_locked(struct page *page,
+ struct writeback_control *wbc)
{
struct address_space *mapping = page->mapping;
struct inode *inode = mapping->host;
@@ -1604,6 +1605,14 @@ static int fuse_writepage_locked(struct page *page)
struct fuse_req *req;
struct page *tmp_page;

+ if (fuse_page_is_writeback(inode, page->index)) {
+ if (wbc->sync_mode != WB_SYNC_ALL) {
+ redirty_page_for_writepage(wbc, page);
+ return 0;
+ }
+ fuse_wait_on_page_writeback(inode, page->index);
+ }
+
set_page_writeback(page);

req = fuse_request_alloc_nofs(1);
@@ -1651,7 +1660,7 @@ static int fuse_writepage(struct page *page, struct writeback_control *wbc)
{
int err;

- err = fuse_writepage_locked(page);
+ err = fuse_writepage_locked(page, wbc);
unlock_page(page);

return err;
@@ -1926,7 +1935,10 @@ static int fuse_launder_page(struct page *page)
int err = 0;
if (clear_page_dirty_for_io(page)) {
struct inode *inode = page->mapping->host;
- err = fuse_writepage_locked(page);
+ struct writeback_control wbc = {
+ .sync_mode = WB_SYNC_ALL,
+ };
+ err = fuse_writepage_locked(page, &wbc);
if (!err)
fuse_wait_on_page_writeback(inode, page->index);
}

2013-06-29 17:46:42

by Maxim Patlasov

[permalink] [raw]
Subject: [PATCH 13/16] fuse: fuse_flush() should wait on writeback

The aim of .flush fop is to hint file-system that flushing its state or caches
or any other important data to reliable storage would be desirable now.
fuse_flush() passes this hint by sending FUSE_FLUSH request to userspace.
However, dirty pages and pages under writeback may be not visible to userspace
yet if we won't ensure it explicitly.

Signed-off-by: Maxim Patlasov <[email protected]>
---
fs/fuse/file.c | 9 +++++++++
1 files changed, 9 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index f48f796..86ef60f 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -19,6 +19,7 @@
#include <linux/falloc.h>

static const struct file_operations fuse_direct_io_file_operations;
+static void fuse_sync_writes(struct inode *inode);

static int fuse_send_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
int opcode, struct fuse_open_out *outargp)
@@ -400,6 +401,14 @@ static int fuse_flush(struct file *file, fl_owner_t id)
if (fc->no_flush)
return 0;

+ err = filemap_write_and_wait(file->f_mapping);
+ if (err)
+ return err;
+
+ mutex_lock(&inode->i_mutex);
+ fuse_sync_writes(inode);
+ mutex_unlock(&inode->i_mutex);
+
req = fuse_get_req_nofail_nopages(fc, file);
memset(&inarg, 0, sizeof(inarg));
inarg.fh = ff->fh;

2013-06-29 17:47:05

by Maxim Patlasov

[permalink] [raw]
Subject: [PATCH 14/16] fuse: Fix O_DIRECT operations vs cached writeback misorder - v2

From: Pavel Emelyanov <[email protected]>

The problem is:

1. write cached data to a file
2. read directly from the same file (via another fd)

The 2nd operation may read stale data, i.e. the one that was in a file
before the 1st op. Problem is in how fuse manages writeback.

When direct op occurs the core kernel code calls filemap_write_and_wait
to flush all the cached ops in flight. But fuse acks the writeback right
after the ->writepages callback exits w/o waiting for the real write to
happen. Thus the subsequent direct op proceeds while the real writeback
is still in flight. This is a problem for backends that reorder operation.

Fix this by making the fuse direct IO callback explicitly wait on the
in-flight writeback to finish.

Changed in v2:
- do not wait on writeback if fuse_direct_io() call came from
CUSE (because it doesn't use fuse inodes)

Signed-off-by: Maxim Patlasov <[email protected]>
---
fs/fuse/cuse.c | 5 +++--
fs/fuse/file.c | 49 +++++++++++++++++++++++++++++++++++++++++++++++--
fs/fuse/fuse_i.h | 13 ++++++++++++-
3 files changed, 62 insertions(+), 5 deletions(-)

diff --git a/fs/fuse/cuse.c b/fs/fuse/cuse.c
index aef34b1..b57b5ee 100644
--- a/fs/fuse/cuse.c
+++ b/fs/fuse/cuse.c
@@ -95,7 +95,7 @@ static ssize_t cuse_read(struct file *file, char __user *buf, size_t count,
struct iovec iov = { .iov_base = buf, .iov_len = count };
struct fuse_io_priv io = { .async = 0, .file = file };

- return fuse_direct_io(&io, &iov, 1, count, &pos, 0);
+ return fuse_direct_io(&io, &iov, 1, count, &pos, FUSE_DIO_CUSE);
}

static ssize_t cuse_write(struct file *file, const char __user *buf,
@@ -109,7 +109,8 @@ static ssize_t cuse_write(struct file *file, const char __user *buf,
* No locking or generic_write_checks(), the server is
* responsible for locking and sanity checks.
*/
- return fuse_direct_io(&io, &iov, 1, count, &pos, 1);
+ return fuse_direct_io(&io, &iov, 1, count, &pos,
+ FUSE_DIO_WRITE | FUSE_DIO_CUSE);
}

static int cuse_open(struct inode *inode, struct file *file)
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 86ef60f..ea45cf3 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -342,6 +342,31 @@ u64 fuse_lock_owner_id(struct fuse_conn *fc, fl_owner_t id)
return (u64) v0 + ((u64) v1 << 32);
}

+static bool fuse_range_is_writeback(struct inode *inode, pgoff_t idx_from,
+ pgoff_t idx_to)
+{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_req *req;
+ bool found = false;
+
+ spin_lock(&fc->lock);
+ list_for_each_entry(req, &fi->writepages, writepages_entry) {
+ pgoff_t curr_index;
+
+ BUG_ON(req->inode != inode);
+ curr_index = req->misc.write.in.offset >> PAGE_CACHE_SHIFT;
+ if (!(idx_from >= curr_index + req->num_pages ||
+ idx_to < curr_index)) {
+ found = true;
+ break;
+ }
+ }
+ spin_unlock(&fc->lock);
+
+ return found;
+}
+
/*
* Check if page is under writeback
*
@@ -386,6 +411,19 @@ static int fuse_wait_on_page_writeback(struct inode *inode, pgoff_t index)
return 0;
}

+static void fuse_wait_on_writeback(struct inode *inode, pgoff_t start,
+ size_t bytes)
+{
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ pgoff_t idx_from, idx_to;
+
+ idx_from = start >> PAGE_CACHE_SHIFT;
+ idx_to = (start + bytes - 1) >> PAGE_CACHE_SHIFT;
+
+ wait_event(fi->page_waitq,
+ !fuse_range_is_writeback(inode, idx_from, idx_to));
+}
+
static int fuse_flush(struct file *file, fl_owner_t id)
{
struct inode *inode = file_inode(file);
@@ -1360,8 +1398,10 @@ static inline int fuse_iter_npages(const struct iov_iter *ii_p)

ssize_t fuse_direct_io(struct fuse_io_priv *io, const struct iovec *iov,
unsigned long nr_segs, size_t count, loff_t *ppos,
- int write)
+ int flags)
{
+ int write = flags & FUSE_DIO_WRITE;
+ int cuse = flags & FUSE_DIO_CUSE;
struct file *file = io->file;
struct fuse_file *ff = file->private_data;
struct fuse_conn *fc = ff->fc;
@@ -1390,6 +1430,10 @@ ssize_t fuse_direct_io(struct fuse_io_priv *io, const struct iovec *iov,
break;
}

+ if (!cuse)
+ fuse_wait_on_writeback(file->f_mapping->host, pos,
+ nbytes);
+
if (write)
nres = fuse_send_write(req, io, pos, nbytes, owner);
else
@@ -1468,7 +1512,8 @@ static ssize_t __fuse_direct_write(struct fuse_io_priv *io,

res = generic_write_checks(file, ppos, &count, 0);
if (!res)
- res = fuse_direct_io(io, iov, nr_segs, count, ppos, 1);
+ res = fuse_direct_io(io, iov, nr_segs, count, ppos,
+ FUSE_DIO_WRITE);

fuse_invalidate_attr(inode);

diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index 82abb82..f969b22 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -859,9 +859,20 @@ int fuse_reverse_inval_entry(struct super_block *sb, u64 parent_nodeid,

int fuse_do_open(struct fuse_conn *fc, u64 nodeid, struct file *file,
bool isdir);
+
+/**
+ * fuse_direct_io() flags
+ */
+
+/** If set, it is WRITE; otherwise - READ */
+#define FUSE_DIO_WRITE (1 << 0)
+
+/** CUSE pass fuse_direct_io() a file which f_mapping->host is not from FUSE */
+#define FUSE_DIO_CUSE (1 << 1)
+
ssize_t fuse_direct_io(struct fuse_io_priv *io, const struct iovec *iov,
unsigned long nr_segs, size_t count, loff_t *ppos,
- int write);
+ int flags);
long fuse_do_ioctl(struct file *file, unsigned int cmd, unsigned long arg,
unsigned int flags);
long fuse_ioctl_common(struct file *file, unsigned int cmd,

2013-06-29 17:47:15

by Maxim Patlasov

[permalink] [raw]
Subject: [PATCH 15/16] fuse: Turn writeback cache on

From: Pavel Emelyanov <[email protected]>

Introduce a bit kernel and userspace exchange between each-other on
the init stage and turn writeback on if the userspace want this and
mount option 'allow_wbcache' is present (controlled by fusermount).

Also add each writable file into per-inode write list and call the
generic_file_aio_write to make use of the Linux page cache engine.

Signed-off-by: Maxim Patlasov <[email protected]>
---
fs/fuse/file.c | 5 +++++
fs/fuse/fuse_i.h | 4 ++++
fs/fuse/inode.c | 13 +++++++++++++
include/uapi/linux/fuse.h | 2 ++
4 files changed, 24 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index ea45cf3..6d30db1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -208,6 +208,8 @@ void fuse_finish_open(struct inode *inode, struct file *file)
spin_unlock(&fc->lock);
fuse_invalidate_attr(inode);
}
+ if ((file->f_mode & FMODE_WRITE) && fc->writeback_cache)
+ fuse_link_write_file(file);
}

int fuse_open_common(struct inode *inode, struct file *file, bool isdir)
@@ -1225,6 +1227,9 @@ static ssize_t fuse_file_aio_write(struct kiocb *iocb, const struct iovec *iov,
struct iov_iter i;
loff_t endbyte = 0;

+ if (get_fuse_conn(inode)->writeback_cache)
+ return generic_file_aio_write(iocb, iov, nr_segs, pos);
+
WARN_ON(iocb->ki_pos != pos);

ocount = 0;
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index f969b22..3494457 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -44,6 +44,10 @@
doing the mount will be allowed to access the filesystem */
#define FUSE_ALLOW_OTHER (1 << 1)

+/** If the FUSE_ALLOW_WBCACHE flag is given, the filesystem
+ module will enable support of writback cache */
+#define FUSE_ALLOW_WBCACHE (1 << 2)
+
/** Number of page pointers embedded in fuse_req */
#define FUSE_REQ_INLINE_PAGES 1

diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 591b46c..4beb8e3 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -459,6 +459,7 @@ enum {
OPT_ALLOW_OTHER,
OPT_MAX_READ,
OPT_BLKSIZE,
+ OPT_ALLOW_WBCACHE,
OPT_ERR
};

@@ -471,6 +472,7 @@ static const match_table_t tokens = {
{OPT_ALLOW_OTHER, "allow_other"},
{OPT_MAX_READ, "max_read=%u"},
{OPT_BLKSIZE, "blksize=%u"},
+ {OPT_ALLOW_WBCACHE, "allow_wbcache"},
{OPT_ERR, NULL}
};

@@ -544,6 +546,10 @@ static int parse_fuse_opt(char *opt, struct fuse_mount_data *d, int is_bdev)
d->blksize = value;
break;

+ case OPT_ALLOW_WBCACHE:
+ d->flags |= FUSE_ALLOW_WBCACHE;
+ break;
+
default:
return 0;
}
@@ -571,6 +577,8 @@ static int fuse_show_options(struct seq_file *m, struct dentry *root)
seq_printf(m, ",max_read=%u", fc->max_read);
if (sb->s_bdev && sb->s_blocksize != FUSE_DEFAULT_BLKSIZE)
seq_printf(m, ",blksize=%lu", sb->s_blocksize);
+ if (fc->flags & FUSE_ALLOW_WBCACHE)
+ seq_puts(m, ",allow_wbcache");
return 0;
}

@@ -888,6 +896,9 @@ static void process_init_reply(struct fuse_conn *fc, struct fuse_req *req)
}
if (arg->flags & FUSE_ASYNC_DIO)
fc->async_dio = 1;
+ if (arg->flags & FUSE_WRITEBACK_CACHE &&
+ fc->flags & FUSE_ALLOW_WBCACHE)
+ fc->writeback_cache = 1;
} else {
ra_pages = fc->max_read / PAGE_CACHE_SIZE;
fc->no_lock = 1;
@@ -916,6 +927,8 @@ static void fuse_send_init(struct fuse_conn *fc, struct fuse_req *req)
FUSE_SPLICE_WRITE | FUSE_SPLICE_MOVE | FUSE_SPLICE_READ |
FUSE_FLOCK_LOCKS | FUSE_IOCTL_DIR | FUSE_AUTO_INVAL_DATA |
FUSE_DO_READDIRPLUS | FUSE_READDIRPLUS_AUTO | FUSE_ASYNC_DIO;
+ if (fc->flags & FUSE_ALLOW_WBCACHE)
+ arg->flags |= FUSE_WRITEBACK_CACHE;
req->in.h.opcode = FUSE_INIT;
req->in.numargs = 1;
req->in.args[0].size = sizeof(*arg);
diff --git a/include/uapi/linux/fuse.h b/include/uapi/linux/fuse.h
index 60bb2f9..5c98dd1 100644
--- a/include/uapi/linux/fuse.h
+++ b/include/uapi/linux/fuse.h
@@ -219,6 +219,7 @@ struct fuse_file_lock {
* FUSE_DO_READDIRPLUS: do READDIRPLUS (READDIR+LOOKUP in one)
* FUSE_READDIRPLUS_AUTO: adaptive readdirplus
* FUSE_ASYNC_DIO: asynchronous direct I/O submission
+ * FUSE_WRITEBACK_CACHE: use writeback cache for buffered writes
*/
#define FUSE_ASYNC_READ (1 << 0)
#define FUSE_POSIX_LOCKS (1 << 1)
@@ -236,6 +237,7 @@ struct fuse_file_lock {
#define FUSE_DO_READDIRPLUS (1 << 13)
#define FUSE_READDIRPLUS_AUTO (1 << 14)
#define FUSE_ASYNC_DIO (1 << 15)
+#define FUSE_WRITEBACK_CACHE (1 << 16)

/**
* CUSE INIT request/reply flags

2013-06-29 17:49:09

by Maxim Patlasov

[permalink] [raw]
Subject: [PATCH 16/16] mm: strictlimit feature

From: Miklos Szeredi <[email protected]>

The feature prevents mistrusted filesystems to grow a large number of dirty
pages before throttling. For such filesystems balance_dirty_pages always
check bdi counters against bdi limits. I.e. even if global "nr_dirty" is under
"freerun", it's not allowed to skip bdi checks. The only use case for now is
fuse: it sets bdi max_ratio to 1% by default and system administrators are
supposed to expect that this limit won't be exceeded.

The feature is on if address space is marked by AS_STRICTLIMIT flag.
A filesystem may set the flag when it initializes a new inode.

Signed-off-by: Maxim Patlasov <[email protected]>
---
fs/fs-writeback.c | 2 -
fs/fuse/file.c | 3 +
fs/fuse/inode.c | 1
include/linux/backing-dev.h | 5 ++
include/linux/pagemap.h | 1
include/linux/writeback.h | 3 +
mm/backing-dev.c | 3 +
mm/page-writeback.c | 131 +++++++++++++++++++++++++++++++++----------
8 files changed, 116 insertions(+), 33 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3be5718..8c75277 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -762,7 +762,7 @@ static bool over_bground_thresh(struct backing_dev_info *bdi)
static void wb_update_bandwidth(struct bdi_writeback *wb,
unsigned long start_time)
{
- __bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
+ __bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time, false);
}

/*
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6d30db1..99b5c97 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1697,6 +1697,7 @@ static int fuse_writepage_locked(struct page *page,
req->inode = inode;

inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+ inc_bdi_stat(mapping->backing_dev_info, BDI_WRITTEN_BACK);
inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
end_page_writeback(page);

@@ -1766,6 +1767,7 @@ static int fuse_send_writepages(struct fuse_fill_wb_data *data)
if (tmp_page) {
copy_highpage(tmp_page, page);
inc_bdi_stat(bdi, BDI_WRITEBACK);
+ inc_bdi_stat(bdi, BDI_WRITTEN_BACK);
inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
} else
all_ok = 0;
@@ -1781,6 +1783,7 @@ static int fuse_send_writepages(struct fuse_fill_wb_data *data)
struct page *page = req->pages[i];
if (page) {
dec_bdi_stat(bdi, BDI_WRITEBACK);
+ dec_bdi_stat(bdi, BDI_WRITTEN_BACK);
dec_zone_page_state(page, NR_WRITEBACK_TEMP);
__free_page(page);
req->pages[i] = NULL;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 4beb8e3..00a28af 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -305,6 +305,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
inode->i_flags |= S_NOCMTIME;
inode->i_generation = generation;
inode->i_data.backing_dev_info = &fc->bdi;
+ set_bit(AS_STRICTLIMIT, &inode->i_data.flags);
fuse_init_inode(inode, attr);
unlock_new_inode(inode);
} else if ((inode->i_mode ^ attr->mode) & S_IFMT) {
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index c388155..9c368af 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -33,6 +33,8 @@ enum bdi_state {
BDI_sync_congested, /* The sync queue is getting full */
BDI_registered, /* bdi_register() was done */
BDI_writeback_running, /* Writeback is in progress */
+ BDI_idle, /* No pages under writeback at the moment of
+ * last update of write bw */
BDI_unused, /* Available bits start here */
};

@@ -43,6 +45,7 @@ enum bdi_stat_item {
BDI_WRITEBACK,
BDI_DIRTIED,
BDI_WRITTEN,
+ BDI_WRITTEN_BACK,
NR_BDI_STAT_ITEMS
};

@@ -76,6 +79,8 @@ struct backing_dev_info {
unsigned long bw_time_stamp; /* last time write bw is updated */
unsigned long dirtied_stamp;
unsigned long written_stamp; /* pages written at bw_time_stamp */
+ unsigned long writeback_stamp; /* pages sent to writeback at
+ * bw_time_stamp */
unsigned long write_bandwidth; /* the estimated write bandwidth */
unsigned long avg_write_bandwidth; /* further smoothed write bw */

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..baac702 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -25,6 +25,7 @@ enum mapping_flags {
AS_MM_ALL_LOCKS = __GFP_BITS_SHIFT + 2, /* under mm_take_all_locks() */
AS_UNEVICTABLE = __GFP_BITS_SHIFT + 3, /* e.g., ramdisk, SHM_LOCK */
AS_BALLOON_MAP = __GFP_BITS_SHIFT + 4, /* balloon page special map */
+ AS_STRICTLIMIT = __GFP_BITS_SHIFT + 5, /* strict dirty limit */
};

static inline void mapping_set_error(struct address_space *mapping, int error)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 579a500..5144051 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -159,7 +159,8 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
unsigned long dirty,
unsigned long bdi_thresh,
unsigned long bdi_dirty,
- unsigned long start_time);
+ unsigned long start_time,
+ bool strictlimit);

void page_writeback_init(void);
void balance_dirty_pages_ratelimited(struct address_space *mapping);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 5025174..71413d6 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -94,6 +94,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
"BackgroundThresh: %10lu kB\n"
"BdiDirtied: %10lu kB\n"
"BdiWritten: %10lu kB\n"
+ "BdiWrittenBack: %10lu kB\n"
"BdiWriteBandwidth: %10lu kBps\n"
"b_dirty: %10lu\n"
"b_io: %10lu\n"
@@ -107,6 +108,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
K(background_thresh),
(unsigned long) K(bdi_stat(bdi, BDI_DIRTIED)),
(unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
+ (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN_BACK)),
(unsigned long) K(bdi->write_bandwidth),
nr_dirty,
nr_io,
@@ -455,6 +457,7 @@ int bdi_init(struct backing_dev_info *bdi)

bdi->bw_time_stamp = jiffies;
bdi->written_stamp = 0;
+ bdi->writeback_stamp = 0;

bdi->balanced_dirty_ratelimit = INIT_BW;
bdi->dirty_ratelimit = INIT_BW;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 4514ad7..77449b0 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -585,6 +585,37 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
}

/*
+ * setpoint - dirty 3
+ * f(dirty) := 1.0 + (----------------)
+ * limit - setpoint
+ *
+ * it's a 3rd order polynomial that subjects to
+ *
+ * (1) f(freerun) = 2.0 => rampup dirty_ratelimit reasonably fast
+ * (2) f(setpoint) = 1.0 => the balance point
+ * (3) f(limit) = 0 => the hard limit
+ * (4) df/dx <= 0 => negative feedback control
+ * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
+ * => fast response on large errors; small oscillation near setpoint
+ */
+static inline long long pos_ratio_polynom(unsigned long setpoint,
+ unsigned long dirty,
+ unsigned long limit)
+{
+ long long pos_ratio;
+ long x;
+
+ x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
+ limit - setpoint + 1);
+ pos_ratio = x;
+ pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+ pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+ pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+
+ return pos_ratio;
+}
+
+/*
* Dirty position control.
*
* (o) global/bdi setpoints
@@ -664,7 +695,8 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
unsigned long bg_thresh,
unsigned long dirty,
unsigned long bdi_thresh,
- unsigned long bdi_dirty)
+ unsigned long bdi_dirty,
+ bool strictlimit)
{
unsigned long write_bw = bdi->avg_write_bandwidth;
unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
@@ -679,29 +711,31 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
if (unlikely(dirty >= limit))
return 0;

+ if (unlikely(strictlimit)) {
+ if (bdi_dirty < 8)
+ return 2 << RATELIMIT_CALC_SHIFT;
+
+ if (bdi_dirty >= bdi_thresh)
+ return 0;
+
+ bdi_setpoint = bdi_thresh + bdi_dirty_limit(bdi, bg_thresh);
+ bdi_setpoint /= 2;
+
+ if (bdi_setpoint == 0 || bdi_setpoint == bdi_thresh)
+ return 0;
+
+ pos_ratio = pos_ratio_polynom(bdi_setpoint, bdi_dirty,
+ bdi_thresh);
+ return min_t(long long, pos_ratio, 2 << RATELIMIT_CALC_SHIFT);
+ }
+
/*
* global setpoint
*
- * setpoint - dirty 3
- * f(dirty) := 1.0 + (----------------)
- * limit - setpoint
- *
- * it's a 3rd order polynomial that subjects to
- *
- * (1) f(freerun) = 2.0 => rampup dirty_ratelimit reasonably fast
- * (2) f(setpoint) = 1.0 => the balance point
- * (3) f(limit) = 0 => the hard limit
- * (4) df/dx <= 0 => negative feedback control
- * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
- * => fast response on large errors; small oscillation near setpoint
+ * See comment for pos_ratio_polynom().
*/
setpoint = (freerun + limit) / 2;
- x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
- limit - setpoint + 1);
- pos_ratio = x;
- pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
- pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
- pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+ pos_ratio = pos_ratio_polynom(setpoint, dirty, limit);

/*
* We have computed basic pos_ratio above based on global situation. If
@@ -892,7 +926,8 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
unsigned long bdi_thresh,
unsigned long bdi_dirty,
unsigned long dirtied,
- unsigned long elapsed)
+ unsigned long elapsed,
+ bool strictlimit)
{
unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
unsigned long limit = hard_dirty_limit(thresh);
@@ -913,7 +948,7 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
dirty_rate = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;

pos_ratio = bdi_position_ratio(bdi, thresh, bg_thresh, dirty,
- bdi_thresh, bdi_dirty);
+ bdi_thresh, bdi_dirty, strictlimit);
/*
* task_ratelimit reflects each dd's dirty rate for the past 200ms.
*/
@@ -994,6 +1029,16 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
* keep that period small to reduce time lags).
*/
step = 0;
+
+ if (unlikely(strictlimit)) {
+ dirty = bdi_dirty;
+ if (bdi_dirty < 8)
+ setpoint = bdi_dirty + 1;
+ else
+ setpoint = (bdi_thresh +
+ bdi_dirty_limit(bdi, bg_thresh)) / 2;
+ }
+
if (dirty < setpoint) {
x = min(bdi->balanced_dirty_ratelimit,
min(balanced_dirty_ratelimit, task_ratelimit));
@@ -1034,12 +1079,14 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
unsigned long dirty,
unsigned long bdi_thresh,
unsigned long bdi_dirty,
- unsigned long start_time)
+ unsigned long start_time,
+ bool strictlimit)
{
unsigned long now = jiffies;
unsigned long elapsed = now - bdi->bw_time_stamp;
unsigned long dirtied;
unsigned long written;
+ unsigned long writeback;

/*
* rate-limit, only update once every 200ms.
@@ -1049,6 +1096,7 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,

dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
+ writeback = bdi_stat_sum(bdi, BDI_WRITTEN_BACK);

/*
* Skip quiet periods when disk bandwidth is under-utilized.
@@ -1057,11 +1105,15 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
goto snapshot;

+ if (test_bit(BDI_idle, &bdi->state) &&
+ bdi->writeback_stamp == writeback)
+ goto snapshot;
+
if (thresh) {
global_update_bandwidth(thresh, dirty, now);
bdi_update_dirty_ratelimit(bdi, thresh, bg_thresh, dirty,
bdi_thresh, bdi_dirty,
- dirtied, elapsed);
+ dirtied, elapsed, strictlimit);
}
bdi_update_write_bandwidth(bdi, elapsed, written);

@@ -1069,6 +1121,12 @@ snapshot:
bdi->dirtied_stamp = dirtied;
bdi->written_stamp = written;
bdi->bw_time_stamp = now;
+
+ bdi->writeback_stamp = writeback;
+ if (bdi_stat_sum(bdi, BDI_WRITEBACK) == 0)
+ set_bit(BDI_idle, &bdi->state);
+ else
+ clear_bit(BDI_idle, &bdi->state);
}

static void bdi_update_bandwidth(struct backing_dev_info *bdi,
@@ -1077,13 +1135,14 @@ static void bdi_update_bandwidth(struct backing_dev_info *bdi,
unsigned long dirty,
unsigned long bdi_thresh,
unsigned long bdi_dirty,
- unsigned long start_time)
+ unsigned long start_time,
+ bool strictlimit)
{
if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
return;
spin_lock(&bdi->wb.list_lock);
__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
- bdi_thresh, bdi_dirty, start_time);
+ bdi_thresh, bdi_dirty, start_time, strictlimit);
spin_unlock(&bdi->wb.list_lock);
}

@@ -1226,6 +1285,7 @@ static void balance_dirty_pages(struct address_space *mapping,
unsigned long dirty_ratelimit;
unsigned long pos_ratio;
struct backing_dev_info *bdi = mapping->backing_dev_info;
+ bool strictlimit = test_bit(AS_STRICTLIMIT, &mapping->flags);
unsigned long start_time = jiffies;

for (;;) {
@@ -1250,7 +1310,7 @@ static void balance_dirty_pages(struct address_space *mapping,
*/
freerun = dirty_freerun_ceiling(dirty_thresh,
background_thresh);
- if (nr_dirty <= freerun) {
+ if (nr_dirty <= freerun && !strictlimit) {
current->dirty_paused_when = now;
current->nr_dirtied = 0;
current->nr_dirtied_pause =
@@ -1258,7 +1318,7 @@ static void balance_dirty_pages(struct address_space *mapping,
break;
}

- if (unlikely(!writeback_in_progress(bdi)))
+ if (unlikely(!writeback_in_progress(bdi)) && !strictlimit)
bdi_start_background_writeback(bdi);

/*
@@ -1296,19 +1356,24 @@ static void balance_dirty_pages(struct address_space *mapping,
bdi_stat(bdi, BDI_WRITEBACK);
}

+ if (unlikely(!writeback_in_progress(bdi)) &&
+ bdi_dirty > bdi_thresh / 4)
+ bdi_start_background_writeback(bdi);
+
dirty_exceeded = (bdi_dirty > bdi_thresh) &&
- (nr_dirty > dirty_thresh);
+ ((nr_dirty > dirty_thresh) || strictlimit);
if (dirty_exceeded && !bdi->dirty_exceeded)
bdi->dirty_exceeded = 1;

bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
nr_dirty, bdi_thresh, bdi_dirty,
- start_time);
+ start_time, strictlimit);

dirty_ratelimit = bdi->dirty_ratelimit;
pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
background_thresh, nr_dirty,
- bdi_thresh, bdi_dirty);
+ bdi_thresh, bdi_dirty,
+ strictlimit);
task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >>
RATELIMIT_CALC_SHIFT;
max_pause = bdi_max_pause(bdi, bdi_dirty);
@@ -1362,6 +1427,8 @@ static void balance_dirty_pages(struct address_space *mapping,
}

pause:
+ if (unlikely(!writeback_in_progress(bdi)))
+ bdi_start_background_writeback(bdi);
trace_balance_dirty_pages(bdi,
dirty_thresh,
background_thresh,
@@ -2265,8 +2332,10 @@ int test_set_page_writeback(struct page *page)
radix_tree_tag_set(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_WRITEBACK);
- if (bdi_cap_account_writeback(bdi))
+ if (bdi_cap_account_writeback(bdi)) {
__inc_bdi_stat(bdi, BDI_WRITEBACK);
+ __inc_bdi_stat(bdi, BDI_WRITTEN_BACK);
+ }
}
if (!PageDirty(page))
radix_tree_tag_clear(&mapping->page_tree,

2013-07-01 21:16:15

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH 16/16] mm: strictlimit feature

On Sat, 29 Jun 2013 21:48:54 +0400 Maxim Patlasov <[email protected]> wrote:

> From: Miklos Szeredi <[email protected]>
>
> The feature prevents mistrusted filesystems to grow a large number of dirty
> pages before throttling. For such filesystems balance_dirty_pages always
> check bdi counters against bdi limits. I.e. even if global "nr_dirty" is under
> "freerun", it's not allowed to skip bdi checks. The only use case for now is
> fuse: it sets bdi max_ratio to 1% by default and system administrators are
> supposed to expect that this limit won't be exceeded.
>
> The feature is on if address space is marked by AS_STRICTLIMIT flag.
> A filesystem may set the flag when it initializes a new inode.
>

Fengguang, could you please review this patch?

I suggest you await the next version, which hopefully will be more
reviewable...

>
> ...
>
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -33,6 +33,8 @@ enum bdi_state {
> BDI_sync_congested, /* The sync queue is getting full */
> BDI_registered, /* bdi_register() was done */
> BDI_writeback_running, /* Writeback is in progress */
> + BDI_idle, /* No pages under writeback at the moment of
> + * last update of write bw */

Why does BDI_idle exist?

> BDI_unused, /* Available bits start here */
> };
>
> @@ -43,6 +45,7 @@ enum bdi_stat_item {
> BDI_WRITEBACK,
> BDI_DIRTIED,
> BDI_WRITTEN,
> + BDI_WRITTEN_BACK,
> NR_BDI_STAT_ITEMS
> };
>
> @@ -76,6 +79,8 @@ struct backing_dev_info {
> unsigned long bw_time_stamp; /* last time write bw is updated */
> unsigned long dirtied_stamp;
> unsigned long written_stamp; /* pages written at bw_time_stamp */
> + unsigned long writeback_stamp; /* pages sent to writeback at
> + * bw_time_stamp */

Well this sucks. Some of the "foo_stamp" fields are in units of time
(jiffies? We aren't told) and some of the "foo_stamp" fields are in
units of number-of-pages. It would be good to fix the naming here.

> unsigned long write_bandwidth; /* the estimated write bandwidth */
> unsigned long avg_write_bandwidth; /* further smoothed write bw */
>
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index e3dea75..baac702 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -25,6 +25,7 @@ enum mapping_flags {
> AS_MM_ALL_LOCKS = __GFP_BITS_SHIFT + 2, /* under mm_take_all_locks() */
> AS_UNEVICTABLE = __GFP_BITS_SHIFT + 3, /* e.g., ramdisk, SHM_LOCK */
> AS_BALLOON_MAP = __GFP_BITS_SHIFT + 4, /* balloon page special map */
> + AS_STRICTLIMIT = __GFP_BITS_SHIFT + 5, /* strict dirty limit */

Thing is, "strict dirty limit" isn't documented anywhere, so this
reference is left dangling.

>
> ...
>
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -94,6 +94,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
> "BackgroundThresh: %10lu kB\n"
> "BdiDirtied: %10lu kB\n"
> "BdiWritten: %10lu kB\n"
> + "BdiWrittenBack: %10lu kB\n"
> "BdiWriteBandwidth: %10lu kBps\n"
> "b_dirty: %10lu\n"
> "b_io: %10lu\n"

I can't imagine what the difference is between BdiWritten and
BdiWrittenBack.

I suggest you document this at the BDI_WRITTEN_BACK definition site in
enum bdi_stat_item. BDI_WRITTEN (at least) will also need
documentation so people can understand the difference.

>
> ...
>
> @@ -679,29 +711,31 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> if (unlikely(dirty >= limit))
> return 0;
>
> + if (unlikely(strictlimit)) {
> + if (bdi_dirty < 8)
> + return 2 << RATELIMIT_CALC_SHIFT;
> +
> + if (bdi_dirty >= bdi_thresh)
> + return 0;
> +
> + bdi_setpoint = bdi_thresh + bdi_dirty_limit(bdi, bg_thresh);
> + bdi_setpoint /= 2;
> +
> + if (bdi_setpoint == 0 || bdi_setpoint == bdi_thresh)
> + return 0;
> +
> + pos_ratio = pos_ratio_polynom(bdi_setpoint, bdi_dirty,
> + bdi_thresh);
> + return min_t(long long, pos_ratio, 2 << RATELIMIT_CALC_SHIFT);
> + }

This would be a suitable site at which to document the strictlimit
feature. What it is, how it works and most importantly, why it exists.

>
> ...
>
> @@ -994,6 +1029,16 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
> * keep that period small to reduce time lags).
> */
> step = 0;
> +
> + if (unlikely(strictlimit)) {
> + dirty = bdi_dirty;
> + if (bdi_dirty < 8)
> + setpoint = bdi_dirty + 1;
> + else
> + setpoint = (bdi_thresh +
> + bdi_dirty_limit(bdi, bg_thresh)) / 2;
> + }

Explain this to the reader, please.

>
> ...
>

2013-07-02 08:34:11

by Maxim Patlasov

[permalink] [raw]
Subject: Re: [PATCH 16/16] mm: strictlimit feature

Hi Andrew,

07/02/2013 01:16 AM, Andrew Morton пишет:
> On Sat, 29 Jun 2013 21:48:54 +0400 Maxim Patlasov <[email protected]> wrote:
>
>> From: Miklos Szeredi <[email protected]>
>>
>> The feature prevents mistrusted filesystems to grow a large number of dirty
>> pages before throttling. For such filesystems balance_dirty_pages always
>> check bdi counters against bdi limits. I.e. even if global "nr_dirty" is under
>> "freerun", it's not allowed to skip bdi checks. The only use case for now is
>> fuse: it sets bdi max_ratio to 1% by default and system administrators are
>> supposed to expect that this limit won't be exceeded.
>>
>> The feature is on if address space is marked by AS_STRICTLIMIT flag.
>> A filesystem may set the flag when it initializes a new inode.
>>
> Fengguang, could you please review this patch?
>
> I suggest you await the next version, which hopefully will be more
> reviewable...

Thanks a lot for quick review, I'll update the patch according to your
comments soon.

I'm answering the question about BDI_idle below inline, but I'll add
some comment about it where BDI_idle is actually used as well.

Thanks,
Maxim

>
>> ...
>>
>> --- a/include/linux/backing-dev.h
>> +++ b/include/linux/backing-dev.h
>> @@ -33,6 +33,8 @@ enum bdi_state {
>> BDI_sync_congested, /* The sync queue is getting full */
>> BDI_registered, /* bdi_register() was done */
>> BDI_writeback_running, /* Writeback is in progress */
>> + BDI_idle, /* No pages under writeback at the moment of
>> + * last update of write bw */
> Why does BDI_idle exist?

BDI_idle along with BDI_WRITTEN_BACK exists to distinguish two cases:

1st. BDI_WRITTEN has not been incremented since we looked at it last
time because backing dev is unresponding. I.e. it had some pages under
writeback but it have not made any progress for some reasons.

2nd. BDI_WRITTEN has not been incremented since we looked at it last
time because backing dev had nothing to do. I.e. there are some dirty
pages on bdi, but they have not been passed to backing dev yet. This is
the case when bdi_dirty is under bdi background threshold and flusher
refrains from flushing even if we woke it up explicitly by
bdi_start_background_writeback.

We have to skip bdi_update_write_bandwidth() in the 2nd case because
otherwise bdi_update_write_bandwidth() will see written==0 and
mistakenly decrease write_bandwidth. The criterion to skip is the
following: BDI_idle is set (i.e. there were no pages under writeback
when we looked at the bdi last time) && BDI_WRITTEN_BACK counter has not
changed (i.e. no new pages has been sent to writeback since we looked at
the bdi last time).

Thanks,
Maxim

>
>> BDI_unused, /* Available bits start here */
>> };
>>
>> @@ -43,6 +45,7 @@ enum bdi_stat_item {
>> BDI_WRITEBACK,
>> BDI_DIRTIED,
>> BDI_WRITTEN,
>> + BDI_WRITTEN_BACK,
>> NR_BDI_STAT_ITEMS
>> };
>>
>> @@ -76,6 +79,8 @@ struct backing_dev_info {
>> unsigned long bw_time_stamp; /* last time write bw is updated */
>> unsigned long dirtied_stamp;
>> unsigned long written_stamp; /* pages written at bw_time_stamp */
>> + unsigned long writeback_stamp; /* pages sent to writeback at
>> + * bw_time_stamp */
> Well this sucks. Some of the "foo_stamp" fields are in units of time
> (jiffies? We aren't told) and some of the "foo_stamp" fields are in
> units of number-of-pages. It would be good to fix the naming here.
>
>> unsigned long write_bandwidth; /* the estimated write bandwidth */
>> unsigned long avg_write_bandwidth; /* further smoothed write bw */
>>
>> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
>> index e3dea75..baac702 100644
>> --- a/include/linux/pagemap.h
>> +++ b/include/linux/pagemap.h
>> @@ -25,6 +25,7 @@ enum mapping_flags {
>> AS_MM_ALL_LOCKS = __GFP_BITS_SHIFT + 2, /* under mm_take_all_locks() */
>> AS_UNEVICTABLE = __GFP_BITS_SHIFT + 3, /* e.g., ramdisk, SHM_LOCK */
>> AS_BALLOON_MAP = __GFP_BITS_SHIFT + 4, /* balloon page special map */
>> + AS_STRICTLIMIT = __GFP_BITS_SHIFT + 5, /* strict dirty limit */
> Thing is, "strict dirty limit" isn't documented anywhere, so this
> reference is left dangling.
>
>> ...
>>
>> --- a/mm/backing-dev.c
>> +++ b/mm/backing-dev.c
>> @@ -94,6 +94,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
>> "BackgroundThresh: %10lu kB\n"
>> "BdiDirtied: %10lu kB\n"
>> "BdiWritten: %10lu kB\n"
>> + "BdiWrittenBack: %10lu kB\n"
>> "BdiWriteBandwidth: %10lu kBps\n"
>> "b_dirty: %10lu\n"
>> "b_io: %10lu\n"
> I can't imagine what the difference is between BdiWritten and
> BdiWrittenBack.
>
> I suggest you document this at the BDI_WRITTEN_BACK definition site in
> enum bdi_stat_item. BDI_WRITTEN (at least) will also need
> documentation so people can understand the difference.
>
>> ...
>>
>> @@ -679,29 +711,31 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
>> if (unlikely(dirty >= limit))
>> return 0;
>>
>> + if (unlikely(strictlimit)) {
>> + if (bdi_dirty < 8)
>> + return 2 << RATELIMIT_CALC_SHIFT;
>> +
>> + if (bdi_dirty >= bdi_thresh)
>> + return 0;
>> +
>> + bdi_setpoint = bdi_thresh + bdi_dirty_limit(bdi, bg_thresh);
>> + bdi_setpoint /= 2;
>> +
>> + if (bdi_setpoint == 0 || bdi_setpoint == bdi_thresh)
>> + return 0;
>> +
>> + pos_ratio = pos_ratio_polynom(bdi_setpoint, bdi_dirty,
>> + bdi_thresh);
>> + return min_t(long long, pos_ratio, 2 << RATELIMIT_CALC_SHIFT);
>> + }
> This would be a suitable site at which to document the strictlimit
> feature. What it is, how it works and most importantly, why it exists.
>
>> ...
>>
>> @@ -994,6 +1029,16 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
>> * keep that period small to reduce time lags).
>> */
>> step = 0;
>> +
>> + if (unlikely(strictlimit)) {
>> + dirty = bdi_dirty;
>> + if (bdi_dirty < 8)
>> + setpoint = bdi_dirty + 1;
>> + else
>> + setpoint = (bdi_thresh +
>> + bdi_dirty_limit(bdi, bg_thresh)) / 2;
>> + }
> Explain this to the reader, please.
>
>> ...
>>
>
>

2013-07-02 17:45:20

by Maxim Patlasov

[permalink] [raw]
Subject: [PATCH] mm: strictlimit feature -v2

From: Miklos Szeredi <[email protected]>

The feature prevents mistrusted filesystems to grow a large number of dirty
pages before throttling. For such filesystems balance_dirty_pages always
check bdi counters against bdi limits. I.e. even if global "nr_dirty" is under
"freerun", it's not allowed to skip bdi checks. The only use case for now is
fuse: it sets bdi max_ratio to 1% by default and system administrators are
supposed to expect that this limit won't be exceeded.

The feature is on if address space is marked by AS_STRICTLIMIT flag.
A filesystem may set the flag when it initializes a new inode.

Changed in v2 (thanks to Andrew Morton):
- added a few explanatory comments
- cleaned up the mess in backing_dev_info foo_stamp fields: now it's clearly
stated that bw_time_stamp is measured in jiffies; renamed other foo_stamp
fields to reflect that they are in units of number-of-pages.

Signed-off-by: Maxim Patlasov <[email protected]>
---
fs/fs-writeback.c | 2
fs/fuse/file.c | 3 +
fs/fuse/inode.c | 1
include/linux/backing-dev.h | 22 ++++-
include/linux/pagemap.h | 1
include/linux/writeback.h | 3 -
mm/backing-dev.c | 5 +
mm/page-writeback.c | 176 +++++++++++++++++++++++++++++++++++--------
8 files changed, 171 insertions(+), 42 deletions(-)

diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index 3be5718..8c75277 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -762,7 +762,7 @@ static bool over_bground_thresh(struct backing_dev_info *bdi)
static void wb_update_bandwidth(struct bdi_writeback *wb,
unsigned long start_time)
{
- __bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time);
+ __bdi_update_bandwidth(wb->bdi, 0, 0, 0, 0, 0, start_time, false);
}

/*
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 6d30db1..99b5c97 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1697,6 +1697,7 @@ static int fuse_writepage_locked(struct page *page,
req->inode = inode;

inc_bdi_stat(mapping->backing_dev_info, BDI_WRITEBACK);
+ inc_bdi_stat(mapping->backing_dev_info, BDI_WRITTEN_BACK);
inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
end_page_writeback(page);

@@ -1766,6 +1767,7 @@ static int fuse_send_writepages(struct fuse_fill_wb_data *data)
if (tmp_page) {
copy_highpage(tmp_page, page);
inc_bdi_stat(bdi, BDI_WRITEBACK);
+ inc_bdi_stat(bdi, BDI_WRITTEN_BACK);
inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
} else
all_ok = 0;
@@ -1781,6 +1783,7 @@ static int fuse_send_writepages(struct fuse_fill_wb_data *data)
struct page *page = req->pages[i];
if (page) {
dec_bdi_stat(bdi, BDI_WRITEBACK);
+ dec_bdi_stat(bdi, BDI_WRITTEN_BACK);
dec_zone_page_state(page, NR_WRITEBACK_TEMP);
__free_page(page);
req->pages[i] = NULL;
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 4beb8e3..00a28af 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -305,6 +305,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
inode->i_flags |= S_NOCMTIME;
inode->i_generation = generation;
inode->i_data.backing_dev_info = &fc->bdi;
+ set_bit(AS_STRICTLIMIT, &inode->i_data.flags);
fuse_init_inode(inode, attr);
unlock_new_inode(inode);
} else if ((inode->i_mode ^ attr->mode) & S_IFMT) {
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index c388155..6b12d01 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -33,6 +33,8 @@ enum bdi_state {
BDI_sync_congested, /* The sync queue is getting full */
BDI_registered, /* bdi_register() was done */
BDI_writeback_running, /* Writeback is in progress */
+ BDI_idle, /* No pages under writeback at the moment of
+ * last update of write bw */
BDI_unused, /* Available bits start here */
};

@@ -41,8 +43,15 @@ typedef int (congested_fn)(void *, int);
enum bdi_stat_item {
BDI_RECLAIMABLE,
BDI_WRITEBACK,
- BDI_DIRTIED,
- BDI_WRITTEN,
+
+ /*
+ * The three counters below reflects number of events of specific type
+ * happened since bdi_init(). The type is defined in comments below:
+ */
+ BDI_DIRTIED, /* a page was dirtied */
+ BDI_WRITTEN, /* writeout completed for a page */
+ BDI_WRITTEN_BACK, /* a page went to writeback */
+
NR_BDI_STAT_ITEMS
};

@@ -73,9 +82,12 @@ struct backing_dev_info {

struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];

- unsigned long bw_time_stamp; /* last time write bw is updated */
- unsigned long dirtied_stamp;
- unsigned long written_stamp; /* pages written at bw_time_stamp */
+ unsigned long bw_time_stamp; /* last time (in jiffies) write bw
+ * is updated */
+ unsigned long dirtied_nr_stamp;
+ unsigned long written_nr_stamp; /* pages written at bw_time_stamp */
+ unsigned long writeback_nr_stamp; /* pages sent to writeback at
+ * bw_time_stamp */
unsigned long write_bandwidth; /* the estimated write bandwidth */
unsigned long avg_write_bandwidth; /* further smoothed write bw */

diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e3dea75..baac702 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -25,6 +25,7 @@ enum mapping_flags {
AS_MM_ALL_LOCKS = __GFP_BITS_SHIFT + 2, /* under mm_take_all_locks() */
AS_UNEVICTABLE = __GFP_BITS_SHIFT + 3, /* e.g., ramdisk, SHM_LOCK */
AS_BALLOON_MAP = __GFP_BITS_SHIFT + 4, /* balloon page special map */
+ AS_STRICTLIMIT = __GFP_BITS_SHIFT + 5, /* strict dirty limit */
};

static inline void mapping_set_error(struct address_space *mapping, int error)
diff --git a/include/linux/writeback.h b/include/linux/writeback.h
index 579a500..5144051 100644
--- a/include/linux/writeback.h
+++ b/include/linux/writeback.h
@@ -159,7 +159,8 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
unsigned long dirty,
unsigned long bdi_thresh,
unsigned long bdi_dirty,
- unsigned long start_time);
+ unsigned long start_time,
+ bool strictlimit);

void page_writeback_init(void);
void balance_dirty_pages_ratelimited(struct address_space *mapping);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index 5025174..acfeec7 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -94,6 +94,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
"BackgroundThresh: %10lu kB\n"
"BdiDirtied: %10lu kB\n"
"BdiWritten: %10lu kB\n"
+ "BdiWrittenBack: %10lu kB\n"
"BdiWriteBandwidth: %10lu kBps\n"
"b_dirty: %10lu\n"
"b_io: %10lu\n"
@@ -107,6 +108,7 @@ static int bdi_debug_stats_show(struct seq_file *m, void *v)
K(background_thresh),
(unsigned long) K(bdi_stat(bdi, BDI_DIRTIED)),
(unsigned long) K(bdi_stat(bdi, BDI_WRITTEN)),
+ (unsigned long) K(bdi_stat(bdi, BDI_WRITTEN_BACK)),
(unsigned long) K(bdi->write_bandwidth),
nr_dirty,
nr_io,
@@ -454,7 +456,8 @@ int bdi_init(struct backing_dev_info *bdi)
bdi->dirty_exceeded = 0;

bdi->bw_time_stamp = jiffies;
- bdi->written_stamp = 0;
+ bdi->written_nr_stamp = 0;
+ bdi->writeback_nr_stamp = 0;

bdi->balanced_dirty_ratelimit = INIT_BW;
bdi->dirty_ratelimit = INIT_BW;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 4514ad7..83c7434 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -585,6 +585,37 @@ unsigned long bdi_dirty_limit(struct backing_dev_info *bdi, unsigned long dirty)
}

/*
+ * setpoint - dirty 3
+ * f(dirty) := 1.0 + (----------------)
+ * limit - setpoint
+ *
+ * it's a 3rd order polynomial that subjects to
+ *
+ * (1) f(freerun) = 2.0 => rampup dirty_ratelimit reasonably fast
+ * (2) f(setpoint) = 1.0 => the balance point
+ * (3) f(limit) = 0 => the hard limit
+ * (4) df/dx <= 0 => negative feedback control
+ * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
+ * => fast response on large errors; small oscillation near setpoint
+ */
+static inline long long pos_ratio_polynom(unsigned long setpoint,
+ unsigned long dirty,
+ unsigned long limit)
+{
+ long long pos_ratio;
+ long x;
+
+ x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
+ limit - setpoint + 1);
+ pos_ratio = x;
+ pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+ pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
+ pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+
+ return pos_ratio;
+}
+
+/*
* Dirty position control.
*
* (o) global/bdi setpoints
@@ -664,7 +695,8 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
unsigned long bg_thresh,
unsigned long dirty,
unsigned long bdi_thresh,
- unsigned long bdi_dirty)
+ unsigned long bdi_dirty,
+ bool strictlimit)
{
unsigned long write_bw = bdi->avg_write_bandwidth;
unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
@@ -680,28 +712,55 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
return 0;

/*
- * global setpoint
+ * The strictlimit feature is a tool preventing mistrusted filesystems
+ * to grow a large number of dirty pages before throttling. For such
+ * filesystems balance_dirty_pages always checks bdi counters against
+ * bdi limits. Even if global "nr_dirty" is under "freerun". This is
+ * especially important for fuse who sets bdi->max_ratio to 1% by
+ * default. Without strictlimit feature, fuse writeback may consume
+ * arbitrary amount of RAM because it is accounted in
+ * NR_WRITEBACK_TEMP which is not involved in calculating "nr_dirty".
*
- * setpoint - dirty 3
- * f(dirty) := 1.0 + (----------------)
- * limit - setpoint
+ * Here, in bdi_position_ratio(), we calculate pos_ratio based on
+ * two values: bdi_dirty and bdi_thresh. Let's consider an example:
+ * total amount of RAM is 16GB, bdi->max_ratio is equal to 1%, global
+ * limits are set by default to 10% and 20% (background and throttle).
+ * Then bdi_thresh is 1% of 20% of 16GB. This amounts to ~8K pages.
+ * bdi_dirty_limit(bdi, bg_thresh) is about ~4K pages. bdi_setpoint is
+ * about ~6K pages (as the average of background and throttle bdi
+ * limits). The 3rd order polynomial will provide positive feedback if
+ * bdi_dirty is under bdi_setpoint and vice versa.
*
- * it's a 3rd order polynomial that subjects to
+ * Note, that we cannot use global counters in these calculations
+ * because we want to throttle process writing to strictlimit address
+ * space much earlier than global "freerun" is reached (~23MB vs.
+ * ~2.3GB in the example above).
+ */
+ if (unlikely(strictlimit)) {
+ if (bdi_dirty < 8)
+ return 2 << RATELIMIT_CALC_SHIFT;
+
+ if (bdi_dirty >= bdi_thresh)
+ return 0;
+
+ bdi_setpoint = bdi_thresh + bdi_dirty_limit(bdi, bg_thresh);
+ bdi_setpoint /= 2;
+
+ if (bdi_setpoint == 0 || bdi_setpoint == bdi_thresh)
+ return 0;
+
+ pos_ratio = pos_ratio_polynom(bdi_setpoint, bdi_dirty,
+ bdi_thresh);
+ return min_t(long long, pos_ratio, 2 << RATELIMIT_CALC_SHIFT);
+ }
+
+ /*
+ * global setpoint
*
- * (1) f(freerun) = 2.0 => rampup dirty_ratelimit reasonably fast
- * (2) f(setpoint) = 1.0 => the balance point
- * (3) f(limit) = 0 => the hard limit
- * (4) df/dx <= 0 => negative feedback control
- * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
- * => fast response on large errors; small oscillation near setpoint
+ * See comment for pos_ratio_polynom().
*/
setpoint = (freerun + limit) / 2;
- x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
- limit - setpoint + 1);
- pos_ratio = x;
- pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
- pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
- pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
+ pos_ratio = pos_ratio_polynom(setpoint, dirty, limit);

/*
* We have computed basic pos_ratio above based on global situation. If
@@ -799,7 +858,7 @@ static void bdi_update_write_bandwidth(struct backing_dev_info *bdi,
* write_bandwidth = ---------------------------------------------------
* period
*/
- bw = written - bdi->written_stamp;
+ bw = written - bdi->written_nr_stamp;
bw *= HZ;
if (unlikely(elapsed > period)) {
do_div(bw, elapsed);
@@ -892,7 +951,8 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
unsigned long bdi_thresh,
unsigned long bdi_dirty,
unsigned long dirtied,
- unsigned long elapsed)
+ unsigned long elapsed,
+ bool strictlimit)
{
unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
unsigned long limit = hard_dirty_limit(thresh);
@@ -910,10 +970,10 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
* The dirty rate will match the writeout rate in long term, except
* when dirty pages are truncated by userspace or re-dirtied by FS.
*/
- dirty_rate = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
+ dirty_rate = (dirtied - bdi->dirtied_nr_stamp) * HZ / elapsed;

pos_ratio = bdi_position_ratio(bdi, thresh, bg_thresh, dirty,
- bdi_thresh, bdi_dirty);
+ bdi_thresh, bdi_dirty, strictlimit);
/*
* task_ratelimit reflects each dd's dirty rate for the past 200ms.
*/
@@ -994,6 +1054,26 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
* keep that period small to reduce time lags).
*/
step = 0;
+
+ /*
+ * For strictlimit case, balanced_dirty_ratelimit was calculated
+ * above based on bdi counters and limits (see bdi_position_ratio()).
+ * Hence, to calculate "step" properly, we have to use bdi_dirty as
+ * "dirty" and bdi_setpoint as "setpoint".
+ *
+ * We rampup dirty_ratelimit forcibly if bdi_dirty is low because
+ * it's possible that bdi_thresh is close to zero due to inactivity
+ * of backing device (see the implementation of bdi_dirty_limit()).
+ */
+ if (unlikely(strictlimit)) {
+ dirty = bdi_dirty;
+ if (bdi_dirty < 8)
+ setpoint = bdi_dirty + 1;
+ else
+ setpoint = (bdi_thresh +
+ bdi_dirty_limit(bdi, bg_thresh)) / 2;
+ }
+
if (dirty < setpoint) {
x = min(bdi->balanced_dirty_ratelimit,
min(balanced_dirty_ratelimit, task_ratelimit));
@@ -1034,12 +1114,14 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
unsigned long dirty,
unsigned long bdi_thresh,
unsigned long bdi_dirty,
- unsigned long start_time)
+ unsigned long start_time,
+ bool strictlimit)
{
unsigned long now = jiffies;
unsigned long elapsed = now - bdi->bw_time_stamp;
unsigned long dirtied;
unsigned long written;
+ unsigned long writeback;

/*
* rate-limit, only update once every 200ms.
@@ -1049,6 +1131,7 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,

dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
+ writeback = bdi_stat_sum(bdi, BDI_WRITTEN_BACK);

/*
* Skip quiet periods when disk bandwidth is under-utilized.
@@ -1057,18 +1140,32 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
goto snapshot;

+ /*
+ * Skip periods when backing dev was idle due to abscence of pages
+ * under writeback (when over_bground_thresh() returns false)
+ */
+ if (test_bit(BDI_idle, &bdi->state) &&
+ bdi->writeback_nr_stamp == writeback)
+ goto snapshot;
+
if (thresh) {
global_update_bandwidth(thresh, dirty, now);
bdi_update_dirty_ratelimit(bdi, thresh, bg_thresh, dirty,
bdi_thresh, bdi_dirty,
- dirtied, elapsed);
+ dirtied, elapsed, strictlimit);
}
bdi_update_write_bandwidth(bdi, elapsed, written);

snapshot:
- bdi->dirtied_stamp = dirtied;
- bdi->written_stamp = written;
+ bdi->dirtied_nr_stamp = dirtied;
+ bdi->written_nr_stamp = written;
bdi->bw_time_stamp = now;
+
+ bdi->writeback_nr_stamp = writeback;
+ if (bdi_stat_sum(bdi, BDI_WRITEBACK) == 0)
+ set_bit(BDI_idle, &bdi->state);
+ else
+ clear_bit(BDI_idle, &bdi->state);
}

static void bdi_update_bandwidth(struct backing_dev_info *bdi,
@@ -1077,13 +1174,14 @@ static void bdi_update_bandwidth(struct backing_dev_info *bdi,
unsigned long dirty,
unsigned long bdi_thresh,
unsigned long bdi_dirty,
- unsigned long start_time)
+ unsigned long start_time,
+ bool strictlimit)
{
if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
return;
spin_lock(&bdi->wb.list_lock);
__bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
- bdi_thresh, bdi_dirty, start_time);
+ bdi_thresh, bdi_dirty, start_time, strictlimit);
spin_unlock(&bdi->wb.list_lock);
}

@@ -1226,6 +1324,7 @@ static void balance_dirty_pages(struct address_space *mapping,
unsigned long dirty_ratelimit;
unsigned long pos_ratio;
struct backing_dev_info *bdi = mapping->backing_dev_info;
+ bool strictlimit = test_bit(AS_STRICTLIMIT, &mapping->flags);
unsigned long start_time = jiffies;

for (;;) {
@@ -1250,7 +1349,7 @@ static void balance_dirty_pages(struct address_space *mapping,
*/
freerun = dirty_freerun_ceiling(dirty_thresh,
background_thresh);
- if (nr_dirty <= freerun) {
+ if (nr_dirty <= freerun && !strictlimit) {
current->dirty_paused_when = now;
current->nr_dirtied = 0;
current->nr_dirtied_pause =
@@ -1258,7 +1357,7 @@ static void balance_dirty_pages(struct address_space *mapping,
break;
}

- if (unlikely(!writeback_in_progress(bdi)))
+ if (unlikely(!writeback_in_progress(bdi)) && !strictlimit)
bdi_start_background_writeback(bdi);

/*
@@ -1296,19 +1395,24 @@ static void balance_dirty_pages(struct address_space *mapping,
bdi_stat(bdi, BDI_WRITEBACK);
}

+ if (unlikely(!writeback_in_progress(bdi)) &&
+ bdi_dirty > bdi_thresh / 4)
+ bdi_start_background_writeback(bdi);
+
dirty_exceeded = (bdi_dirty > bdi_thresh) &&
- (nr_dirty > dirty_thresh);
+ ((nr_dirty > dirty_thresh) || strictlimit);
if (dirty_exceeded && !bdi->dirty_exceeded)
bdi->dirty_exceeded = 1;

bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
nr_dirty, bdi_thresh, bdi_dirty,
- start_time);
+ start_time, strictlimit);

dirty_ratelimit = bdi->dirty_ratelimit;
pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
background_thresh, nr_dirty,
- bdi_thresh, bdi_dirty);
+ bdi_thresh, bdi_dirty,
+ strictlimit);
task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >>
RATELIMIT_CALC_SHIFT;
max_pause = bdi_max_pause(bdi, bdi_dirty);
@@ -1362,6 +1466,8 @@ static void balance_dirty_pages(struct address_space *mapping,
}

pause:
+ if (unlikely(!writeback_in_progress(bdi)))
+ bdi_start_background_writeback(bdi);
trace_balance_dirty_pages(bdi,
dirty_thresh,
background_thresh,
@@ -2265,8 +2371,10 @@ int test_set_page_writeback(struct page *page)
radix_tree_tag_set(&mapping->page_tree,
page_index(page),
PAGECACHE_TAG_WRITEBACK);
- if (bdi_cap_account_writeback(bdi))
+ if (bdi_cap_account_writeback(bdi)) {
__inc_bdi_stat(bdi, BDI_WRITEBACK);
+ __inc_bdi_stat(bdi, BDI_WRITTEN_BACK);
+ }
}
if (!PageDirty(page))
radix_tree_tag_clear(&mapping->page_tree,

2013-07-02 19:38:09

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] mm: strictlimit feature -v2

On Tue, 02 Jul 2013 21:44:47 +0400 Maxim Patlasov <[email protected]> wrote:

> From: Miklos Szeredi <[email protected]>
>
> The feature prevents mistrusted filesystems to grow a large number of dirty
> pages before throttling. For such filesystems balance_dirty_pages always
> check bdi counters against bdi limits. I.e. even if global "nr_dirty" is under
> "freerun", it's not allowed to skip bdi checks. The only use case for now is
> fuse: it sets bdi max_ratio to 1% by default and system administrators are
> supposed to expect that this limit won't be exceeded.
>
> The feature is on if address space is marked by AS_STRICTLIMIT flag.
> A filesystem may set the flag when it initializes a new inode.
>
> Changed in v2 (thanks to Andrew Morton):
> - added a few explanatory comments
> - cleaned up the mess in backing_dev_info foo_stamp fields: now it's clearly
> stated that bw_time_stamp is measured in jiffies; renamed other foo_stamp
> fields to reflect that they are in units of number-of-pages.
>

Better, thanks.

The writeback arithemtic makes my head spin - I'd really like Fengguang
to go over this, please.

A quick visit from the spelling police:

>
> ...
>
> @@ -41,8 +43,15 @@ typedef int (congested_fn)(void *, int);
> enum bdi_stat_item {
> BDI_RECLAIMABLE,
> BDI_WRITEBACK,
> - BDI_DIRTIED,
> - BDI_WRITTEN,
> +
> + /*
> + * The three counters below reflects number of events of specific type
> + * happened since bdi_init(). The type is defined in comments below:

"The three counters below reflect the number of events of specific
types since bdi_init()"

> + */
> + BDI_DIRTIED, /* a page was dirtied */
> + BDI_WRITTEN, /* writeout completed for a page */
> + BDI_WRITTEN_BACK, /* a page went to writeback */
> +
> NR_BDI_STAT_ITEMS
> };
>
>
> ...
>
> @@ -680,28 +712,55 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> return 0;
>
> /*
> - * global setpoint
> + * The strictlimit feature is a tool preventing mistrusted filesystems
> + * to grow a large number of dirty pages before throttling. For such

"from growing"

> + * filesystems balance_dirty_pages always checks bdi counters against
> + * bdi limits. Even if global "nr_dirty" is under "freerun". This is
> + * especially important for fuse who sets bdi->max_ratio to 1% by

s/who/which/

> + * default. Without strictlimit feature, fuse writeback may consume
> + * arbitrary amount of RAM because it is accounted in
> + * NR_WRITEBACK_TEMP which is not involved in calculating "nr_dirty".
>
> ...
>
> @@ -994,6 +1054,26 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
> * keep that period small to reduce time lags).
> */
> step = 0;
> +
> + /*
> + * For strictlimit case, balanced_dirty_ratelimit was calculated

balance_dirty_ratelimit?

> + * above based on bdi counters and limits (see bdi_position_ratio()).
> + * Hence, to calculate "step" properly, we have to use bdi_dirty as
> + * "dirty" and bdi_setpoint as "setpoint".
> + *
> + * We rampup dirty_ratelimit forcibly if bdi_dirty is low because
> + * it's possible that bdi_thresh is close to zero due to inactivity
> + * of backing device (see the implementation of bdi_dirty_limit()).
> + */
> + if (unlikely(strictlimit)) {
> + dirty = bdi_dirty;
> + if (bdi_dirty < 8)
> + setpoint = bdi_dirty + 1;
> + else
>
> ...
>
> @@ -1057,18 +1140,32 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
> if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
> goto snapshot;
>
> + /*
> + * Skip periods when backing dev was idle due to abscence of pages

"absence"

> + * under writeback (when over_bground_thresh() returns false)
> + */
> + if (test_bit(BDI_idle, &bdi->state) &&
> + bdi->writeback_nr_stamp == writeback)
>
> ...
>

2013-07-03 11:01:54

by Maxim Patlasov

[permalink] [raw]
Subject: Re: [PATCH] mm: strictlimit feature -v2

07/02/2013 11:38 PM, Andrew Morton пишет:
> On Tue, 02 Jul 2013 21:44:47 +0400 Maxim Patlasov <[email protected]> wrote:
>
>> From: Miklos Szeredi <[email protected]>
>>
>> The feature prevents mistrusted filesystems to grow a large number of dirty
>> pages before throttling. For such filesystems balance_dirty_pages always
>> check bdi counters against bdi limits. I.e. even if global "nr_dirty" is under
>> "freerun", it's not allowed to skip bdi checks. The only use case for now is
>> fuse: it sets bdi max_ratio to 1% by default and system administrators are
>> supposed to expect that this limit won't be exceeded.
>>
>> The feature is on if address space is marked by AS_STRICTLIMIT flag.
>> A filesystem may set the flag when it initializes a new inode.
>>
>> Changed in v2 (thanks to Andrew Morton):
>> - added a few explanatory comments
>> - cleaned up the mess in backing_dev_info foo_stamp fields: now it's clearly
>> stated that bw_time_stamp is measured in jiffies; renamed other foo_stamp
>> fields to reflect that they are in units of number-of-pages.
>>
> Better, thanks.
>
> The writeback arithemtic makes my head spin - I'd really like Fengguang
> to go over this, please.
>
> A quick visit from the spelling police:

Great! Thank you, Andrew. I'll wait for Fengguang' feedback for a while
before respin.

>
>> ...
>>
>> @@ -41,8 +43,15 @@ typedef int (congested_fn)(void *, int);
>> enum bdi_stat_item {
>> BDI_RECLAIMABLE,
>> BDI_WRITEBACK,
>> - BDI_DIRTIED,
>> - BDI_WRITTEN,
>> +
>> + /*
>> + * The three counters below reflects number of events of specific type
>> + * happened since bdi_init(). The type is defined in comments below:
> "The three counters below reflect the number of events of specific
> types since bdi_init()"
>
>> + */
>> + BDI_DIRTIED, /* a page was dirtied */
>> + BDI_WRITTEN, /* writeout completed for a page */
>> + BDI_WRITTEN_BACK, /* a page went to writeback */
>> +
>> NR_BDI_STAT_ITEMS
>> };
>>
>>
>> ...
>>
>> @@ -680,28 +712,55 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
>> return 0;
>>
>> /*
>> - * global setpoint
>> + * The strictlimit feature is a tool preventing mistrusted filesystems
>> + * to grow a large number of dirty pages before throttling. For such
> "from growing"
>
>> + * filesystems balance_dirty_pages always checks bdi counters against
>> + * bdi limits. Even if global "nr_dirty" is under "freerun". This is
>> + * especially important for fuse who sets bdi->max_ratio to 1% by
> s/who/which/
>
>> + * default. Without strictlimit feature, fuse writeback may consume
>> + * arbitrary amount of RAM because it is accounted in
>> + * NR_WRITEBACK_TEMP which is not involved in calculating "nr_dirty".
>>
>> ...
>>
>> @@ -994,6 +1054,26 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
>> * keep that period small to reduce time lags).
>> */
>> step = 0;
>> +
>> + /*
>> + * For strictlimit case, balanced_dirty_ratelimit was calculated
> balance_dirty_ratelimit?
>
>> + * above based on bdi counters and limits (see bdi_position_ratio()).
>> + * Hence, to calculate "step" properly, we have to use bdi_dirty as
>> + * "dirty" and bdi_setpoint as "setpoint".
>> + *
>> + * We rampup dirty_ratelimit forcibly if bdi_dirty is low because
>> + * it's possible that bdi_thresh is close to zero due to inactivity
>> + * of backing device (see the implementation of bdi_dirty_limit()).
>> + */
>> + if (unlikely(strictlimit)) {
>> + dirty = bdi_dirty;
>> + if (bdi_dirty < 8)
>> + setpoint = bdi_dirty + 1;
>> + else
>>
>> ...
>>
>> @@ -1057,18 +1140,32 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>> if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
>> goto snapshot;
>>
>> + /*
>> + * Skip periods when backing dev was idle due to abscence of pages
> "absence"
>
>> + * under writeback (when over_bground_thresh() returns false)
>> + */
>> + if (test_bit(BDI_idle, &bdi->state) &&
>> + bdi->writeback_nr_stamp == writeback)
>>
>> ...
>>
>
>

2013-07-03 23:16:48

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH] mm: strictlimit feature -v2

On Wed 03-07-13 15:01:19, Maxim Patlasov wrote:
> 07/02/2013 11:38 PM, Andrew Morton пишет:
> >On Tue, 02 Jul 2013 21:44:47 +0400 Maxim Patlasov <[email protected]> wrote:
> >
> >>From: Miklos Szeredi <[email protected]>
> >>
> >>The feature prevents mistrusted filesystems to grow a large number of dirty
> >>pages before throttling. For such filesystems balance_dirty_pages always
> >>check bdi counters against bdi limits. I.e. even if global "nr_dirty" is under
> >>"freerun", it's not allowed to skip bdi checks. The only use case for now is
> >>fuse: it sets bdi max_ratio to 1% by default and system administrators are
> >>supposed to expect that this limit won't be exceeded.
> >>
> >>The feature is on if address space is marked by AS_STRICTLIMIT flag.
> >>A filesystem may set the flag when it initializes a new inode.
> >>
> >>Changed in v2 (thanks to Andrew Morton):
> >> - added a few explanatory comments
> >> - cleaned up the mess in backing_dev_info foo_stamp fields: now it's clearly
> >> stated that bw_time_stamp is measured in jiffies; renamed other foo_stamp
> >> fields to reflect that they are in units of number-of-pages.
> >>
> >Better, thanks.
> >
> >The writeback arithemtic makes my head spin - I'd really like Fengguang
> >to go over this, please.
> >
> >A quick visit from the spelling police:
>
> Great! Thank you, Andrew. I'll wait for Fengguang' feedback for a
> while before respin.
Sorry for the bad mail threading but I've noticed the thread only now and
I don't have email with your patches in my mailbox anymore. Below is a
review of your strictlimit patch. In principle, I'm OK with the idea (I
even wanted to have a similar ability e.g. for NFS mounts) but I have some
reservations regarding the implementation:

> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
> index 4beb8e3..00a28af 100644
> --- a/fs/fuse/inode.c
> +++ b/fs/fuse/inode.c
> @@ -305,6 +305,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
> inode->i_flags |= S_NOCMTIME;
> inode->i_generation = generation;
> inode->i_data.backing_dev_info = &fc->bdi;
> + set_bit(AS_STRICTLIMIT, &inode->i_data.flags);
> fuse_init_inode(inode, attr);
> unlock_new_inode(inode);
> } else if ((inode->i_mode ^ attr->mode) & S_IFMT) {
It seems wrong to use address space bits for this. Using BDI capabilities for
this would look more appropriate. Sure you couldn't then always tune this on
per-fs basis since filesystems can share BDI (e.g. when they are on different
partitions of one disk) but if several filesystems are sharing a BDI and some
would want 'strict' behavior and some don't I don't believe the resulting
behavior would be sane - e.g. non-strict fs could be getting bdi over per-bdi
limit without any throttling and strictly limited fs would be continuously
stalled. Or do I miss any reason why this is set on address space?

As a bonus you don't have to pass the 'strictlimit' flag to writeback
functions when it is a bdi flag.

> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index c388155..6b12d01 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -33,6 +33,8 @@ enum bdi_state {
> BDI_sync_congested, /* The sync queue is getting full */
> BDI_registered, /* bdi_register() was done */
> BDI_writeback_running, /* Writeback is in progress */
> + BDI_idle, /* No pages under writeback at the moment of
> + * last update of write bw */
> BDI_unused, /* Available bits start here */
> };
>
> @@ -41,8 +43,15 @@ typedef int (congested_fn)(void *, int);
> enum bdi_stat_item {
> BDI_RECLAIMABLE,
> BDI_WRITEBACK,
> - BDI_DIRTIED,
> - BDI_WRITTEN,
> +
> + /*
> + * The three counters below reflects number of events of specific type
> + * happened since bdi_init(). The type is defined in comments below:
> + */
> + BDI_DIRTIED, /* a page was dirtied */
> + BDI_WRITTEN, /* writeout completed for a page */
> + BDI_WRITTEN_BACK, /* a page went to writeback */
> +
> NR_BDI_STAT_ITEMS
> };
>
> @@ -73,9 +82,12 @@ struct backing_dev_info {
>
> struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
>
> - unsigned long bw_time_stamp; /* last time write bw is updated */
> - unsigned long dirtied_stamp;
> - unsigned long written_stamp; /* pages written at bw_time_stamp */
> + unsigned long bw_time_stamp; /* last time (in jiffies) write bw
> + * is updated */
> + unsigned long dirtied_nr_stamp;
> + unsigned long written_nr_stamp; /* pages written at bw_time_stamp */
> + unsigned long writeback_nr_stamp; /* pages sent to writeback at
> + * bw_time_stamp */
> unsigned long write_bandwidth; /* the estimated write bandwidth */
> unsigned long avg_write_bandwidth; /* further smoothed write bw */
>
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 4514ad7..83c7434 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -680,28 +712,55 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
> return 0;
>
> /*
> - * global setpoint
> + * The strictlimit feature is a tool preventing mistrusted filesystems
> + * to grow a large number of dirty pages before throttling. For such
> + * filesystems balance_dirty_pages always checks bdi counters against
> + * bdi limits. Even if global "nr_dirty" is under "freerun". This is
> + * especially important for fuse who sets bdi->max_ratio to 1% by
> + * default. Without strictlimit feature, fuse writeback may consume
> + * arbitrary amount of RAM because it is accounted in
> + * NR_WRITEBACK_TEMP which is not involved in calculating "nr_dirty".
> *
> - * setpoint - dirty 3
> - * f(dirty) := 1.0 + (----------------)
> - * limit - setpoint
> + * Here, in bdi_position_ratio(), we calculate pos_ratio based on
> + * two values: bdi_dirty and bdi_thresh. Let's consider an example:
> + * total amount of RAM is 16GB, bdi->max_ratio is equal to 1%, global
> + * limits are set by default to 10% and 20% (background and throttle).
> + * Then bdi_thresh is 1% of 20% of 16GB. This amounts to ~8K pages.
> + * bdi_dirty_limit(bdi, bg_thresh) is about ~4K pages. bdi_setpoint is
> + * about ~6K pages (as the average of background and throttle bdi
> + * limits). The 3rd order polynomial will provide positive feedback if
> + * bdi_dirty is under bdi_setpoint and vice versa.
> *
> - * it's a 3rd order polynomial that subjects to
> + * Note, that we cannot use global counters in these calculations
> + * because we want to throttle process writing to strictlimit address
> + * space much earlier than global "freerun" is reached (~23MB vs.
> + * ~2.3GB in the example above).
> + */
> + if (unlikely(strictlimit)) {
> + if (bdi_dirty < 8)
> + return 2 << RATELIMIT_CALC_SHIFT;
> +
> + if (bdi_dirty >= bdi_thresh)
> + return 0;
> +
> + bdi_setpoint = bdi_thresh + bdi_dirty_limit(bdi, bg_thresh);
> + bdi_setpoint /= 2;
> +
> + if (bdi_setpoint == 0 || bdi_setpoint == bdi_thresh)
> + return 0;
> +
> + pos_ratio = pos_ratio_polynom(bdi_setpoint, bdi_dirty,
> + bdi_thresh);
> + return min_t(long long, pos_ratio, 2 << RATELIMIT_CALC_SHIFT);
> + }
But if global limits are exceeded but a BDI in strict mode is below limit,
this would allow dirtying on that BDI which seems wrong. Also the logic
in bdi_position_ratio() is already supposed to take bdi limits in account
(although the math is somewhat convoluted) so you shouldn't have to touch it.
Only maybe the increasing of bdi_thresh to (limit - dirty) / 8 might be too
much for strict limitting so that may need some tweaking (although setting it
at 8 pages as your patch does seems *too* strict to me).

> +
> + /*
> + * global setpoint
> *
> - * (1) f(freerun) = 2.0 => rampup dirty_ratelimit reasonably fast
> - * (2) f(setpoint) = 1.0 => the balance point
> - * (3) f(limit) = 0 => the hard limit
> - * (4) df/dx <= 0 => negative feedback control
> - * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
> - * => fast response on large errors; small oscillation near setpoint
> + * See comment for pos_ratio_polynom().
> */
> setpoint = (freerun + limit) / 2;
> - x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
> - limit - setpoint + 1);
> - pos_ratio = x;
> - pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> - pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
> - pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
> + pos_ratio = pos_ratio_polynom(setpoint, dirty, limit);
>
> /*
> * We have computed basic pos_ratio above based on global situation. If
> @@ -892,7 +951,8 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
> unsigned long bdi_thresh,
> unsigned long bdi_dirty,
> unsigned long dirtied,
> - unsigned long elapsed)
> + unsigned long elapsed,
> + bool strictlimit)
> {
> unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
> unsigned long limit = hard_dirty_limit(thresh);
> @@ -910,10 +970,10 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
> * The dirty rate will match the writeout rate in long term, except
> * when dirty pages are truncated by userspace or re-dirtied by FS.
> */
> - dirty_rate = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
> + dirty_rate = (dirtied - bdi->dirtied_nr_stamp) * HZ / elapsed;
>
> pos_ratio = bdi_position_ratio(bdi, thresh, bg_thresh, dirty,
> - bdi_thresh, bdi_dirty);
> + bdi_thresh, bdi_dirty, strictlimit);
> /*
> * task_ratelimit reflects each dd's dirty rate for the past 200ms.
> */
> @@ -994,6 +1054,26 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
> * keep that period small to reduce time lags).
> */
> step = 0;
> +
> + /*
> + * For strictlimit case, balanced_dirty_ratelimit was calculated
> + * above based on bdi counters and limits (see bdi_position_ratio()).
> + * Hence, to calculate "step" properly, we have to use bdi_dirty as
> + * "dirty" and bdi_setpoint as "setpoint".
> + *
> + * We rampup dirty_ratelimit forcibly if bdi_dirty is low because
> + * it's possible that bdi_thresh is close to zero due to inactivity
> + * of backing device (see the implementation of bdi_dirty_limit()).
> + */
> + if (unlikely(strictlimit)) {
> + dirty = bdi_dirty;
> + if (bdi_dirty < 8)
> + setpoint = bdi_dirty + 1;
> + else
> + setpoint = (bdi_thresh +
> + bdi_dirty_limit(bdi, bg_thresh)) / 2;
> + }
> +
> if (dirty < setpoint) {
> x = min(bdi->balanced_dirty_ratelimit,
> min(balanced_dirty_ratelimit, task_ratelimit));
> @@ -1034,12 +1114,14 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
> unsigned long dirty,
> unsigned long bdi_thresh,
> unsigned long bdi_dirty,
> - unsigned long start_time)
> + unsigned long start_time,
> + bool strictlimit)
> {
> unsigned long now = jiffies;
> unsigned long elapsed = now - bdi->bw_time_stamp;
> unsigned long dirtied;
> unsigned long written;
> + unsigned long writeback;
>
> /*
> * rate-limit, only update once every 200ms.
> @@ -1049,6 +1131,7 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>
> dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
> written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
> + writeback = bdi_stat_sum(bdi, BDI_WRITTEN_BACK);
>
> /*
> * Skip quiet periods when disk bandwidth is under-utilized.
> @@ -1057,18 +1140,32 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
> if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
> goto snapshot;
>
> + /*
> + * Skip periods when backing dev was idle due to abscence of pages
> + * under writeback (when over_bground_thresh() returns false)
> + */
> + if (test_bit(BDI_idle, &bdi->state) &&
> + bdi->writeback_nr_stamp == writeback)
> + goto snapshot;
> +
Hum, I understand the desire behind BDI_idle but that seems to be solving
only a special case, isn't it? When there is a small traffic on the bdi, the
bandwidth will get updated anyway and the computed bandwidth will go down
from the real maximum value when there are enouch pages to write. So I'm not
sure how much this really helps. Plus this 'idle' logic seems completely
independent to balance_dirty_pages() tweaks so it would be better done as a
separate patch (in case you have convincing reasons we really need that
logic).

> if (thresh) {
> global_update_bandwidth(thresh, dirty, now);
> bdi_update_dirty_ratelimit(bdi, thresh, bg_thresh, dirty,
> bdi_thresh, bdi_dirty,
> - dirtied, elapsed);
> + dirtied, elapsed, strictlimit);
> }
> bdi_update_write_bandwidth(bdi, elapsed, written);
>
> snapshot:
> - bdi->dirtied_stamp = dirtied;
> - bdi->written_stamp = written;
> + bdi->dirtied_nr_stamp = dirtied;
> + bdi->written_nr_stamp = written;
> bdi->bw_time_stamp = now;
> +
> + bdi->writeback_nr_stamp = writeback;
> + if (bdi_stat_sum(bdi, BDI_WRITEBACK) == 0)
> + set_bit(BDI_idle, &bdi->state);
> + else
> + clear_bit(BDI_idle, &bdi->state);
> }
>
> static void bdi_update_bandwidth(struct backing_dev_info *bdi,
> @@ -1077,13 +1174,14 @@ static void bdi_update_bandwidth(struct backing_dev_info *bdi,
> unsigned long dirty,
> unsigned long bdi_thresh,
> unsigned long bdi_dirty,
> - unsigned long start_time)
> + unsigned long start_time,
> + bool strictlimit)
> {
> if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
> return;
> spin_lock(&bdi->wb.list_lock);
> __bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
> - bdi_thresh, bdi_dirty, start_time);
> + bdi_thresh, bdi_dirty, start_time, strictlimit);
> spin_unlock(&bdi->wb.list_lock);
> }
>
> @@ -1226,6 +1324,7 @@ static void balance_dirty_pages(struct address_space *mapping,
> unsigned long dirty_ratelimit;
> unsigned long pos_ratio;
> struct backing_dev_info *bdi = mapping->backing_dev_info;
> + bool strictlimit = test_bit(AS_STRICTLIMIT, &mapping->flags);
> unsigned long start_time = jiffies;
>
> for (;;) {
> @@ -1250,7 +1349,7 @@ static void balance_dirty_pages(struct address_space *mapping,
> */
> freerun = dirty_freerun_ceiling(dirty_thresh,
> background_thresh);
> - if (nr_dirty <= freerun) {
> + if (nr_dirty <= freerun && !strictlimit) {
> current->dirty_paused_when = now;
> current->nr_dirtied = 0;
> current->nr_dirtied_pause =
I'd rather change this to check bdi_dirty <= bdi_freerun in strictlimit
case.

> @@ -1258,7 +1357,7 @@ static void balance_dirty_pages(struct address_space *mapping,
> break;
> }
>
> - if (unlikely(!writeback_in_progress(bdi)))
> + if (unlikely(!writeback_in_progress(bdi)) && !strictlimit)
> bdi_start_background_writeback(bdi);
This can then go away.

> /*
> @@ -1296,19 +1395,24 @@ static void balance_dirty_pages(struct address_space *mapping,
> bdi_stat(bdi, BDI_WRITEBACK);
> }
>
> + if (unlikely(!writeback_in_progress(bdi)) &&
> + bdi_dirty > bdi_thresh / 4)
> + bdi_start_background_writeback(bdi);
> +
Why is this?

> dirty_exceeded = (bdi_dirty > bdi_thresh) &&
> - (nr_dirty > dirty_thresh);
> + ((nr_dirty > dirty_thresh) || strictlimit);
> if (dirty_exceeded && !bdi->dirty_exceeded)
> bdi->dirty_exceeded = 1;
>
> bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
> nr_dirty, bdi_thresh, bdi_dirty,
> - start_time);
> + start_time, strictlimit);
>
> dirty_ratelimit = bdi->dirty_ratelimit;
> pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
> background_thresh, nr_dirty,
> - bdi_thresh, bdi_dirty);
> + bdi_thresh, bdi_dirty,
> + strictlimit);
> task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >>
> RATELIMIT_CALC_SHIFT;
> max_pause = bdi_max_pause(bdi, bdi_dirty);
> @@ -1362,6 +1466,8 @@ static void balance_dirty_pages(struct address_space *mapping,
> }
>
> pause:
> + if (unlikely(!writeback_in_progress(bdi)))
> + bdi_start_background_writeback(bdi);
> trace_balance_dirty_pages(bdi,
> dirty_thresh,
> background_thresh,
And this shouldn't be necessary either after updating the freerun test
properly.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-07-05 13:15:57

by Maxim Patlasov

[permalink] [raw]
Subject: Re: [PATCH] mm: strictlimit feature -v2

Hi Jan,

First of all, thanks a lot for review, I highly appreciate it. I agree
with most of your comments, see please inline replies below.

To make further reviews easier, I'm going to send the next version in
the form of small mm-only patchset (i.e. putting all that "fuse
writeback cache policy" stuff aside).

07/04/2013 03:16 AM, Jan Kara пишет:
> On Wed 03-07-13 15:01:19, Maxim Patlasov wrote:
>> 07/02/2013 11:38 PM, Andrew Morton пишет:
>>> On Tue, 02 Jul 2013 21:44:47 +0400 Maxim Patlasov <[email protected]> wrote:
>>>
>>>> From: Miklos Szeredi <[email protected]>
>>>>
>>>> The feature prevents mistrusted filesystems to grow a large number of dirty
>>>> pages before throttling. For such filesystems balance_dirty_pages always
>>>> check bdi counters against bdi limits. I.e. even if global "nr_dirty" is under
>>>> "freerun", it's not allowed to skip bdi checks. The only use case for now is
>>>> fuse: it sets bdi max_ratio to 1% by default and system administrators are
>>>> supposed to expect that this limit won't be exceeded.
>>>>
>>>> The feature is on if address space is marked by AS_STRICTLIMIT flag.
>>>> A filesystem may set the flag when it initializes a new inode.
>>>>
>>>> Changed in v2 (thanks to Andrew Morton):
>>>> - added a few explanatory comments
>>>> - cleaned up the mess in backing_dev_info foo_stamp fields: now it's clearly
>>>> stated that bw_time_stamp is measured in jiffies; renamed other foo_stamp
>>>> fields to reflect that they are in units of number-of-pages.
>>>>
>>> Better, thanks.
>>>
>>> The writeback arithemtic makes my head spin - I'd really like Fengguang
>>> to go over this, please.
>>>
>>> A quick visit from the spelling police:
>> Great! Thank you, Andrew. I'll wait for Fengguang' feedback for a
>> while before respin.
> Sorry for the bad mail threading but I've noticed the thread only now and
> I don't have email with your patches in my mailbox anymore. Below is a
> review of your strictlimit patch. In principle, I'm OK with the idea (I
> even wanted to have a similar ability e.g. for NFS mounts) but I have some
> reservations regarding the implementation:
>
>> diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
>> index 4beb8e3..00a28af 100644
>> --- a/fs/fuse/inode.c
>> +++ b/fs/fuse/inode.c
>> @@ -305,6 +305,7 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
>> inode->i_flags |= S_NOCMTIME;
>> inode->i_generation = generation;
>> inode->i_data.backing_dev_info = &fc->bdi;
>> + set_bit(AS_STRICTLIMIT, &inode->i_data.flags);
>> fuse_init_inode(inode, attr);
>> unlock_new_inode(inode);
>> } else if ((inode->i_mode ^ attr->mode) & S_IFMT) {
> It seems wrong to use address space bits for this. Using BDI capabilities for
> this would look more appropriate. Sure you couldn't then always tune this on
> per-fs basis since filesystems can share BDI (e.g. when they are on different
> partitions of one disk) but if several filesystems are sharing a BDI and some
> would want 'strict' behavior and some don't I don't believe the resulting
> behavior would be sane - e.g. non-strict fs could be getting bdi over per-bdi
> limit without any throttling and strictly limited fs would be continuously
> stalled. Or do I miss any reason why this is set on address space?
>
> As a bonus you don't have to pass the 'strictlimit' flag to writeback
> functions when it is a bdi flag.

Completely agree. Address space flag was Miklos' idea. I had to convert
it to bdi cap from the beginning.

>
>> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
>> index c388155..6b12d01 100644
>> --- a/include/linux/backing-dev.h
>> +++ b/include/linux/backing-dev.h
>> @@ -33,6 +33,8 @@ enum bdi_state {
>> BDI_sync_congested, /* The sync queue is getting full */
>> BDI_registered, /* bdi_register() was done */
>> BDI_writeback_running, /* Writeback is in progress */
>> + BDI_idle, /* No pages under writeback at the moment of
>> + * last update of write bw */
>> BDI_unused, /* Available bits start here */
>> };
>>
>> @@ -41,8 +43,15 @@ typedef int (congested_fn)(void *, int);
>> enum bdi_stat_item {
>> BDI_RECLAIMABLE,
>> BDI_WRITEBACK,
>> - BDI_DIRTIED,
>> - BDI_WRITTEN,
>> +
>> + /*
>> + * The three counters below reflects number of events of specific type
>> + * happened since bdi_init(). The type is defined in comments below:
>> + */
>> + BDI_DIRTIED, /* a page was dirtied */
>> + BDI_WRITTEN, /* writeout completed for a page */
>> + BDI_WRITTEN_BACK, /* a page went to writeback */
>> +
>> NR_BDI_STAT_ITEMS
>> };
>>
>> @@ -73,9 +82,12 @@ struct backing_dev_info {
>>
>> struct percpu_counter bdi_stat[NR_BDI_STAT_ITEMS];
>>
>> - unsigned long bw_time_stamp; /* last time write bw is updated */
>> - unsigned long dirtied_stamp;
>> - unsigned long written_stamp; /* pages written at bw_time_stamp */
>> + unsigned long bw_time_stamp; /* last time (in jiffies) write bw
>> + * is updated */
>> + unsigned long dirtied_nr_stamp;
>> + unsigned long written_nr_stamp; /* pages written at bw_time_stamp */
>> + unsigned long writeback_nr_stamp; /* pages sent to writeback at
>> + * bw_time_stamp */
>> unsigned long write_bandwidth; /* the estimated write bandwidth */
>> unsigned long avg_write_bandwidth; /* further smoothed write bw */
>>
>> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
>> index 4514ad7..83c7434 100644
>> --- a/mm/page-writeback.c
>> +++ b/mm/page-writeback.c
>> @@ -680,28 +712,55 @@ static unsigned long bdi_position_ratio(struct backing_dev_info *bdi,
>> return 0;
>>
>> /*
>> - * global setpoint
>> + * The strictlimit feature is a tool preventing mistrusted filesystems
>> + * to grow a large number of dirty pages before throttling. For such
>> + * filesystems balance_dirty_pages always checks bdi counters against
>> + * bdi limits. Even if global "nr_dirty" is under "freerun". This is
>> + * especially important for fuse who sets bdi->max_ratio to 1% by
>> + * default. Without strictlimit feature, fuse writeback may consume
>> + * arbitrary amount of RAM because it is accounted in
>> + * NR_WRITEBACK_TEMP which is not involved in calculating "nr_dirty".
>> *
>> - * setpoint - dirty 3
>> - * f(dirty) := 1.0 + (----------------)
>> - * limit - setpoint
>> + * Here, in bdi_position_ratio(), we calculate pos_ratio based on
>> + * two values: bdi_dirty and bdi_thresh. Let's consider an example:
>> + * total amount of RAM is 16GB, bdi->max_ratio is equal to 1%, global
>> + * limits are set by default to 10% and 20% (background and throttle).
>> + * Then bdi_thresh is 1% of 20% of 16GB. This amounts to ~8K pages.
>> + * bdi_dirty_limit(bdi, bg_thresh) is about ~4K pages. bdi_setpoint is
>> + * about ~6K pages (as the average of background and throttle bdi
>> + * limits). The 3rd order polynomial will provide positive feedback if
>> + * bdi_dirty is under bdi_setpoint and vice versa.
>> *
>> - * it's a 3rd order polynomial that subjects to
>> + * Note, that we cannot use global counters in these calculations
>> + * because we want to throttle process writing to strictlimit address
>> + * space much earlier than global "freerun" is reached (~23MB vs.
>> + * ~2.3GB in the example above).
>> + */
>> + if (unlikely(strictlimit)) {
>> + if (bdi_dirty < 8)
>> + return 2 << RATELIMIT_CALC_SHIFT;
>> +
>> + if (bdi_dirty >= bdi_thresh)
>> + return 0;
>> +
>> + bdi_setpoint = bdi_thresh + bdi_dirty_limit(bdi, bg_thresh);
>> + bdi_setpoint /= 2;
>> +
>> + if (bdi_setpoint == 0 || bdi_setpoint == bdi_thresh)
>> + return 0;
>> +
>> + pos_ratio = pos_ratio_polynom(bdi_setpoint, bdi_dirty,
>> + bdi_thresh);
>> + return min_t(long long, pos_ratio, 2 << RATELIMIT_CALC_SHIFT);
>> + }
> But if global limits are exceeded but a BDI in strict mode is below limit,
> this would allow dirtying on that BDI which seems wrong. Also the logic
> in bdi_position_ratio() is already supposed to take bdi limits in account
> (although the math is somewhat convoluted) so you shouldn't have to touch it.
> Only maybe the increasing of bdi_thresh to (limit - dirty) / 8 might be too
> much for strict limitting so that may need some tweaking (although setting it
> at 8 pages as your patch does seems *too* strict to me).

I agree that if global nr_dirty comes close to global limit (or already
exceeded it), we must take global pos_ratio in consideration (even if
bdi_dirty << bdi_thresh). But I don't think we can keep
bdi_position_ratio() intact. Because: 1) the math there goes crazy if
dirty < freerun; 2) "(limit - dirty) / 8" is useless for FUSE which
account internal writeback in NR_WRITEBACK_TEMP; i.e. you can easily
observe "dirty" close to zero while bdi_dirty is huge. Please let me
know if you need more details about NR_WRITEBACK_TEMP peculiarities.

>
>> +
>> + /*
>> + * global setpoint
>> *
>> - * (1) f(freerun) = 2.0 => rampup dirty_ratelimit reasonably fast
>> - * (2) f(setpoint) = 1.0 => the balance point
>> - * (3) f(limit) = 0 => the hard limit
>> - * (4) df/dx <= 0 => negative feedback control
>> - * (5) the closer to setpoint, the smaller |df/dx| (and the reverse)
>> - * => fast response on large errors; small oscillation near setpoint
>> + * See comment for pos_ratio_polynom().
>> */
>> setpoint = (freerun + limit) / 2;
>> - x = div_s64(((s64)setpoint - (s64)dirty) << RATELIMIT_CALC_SHIFT,
>> - limit - setpoint + 1);
>> - pos_ratio = x;
>> - pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
>> - pos_ratio = pos_ratio * x >> RATELIMIT_CALC_SHIFT;
>> - pos_ratio += 1 << RATELIMIT_CALC_SHIFT;
>> + pos_ratio = pos_ratio_polynom(setpoint, dirty, limit);
>>
>> /*
>> * We have computed basic pos_ratio above based on global situation. If
>> @@ -892,7 +951,8 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
>> unsigned long bdi_thresh,
>> unsigned long bdi_dirty,
>> unsigned long dirtied,
>> - unsigned long elapsed)
>> + unsigned long elapsed,
>> + bool strictlimit)
>> {
>> unsigned long freerun = dirty_freerun_ceiling(thresh, bg_thresh);
>> unsigned long limit = hard_dirty_limit(thresh);
>> @@ -910,10 +970,10 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
>> * The dirty rate will match the writeout rate in long term, except
>> * when dirty pages are truncated by userspace or re-dirtied by FS.
>> */
>> - dirty_rate = (dirtied - bdi->dirtied_stamp) * HZ / elapsed;
>> + dirty_rate = (dirtied - bdi->dirtied_nr_stamp) * HZ / elapsed;
>>
>> pos_ratio = bdi_position_ratio(bdi, thresh, bg_thresh, dirty,
>> - bdi_thresh, bdi_dirty);
>> + bdi_thresh, bdi_dirty, strictlimit);
>> /*
>> * task_ratelimit reflects each dd's dirty rate for the past 200ms.
>> */
>> @@ -994,6 +1054,26 @@ static void bdi_update_dirty_ratelimit(struct backing_dev_info *bdi,
>> * keep that period small to reduce time lags).
>> */
>> step = 0;
>> +
>> + /*
>> + * For strictlimit case, balanced_dirty_ratelimit was calculated
>> + * above based on bdi counters and limits (see bdi_position_ratio()).
>> + * Hence, to calculate "step" properly, we have to use bdi_dirty as
>> + * "dirty" and bdi_setpoint as "setpoint".
>> + *
>> + * We rampup dirty_ratelimit forcibly if bdi_dirty is low because
>> + * it's possible that bdi_thresh is close to zero due to inactivity
>> + * of backing device (see the implementation of bdi_dirty_limit()).
>> + */
>> + if (unlikely(strictlimit)) {
>> + dirty = bdi_dirty;
>> + if (bdi_dirty < 8)
>> + setpoint = bdi_dirty + 1;
>> + else
>> + setpoint = (bdi_thresh +
>> + bdi_dirty_limit(bdi, bg_thresh)) / 2;
>> + }
>> +
>> if (dirty < setpoint) {
>> x = min(bdi->balanced_dirty_ratelimit,
>> min(balanced_dirty_ratelimit, task_ratelimit));
>> @@ -1034,12 +1114,14 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>> unsigned long dirty,
>> unsigned long bdi_thresh,
>> unsigned long bdi_dirty,
>> - unsigned long start_time)
>> + unsigned long start_time,
>> + bool strictlimit)
>> {
>> unsigned long now = jiffies;
>> unsigned long elapsed = now - bdi->bw_time_stamp;
>> unsigned long dirtied;
>> unsigned long written;
>> + unsigned long writeback;
>>
>> /*
>> * rate-limit, only update once every 200ms.
>> @@ -1049,6 +1131,7 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>>
>> dirtied = percpu_counter_read(&bdi->bdi_stat[BDI_DIRTIED]);
>> written = percpu_counter_read(&bdi->bdi_stat[BDI_WRITTEN]);
>> + writeback = bdi_stat_sum(bdi, BDI_WRITTEN_BACK);
>>
>> /*
>> * Skip quiet periods when disk bandwidth is under-utilized.
>> @@ -1057,18 +1140,32 @@ void __bdi_update_bandwidth(struct backing_dev_info *bdi,
>> if (elapsed > HZ && time_before(bdi->bw_time_stamp, start_time))
>> goto snapshot;
>>
>> + /*
>> + * Skip periods when backing dev was idle due to abscence of pages
>> + * under writeback (when over_bground_thresh() returns false)
>> + */
>> + if (test_bit(BDI_idle, &bdi->state) &&
>> + bdi->writeback_nr_stamp == writeback)
>> + goto snapshot;
>> +
> Hum, I understand the desire behind BDI_idle but that seems to be solving
> only a special case, isn't it? When there is a small traffic on the bdi, the
> bandwidth will get updated anyway and the computed bandwidth will go down
> from the real maximum value when there are enouch pages to write. So I'm not
> sure how much this really helps. Plus this 'idle' logic seems completely
> independent to balance_dirty_pages() tweaks so it would be better done as a
> separate patch (in case you have convincing reasons we really need that
> logic).

Yes, I'll move it to separate patch. The rationale behind BDI_idle looks
like this:

> BDI_idle along with BDI_WRITTEN_BACK exists to distinguish two cases:
>
> 1st. BDI_WRITTEN has not been incremented since we looked at it last
> time because backing dev is unresponding. I.e. it had some pages under
> writeback but it have not made any progress for some reasons.
>
> 2nd. BDI_WRITTEN has not been incremented since we looked at it last
> time because backing dev had nothing to do. I.e. there are some dirty
> pages on bdi, but they have not been passed to backing dev yet. This is
> the case when bdi_dirty is under bdi background threshold and flusher
> refrains from flushing even if we woke it up explicitly by
> bdi_start_background_writeback.
>
> We have to skip bdi_update_write_bandwidth() in the 2nd case because
> otherwise bdi_update_write_bandwidth() will see written==0 and
> mistakenly decrease write_bandwidth. The criterion to skip is the
> following: BDI_idle is set (i.e. there were no pages under writeback
> when we looked at the bdi last time) && BDI_WRITTEN_BACK counter has not
> changed (i.e. no new pages has been sent to writeback since we looked at
> the bdi last time).

Notice, that BDI_idle is useful right now only for strictlimit feature
because w/o strictlimit, we do not call bdi_update_write_bandwidth()
until "dirty" exceeds "freerun" and over_bground_thresh() guarantees
that flusher has passed some amount of work to backing dev by then.

>
>> if (thresh) {
>> global_update_bandwidth(thresh, dirty, now);
>> bdi_update_dirty_ratelimit(bdi, thresh, bg_thresh, dirty,
>> bdi_thresh, bdi_dirty,
>> - dirtied, elapsed);
>> + dirtied, elapsed, strictlimit);
>> }
>> bdi_update_write_bandwidth(bdi, elapsed, written);
>>
>> snapshot:
>> - bdi->dirtied_stamp = dirtied;
>> - bdi->written_stamp = written;
>> + bdi->dirtied_nr_stamp = dirtied;
>> + bdi->written_nr_stamp = written;
>> bdi->bw_time_stamp = now;
>> +
>> + bdi->writeback_nr_stamp = writeback;
>> + if (bdi_stat_sum(bdi, BDI_WRITEBACK) == 0)
>> + set_bit(BDI_idle, &bdi->state);
>> + else
>> + clear_bit(BDI_idle, &bdi->state);
>> }
>>
>> static void bdi_update_bandwidth(struct backing_dev_info *bdi,
>> @@ -1077,13 +1174,14 @@ static void bdi_update_bandwidth(struct backing_dev_info *bdi,
>> unsigned long dirty,
>> unsigned long bdi_thresh,
>> unsigned long bdi_dirty,
>> - unsigned long start_time)
>> + unsigned long start_time,
>> + bool strictlimit)
>> {
>> if (time_is_after_eq_jiffies(bdi->bw_time_stamp + BANDWIDTH_INTERVAL))
>> return;
>> spin_lock(&bdi->wb.list_lock);
>> __bdi_update_bandwidth(bdi, thresh, bg_thresh, dirty,
>> - bdi_thresh, bdi_dirty, start_time);
>> + bdi_thresh, bdi_dirty, start_time, strictlimit);
>> spin_unlock(&bdi->wb.list_lock);
>> }
>>
>> @@ -1226,6 +1324,7 @@ static void balance_dirty_pages(struct address_space *mapping,
>> unsigned long dirty_ratelimit;
>> unsigned long pos_ratio;
>> struct backing_dev_info *bdi = mapping->backing_dev_info;
>> + bool strictlimit = test_bit(AS_STRICTLIMIT, &mapping->flags);
>> unsigned long start_time = jiffies;
>>
>> for (;;) {
>> @@ -1250,7 +1349,7 @@ static void balance_dirty_pages(struct address_space *mapping,
>> */
>> freerun = dirty_freerun_ceiling(dirty_thresh,
>> background_thresh);
>> - if (nr_dirty <= freerun) {
>> + if (nr_dirty <= freerun && !strictlimit) {
>> current->dirty_paused_when = now;
>> current->nr_dirtied = 0;
>> current->nr_dirtied_pause =
> I'd rather change this to check bdi_dirty <= bdi_freerun in strictlimit
> case.

An excellent idea, thank you very much! Some complication comes from the
fact that bdi_dirty and bdi_freerun are not calculated by that time, but
we can check bdi_dirty <= bdi_freerun a bit later -- that shouldn't make
big difference.

>
>> @@ -1258,7 +1357,7 @@ static void balance_dirty_pages(struct address_space *mapping,
>> break;
>> }
>>
>> - if (unlikely(!writeback_in_progress(bdi)))
>> + if (unlikely(!writeback_in_progress(bdi)) && !strictlimit)
>> bdi_start_background_writeback(bdi);
> This can then go away.

Yes.

>
>> /*
>> @@ -1296,19 +1395,24 @@ static void balance_dirty_pages(struct address_space *mapping,
>> bdi_stat(bdi, BDI_WRITEBACK);
>> }
>>
>> + if (unlikely(!writeback_in_progress(bdi)) &&
>> + bdi_dirty > bdi_thresh / 4)
>> + bdi_start_background_writeback(bdi);
>> +
> Why is this?

We attempted to avoid kicking flusher every time we dive into
balance_dirty_pages(). This will surely go away.

>
>> dirty_exceeded = (bdi_dirty > bdi_thresh) &&
>> - (nr_dirty > dirty_thresh);
>> + ((nr_dirty > dirty_thresh) || strictlimit);
>> if (dirty_exceeded && !bdi->dirty_exceeded)
>> bdi->dirty_exceeded = 1;
>>
>> bdi_update_bandwidth(bdi, dirty_thresh, background_thresh,
>> nr_dirty, bdi_thresh, bdi_dirty,
>> - start_time);
>> + start_time, strictlimit);
>>
>> dirty_ratelimit = bdi->dirty_ratelimit;
>> pos_ratio = bdi_position_ratio(bdi, dirty_thresh,
>> background_thresh, nr_dirty,
>> - bdi_thresh, bdi_dirty);
>> + bdi_thresh, bdi_dirty,
>> + strictlimit);
>> task_ratelimit = ((u64)dirty_ratelimit * pos_ratio) >>
>> RATELIMIT_CALC_SHIFT;
>> max_pause = bdi_max_pause(bdi, bdi_dirty);
>> @@ -1362,6 +1466,8 @@ static void balance_dirty_pages(struct address_space *mapping,
>> }
>>
>> pause:
>> + if (unlikely(!writeback_in_progress(bdi)))
>> + bdi_start_background_writeback(bdi);
>> trace_balance_dirty_pages(bdi,
>> dirty_thresh,
>> background_thresh,
> And this shouldn't be necessary either after updating the freerun test
> properly.

Yep!

Thanks,
Maxim

2013-07-11 16:15:13

by Maxim Patlasov

[permalink] [raw]
Subject: [PATCH 07/16] fuse: Trust kernel i_mtime only -v2

Let the kernel maintain i_mtime locally:
- clear S_NOCMTIME
- implement i_op->update_time()
- flush mtime on fsync and last close
- update i_mtime explicitly on truncate and fallocate

Fuse inode flag FUSE_I_MTIME_DIRTY serves as indication that local i_mtime
should be flushed to the server eventually.

Changed in v2 (thanks to Brian):
- renamed FUSE_I_MTIME_UPDATED to FUSE_I_MTIME_DIRTY
- simplified fuse_set_mtime_local()
- abandoned optimizations: clearing the flag on some operations (like
direct write) is doable, but may lead to unexpected changes of
user-visible mtime.

Signed-off-by: Maxim Patlasov <[email protected]>
---
fs/fuse/dir.c | 109 ++++++++++++++++++++++++++++++++++++++++++++++++------
fs/fuse/file.c | 30 +++++++++++++--
fs/fuse/fuse_i.h | 6 ++-
fs/fuse/inode.c | 13 +++++-
4 files changed, 138 insertions(+), 20 deletions(-)

diff --git a/fs/fuse/dir.c b/fs/fuse/dir.c
index b755884..b834e46 100644
--- a/fs/fuse/dir.c
+++ b/fs/fuse/dir.c
@@ -854,8 +854,11 @@ static void fuse_fillattr(struct inode *inode, struct fuse_attr *attr,
struct fuse_conn *fc = get_fuse_conn(inode);

/* see the comment in fuse_change_attributes() */
- if (fc->writeback_cache && S_ISREG(inode->i_mode))
+ if (fc->writeback_cache && S_ISREG(inode->i_mode)) {
attr->size = i_size_read(inode);
+ attr->mtime = inode->i_mtime.tv_sec;
+ attr->mtimensec = inode->i_mtime.tv_nsec;
+ }

stat->dev = inode->i_sb->s_dev;
stat->ino = attr->ino;
@@ -1565,6 +1568,82 @@ void fuse_release_nowrite(struct inode *inode)
spin_unlock(&fc->lock);
}

+static void fuse_setattr_fill(struct fuse_conn *fc, struct fuse_req *req,
+ struct inode *inode,
+ struct fuse_setattr_in *inarg_p,
+ struct fuse_attr_out *outarg_p)
+{
+ req->in.h.opcode = FUSE_SETATTR;
+ req->in.h.nodeid = get_node_id(inode);
+ req->in.numargs = 1;
+ req->in.args[0].size = sizeof(*inarg_p);
+ req->in.args[0].value = inarg_p;
+ req->out.numargs = 1;
+ if (fc->minor < 9)
+ req->out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE;
+ else
+ req->out.args[0].size = sizeof(*outarg_p);
+ req->out.args[0].value = outarg_p;
+}
+
+/*
+ * Flush inode->i_mtime to the server
+ */
+int fuse_flush_mtime(struct file *file, bool nofail)
+{
+ struct inode *inode = file->f_mapping->host;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_req *req = NULL;
+ struct fuse_setattr_in inarg;
+ struct fuse_attr_out outarg;
+ int err;
+
+ if (nofail) {
+ req = fuse_get_req_nofail_nopages(fc, file);
+ } else {
+ req = fuse_get_req_nopages(fc);
+ if (IS_ERR(req))
+ return PTR_ERR(req);
+ }
+
+ memset(&inarg, 0, sizeof(inarg));
+ memset(&outarg, 0, sizeof(outarg));
+
+ inarg.valid |= FATTR_MTIME;
+ inarg.mtime = inode->i_mtime.tv_sec;
+ inarg.mtimensec = inode->i_mtime.tv_nsec;
+
+ fuse_setattr_fill(fc, req, inode, &inarg, &outarg);
+ fuse_request_send(fc, req);
+ err = req->out.h.error;
+ fuse_put_request(fc, req);
+
+ if (!err)
+ clear_bit(FUSE_I_MTIME_DIRTY, &fi->state);
+
+ return err;
+}
+
+/*
+ * S_NOCMTIME is clear, so we need to update inode->i_mtime manually.
+ */
+static void fuse_set_mtime_local(struct iattr *iattr, struct inode *inode)
+{
+ unsigned ivalid = iattr->ia_valid;
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ if ((ivalid & ATTR_MTIME) && (ivalid & ATTR_MTIME_SET)) {
+ /* client fs has just set mtime to iattr->ia_mtime */
+ inode->i_mtime = iattr->ia_mtime;
+ clear_bit(FUSE_I_MTIME_DIRTY, &fi->state);
+ } else if ((ivalid & ATTR_MTIME) || (ivalid & ATTR_SIZE)) {
+ /* client fs doesn't know that we're updating i_mtime */
+ inode->i_mtime = current_fs_time(inode->i_sb);
+ set_bit(FUSE_I_MTIME_DIRTY, &fi->state);
+ }
+}
+
/*
* Set attributes, and at the same time refresh them.
*
@@ -1621,17 +1700,7 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
inarg.valid |= FATTR_LOCKOWNER;
inarg.lock_owner = fuse_lock_owner_id(fc, current->files);
}
- req->in.h.opcode = FUSE_SETATTR;
- req->in.h.nodeid = get_node_id(inode);
- req->in.numargs = 1;
- req->in.args[0].size = sizeof(inarg);
- req->in.args[0].value = &inarg;
- req->out.numargs = 1;
- if (fc->minor < 9)
- req->out.args[0].size = FUSE_COMPAT_ATTR_OUT_SIZE;
- else
- req->out.args[0].size = sizeof(outarg);
- req->out.args[0].value = &outarg;
+ fuse_setattr_fill(fc, req, inode, &inarg, &outarg);
fuse_request_send(fc, req);
err = req->out.h.error;
fuse_put_request(fc, req);
@@ -1648,6 +1717,10 @@ int fuse_do_setattr(struct inode *inode, struct iattr *attr,
}

spin_lock(&fc->lock);
+ /* the kernel maintains i_mtime locally */
+ if (fc->writeback_cache && S_ISREG(inode->i_mode))
+ fuse_set_mtime_local(attr, inode);
+
fuse_change_attributes_common(inode, &outarg.attr,
attr_timeout(&outarg));
oldsize = inode->i_size;
@@ -1872,6 +1945,17 @@ static int fuse_removexattr(struct dentry *entry, const char *name)
return err;
}

+static int fuse_update_time(struct inode *inode, struct timespec *now,
+ int flags)
+{
+ if (flags & S_MTIME) {
+ inode->i_mtime = *now;
+ set_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state);
+ BUG_ON(!S_ISREG(inode->i_mode));
+ }
+ return 0;
+}
+
static const struct inode_operations fuse_dir_inode_operations = {
.lookup = fuse_lookup,
.mkdir = fuse_mkdir,
@@ -1911,6 +1995,7 @@ static const struct inode_operations fuse_common_inode_operations = {
.getxattr = fuse_getxattr,
.listxattr = fuse_listxattr,
.removexattr = fuse_removexattr,
+ .update_time = fuse_update_time,
};

static const struct inode_operations fuse_symlink_inode_operations = {
diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 98bc0d0..09a9eeb 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -291,6 +291,9 @@ static int fuse_open(struct inode *inode, struct file *file)

static int fuse_release(struct inode *inode, struct file *file)
{
+ if (test_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state))
+ fuse_flush_mtime(file, true);
+
fuse_release_common(file, FUSE_RELEASE);

/* return value is ignored by VFS */
@@ -458,6 +461,12 @@ int fuse_fsync_common(struct file *file, loff_t start, loff_t end,

fuse_sync_writes(inode);

+ if (test_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state)) {
+ int err = fuse_flush_mtime(file, false);
+ if (err)
+ goto out;
+ }
+
req = fuse_get_req_nopages(fc);
if (IS_ERR(req)) {
err = PTR_ERR(req);
@@ -948,16 +957,21 @@ static size_t fuse_send_write(struct fuse_req *req, struct fuse_io_priv *io,
return req->misc.write.out.size;
}

-void fuse_write_update_size(struct inode *inode, loff_t pos)
+bool fuse_write_update_size(struct inode *inode, loff_t pos)
{
struct fuse_conn *fc = get_fuse_conn(inode);
struct fuse_inode *fi = get_fuse_inode(inode);
+ bool ret = false;

spin_lock(&fc->lock);
fi->attr_version = ++fc->attr_version;
- if (pos > inode->i_size)
+ if (pos > inode->i_size) {
i_size_write(inode, pos);
+ ret = true;
+ }
spin_unlock(&fc->lock);
+
+ return ret;
}

static size_t fuse_send_write_pages(struct fuse_req *req, struct file *file,
@@ -2553,8 +2567,16 @@ static long fuse_file_fallocate(struct file *file, int mode, loff_t offset,
goto out;

/* we could have extended the file */
- if (!(mode & FALLOC_FL_KEEP_SIZE))
- fuse_write_update_size(inode, offset + length);
+ if (!(mode & FALLOC_FL_KEEP_SIZE)) {
+ bool changed = fuse_write_update_size(inode, offset + length);
+
+ if (changed && fc->writeback_cache) {
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ inode->i_mtime = current_fs_time(inode->i_sb);
+ set_bit(FUSE_I_MTIME_DIRTY, &fi->state);
+ }
+ }

if (mode & FALLOC_FL_PUNCH_HOLE)
truncate_pagecache_range(inode, offset, offset + length - 1);
diff --git a/fs/fuse/fuse_i.h b/fs/fuse/fuse_i.h
index a0062a0..c192e6f 100644
--- a/fs/fuse/fuse_i.h
+++ b/fs/fuse/fuse_i.h
@@ -115,6 +115,8 @@ struct fuse_inode {
enum {
/** Advise readdirplus */
FUSE_I_ADVISE_RDPLUS,
+ /** i_mtime has been updated locally; a flush to userspace needed */
+ FUSE_I_MTIME_DIRTY,
};

struct fuse_conn;
@@ -867,7 +869,9 @@ long fuse_ioctl_common(struct file *file, unsigned int cmd,
unsigned fuse_file_poll(struct file *file, poll_table *wait);
int fuse_dev_release(struct inode *inode, struct file *file);

-void fuse_write_update_size(struct inode *inode, loff_t pos);
+bool fuse_write_update_size(struct inode *inode, loff_t pos);
+
+int fuse_flush_mtime(struct file *file, bool nofail);

int fuse_do_setattr(struct inode *inode, struct iattr *attr,
struct file *file);
diff --git a/fs/fuse/inode.c b/fs/fuse/inode.c
index 121638d..591b46c 100644
--- a/fs/fuse/inode.c
+++ b/fs/fuse/inode.c
@@ -170,8 +170,11 @@ void fuse_change_attributes_common(struct inode *inode, struct fuse_attr *attr,
inode->i_blocks = attr->blocks;
inode->i_atime.tv_sec = attr->atime;
inode->i_atime.tv_nsec = attr->atimensec;
- inode->i_mtime.tv_sec = attr->mtime;
- inode->i_mtime.tv_nsec = attr->mtimensec;
+ /* mtime from server may be stale due to local buffered write */
+ if (!fc->writeback_cache || !S_ISREG(inode->i_mode)) {
+ inode->i_mtime.tv_sec = attr->mtime;
+ inode->i_mtime.tv_nsec = attr->mtimensec;
+ }
inode->i_ctime.tv_sec = attr->ctime;
inode->i_ctime.tv_nsec = attr->ctimensec;

@@ -249,6 +252,8 @@ static void fuse_init_inode(struct inode *inode, struct fuse_attr *attr)
{
inode->i_mode = attr->mode & S_IFMT;
inode->i_size = attr->size;
+ inode->i_mtime.tv_sec = attr->mtime;
+ inode->i_mtime.tv_nsec = attr->mtimensec;
if (S_ISREG(inode->i_mode)) {
fuse_init_common(inode);
fuse_init_file_inode(inode);
@@ -295,7 +300,9 @@ struct inode *fuse_iget(struct super_block *sb, u64 nodeid,
return NULL;

if ((inode->i_state & I_NEW)) {
- inode->i_flags |= S_NOATIME|S_NOCMTIME;
+ inode->i_flags |= S_NOATIME;
+ if (!fc->writeback_cache)
+ inode->i_flags |= S_NOCMTIME;
inode->i_generation = generation;
inode->i_data.backing_dev_info = &fc->bdi;
fuse_init_inode(inode, attr);

2013-07-11 16:18:50

by Maxim Patlasov

[permalink] [raw]
Subject: [PATCH 08/16] fuse: Flush files on wb close -v2

From: Pavel Emelyanov <[email protected]>

Any write request requires a file handle to report to the userspace. Thus
when we close a file (and free the fuse_file with this info) we have to
flush all the outstanding dirty pages.

filemap_write_and_wait() is enough because every page under fuse writeback
is accounted in ff->count. This delays actual close until all fuse wb is
completed.

In case of "write cache" turned off, the flush is ensured by fuse_vma_close().

Changed in v2:
- updated patch to be applied smoothly on top of
"Trust kernel i_mtime only -v2".

Signed-off-by: Maxim Patlasov <[email protected]>
---
fs/fuse/file.c | 6 ++++++
1 files changed, 6 insertions(+), 0 deletions(-)

diff --git a/fs/fuse/file.c b/fs/fuse/file.c
index 09a9eeb..3778de1 100644
--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -291,6 +291,12 @@ static int fuse_open(struct inode *inode, struct file *file)

static int fuse_release(struct inode *inode, struct file *file)
{
+ struct fuse_conn *fc = get_fuse_conn(inode);
+
+ /* see fuse_vma_close() for !writeback_cache case */
+ if (fc->writeback_cache)
+ filemap_write_and_wait(file->f_mapping);
+
if (test_bit(FUSE_I_MTIME_DIRTY, &get_fuse_inode(inode)->state))
fuse_flush_mtime(file, true);

2013-07-19 16:50:43

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 10/16] fuse: Implement writepages callback

On Sat, Jun 29, 2013 at 09:45:29PM +0400, Maxim Patlasov wrote:
> From: Pavel Emelyanov <[email protected]>
>
> The .writepages one is required to make each writeback request carry more than
> one page on it. The patch enables optimized behaviour unconditionally,
> i.e. mmap-ed writes will benefit from the patch even if fc->writeback_cache=0.

I rewrote this a bit, so we won't have to do the thing in two passes, which
makes it simpler and more robust. Waiting for page writeback here is wrong
anyway, see comment above fuse_page_mkwrite(). BTW we had a race there because
fuse_page_mkwrite() didn't take the page lock. I've also fixed that up and
pushed a series containing these patches up to implementing ->writepages() to

git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git writepages

Passed some trivial testing but more is needed.

I'll get to the rest of the patches next week.

Thanks,
Miklos


Subject: fuse: Implement writepages callback
From: Pavel Emelyanov <[email protected]>
Date: Sat, 29 Jun 2013 21:45:29 +0400

The .writepages one is required to make each writeback request carry more than
one page on it. The patch enables optimized behaviour unconditionally,
i.e. mmap-ed writes will benefit from the patch even if fc->writeback_cache=0.

[SzM: simplify, add comments]

Signed-off-by: Maxim Patlasov <[email protected]>
Signed-off-by: Miklos Szeredi <[email protected]>
---
fs/fuse/file.c | 139 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++
1 file changed, 139 insertions(+)

--- a/fs/fuse/file.c
+++ b/fs/fuse/file.c
@@ -1579,6 +1579,144 @@ static int fuse_writepage(struct page *p
return err;
}

+struct fuse_fill_wb_data {
+ struct fuse_req *req;
+ struct fuse_file *ff;
+ struct inode *inode;
+};
+
+static void fuse_writepages_send(struct fuse_fill_wb_data *data)
+{
+ struct fuse_req *req = data->req;
+ struct inode *inode = data->inode;
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ req->ff = fuse_file_get(data->ff);
+ spin_lock(&fc->lock);
+ list_add_tail(&req->list, &fi->queued_writes);
+ fuse_flush_writepages(inode);
+ spin_unlock(&fc->lock);
+}
+
+static int fuse_writepages_fill(struct page *page,
+ struct writeback_control *wbc, void *_data)
+{
+ struct fuse_fill_wb_data *data = _data;
+ struct fuse_req *req = data->req;
+ struct inode *inode = data->inode;
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct page *tmp_page;
+ int err;
+
+ if (req) {
+ BUG_ON(!req->num_pages);
+ if (req->num_pages == FUSE_MAX_PAGES_PER_REQ ||
+ (req->num_pages + 1) * PAGE_CACHE_SIZE > fc->max_write ||
+ req->pages[req->num_pages - 1]->index + 1 != page->index) {
+
+ fuse_writepages_send(data);
+ data->req = NULL;
+ }
+ }
+ err = -ENOMEM;
+ tmp_page = alloc_page(GFP_NOFS | __GFP_HIGHMEM);
+ if (!tmp_page)
+ goto out_unlock;
+
+ /*
+ * The page must not be redirtied until the writeout is completed
+ * (i.e. userspace has sent a reply to the write request). Otherwise
+ * there could be more than one temporary page instance for each real
+ * page.
+ *
+ * This is ensured by holding the page lock in page_mkwrite() while
+ * checking fuse_page_is_writeback(). We already hold the page lock
+ * since clear_page_dirty_for_io() and keep it held until we add the
+ * request to the fi->writepages list and increment req->num_pages.
+ * After this fuse_page_is_writeback() will indicate that the page is
+ * under writeback, so we can release the page lock.
+ */
+ if (data->req == NULL) {
+ struct fuse_inode *fi = get_fuse_inode(inode);
+
+ err = -ENOMEM;
+ req = fuse_request_alloc_nofs(FUSE_MAX_PAGES_PER_REQ);
+ if (!req) {
+ __free_page(tmp_page);
+ goto out_unlock;
+ }
+
+ fuse_write_fill(req, data->ff, page_offset(page), 0);
+ req->misc.write.in.write_flags |= FUSE_WRITE_CACHE;
+ req->in.argpages = 1;
+ req->background = 1;
+ req->num_pages = 0;
+ req->end = fuse_writepage_end;
+ req->inode = inode;
+
+ spin_lock(&fc->lock);
+ list_add(&req->writepages_entry, &fi->writepages);
+ spin_unlock(&fc->lock);
+
+ data->req = req;
+ }
+ set_page_writeback(page);
+
+ copy_highpage(tmp_page, page);
+ req->pages[req->num_pages] = tmp_page;
+ req->page_descs[req->num_pages].offset = 0;
+ req->page_descs[req->num_pages].length = PAGE_SIZE;
+
+ inc_bdi_stat(page->mapping->backing_dev_info, BDI_WRITEBACK);
+ inc_zone_page_state(tmp_page, NR_WRITEBACK_TEMP);
+ end_page_writeback(page);
+
+ /*
+ * Protected by fc->lock against concurrent access by
+ * fuse_page_is_writeback().
+ */
+ spin_lock(&fc->lock);
+ req->num_pages++;
+ spin_unlock(&fc->lock);
+
+ err = 0;
+out_unlock:
+ unlock_page(page);
+
+ return err;
+}
+
+static int fuse_writepages(struct address_space *mapping,
+ struct writeback_control *wbc)
+{
+ struct inode *inode = mapping->host;
+ struct fuse_conn *fc = get_fuse_conn(inode);
+ struct fuse_fill_wb_data data;
+ int err;
+
+ err = -EIO;
+ if (is_bad_inode(inode))
+ goto out;
+
+ data.req = NULL;
+ data.inode = inode;
+ data.ff = fuse_write_file(fc, get_fuse_inode(inode));
+ if (!data.ff)
+ goto out;
+
+ err = write_cache_pages(mapping, wbc, fuse_writepages_fill, &data);
+ if (data.req) {
+ /* Ignore errors if we can write at least one page */
+ BUG_ON(!data.req->num_pages);
+ fuse_writepages_send(&data);
+ err = 0;
+ }
+ fuse_file_put(data.ff, false);
+out:
+ return err;
+}
+
static int fuse_launder_page(struct page *page)
{
int err = 0;
@@ -2589,6 +2727,7 @@ static const struct file_operations fuse
static const struct address_space_operations fuse_file_aops = {
.readpage = fuse_readpage,
.writepage = fuse_writepage,
+ .writepages = fuse_writepages,
.launder_page = fuse_launder_page,
.readpages = fuse_readpages,
.set_page_dirty = __set_page_dirty_nobuffers,

2013-08-02 15:40:16

by Maxim Patlasov

[permalink] [raw]
Subject: Re: [PATCH 10/16] fuse: Implement writepages callback

07/19/2013 08:50 PM, Miklos Szeredi пишет:
> On Sat, Jun 29, 2013 at 09:45:29PM +0400, Maxim Patlasov wrote:
>> From: Pavel Emelyanov <[email protected]>
>>
>> The .writepages one is required to make each writeback request carry more than
>> one page on it. The patch enables optimized behaviour unconditionally,
>> i.e. mmap-ed writes will benefit from the patch even if fc->writeback_cache=0.
> I rewrote this a bit, so we won't have to do the thing in two passes, which
> makes it simpler and more robust. Waiting for page writeback here is wrong
> anyway, see comment above fuse_page_mkwrite(). BTW we had a race there because
> fuse_page_mkwrite() didn't take the page lock. I've also fixed that up and
> pushed a series containing these patches up to implementing ->writepages() to
>
> git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git writepages
>
> Passed some trivial testing but more is needed.

Thanks a lot for efforts. The approach you implemented looks promising,
but it introduces the following assumption: a page cannot become dirty
before we have a chance to wait on fuse writeback holding the page
locked. This is already true for mmap-ed writes (due to your fixes) and
it seems doable for cached writes as well (like we do in
fuse_perform_write). But the assumption seems to be broken in case of
direct read from local fs (e.g. ext4) to a memory region mmap-ed to a
file on fuse fs. See how dio_bio_submit() marks pages dirty by
bio_set_pages_dirty(). I can't see any solution for this use-case. Do you?

Thanks,
Maxim

>
> I'll get to the rest of the patches next week.
>
> Thanks,
> Miklos
>

2013-08-06 16:25:37

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 10/16] fuse: Implement writepages callback

On Fri, Aug 2, 2013 at 5:40 PM, Maxim Patlasov <[email protected]> wrote:
> 07/19/2013 08:50 PM, Miklos Szeredi пишет:
>
>> On Sat, Jun 29, 2013 at 09:45:29PM +0400, Maxim Patlasov wrote:
>>>
>>> From: Pavel Emelyanov <[email protected]>
>>>
>>> The .writepages one is required to make each writeback request carry more
>>> than
>>> one page on it. The patch enables optimized behaviour unconditionally,
>>> i.e. mmap-ed writes will benefit from the patch even if
>>> fc->writeback_cache=0.
>>
>> I rewrote this a bit, so we won't have to do the thing in two passes,
>> which
>> makes it simpler and more robust. Waiting for page writeback here is
>> wrong
>> anyway, see comment above fuse_page_mkwrite(). BTW we had a race there
>> because
>> fuse_page_mkwrite() didn't take the page lock. I've also fixed that up
>> and
>> pushed a series containing these patches up to implementing ->writepages()
>> to
>>
>> git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git
>> writepages
>>
>> Passed some trivial testing but more is needed.
>
>
> Thanks a lot for efforts. The approach you implemented looks promising, but
> it introduces the following assumption: a page cannot become dirty before we
> have a chance to wait on fuse writeback holding the page locked. This is
> already true for mmap-ed writes (due to your fixes) and it seems doable for
> cached writes as well (like we do in fuse_perform_write). But the assumption
> seems to be broken in case of direct read from local fs (e.g. ext4) to a
> memory region mmap-ed to a file on fuse fs. See how dio_bio_submit() marks
> pages dirty by bio_set_pages_dirty(). I can't see any solution for this
> use-case. Do you?

Hmm. Direct IO on an mmaped file will do get_user_pages() which will
do the necessary page fault magic and ->page_mkwrite() will be called.
At least AFAICS.

The page cannot become dirty through a memory mapping without first
switching the pte from read-only to read-write first. Page accounting
logic relies on this too. The other way the page can become dirty is
through write(2) on the fs. But we do get notified about that too.

Thanks,
Miklos

2013-08-09 15:02:08

by Maxim Patlasov

[permalink] [raw]
Subject: Re: [PATCH 10/16] fuse: Implement writepages callback

Hi Miklos,

08/06/2013 08:25 PM, Miklos Szeredi пишет:
> On Fri, Aug 2, 2013 at 5:40 PM, Maxim Patlasov <[email protected]> wrote:
>> 07/19/2013 08:50 PM, Miklos Szeredi пишет:
>>
>>> On Sat, Jun 29, 2013 at 09:45:29PM +0400, Maxim Patlasov wrote:
>>>> From: Pavel Emelyanov <[email protected]>
>>>>
>>>> The .writepages one is required to make each writeback request carry more
>>>> than
>>>> one page on it. The patch enables optimized behaviour unconditionally,
>>>> i.e. mmap-ed writes will benefit from the patch even if
>>>> fc->writeback_cache=0.
>>> I rewrote this a bit, so we won't have to do the thing in two passes,
>>> which
>>> makes it simpler and more robust. Waiting for page writeback here is
>>> wrong
>>> anyway, see comment above fuse_page_mkwrite(). BTW we had a race there
>>> because
>>> fuse_page_mkwrite() didn't take the page lock. I've also fixed that up
>>> and
>>> pushed a series containing these patches up to implementing ->writepages()
>>> to
>>>
>>> git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse.git
>>> writepages
>>>
>>> Passed some trivial testing but more is needed.
>>
>> Thanks a lot for efforts. The approach you implemented looks promising, but
>> it introduces the following assumption: a page cannot become dirty before we
>> have a chance to wait on fuse writeback holding the page locked. This is
>> already true for mmap-ed writes (due to your fixes) and it seems doable for
>> cached writes as well (like we do in fuse_perform_write). But the assumption
>> seems to be broken in case of direct read from local fs (e.g. ext4) to a
>> memory region mmap-ed to a file on fuse fs. See how dio_bio_submit() marks
>> pages dirty by bio_set_pages_dirty(). I can't see any solution for this
>> use-case. Do you?
> Hmm. Direct IO on an mmaped file will do get_user_pages() which will
> do the necessary page fault magic and ->page_mkwrite() will be called.
> At least AFAICS.

Yes, I agree.

>
> The page cannot become dirty through a memory mapping without first
> switching the pte from read-only to read-write first. Page accounting
> logic relies on this too. The other way the page can become dirty is
> through write(2) on the fs. But we do get notified about that too.

Yes, that's correct, but I don't understand why you disregard two other
cases of marking page dirty (both related to direct AIO read from a file
to a memory region mmap-ed to a fuse file):

1. dio_bio_submit() -->
bio_set_pages_dirty() -->
set_page_dirty_lock()

2. dio_bio_complete() -->
bio_check_pages_dirty() -->
bio_dirty_fn() -->
bio_set_pages_dirty() -->
set_page_dirty_lock()

As soon as a page became dirty through a memory mapping (exactly as you
explained), nothing would prevent it to be written-back. And fuse will
call end_page_writeback almost immediately after copying the real page
to a temporary one. Then dio_bio_submit may re-dirty page speculatively
w/o notifying fuse. And again, since then nothing would prevent it to be
written-back once more. Hence we can end up in more then one temporary
page in fuse write-back. And similar concern for dio_bio_complete()
re-dirty.

This make me think that we do need fuse_page_is_writeback() in
fuse_writepages_fill(). But it shouldn't be harmful because it will
no-op practically always due to waiting for fuse writeback in
->page_mkwrite() and in course of handling write(2).

Thanks,
Maxim

2013-08-30 10:12:11

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 10/16] fuse: Implement writepages callback

On Fri, Aug 09, 2013 at 07:02:12PM +0400, Maxim Patlasov wrote:
> 08/06/2013 08:25 PM, Miklos Szeredi пишет:
> >Hmm. Direct IO on an mmaped file will do get_user_pages() which will
> >do the necessary page fault magic and ->page_mkwrite() will be called.
> >At least AFAICS.
>
> Yes, I agree.
>
> >
> >The page cannot become dirty through a memory mapping without first
> >switching the pte from read-only to read-write first. Page accounting
> >logic relies on this too. The other way the page can become dirty is
> >through write(2) on the fs. But we do get notified about that too.
>
> Yes, that's correct, but I don't understand why you disregard two
> other cases of marking page dirty (both related to direct AIO read
> from a file to a memory region mmap-ed to a fuse file):
>
> 1. dio_bio_submit() -->
> bio_set_pages_dirty() -->
> set_page_dirty_lock()
>
> 2. dio_bio_complete() -->
> bio_check_pages_dirty() -->
> bio_dirty_fn() -->
> bio_set_pages_dirty() -->
> set_page_dirty_lock()
>
> As soon as a page became dirty through a memory mapping (exactly as
> you explained), nothing would prevent it to be written-back. And
> fuse will call end_page_writeback almost immediately after copying
> the real page to a temporary one. Then dio_bio_submit may re-dirty
> page speculatively w/o notifying fuse. And again, since then nothing
> would prevent it to be written-back once more. Hence we can end up
> in more then one temporary page in fuse write-back. And similar
> concern for dio_bio_complete() re-dirty.
>
> This make me think that we do need fuse_page_is_writeback() in
> fuse_writepages_fill(). But it shouldn't be harmful because it will
> no-op practically always due to waiting for fuse writeback in
> ->page_mkwrite() and in course of handling write(2).

The problem is: if we need it in ->writepages, we need it in ->writepage too.
And that's where we can't have it because it would deadlock in reclaim.

There's a way to work around this:

- if the request is still in queue, just update it with the contents of the
new page

- if the request already in userspace, create a new reqest, but only let
userspace have it once the previous request for the same page completes, so
the ordering is not messed up

But that's a lot of hairy code.

Any other ideas?

The best would be if we could get rid of the ugly temporary page requirement for
fuse writeback. The big blocker has always been direct reclaim: allocation must
not wait on fuse writebacks. AFAICS there's still a wait_on_page_writeback() in
relation to memcg. And it interacts with page migration which also has them.
Those are the really difficult ones...

The other offender, balance_dirty_pages() is much easier and needs to be tought
about fuse behavior anyway.

Thoughts?

Thanks,
Miklos

2013-08-30 14:50:32

by Maxim Patlasov

[permalink] [raw]
Subject: Re: [PATCH 10/16] fuse: Implement writepages callback

Hi Miklos,

08/30/2013 02:12 PM, Miklos Szeredi пишет:
> On Fri, Aug 09, 2013 at 07:02:12PM +0400, Maxim Patlasov wrote:
>> 08/06/2013 08:25 PM, Miklos Szeredi пишет:
>>> Hmm. Direct IO on an mmaped file will do get_user_pages() which will
>>> do the necessary page fault magic and ->page_mkwrite() will be called.
>>> At least AFAICS.
>> Yes, I agree.
>>
>>> The page cannot become dirty through a memory mapping without first
>>> switching the pte from read-only to read-write first. Page accounting
>>> logic relies on this too. The other way the page can become dirty is
>>> through write(2) on the fs. But we do get notified about that too.
>> Yes, that's correct, but I don't understand why you disregard two
>> other cases of marking page dirty (both related to direct AIO read
>> from a file to a memory region mmap-ed to a fuse file):
>>
>> 1. dio_bio_submit() -->
>> bio_set_pages_dirty() -->
>> set_page_dirty_lock()
>>
>> 2. dio_bio_complete() -->
>> bio_check_pages_dirty() -->
>> bio_dirty_fn() -->
>> bio_set_pages_dirty() -->
>> set_page_dirty_lock()
>>
>> As soon as a page became dirty through a memory mapping (exactly as
>> you explained), nothing would prevent it to be written-back. And
>> fuse will call end_page_writeback almost immediately after copying
>> the real page to a temporary one. Then dio_bio_submit may re-dirty
>> page speculatively w/o notifying fuse. And again, since then nothing
>> would prevent it to be written-back once more. Hence we can end up
>> in more then one temporary page in fuse write-back. And similar
>> concern for dio_bio_complete() re-dirty.
>>
>> This make me think that we do need fuse_page_is_writeback() in
>> fuse_writepages_fill(). But it shouldn't be harmful because it will
>> no-op practically always due to waiting for fuse writeback in
>> ->page_mkwrite() and in course of handling write(2).
> The problem is: if we need it in ->writepages, we need it in ->writepage too.
> And that's where we can't have it because it would deadlock in reclaim.

I thought we're protected from the deadlock by the following chunk (in
the very beginning of fuse_writepage):

> + if (fuse_page_is_writeback(inode, page->index)) {
> + if (wbc->sync_mode != WB_SYNC_ALL) {
> + redirty_page_for_writepage(wbc, page);
> + return 0;
> + }
> + fuse_wait_on_page_writeback(inode, page->index);
> + }

Because reclaimer will never call us with WB_SYNC_ALL. Did I miss
something?

>
> There's a way to work around this:
>
> - if the request is still in queue, just update it with the contents of the
> new page
>
> - if the request already in userspace, create a new reqest, but only let
> userspace have it once the previous request for the same page completes, so
> the ordering is not messed up
>
> But that's a lot of hairy code.

Is it exactly how NFS solves similar problem?

>
> Any other ideas?
>
> The best would be if we could get rid of the ugly temporary page requirement for
> fuse writeback. The big blocker has always been direct reclaim: allocation must
> not wait on fuse writebacks. AFAICS there's still a wait_on_page_writeback() in
> relation to memcg. And it interacts with page migration which also has them.
> Those are the really difficult ones...

Yes, I agree. I think there are pretty many reasons not to keep original
page under writeback for long. And not to make it dependant on userspace
process as well.

>
> The other offender, balance_dirty_pages() is much easier and needs to be tought
> about fuse behavior anyway.

BTW, strictlimit feature (including its enable for fuse) is already in
linux-next.

Thanks,
Maxim

2013-09-03 10:31:14

by Miklos Szeredi

[permalink] [raw]
Subject: Re: [PATCH 10/16] fuse: Implement writepages callback

On Fri, Aug 30, 2013 at 06:50:18PM +0400, Maxim Patlasov wrote:
> Hi Miklos,
>
> 08/30/2013 02:12 PM, Miklos Szeredi пишет:
> >On Fri, Aug 09, 2013 at 07:02:12PM +0400, Maxim Patlasov wrote:
> >>08/06/2013 08:25 PM, Miklos Szeredi пишет:
> >>>Hmm. Direct IO on an mmaped file will do get_user_pages() which will
> >>>do the necessary page fault magic and ->page_mkwrite() will be called.
> >>>At least AFAICS.
> >>Yes, I agree.
> >>
> >>>The page cannot become dirty through a memory mapping without first
> >>>switching the pte from read-only to read-write first. Page accounting
> >>>logic relies on this too. The other way the page can become dirty is
> >>>through write(2) on the fs. But we do get notified about that too.
> >>Yes, that's correct, but I don't understand why you disregard two
> >>other cases of marking page dirty (both related to direct AIO read
> >>from a file to a memory region mmap-ed to a fuse file):
> >>
> >>1. dio_bio_submit() -->
> >> bio_set_pages_dirty() -->
> >> set_page_dirty_lock()
> >>
> >>2. dio_bio_complete() -->
> >> bio_check_pages_dirty() -->
> >> bio_dirty_fn() -->
> >> bio_set_pages_dirty() -->
> >> set_page_dirty_lock()
> >>
> >>As soon as a page became dirty through a memory mapping (exactly as
> >>you explained), nothing would prevent it to be written-back. And
> >>fuse will call end_page_writeback almost immediately after copying
> >>the real page to a temporary one. Then dio_bio_submit may re-dirty
> >>page speculatively w/o notifying fuse. And again, since then nothing
> >>would prevent it to be written-back once more. Hence we can end up
> >>in more then one temporary page in fuse write-back. And similar
> >>concern for dio_bio_complete() re-dirty.
> >>
> >>This make me think that we do need fuse_page_is_writeback() in
> >>fuse_writepages_fill(). But it shouldn't be harmful because it will
> >>no-op practically always due to waiting for fuse writeback in
> >>->page_mkwrite() and in course of handling write(2).
> >The problem is: if we need it in ->writepages, we need it in ->writepage too.
> >And that's where we can't have it because it would deadlock in reclaim.
>
> I thought we're protected from the deadlock by the following chunk
> (in the very beginning of fuse_writepage):
>
> >+ if (fuse_page_is_writeback(inode, page->index)) {
> >+ if (wbc->sync_mode != WB_SYNC_ALL) {
> >+ redirty_page_for_writepage(wbc, page);
> >+ return 0;
> >+ }
> >+ fuse_wait_on_page_writeback(inode, page->index);
> >+ }
>
> Because reclaimer will never call us with WB_SYNC_ALL. Did I miss
> something?

Yeah, we could have that in ->writepage() too. And apparently that would work,
reclaim would just leave us alone.

Then there's sync(2) which does do WB_SYNC_ALL. Yet for an unprivileged fuse
mount we don't want ->writepages() to block because that's a quite clear DoS
issue.

So we are left with this:

> >There's a way to work around this:
> >
> > - if the request is still in queue, just update it with the contents of
> > the new page
> >
> > - if the request already in userspace, create a new reqest, but only let
> > userspace have it once the previous request for the same page
> > completes, so the ordering is not messed up
> >
> >But that's a lot of hairy code.
>
> Is it exactly how NFS solves similar problem?

NFS will apparently just block if there's a request outstanding and we are in
WB_SYNC_ALL mode. Which is somewhat simpler.

Thanks,
Miklos

2013-09-03 16:03:03

by Maxim Patlasov

[permalink] [raw]
Subject: Re: [PATCH 10/16] fuse: Implement writepages callback

09/03/2013 02:31 PM, Miklos Szeredi пишет:
> On Fri, Aug 30, 2013 at 06:50:18PM +0400, Maxim Patlasov wrote:
>> Hi Miklos,
>>
>> 08/30/2013 02:12 PM, Miklos Szeredi пишет:
>>> On Fri, Aug 09, 2013 at 07:02:12PM +0400, Maxim Patlasov wrote:
>>>> 08/06/2013 08:25 PM, Miklos Szeredi пишет:
>>>>> Hmm. Direct IO on an mmaped file will do get_user_pages() which will
>>>>> do the necessary page fault magic and ->page_mkwrite() will be called.
>>>>> At least AFAICS.
>>>> Yes, I agree.
>>>>
>>>>> The page cannot become dirty through a memory mapping without first
>>>>> switching the pte from read-only to read-write first. Page accounting
>>>>> logic relies on this too. The other way the page can become dirty is
>>>>> through write(2) on the fs. But we do get notified about that too.
>>>> Yes, that's correct, but I don't understand why you disregard two
>>>> other cases of marking page dirty (both related to direct AIO read
>>> >from a file to a memory region mmap-ed to a fuse file):
>>>> 1. dio_bio_submit() -->
>>>> bio_set_pages_dirty() -->
>>>> set_page_dirty_lock()
>>>>
>>>> 2. dio_bio_complete() -->
>>>> bio_check_pages_dirty() -->
>>>> bio_dirty_fn() -->
>>>> bio_set_pages_dirty() -->
>>>> set_page_dirty_lock()
>>>>
>>>> As soon as a page became dirty through a memory mapping (exactly as
>>>> you explained), nothing would prevent it to be written-back. And
>>>> fuse will call end_page_writeback almost immediately after copying
>>>> the real page to a temporary one. Then dio_bio_submit may re-dirty
>>>> page speculatively w/o notifying fuse. And again, since then nothing
>>>> would prevent it to be written-back once more. Hence we can end up
>>>> in more then one temporary page in fuse write-back. And similar
>>>> concern for dio_bio_complete() re-dirty.
>>>>
>>>> This make me think that we do need fuse_page_is_writeback() in
>>>> fuse_writepages_fill(). But it shouldn't be harmful because it will
>>>> no-op practically always due to waiting for fuse writeback in
>>>> ->page_mkwrite() and in course of handling write(2).
>>> The problem is: if we need it in ->writepages, we need it in ->writepage too.
>>> And that's where we can't have it because it would deadlock in reclaim.
>> I thought we're protected from the deadlock by the following chunk
>> (in the very beginning of fuse_writepage):
>>
>>> + if (fuse_page_is_writeback(inode, page->index)) {
>>> + if (wbc->sync_mode != WB_SYNC_ALL) {
>>> + redirty_page_for_writepage(wbc, page);
>>> + return 0;
>>> + }
>>> + fuse_wait_on_page_writeback(inode, page->index);
>>> + }
>> Because reclaimer will never call us with WB_SYNC_ALL. Did I miss
>> something?
> Yeah, we could have that in ->writepage() too. And apparently that would work,
> reclaim would just leave us alone.
>
> Then there's sync(2) which does do WB_SYNC_ALL. Yet for an unprivileged fuse
> mount we don't want ->writepages() to block because that's a quite clear DoS
> issue.

Yes, I agree, but those cases (when sync(2) coincides with a page under
fuse writeback originated by flusher coinciding with those direct AIO
read redirty) should be very rare. I'd suggest to go on and put up with
it for now: unprivileged users won't be able to use writeback_cache
option until sysad enables allow_wbcache in fusermount.

>
> So we are left with this:

Yes. May we implement it as a separate fix after inclusion of this
patch-set?

>
>>> There's a way to work around this:
>>>
>>> - if the request is still in queue, just update it with the contents of
>>> the new page
>>>
>>> - if the request already in userspace, create a new reqest, but only let
>>> userspace have it once the previous request for the same page
>>> completes, so the ordering is not messed up
>>>
>>> But that's a lot of hairy code.
>> Is it exactly how NFS solves similar problem?
> NFS will apparently just block if there's a request outstanding and we are in
> WB_SYNC_ALL mode. Which is somewhat simpler.

Yes, indeed.

Thanks,
Maxim