2012-11-01 08:01:07

by Darrick J. Wong

[permalink] [raw]
Subject: [RFC PATCH v2 0/3] mm/fs: Implement faster stable page writes on filesystems

Hi all,

This patchset makes some key modifications to the original 'stable page writes'
patchset. First, it provides users (devices and filesystems) of a
backing_dev_info the ability to declare whether or not it is necessary to
ensure that page contents cannot change during writeout, whereas the current
code assumes that this is true. Second, it relaxes the wait_on_page_writeback
calls so that they only occur if something needs it. Third, it fixes up (most)
of the remaining filesystems to use this improved conditional-wait logic in the
hopes of providing stable page writes on all filesystems.

It is hoped that (for people not using checksumming devices, anyway) this
patchset will give back unnecessary performance decreases since the original
stable page write patchset went into 3.0. It seems possible, though, that iscsi
and raid5 may wish to use the new stable page write support to enable zero-copy
writeout.

Unfortunately, it seems that ext3 is still broken wrt stable page writes. One
workaround would be to use ext4 instead, or avoid the use of ext3.ko + DIF/DIX.
Hopefully it doesn't take long to sort out.

Another thing I noticed is that there are several filesystems that call
wait_on_page_writeback before returning VM_FAULT_LOCKED in their page_mkwrite
delegates. It might be possible to convert some of these to
wait_for_stable_pages unless there's some other reason that we always want to
wait for writeback.

Finally, if a filesystem wants the VM to help it provide stable pages, it's now
possible to use the *_require_stable_pages() functions to turn that on. It
might be useful for checksumming data blocks during write.

This patchset has been lightly tested on 3.7.0-rc3 on x64.

--D


2012-11-01 07:59:52

by Darrick J. Wong

[permalink] [raw]
Subject: [PATCH 3/3] fs: Fix remaining filesystems to wait for stable page writeback

Fix up the filesystems that provide their own ->page_mkwrite handlers to
provide stable page writes if necessary.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/9p/vfs_file.c | 1 +
fs/afs/write.c | 4 ++--
fs/ceph/addr.c | 1 +
fs/cifs/file.c | 1 +
fs/ocfs2/mmap.c | 1 +
fs/ubifs/file.c | 4 ++--
6 files changed, 8 insertions(+), 4 deletions(-)


diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
index c2483e9..aa253f0 100644
--- a/fs/9p/vfs_file.c
+++ b/fs/9p/vfs_file.c
@@ -620,6 +620,7 @@ v9fs_vm_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
lock_page(page);
if (page->mapping != inode->i_mapping)
goto out_unlock;
+ wait_on_stable_page_write(page);

return VM_FAULT_LOCKED;
out_unlock:
diff --git a/fs/afs/write.c b/fs/afs/write.c
index 9aa52d9..39eb2a4 100644
--- a/fs/afs/write.c
+++ b/fs/afs/write.c
@@ -758,7 +758,7 @@ int afs_page_mkwrite(struct vm_area_struct *vma, struct page *page)
#ifdef CONFIG_AFS_FSCACHE
fscache_wait_on_page_write(vnode->cache, page);
#endif
-
+ wait_on_stable_page_write(page);
_leave(" = 0");
- return 0;
+ return VM_FAULT_LOCKED;
}
diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
index 6690269..e9734bf 100644
--- a/fs/ceph/addr.c
+++ b/fs/ceph/addr.c
@@ -1208,6 +1208,7 @@ static int ceph_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
set_page_dirty(page);
up_read(&mdsc->snap_rwsem);
ret = VM_FAULT_LOCKED;
+ wait_on_stable_page_write(page);
} else {
if (ret == -ENOMEM)
ret = VM_FAULT_OOM;
diff --git a/fs/cifs/file.c b/fs/cifs/file.c
index edb25b4..a8770bf 100644
--- a/fs/cifs/file.c
+++ b/fs/cifs/file.c
@@ -2997,6 +2997,7 @@ cifs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
struct page *page = vmf->page;

lock_page(page);
+ wait_on_stable_page_write(page);
return VM_FAULT_LOCKED;
}

diff --git a/fs/ocfs2/mmap.c b/fs/ocfs2/mmap.c
index 47a87dd..a0027b1 100644
--- a/fs/ocfs2/mmap.c
+++ b/fs/ocfs2/mmap.c
@@ -124,6 +124,7 @@ static int __ocfs2_page_mkwrite(struct file *file, struct buffer_head *di_bh,
fsdata);
BUG_ON(ret != len);
ret = VM_FAULT_LOCKED;
+ wait_on_stable_page_write(page);
out:
return ret;
}
diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
index 5bc7781..cb0d3aa 100644
--- a/fs/ubifs/file.c
+++ b/fs/ubifs/file.c
@@ -1522,8 +1522,8 @@ static int ubifs_vm_page_mkwrite(struct vm_area_struct *vma,
ubifs_release_dirty_inode_budget(c, ui);
}

- unlock_page(page);
- return 0;
+ wait_on_stable_page_write(page);
+ return VM_FAULT_LOCKED;

out_unlock:
unlock_page(page);

2012-11-01 08:01:09

by Darrick J. Wong

[permalink] [raw]
Subject: [PATCH 1/3] bdi: Track users that require stable page writes

This creates a per-backing-device counter that tracks the number of users which
require pages to be held immutable during writeout. Eventually it will be used
to waive wait_for_page_writeback() if nobody requires stable pages.

Signed-off-by: Darrick J. Wong <[email protected]>
---
Documentation/ABI/testing/sysfs-class-bdi | 7 +++++
block/blk-integrity.c | 4 +++
include/linux/backing-dev.h | 16 ++++++++++++
include/linux/blkdev.h | 10 ++++++++
mm/backing-dev.c | 38 +++++++++++++++++++++++++++++
5 files changed, 75 insertions(+)


diff --git a/Documentation/ABI/testing/sysfs-class-bdi b/Documentation/ABI/testing/sysfs-class-bdi
index 5f50097..218a618 100644
--- a/Documentation/ABI/testing/sysfs-class-bdi
+++ b/Documentation/ABI/testing/sysfs-class-bdi
@@ -48,3 +48,10 @@ max_ratio (read-write)
most of the write-back cache. For example in case of an NFS
mount that is prone to get stuck, or a FUSE mount which cannot
be trusted to play fair.
+
+stable_pages_required (read-write)
+
+ If set, the backing device requires that all pages comprising a write
+ request must not be changed until writeout is complete. The system
+ administrator can turn this on if the hardware does not do so already.
+ However, once enabled, this flag cannot be disabled.
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index da2a818..cf2dd95 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -420,6 +420,8 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template)
} else
bi->name = bi_unsupported_name;

+ queue_require_stable_pages(disk->queue);
+
return 0;
}
EXPORT_SYMBOL(blk_integrity_register);
@@ -438,6 +440,8 @@ void blk_integrity_unregister(struct gendisk *disk)
if (!disk || !disk->integrity)
return;

+ queue_unrequire_stable_pages(disk->queue);
+
bi = disk->integrity;

kobject_uevent(&bi->kobj, KOBJ_REMOVE);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 2a9a9ab..0554f5d 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -109,6 +109,7 @@ struct backing_dev_info {
struct dentry *debug_dir;
struct dentry *debug_stats;
#endif
+ atomic_t stable_page_users;
};

int bdi_init(struct backing_dev_info *bdi);
@@ -307,6 +308,21 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout);
int pdflush_proc_obsolete(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp, loff_t *ppos);

+static inline void bdi_require_stable_pages(struct backing_dev_info *bdi)
+{
+ atomic_inc(&bdi->stable_page_users);
+}
+
+static inline void bdi_unrequire_stable_pages(struct backing_dev_info *bdi)
+{
+ atomic_dec(&bdi->stable_page_users);
+}
+
+static inline bool bdi_cap_stable_pages_required(struct backing_dev_info *bdi)
+{
+ return atomic_read(&bdi->stable_page_users) > 0;
+}
+
static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
{
return !(bdi->capabilities & BDI_CAP_NO_WRITEBACK);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 1756001..bf927c0 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -458,6 +458,16 @@ struct request_queue {
(1 << QUEUE_FLAG_SAME_COMP) | \
(1 << QUEUE_FLAG_ADD_RANDOM))

+static inline void queue_require_stable_pages(struct request_queue *q)
+{
+ bdi_require_stable_pages(&q->backing_dev_info);
+}
+
+static inline void queue_unrequire_stable_pages(struct request_queue *q)
+{
+ bdi_unrequire_stable_pages(&q->backing_dev_info);
+}
+
static inline void queue_lockdep_assert_held(struct request_queue *q)
{
if (q->queue_lock)
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index d3ca2b3..dd9f5ed 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -221,12 +221,48 @@ static ssize_t max_ratio_store(struct device *dev,
}
BDI_SHOW(max_ratio, bdi->max_ratio)

+static ssize_t stable_pages_required_store(struct device *dev,
+ struct device_attribute *attr,
+ const char *buf, size_t count)
+{
+ struct backing_dev_info *bdi = dev_get_drvdata(dev);
+ unsigned int spw;
+ ssize_t ret;
+
+ ret = kstrtouint(buf, 10, &spw);
+ if (ret < 0)
+ return ret;
+
+ /*
+ * SPW could be enabled due to hw requirement, so don't
+ * let users disable it.
+ */
+ if (bdi_cap_stable_pages_required(bdi) && spw == 0)
+ return -EINVAL;
+
+ if (spw != 0)
+ atomic_inc(&bdi->stable_page_users);
+
+ return count;
+}
+
+static ssize_t stable_pages_required_show(struct device *dev,
+ struct device_attribute *attr,
+ char *page)
+{
+ struct backing_dev_info *bdi = dev_get_drvdata(dev);
+
+ return snprintf(page, PAGE_SIZE-1, "%d\n",
+ bdi_cap_stable_pages_required(bdi) ? 1 : 0);
+}
+
#define __ATTR_RW(attr) __ATTR(attr, 0644, attr##_show, attr##_store)

static struct device_attribute bdi_dev_attrs[] = {
__ATTR_RW(read_ahead_kb),
__ATTR_RW(min_ratio),
__ATTR_RW(max_ratio),
+ __ATTR_RW(stable_pages_required),
__ATTR_NULL,
};

@@ -650,6 +686,8 @@ int bdi_init(struct backing_dev_info *bdi)
bdi->write_bandwidth = INIT_BW;
bdi->avg_write_bandwidth = INIT_BW;

+ atomic_set(&bdi->stable_page_users, 0);
+
err = fprop_local_init_percpu(&bdi->completions);

if (err) {

2012-11-01 08:01:05

by Darrick J. Wong

[permalink] [raw]
Subject: [PATCH 2/3] mm: Only enforce stable page writes if the backing device requires it

Create a helper function to check if a backing device requires stable page
writes and, if so, performs the necessary wait. Then, make it so that all
points in the memory manager that handle making pages writable use the helper
function. This should provide stable page write support to most filesystems,
while eliminating unnecessary waiting for devices that don't require the
feature.

Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/buffer.c | 2 +-
fs/ext4/inode.c | 2 +-
include/linux/pagemap.h | 1 +
mm/filemap.c | 3 ++-
mm/page-writeback.c | 11 +++++++++++
5 files changed, 16 insertions(+), 3 deletions(-)


diff --git a/fs/buffer.c b/fs/buffer.c
index b5f0442..cac3007 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2334,7 +2334,7 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
if (unlikely(ret < 0))
goto out_unlock;
set_page_dirty(page);
- wait_on_page_writeback(page);
+ wait_on_stable_page_write(page);
return 0;
out_unlock:
unlock_page(page);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b3c243b..948d68a 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4814,7 +4814,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
ext4_bh_unmapped)) {
/* Wait so that we don't change page under IO */
- wait_on_page_writeback(page);
+ wait_on_stable_page_write(page);
ret = VM_FAULT_LOCKED;
goto out;
}
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index e42c762..c28da25 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -398,6 +398,7 @@ static inline void wait_on_page_writeback(struct page *page)
}

extern void end_page_writeback(struct page *page);
+void wait_on_stable_page_write(struct page *page);

/*
* Add an arbitrary waiter to a page's wait queue
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..ee46141 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1728,6 +1728,7 @@ int filemap_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
* see the dirty page and writeprotect it again.
*/
set_page_dirty(page);
+ wait_on_stable_page_write(page);
out:
sb_end_pagefault(inode->i_sb);
return ret;
@@ -2274,7 +2275,7 @@ repeat:
return NULL;
}
found:
- wait_on_page_writeback(page);
+ wait_on_stable_page_write(page);
return page;
}
EXPORT_SYMBOL(grab_cache_page_write_begin);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 830893b..916dae1 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2275,3 +2275,14 @@ int mapping_tagged(struct address_space *mapping, int tag)
return radix_tree_tagged(&mapping->page_tree, tag);
}
EXPORT_SYMBOL(mapping_tagged);
+
+void wait_on_stable_page_write(struct page *page)
+{
+ struct backing_dev_info *bdi = page->mapping->backing_dev_info;
+
+ if (!bdi_cap_stable_pages_required(bdi))
+ return;
+
+ wait_on_page_writeback(page);
+}
+EXPORT_SYMBOL_GPL(wait_on_stable_page_write);

2012-11-01 12:36:58

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 3/3] fs: Fix remaining filesystems to wait for stable page writeback

On Thu 01-11-12 00:58:29, Darrick J. Wong wrote:
> Fix up the filesystems that provide their own ->page_mkwrite handlers to
> provide stable page writes if necessary.
>
> Signed-off-by: Darrick J. Wong <[email protected]>
> ---
> fs/9p/vfs_file.c | 1 +
> fs/afs/write.c | 4 ++--
> fs/ceph/addr.c | 1 +
> fs/cifs/file.c | 1 +
> fs/ocfs2/mmap.c | 1 +
> fs/ubifs/file.c | 4 ++--
> 6 files changed, 8 insertions(+), 4 deletions(-)
>
>
> diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
> index c2483e9..aa253f0 100644
> --- a/fs/9p/vfs_file.c
> +++ b/fs/9p/vfs_file.c
> @@ -620,6 +620,7 @@ v9fs_vm_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> lock_page(page);
> if (page->mapping != inode->i_mapping)
> goto out_unlock;
> + wait_on_stable_page_write(page);
>
> return VM_FAULT_LOCKED;
> out_unlock:
> diff --git a/fs/afs/write.c b/fs/afs/write.c
> index 9aa52d9..39eb2a4 100644
> --- a/fs/afs/write.c
> +++ b/fs/afs/write.c
> @@ -758,7 +758,7 @@ int afs_page_mkwrite(struct vm_area_struct *vma, struct page *page)
> #ifdef CONFIG_AFS_FSCACHE
> fscache_wait_on_page_write(vnode->cache, page);
> #endif
> -
> + wait_on_stable_page_write(page);
> _leave(" = 0");
> - return 0;
> + return VM_FAULT_LOCKED;
> }
Oh, I missed these two since I've got confused by
fscache_wait_on_page_write().

> diff --git a/fs/ocfs2/mmap.c b/fs/ocfs2/mmap.c
> index 47a87dd..a0027b1 100644
> --- a/fs/ocfs2/mmap.c
> +++ b/fs/ocfs2/mmap.c
> @@ -124,6 +124,7 @@ static int __ocfs2_page_mkwrite(struct file *file, struct buffer_head *di_bh,
> fsdata);
> BUG_ON(ret != len);
> ret = VM_FAULT_LOCKED;
> + wait_on_stable_page_write(page);
> out:
> return ret;
> }
Actually, this is not so easy. ocfs2 doesn't use
grab_cache_page_write_begin() so you have to modify write_begin() path as
well. And then you don't have to modify __ocfs2_page_mkwrite() because it
uses ocfs2_write_begin(). Preferably teach it to use
grab_cache_page_write_begin()...

And I think you are missing ncpfs. Because ncp_file_mmap does not set
.mkwrite - it should use filemap_page_mkwrite() I think.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-11-01 13:28:53

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 2/3] mm: Only enforce stable page writes if the backing device requires it

On Thu 01-11-12 00:58:21, Darrick J. Wong wrote:
> Create a helper function to check if a backing device requires stable page
> writes and, if so, performs the necessary wait. Then, make it so that all
> points in the memory manager that handle making pages writable use the helper
> function. This should provide stable page write support to most filesystems,
> while eliminating unnecessary waiting for devices that don't require the
> feature.
>
> Signed-off-by: Darrick J. Wong <[email protected]>
> ---
> fs/buffer.c | 2 +-
> fs/ext4/inode.c | 2 +-
> include/linux/pagemap.h | 1 +
> mm/filemap.c | 3 ++-
> mm/page-writeback.c | 11 +++++++++++
> 5 files changed, 16 insertions(+), 3 deletions(-)
>
..
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 830893b..916dae1 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2275,3 +2275,14 @@ int mapping_tagged(struct address_space *mapping, int tag)
> return radix_tree_tagged(&mapping->page_tree, tag);
> }
> EXPORT_SYMBOL(mapping_tagged);
> +
> +void wait_on_stable_page_write(struct page *page)
> +{
> + struct backing_dev_info *bdi = page->mapping->backing_dev_info;
> +
> + if (!bdi_cap_stable_pages_required(bdi))
> + return;
> +
> + wait_on_page_writeback(page);
> +}
> +EXPORT_SYMBOL_GPL(wait_on_stable_page_write);
Just one nit: Maybe "wait_if_stable_write()" would describe the function
better? Otherwise the patch looks OK.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-11-01 13:31:58

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 1/3] bdi: Track users that require stable page writes

On Thu 01-11-12 00:58:13, Darrick J. Wong wrote:
> This creates a per-backing-device counter that tracks the number of users which
> require pages to be held immutable during writeout. Eventually it will be used
> to waive wait_for_page_writeback() if nobody requires stable pages.
As I wrote in another mail, maybe a combination of bdi and sb flag would
make things simpler (less chances for errors). But I can live with this as
well...

Honza
>
> Signed-off-by: Darrick J. Wong <[email protected]>
> ---
> Documentation/ABI/testing/sysfs-class-bdi | 7 +++++
> block/blk-integrity.c | 4 +++
> include/linux/backing-dev.h | 16 ++++++++++++
> include/linux/blkdev.h | 10 ++++++++
> mm/backing-dev.c | 38 +++++++++++++++++++++++++++++
> 5 files changed, 75 insertions(+)
>
>
> diff --git a/Documentation/ABI/testing/sysfs-class-bdi b/Documentation/ABI/testing/sysfs-class-bdi
> index 5f50097..218a618 100644
> --- a/Documentation/ABI/testing/sysfs-class-bdi
> +++ b/Documentation/ABI/testing/sysfs-class-bdi
> @@ -48,3 +48,10 @@ max_ratio (read-write)
> most of the write-back cache. For example in case of an NFS
> mount that is prone to get stuck, or a FUSE mount which cannot
> be trusted to play fair.
> +
> +stable_pages_required (read-write)
> +
> + If set, the backing device requires that all pages comprising a write
> + request must not be changed until writeout is complete. The system
> + administrator can turn this on if the hardware does not do so already.
> + However, once enabled, this flag cannot be disabled.
> diff --git a/block/blk-integrity.c b/block/blk-integrity.c
> index da2a818..cf2dd95 100644
> --- a/block/blk-integrity.c
> +++ b/block/blk-integrity.c
> @@ -420,6 +420,8 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template)
> } else
> bi->name = bi_unsupported_name;
>
> + queue_require_stable_pages(disk->queue);
> +
> return 0;
> }
> EXPORT_SYMBOL(blk_integrity_register);
> @@ -438,6 +440,8 @@ void blk_integrity_unregister(struct gendisk *disk)
> if (!disk || !disk->integrity)
> return;
>
> + queue_unrequire_stable_pages(disk->queue);
> +
> bi = disk->integrity;
>
> kobject_uevent(&bi->kobj, KOBJ_REMOVE);
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 2a9a9ab..0554f5d 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -109,6 +109,7 @@ struct backing_dev_info {
> struct dentry *debug_dir;
> struct dentry *debug_stats;
> #endif
> + atomic_t stable_page_users;
> };
>
> int bdi_init(struct backing_dev_info *bdi);
> @@ -307,6 +308,21 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout);
> int pdflush_proc_obsolete(struct ctl_table *table, int write,
> void __user *buffer, size_t *lenp, loff_t *ppos);
>
> +static inline void bdi_require_stable_pages(struct backing_dev_info *bdi)
> +{
> + atomic_inc(&bdi->stable_page_users);
> +}
> +
> +static inline void bdi_unrequire_stable_pages(struct backing_dev_info *bdi)
> +{
> + atomic_dec(&bdi->stable_page_users);
> +}
> +
> +static inline bool bdi_cap_stable_pages_required(struct backing_dev_info *bdi)
> +{
> + return atomic_read(&bdi->stable_page_users) > 0;
> +}
> +
> static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
> {
> return !(bdi->capabilities & BDI_CAP_NO_WRITEBACK);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 1756001..bf927c0 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -458,6 +458,16 @@ struct request_queue {
> (1 << QUEUE_FLAG_SAME_COMP) | \
> (1 << QUEUE_FLAG_ADD_RANDOM))
>
> +static inline void queue_require_stable_pages(struct request_queue *q)
> +{
> + bdi_require_stable_pages(&q->backing_dev_info);
> +}
> +
> +static inline void queue_unrequire_stable_pages(struct request_queue *q)
> +{
> + bdi_unrequire_stable_pages(&q->backing_dev_info);
> +}
> +
> static inline void queue_lockdep_assert_held(struct request_queue *q)
> {
> if (q->queue_lock)
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index d3ca2b3..dd9f5ed 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -221,12 +221,48 @@ static ssize_t max_ratio_store(struct device *dev,
> }
> BDI_SHOW(max_ratio, bdi->max_ratio)
>
> +static ssize_t stable_pages_required_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t count)
> +{
> + struct backing_dev_info *bdi = dev_get_drvdata(dev);
> + unsigned int spw;
> + ssize_t ret;
> +
> + ret = kstrtouint(buf, 10, &spw);
> + if (ret < 0)
> + return ret;
> +
> + /*
> + * SPW could be enabled due to hw requirement, so don't
> + * let users disable it.
> + */
> + if (bdi_cap_stable_pages_required(bdi) && spw == 0)
> + return -EINVAL;
> +
> + if (spw != 0)
> + atomic_inc(&bdi->stable_page_users);
> +
> + return count;
> +}
> +
> +static ssize_t stable_pages_required_show(struct device *dev,
> + struct device_attribute *attr,
> + char *page)
> +{
> + struct backing_dev_info *bdi = dev_get_drvdata(dev);
> +
> + return snprintf(page, PAGE_SIZE-1, "%d\n",
> + bdi_cap_stable_pages_required(bdi) ? 1 : 0);
> +}
> +
> #define __ATTR_RW(attr) __ATTR(attr, 0644, attr##_show, attr##_store)
>
> static struct device_attribute bdi_dev_attrs[] = {
> __ATTR_RW(read_ahead_kb),
> __ATTR_RW(min_ratio),
> __ATTR_RW(max_ratio),
> + __ATTR_RW(stable_pages_required),
> __ATTR_NULL,
> };
>
> @@ -650,6 +686,8 @@ int bdi_init(struct backing_dev_info *bdi)
> bdi->write_bandwidth = INIT_BW;
> bdi->avg_write_bandwidth = INIT_BW;
>
> + atomic_set(&bdi->stable_page_users, 0);
> +
> err = fprop_local_init_percpu(&bdi->completions);
>
> if (err) {
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-11-01 18:22:51

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 1/3] bdi: Track users that require stable page writes

On 11/01/2012 12:58 AM, Darrick J. Wong wrote:
> This creates a per-backing-device counter that tracks the number of users which
> require pages to be held immutable during writeout. Eventually it will be used
> to waive wait_for_page_writeback() if nobody requires stable pages.
>

There is two things I do not like:
1. Please remind me why we need the users counter?
If the device needs it, it will always be needed. The below
queue_unrequire_stable_pages call at blk_integrity_unregister
only happens at device destruction time, no?

It was also said that maybe individual filesystems would need
stable pages where other FSs using the same BDI do not. But
since the FS is at the driving seat in any case, it can just
do wait_for_writeback() regardless and can care less about
the bdi flag and the other FSs. Actually all those FSs already
do this this, and do not need any help. So this reason is
mute.

2. I hate the atomic_read for every mkwrite. I think we can do
better, since as you noted it can never be turned off, only
on, at init time. And because of 1. above it is not dynamic.
I think I like your previous simple bool better.

But I do like a lot the new request_queue API you added here.
Thanks

Boaz
> Signed-off-by: Darrick J. Wong <[email protected]>
> ---
> Documentation/ABI/testing/sysfs-class-bdi | 7 +++++
> block/blk-integrity.c | 4 +++
> include/linux/backing-dev.h | 16 ++++++++++++
> include/linux/blkdev.h | 10 ++++++++
> mm/backing-dev.c | 38 +++++++++++++++++++++++++++++
> 5 files changed, 75 insertions(+)
>
>
> diff --git a/Documentation/ABI/testing/sysfs-class-bdi b/Documentation/ABI/testing/sysfs-class-bdi
> index 5f50097..218a618 100644
> --- a/Documentation/ABI/testing/sysfs-class-bdi
> +++ b/Documentation/ABI/testing/sysfs-class-bdi
> @@ -48,3 +48,10 @@ max_ratio (read-write)
> most of the write-back cache. For example in case of an NFS
> mount that is prone to get stuck, or a FUSE mount which cannot
> be trusted to play fair.
> +
> +stable_pages_required (read-write)
> +
> + If set, the backing device requires that all pages comprising a write
> + request must not be changed until writeout is complete. The system
> + administrator can turn this on if the hardware does not do so already.
> + However, once enabled, this flag cannot be disabled.
> diff --git a/block/blk-integrity.c b/block/blk-integrity.c
> index da2a818..cf2dd95 100644
> --- a/block/blk-integrity.c
> +++ b/block/blk-integrity.c
> @@ -420,6 +420,8 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template)
> } else
> bi->name = bi_unsupported_name;
>
> + queue_require_stable_pages(disk->queue);
> +
> return 0;
> }
> EXPORT_SYMBOL(blk_integrity_register);
> @@ -438,6 +440,8 @@ void blk_integrity_unregister(struct gendisk *disk)
> if (!disk || !disk->integrity)
> return;
>
> + queue_unrequire_stable_pages(disk->queue);
> +
> bi = disk->integrity;
>
> kobject_uevent(&bi->kobj, KOBJ_REMOVE);
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 2a9a9ab..0554f5d 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -109,6 +109,7 @@ struct backing_dev_info {
> struct dentry *debug_dir;
> struct dentry *debug_stats;
> #endif
> + atomic_t stable_page_users;
> };
>
> int bdi_init(struct backing_dev_info *bdi);
> @@ -307,6 +308,21 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout);
> int pdflush_proc_obsolete(struct ctl_table *table, int write,
> void __user *buffer, size_t *lenp, loff_t *ppos);
>
> +static inline void bdi_require_stable_pages(struct backing_dev_info *bdi)
> +{
> + atomic_inc(&bdi->stable_page_users);
> +}
> +
> +static inline void bdi_unrequire_stable_pages(struct backing_dev_info *bdi)
> +{
> + atomic_dec(&bdi->stable_page_users);
> +}
> +
> +static inline bool bdi_cap_stable_pages_required(struct backing_dev_info *bdi)
> +{
> + return atomic_read(&bdi->stable_page_users) > 0;
> +}
> +
> static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
> {
> return !(bdi->capabilities & BDI_CAP_NO_WRITEBACK);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 1756001..bf927c0 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -458,6 +458,16 @@ struct request_queue {
> (1 << QUEUE_FLAG_SAME_COMP) | \
> (1 << QUEUE_FLAG_ADD_RANDOM))
>
> +static inline void queue_require_stable_pages(struct request_queue *q)
> +{
> + bdi_require_stable_pages(&q->backing_dev_info);
> +}
> +
> +static inline void queue_unrequire_stable_pages(struct request_queue *q)
> +{
> + bdi_unrequire_stable_pages(&q->backing_dev_info);
> +}
> +
> static inline void queue_lockdep_assert_held(struct request_queue *q)
> {
> if (q->queue_lock)
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index d3ca2b3..dd9f5ed 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -221,12 +221,48 @@ static ssize_t max_ratio_store(struct device *dev,
> }
> BDI_SHOW(max_ratio, bdi->max_ratio)
>
> +static ssize_t stable_pages_required_store(struct device *dev,
> + struct device_attribute *attr,
> + const char *buf, size_t count)
> +{
> + struct backing_dev_info *bdi = dev_get_drvdata(dev);
> + unsigned int spw;
> + ssize_t ret;
> +
> + ret = kstrtouint(buf, 10, &spw);
> + if (ret < 0)
> + return ret;
> +
> + /*
> + * SPW could be enabled due to hw requirement, so don't
> + * let users disable it.
> + */
> + if (bdi_cap_stable_pages_required(bdi) && spw == 0)
> + return -EINVAL;
> +
> + if (spw != 0)
> + atomic_inc(&bdi->stable_page_users);
> +
> + return count;
> +}
> +
> +static ssize_t stable_pages_required_show(struct device *dev,
> + struct device_attribute *attr,
> + char *page)
> +{
> + struct backing_dev_info *bdi = dev_get_drvdata(dev);
> +
> + return snprintf(page, PAGE_SIZE-1, "%d\n",
> + bdi_cap_stable_pages_required(bdi) ? 1 : 0);
> +}
> +
> #define __ATTR_RW(attr) __ATTR(attr, 0644, attr##_show, attr##_store)
>
> static struct device_attribute bdi_dev_attrs[] = {
> __ATTR_RW(read_ahead_kb),
> __ATTR_RW(min_ratio),
> __ATTR_RW(max_ratio),
> + __ATTR_RW(stable_pages_required),
> __ATTR_NULL,
> };
>
> @@ -650,6 +686,8 @@ int bdi_init(struct backing_dev_info *bdi)
> bdi->write_bandwidth = INIT_BW;
> bdi->avg_write_bandwidth = INIT_BW;
>
> + atomic_set(&bdi->stable_page_users, 0);
> +
> err = fprop_local_init_percpu(&bdi->completions);
>
> if (err) {
>

2012-11-01 18:44:55

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 3/3] fs: Fix remaining filesystems to wait for stable page writeback

On 11/01/2012 12:58 AM, Darrick J. Wong wrote:
> Fix up the filesystems that provide their own ->page_mkwrite handlers to
> provide stable page writes if necessary.
>
> Signed-off-by: Darrick J. Wong <[email protected]>
> ---
> fs/9p/vfs_file.c | 1 +
> fs/afs/write.c | 4 ++--
> fs/ceph/addr.c | 1 +
> fs/cifs/file.c | 1 +
> fs/ocfs2/mmap.c | 1 +
> fs/ubifs/file.c | 4 ++--
> 6 files changed, 8 insertions(+), 4 deletions(-)
>
>
> diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
> index c2483e9..aa253f0 100644
> --- a/fs/9p/vfs_file.c
> +++ b/fs/9p/vfs_file.c
> @@ -620,6 +620,7 @@ v9fs_vm_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> lock_page(page);
> if (page->mapping != inode->i_mapping)
> goto out_unlock;
> + wait_on_stable_page_write(page);
>

Good god thanks, yes please ;-)

> return VM_FAULT_LOCKED;
> out_unlock:
> diff --git a/fs/afs/write.c b/fs/afs/write.c
> index 9aa52d9..39eb2a4 100644
> --- a/fs/afs/write.c
> +++ b/fs/afs/write.c
> @@ -758,7 +758,7 @@ int afs_page_mkwrite(struct vm_area_struct *vma, struct page *page)

afs, is it not a network filesystem? which means that it has it's own emulated none-block-device
BDI, registered internally. So if you do need stable pages someone should call
bdi_require_stable_pages()

But again since it is a network filesystem I don't see how it is needed, and/or it might be
taken care of already.

> #ifdef CONFIG_AFS_FSCACHE
> fscache_wait_on_page_write(vnode->cache, page);
> #endif
> -
> + wait_on_stable_page_write(page);
> _leave(" = 0");
> - return 0;
> + return VM_FAULT_LOCKED;
> }
> diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c

CEPH for sure has it's own "emulated none-block-device BDI". This one is also
a pure networking filesystem.

And it already does what it needs to do with wait_on_writeback().

So i do not think you should touch CEPH

> index 6690269..e9734bf 100644
> --- a/fs/ceph/addr.c
> +++ b/fs/ceph/addr.c
> @@ -1208,6 +1208,7 @@ static int ceph_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> set_page_dirty(page);
> up_read(&mdsc->snap_rwsem);
> ret = VM_FAULT_LOCKED;
> + wait_on_stable_page_write(page);
> } else {
> if (ret == -ENOMEM)
> ret = VM_FAULT_OOM;
> diff --git a/fs/cifs/file.c b/fs/cifs/file.c

Cifs also self-BDI network filesystem, but

> index edb25b4..a8770bf 100644
> --- a/fs/cifs/file.c
> +++ b/fs/cifs/file.c
> @@ -2997,6 +2997,7 @@ cifs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> struct page *page = vmf->page;
>
> lock_page(page);

It waits by locking the page, that's cifs naive way of waiting for writeback

> + wait_on_stable_page_write(page);

Instead it could do better and not override page_mkwrite at all, and all it needs
to do is call bdi_require_stable_pages() at it's own registered BDI

> return VM_FAULT_LOCKED;
> }
>
> diff --git a/fs/ocfs2/mmap.c b/fs/ocfs2/mmap.c
> index 47a87dd..a0027b1 100644
> --- a/fs/ocfs2/mmap.c
> +++ b/fs/ocfs2/mmap.c
> @@ -124,6 +124,7 @@ static int __ocfs2_page_mkwrite(struct file *file, struct buffer_head *di_bh,
> fsdata);
> BUG_ON(ret != len);
> ret = VM_FAULT_LOCKED;
> + wait_on_stable_page_write(page);
> out:
> return ret;
> }
> diff --git a/fs/ubifs/file.c b/fs/ubifs/file.c
> index 5bc7781..cb0d3aa 100644
> --- a/fs/ubifs/file.c
> +++ b/fs/ubifs/file.c
> @@ -1522,8 +1522,8 @@ static int ubifs_vm_page_mkwrite(struct vm_area_struct *vma,
> ubifs_release_dirty_inode_budget(c, ui);
> }
>
> - unlock_page(page);
> - return 0;
> + wait_on_stable_page_write(page);

ubifs has it's special ubi block device. So someone needs to call bdi_require_stable_pages()
for this to work.

I think that here too. The existing code, like cifs, calls page_lock, as a way of
waiting for writeback.

So this is certainly not finished.

> + return VM_FAULT_LOCKED;
>
> out_unlock:
> unlock_page(page);
>

Cheers
Boaz

2012-11-01 18:59:12

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH 1/3] bdi: Track users that require stable page writes

On Thu, Nov 01, 2012 at 11:21:22AM -0700, Boaz Harrosh wrote:
> On 11/01/2012 12:58 AM, Darrick J. Wong wrote:
> > This creates a per-backing-device counter that tracks the number of users which
> > require pages to be held immutable during writeout. Eventually it will be used
> > to waive wait_for_page_writeback() if nobody requires stable pages.
> >
>
> There is two things I do not like:
> 1. Please remind me why we need the users counter?
> If the device needs it, it will always be needed. The below
> queue_unrequire_stable_pages call at blk_integrity_unregister
> only happens at device destruction time, no?
>
> It was also said that maybe individual filesystems would need
> stable pages where other FSs using the same BDI do not. But
> since the FS is at the driving seat in any case, it can just
> do wait_for_writeback() regardless and can care less about
> the bdi flag and the other FSs. Actually all those FSs already
> do this this, and do not need any help. So this reason is
> mute.

The counter exists so that a filesystem can forcibly enable stable page writes
even if the underlying device doesn't require it, because the generic fs/mm
waiting only happens if stable_pages_required=1. The idea here was to allow a
filesystem that needs stable page writes for its own purposes (i.e. data block
checksumming) to be able to enforce the requirement even if the disk doesn't
care (or doesn't exist).

But maybe there are no such filesystems?

<shrug> I don't really like the idea that you can dirty a page during writeout,
which means that the disk could write the before page, the after page, or some
weird mix of the two, and all you get is a promise that the after page will get
rewritten if the power doesn't go out. Not to mention that it could happen
again. :)

> 2. I hate the atomic_read for every mkwrite. I think we can do
> better, since as you noted it can never be turned off, only
> on, at init time. And because of 1. above it is not dynamic.
> I think I like your previous simple bool better.

I doubt the counter would change much; I could probably change it to something
less heavyweight if it's really a problem.

> But I do like a lot the new request_queue API you added here.
> Thanks

Good!

--D
>
> Boaz
> > Signed-off-by: Darrick J. Wong <[email protected]>
> > ---
> > Documentation/ABI/testing/sysfs-class-bdi | 7 +++++
> > block/blk-integrity.c | 4 +++
> > include/linux/backing-dev.h | 16 ++++++++++++
> > include/linux/blkdev.h | 10 ++++++++
> > mm/backing-dev.c | 38 +++++++++++++++++++++++++++++
> > 5 files changed, 75 insertions(+)
> >
> >
> > diff --git a/Documentation/ABI/testing/sysfs-class-bdi b/Documentation/ABI/testing/sysfs-class-bdi
> > index 5f50097..218a618 100644
> > --- a/Documentation/ABI/testing/sysfs-class-bdi
> > +++ b/Documentation/ABI/testing/sysfs-class-bdi
> > @@ -48,3 +48,10 @@ max_ratio (read-write)
> > most of the write-back cache. For example in case of an NFS
> > mount that is prone to get stuck, or a FUSE mount which cannot
> > be trusted to play fair.
> > +
> > +stable_pages_required (read-write)
> > +
> > + If set, the backing device requires that all pages comprising a write
> > + request must not be changed until writeout is complete. The system
> > + administrator can turn this on if the hardware does not do so already.
> > + However, once enabled, this flag cannot be disabled.
> > diff --git a/block/blk-integrity.c b/block/blk-integrity.c
> > index da2a818..cf2dd95 100644
> > --- a/block/blk-integrity.c
> > +++ b/block/blk-integrity.c
> > @@ -420,6 +420,8 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template)
> > } else
> > bi->name = bi_unsupported_name;
> >
> > + queue_require_stable_pages(disk->queue);
> > +
> > return 0;
> > }
> > EXPORT_SYMBOL(blk_integrity_register);
> > @@ -438,6 +440,8 @@ void blk_integrity_unregister(struct gendisk *disk)
> > if (!disk || !disk->integrity)
> > return;
> >
> > + queue_unrequire_stable_pages(disk->queue);
> > +
> > bi = disk->integrity;
> >
> > kobject_uevent(&bi->kobj, KOBJ_REMOVE);
> > diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> > index 2a9a9ab..0554f5d 100644
> > --- a/include/linux/backing-dev.h
> > +++ b/include/linux/backing-dev.h
> > @@ -109,6 +109,7 @@ struct backing_dev_info {
> > struct dentry *debug_dir;
> > struct dentry *debug_stats;
> > #endif
> > + atomic_t stable_page_users;
> > };
> >
> > int bdi_init(struct backing_dev_info *bdi);
> > @@ -307,6 +308,21 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout);
> > int pdflush_proc_obsolete(struct ctl_table *table, int write,
> > void __user *buffer, size_t *lenp, loff_t *ppos);
> >
> > +static inline void bdi_require_stable_pages(struct backing_dev_info *bdi)
> > +{
> > + atomic_inc(&bdi->stable_page_users);
> > +}
> > +
> > +static inline void bdi_unrequire_stable_pages(struct backing_dev_info *bdi)
> > +{
> > + atomic_dec(&bdi->stable_page_users);
> > +}
> > +
> > +static inline bool bdi_cap_stable_pages_required(struct backing_dev_info *bdi)
> > +{
> > + return atomic_read(&bdi->stable_page_users) > 0;
> > +}
> > +
> > static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
> > {
> > return !(bdi->capabilities & BDI_CAP_NO_WRITEBACK);
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 1756001..bf927c0 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -458,6 +458,16 @@ struct request_queue {
> > (1 << QUEUE_FLAG_SAME_COMP) | \
> > (1 << QUEUE_FLAG_ADD_RANDOM))
> >
> > +static inline void queue_require_stable_pages(struct request_queue *q)
> > +{
> > + bdi_require_stable_pages(&q->backing_dev_info);
> > +}
> > +
> > +static inline void queue_unrequire_stable_pages(struct request_queue *q)
> > +{
> > + bdi_unrequire_stable_pages(&q->backing_dev_info);
> > +}
> > +
> > static inline void queue_lockdep_assert_held(struct request_queue *q)
> > {
> > if (q->queue_lock)
> > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > index d3ca2b3..dd9f5ed 100644
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -221,12 +221,48 @@ static ssize_t max_ratio_store(struct device *dev,
> > }
> > BDI_SHOW(max_ratio, bdi->max_ratio)
> >
> > +static ssize_t stable_pages_required_store(struct device *dev,
> > + struct device_attribute *attr,
> > + const char *buf, size_t count)
> > +{
> > + struct backing_dev_info *bdi = dev_get_drvdata(dev);
> > + unsigned int spw;
> > + ssize_t ret;
> > +
> > + ret = kstrtouint(buf, 10, &spw);
> > + if (ret < 0)
> > + return ret;
> > +
> > + /*
> > + * SPW could be enabled due to hw requirement, so don't
> > + * let users disable it.
> > + */
> > + if (bdi_cap_stable_pages_required(bdi) && spw == 0)
> > + return -EINVAL;
> > +
> > + if (spw != 0)
> > + atomic_inc(&bdi->stable_page_users);
> > +
> > + return count;
> > +}
> > +
> > +static ssize_t stable_pages_required_show(struct device *dev,
> > + struct device_attribute *attr,
> > + char *page)
> > +{
> > + struct backing_dev_info *bdi = dev_get_drvdata(dev);
> > +
> > + return snprintf(page, PAGE_SIZE-1, "%d\n",
> > + bdi_cap_stable_pages_required(bdi) ? 1 : 0);
> > +}
> > +
> > #define __ATTR_RW(attr) __ATTR(attr, 0644, attr##_show, attr##_store)
> >
> > static struct device_attribute bdi_dev_attrs[] = {
> > __ATTR_RW(read_ahead_kb),
> > __ATTR_RW(min_ratio),
> > __ATTR_RW(max_ratio),
> > + __ATTR_RW(stable_pages_required),
> > __ATTR_NULL,
> > };
> >
> > @@ -650,6 +686,8 @@ int bdi_init(struct backing_dev_info *bdi)
> > bdi->write_bandwidth = INIT_BW;
> > bdi->avg_write_bandwidth = INIT_BW;
> >
> > + atomic_set(&bdi->stable_page_users, 0);
> > +
> > err = fprop_local_init_percpu(&bdi->completions);
> >
> > if (err) {
> >
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-11-01 20:23:10

by Jeffrey Layton

[permalink] [raw]
Subject: Re: [PATCH 3/3] fs: Fix remaining filesystems to wait for stable page writeback

On Thu, 1 Nov 2012 11:43:26 -0700
Boaz Harrosh <[email protected]> wrote:

> On 11/01/2012 12:58 AM, Darrick J. Wong wrote:
> > Fix up the filesystems that provide their own ->page_mkwrite handlers to
> > provide stable page writes if necessary.
> >
> > Signed-off-by: Darrick J. Wong <[email protected]>
> > ---
> > fs/9p/vfs_file.c | 1 +
> > fs/afs/write.c | 4 ++--
> > fs/ceph/addr.c | 1 +
> > fs/cifs/file.c | 1 +
> > fs/ocfs2/mmap.c | 1 +
> > fs/ubifs/file.c | 4 ++--
> > 6 files changed, 8 insertions(+), 4 deletions(-)
> >
> >
> > diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
> > index c2483e9..aa253f0 100644
> > --- a/fs/9p/vfs_file.c
> > +++ b/fs/9p/vfs_file.c
> > @@ -620,6 +620,7 @@ v9fs_vm_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > lock_page(page);
> > if (page->mapping != inode->i_mapping)
> > goto out_unlock;
> > + wait_on_stable_page_write(page);
> >
>
> Good god thanks, yes please ;-)
>
> > return VM_FAULT_LOCKED;
> > out_unlock:
> > diff --git a/fs/afs/write.c b/fs/afs/write.c
> > index 9aa52d9..39eb2a4 100644
> > --- a/fs/afs/write.c
> > +++ b/fs/afs/write.c
> > @@ -758,7 +758,7 @@ int afs_page_mkwrite(struct vm_area_struct *vma, struct page *page)
>
> afs, is it not a network filesystem? which means that it has it's own emulated none-block-device
> BDI, registered internally. So if you do need stable pages someone should call
> bdi_require_stable_pages()
>
> But again since it is a network filesystem I don't see how it is needed, and/or it might be
> taken care of already.
>
> > #ifdef CONFIG_AFS_FSCACHE
> > fscache_wait_on_page_write(vnode->cache, page);
> > #endif
> > -
> > + wait_on_stable_page_write(page);
> > _leave(" = 0");
> > - return 0;
> > + return VM_FAULT_LOCKED;
> > }
> > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
>
> CEPH for sure has it's own "emulated none-block-device BDI". This one is also
> a pure networking filesystem.
>
> And it already does what it needs to do with wait_on_writeback().
>
> So i do not think you should touch CEPH
>
> > index 6690269..e9734bf 100644
> > --- a/fs/ceph/addr.c
> > +++ b/fs/ceph/addr.c
> > @@ -1208,6 +1208,7 @@ static int ceph_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > set_page_dirty(page);
> > up_read(&mdsc->snap_rwsem);
> > ret = VM_FAULT_LOCKED;
> > + wait_on_stable_page_write(page);
> > } else {
> > if (ret == -ENOMEM)
> > ret = VM_FAULT_OOM;
> > diff --git a/fs/cifs/file.c b/fs/cifs/file.c
>
> Cifs also self-BDI network filesystem, but
>
> > index edb25b4..a8770bf 100644
> > --- a/fs/cifs/file.c
> > +++ b/fs/cifs/file.c
> > @@ -2997,6 +2997,7 @@ cifs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > struct page *page = vmf->page;
> >
> > lock_page(page);
>
> It waits by locking the page, that's cifs naive way of waiting for writeback
>
> > + wait_on_stable_page_write(page);
>
> Instead it could do better and not override page_mkwrite at all, and all it needs
> to do is call bdi_require_stable_pages() at it's own registered BDI
>

Hmm...I don't know...

I've never been crazy about using the page lock for this, but in the
absence of a better way to guarantee stable pages, it was what I ended
up with at the time. cifs_writepages will hold the page lock until
kernel_sendmsg returns. At that point the TCP layer will have copied
off the page data so it's safe to release it.

With this change though, we're going to end up blocking until the
writeback flag clears, right? And I think that will happen when the
reply comes in? So, we'll end up blocking for much longer than is
really necessary in page_mkwrite with this change.

--
Jeff Layton <[email protected]>

2012-11-01 22:25:00

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 3/3] fs: Fix remaining filesystems to wait for stable page writeback

On 11/01/2012 01:22 PM, Jeff Layton wrote:
> Hmm...I don't know...
>
> I've never been crazy about using the page lock for this, but in the
> absence of a better way to guarantee stable pages, it was what I ended
> up with at the time. cifs_writepages will hold the page lock until
> kernel_sendmsg returns. At that point the TCP layer will have copied
> off the page data so it's safe to release it.
>
> With this change though, we're going to end up blocking until the
> writeback flag clears, right? And I think that will happen when the
> reply comes in? So, we'll end up blocking for much longer than is
> really necessary in page_mkwrite with this change.
>

Hmm OK, that is a very good point. In that case it is just a simple
nack on Darrick's hunk to cifs. cifs is fine and should not be touched

Boaz

2012-11-01 22:49:31

by Darrick J. Wong

[permalink] [raw]
Subject: Re: [PATCH 3/3] fs: Fix remaining filesystems to wait for stable page writeback

On Thu, Nov 01, 2012 at 04:22:54PM -0400, Jeff Layton wrote:
> On Thu, 1 Nov 2012 11:43:26 -0700
> Boaz Harrosh <[email protected]> wrote:
>
> > On 11/01/2012 12:58 AM, Darrick J. Wong wrote:
> > > Fix up the filesystems that provide their own ->page_mkwrite handlers to
> > > provide stable page writes if necessary.
> > >
> > > Signed-off-by: Darrick J. Wong <[email protected]>
> > > ---
> > > fs/9p/vfs_file.c | 1 +
> > > fs/afs/write.c | 4 ++--
> > > fs/ceph/addr.c | 1 +
> > > fs/cifs/file.c | 1 +
> > > fs/ocfs2/mmap.c | 1 +
> > > fs/ubifs/file.c | 4 ++--
> > > 6 files changed, 8 insertions(+), 4 deletions(-)
> > >
> > >
> > > diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
> > > index c2483e9..aa253f0 100644
> > > --- a/fs/9p/vfs_file.c
> > > +++ b/fs/9p/vfs_file.c
> > > @@ -620,6 +620,7 @@ v9fs_vm_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > lock_page(page);
> > > if (page->mapping != inode->i_mapping)
> > > goto out_unlock;
> > > + wait_on_stable_page_write(page);
> > >
> >
> > Good god thanks, yes please ;-)
> >
> > > return VM_FAULT_LOCKED;
> > > out_unlock:
> > > diff --git a/fs/afs/write.c b/fs/afs/write.c
> > > index 9aa52d9..39eb2a4 100644
> > > --- a/fs/afs/write.c
> > > +++ b/fs/afs/write.c
> > > @@ -758,7 +758,7 @@ int afs_page_mkwrite(struct vm_area_struct *vma, struct page *page)
> >
> > afs, is it not a network filesystem? which means that it has it's own emulated none-block-device
> > BDI, registered internally. So if you do need stable pages someone should call
> > bdi_require_stable_pages()
> >
> > But again since it is a network filesystem I don't see how it is needed, and/or it might be
> > taken care of already.
> >
> > > #ifdef CONFIG_AFS_FSCACHE
> > > fscache_wait_on_page_write(vnode->cache, page);
> > > #endif
> > > -
> > > + wait_on_stable_page_write(page);
> > > _leave(" = 0");
> > > - return 0;
> > > + return VM_FAULT_LOCKED;
> > > }
> > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> >
> > CEPH for sure has it's own "emulated none-block-device BDI". This one is also
> > a pure networking filesystem.
> >
> > And it already does what it needs to do with wait_on_writeback().
> >
> > So i do not think you should touch CEPH
> >
> > > index 6690269..e9734bf 100644
> > > --- a/fs/ceph/addr.c
> > > +++ b/fs/ceph/addr.c
> > > @@ -1208,6 +1208,7 @@ static int ceph_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > set_page_dirty(page);
> > > up_read(&mdsc->snap_rwsem);
> > > ret = VM_FAULT_LOCKED;
> > > + wait_on_stable_page_write(page);
> > > } else {
> > > if (ret == -ENOMEM)
> > > ret = VM_FAULT_OOM;
> > > diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> >
> > Cifs also self-BDI network filesystem, but
> >
> > > index edb25b4..a8770bf 100644
> > > --- a/fs/cifs/file.c
> > > +++ b/fs/cifs/file.c
> > > @@ -2997,6 +2997,7 @@ cifs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > struct page *page = vmf->page;
> > >
> > > lock_page(page);
> >
> > It waits by locking the page, that's cifs naive way of waiting for writeback
> >
> > > + wait_on_stable_page_write(page);
> >
> > Instead it could do better and not override page_mkwrite at all, and all it needs
> > to do is call bdi_require_stable_pages() at it's own registered BDI
> >
>
> Hmm...I don't know...
>
> I've never been crazy about using the page lock for this, but in the
> absence of a better way to guarantee stable pages, it was what I ended
> up with at the time. cifs_writepages will hold the page lock until
> kernel_sendmsg returns. At that point the TCP layer will have copied
> off the page data so it's safe to release it.
>
> With this change though, we're going to end up blocking until the
> writeback flag clears, right? And I think that will happen when the
> reply comes in? So, we'll end up blocking for much longer than is
> really necessary in page_mkwrite with this change.

That's a very good point to make-- network FSes can stop the stable-waiting
after the request is sent. Can I interest you in a new page flag (PG_stable)
that indicates when a page has to be held for stable write? Along with a
modification to wait_on_stable_page_write that uses the new PG_stable flag
instead of just writeback? Then, you can clear PG_stable right after the
sendmsg() and release the page for further activity without having to overload
the page lock.

I wrote a patch that does exactly that as part of my work to defer the
integrity checksumming until the last possible instant. However, I haven't
gotten that part to work yet, so I left the PG_stable patch out of this
submission. On the other hand, it sounds like you could use it.

--D
>
> --
> Jeff Layton <[email protected]>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-11-01 22:57:11

by Boaz Harrosh

[permalink] [raw]
Subject: Re: [PATCH 1/3] bdi: Track users that require stable page writes

On 11/01/2012 11:57 AM, Darrick J. Wong wrote:
> On Thu, Nov 01, 2012 at 11:21:22AM -0700, Boaz Harrosh wrote:
>> On 11/01/2012 12:58 AM, Darrick J. Wong wrote:
>>> This creates a per-backing-device counter that tracks the number of users which
>>> require pages to be held immutable during writeout. Eventually it will be used
>>> to waive wait_for_page_writeback() if nobody requires stable pages.
>>>
>>
>> There is two things I do not like:
>> 1. Please remind me why we need the users counter?
>> If the device needs it, it will always be needed. The below
>> queue_unrequire_stable_pages call at blk_integrity_unregister
>> only happens at device destruction time, no?
>>
>> It was also said that maybe individual filesystems would need
>> stable pages where other FSs using the same BDI do not. But
>> since the FS is at the driving seat in any case, it can just
>> do wait_for_writeback() regardless and can care less about
>> the bdi flag and the other FSs. Actually all those FSs already
>> do this this, and do not need any help. So this reason is
>> mute.
>
> The counter exists so that a filesystem can forcibly enable stable page writes
> even if the underlying device doesn't require it, because the generic fs/mm
> waiting only happens if stable_pages_required=1. The idea here was to allow a
> filesystem that needs stable page writes for its own purposes (i.e. data block
> checksumming) to be able to enforce the requirement even if the disk doesn't
> care (or doesn't exist).
>

But the filesystem does not need BDI flag to do that, It can just call
wait_on_page_writeback() directly and or any other waiting like cifs does,
and this way will not affect any other partitions of the same BDI. So this flag
is never needed by the FS, it is always to service the device.

> But maybe there are no such filesystems?
>

Exactly, all the FSs that do care, already take care of it.

> <shrug> I don't really like the idea that you can dirty a page during writeout,
> which means that the disk could write the before page, the after page, or some
> weird mix of the two, and all you get is a promise that the after page will get
> rewritten if the power doesn't go out. Not to mention that it could happen
> again. :)
>

I guess that's life. But what is the difference between a modification of a page
that comes 1 nanosecond before the end_write_back, and another one that came
1 nano after end_write_back. To the disk it looks like the same load pattern.

I do have in mind a way that we can minimize these redirty-while-writeback by 90% but
I don't care enough (And am not paid) to fix it, so the contention is currently worse
then what it could be.

>> 2. I hate the atomic_read for every mkwrite. I think we can do
>> better, since as you noted it can never be turned off, only
>> on, at init time. And because of 1. above it is not dynamic.
>> I think I like your previous simple bool better.
>
> I doubt the counter would change much; I could probably change it to something
> less heavyweight if it's really a problem.
>

I hope you are convinced that a counter is not needed, and a simple bool like
before is enough

Thanks
Boaz

2012-11-01 23:15:40

by Jan Kara

[permalink] [raw]
Subject: Re: [PATCH 1/3] bdi: Track users that require stable page writes

On Thu 01-11-12 15:56:34, Boaz Harrosh wrote:
> On 11/01/2012 11:57 AM, Darrick J. Wong wrote:
> > On Thu, Nov 01, 2012 at 11:21:22AM -0700, Boaz Harrosh wrote:
> >> On 11/01/2012 12:58 AM, Darrick J. Wong wrote:
> >>> This creates a per-backing-device counter that tracks the number of users which
> >>> require pages to be held immutable during writeout. Eventually it will be used
> >>> to waive wait_for_page_writeback() if nobody requires stable pages.
> >>>
> >>
> >> There is two things I do not like:
> >> 1. Please remind me why we need the users counter?
> >> If the device needs it, it will always be needed. The below
> >> queue_unrequire_stable_pages call at blk_integrity_unregister
> >> only happens at device destruction time, no?
> >>
> >> It was also said that maybe individual filesystems would need
> >> stable pages where other FSs using the same BDI do not. But
> >> since the FS is at the driving seat in any case, it can just
> >> do wait_for_writeback() regardless and can care less about
> >> the bdi flag and the other FSs. Actually all those FSs already
> >> do this this, and do not need any help. So this reason is
> >> mute.
> >
> > The counter exists so that a filesystem can forcibly enable stable page writes
> > even if the underlying device doesn't require it, because the generic fs/mm
> > waiting only happens if stable_pages_required=1. The idea here was to allow a
> > filesystem that needs stable page writes for its own purposes (i.e. data block
> > checksumming) to be able to enforce the requirement even if the disk doesn't
> > care (or doesn't exist).
> >
>
> But the filesystem does not need BDI flag to do that, It can just call
> wait_on_page_writeback() directly and or any other waiting like cifs does,
> and this way will not affect any other partitions of the same BDI. So this flag
> is never needed by the FS, it is always to service the device.
But if they use some generic VFS functions, these VFS functions need to
know whether to wait or not - e.g. grab_cache_page_write_begin() or
block_page_mkwrite().

> > But maybe there are no such filesystems?
> >
>
> Exactly, all the FSs that do care, already take care of it.
I'm not exactly sure whether FSs that do care didn't start to use generic
functions which now always call wait_on_page_writeback() for quite a few
releases...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-11-02 00:36:21

by Jeffrey Layton

[permalink] [raw]
Subject: Re: [PATCH 3/3] fs: Fix remaining filesystems to wait for stable page writeback

On Thu, 1 Nov 2012 15:47:30 -0700
"Darrick J. Wong" <[email protected]> wrote:

> On Thu, Nov 01, 2012 at 04:22:54PM -0400, Jeff Layton wrote:
> > On Thu, 1 Nov 2012 11:43:26 -0700
> > Boaz Harrosh <[email protected]> wrote:
> >
> > > On 11/01/2012 12:58 AM, Darrick J. Wong wrote:
> > > > Fix up the filesystems that provide their own ->page_mkwrite handlers to
> > > > provide stable page writes if necessary.
> > > >
> > > > Signed-off-by: Darrick J. Wong <[email protected]>
> > > > ---
> > > > fs/9p/vfs_file.c | 1 +
> > > > fs/afs/write.c | 4 ++--
> > > > fs/ceph/addr.c | 1 +
> > > > fs/cifs/file.c | 1 +
> > > > fs/ocfs2/mmap.c | 1 +
> > > > fs/ubifs/file.c | 4 ++--
> > > > 6 files changed, 8 insertions(+), 4 deletions(-)
> > > >
> > > >
> > > > diff --git a/fs/9p/vfs_file.c b/fs/9p/vfs_file.c
> > > > index c2483e9..aa253f0 100644
> > > > --- a/fs/9p/vfs_file.c
> > > > +++ b/fs/9p/vfs_file.c
> > > > @@ -620,6 +620,7 @@ v9fs_vm_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > > lock_page(page);
> > > > if (page->mapping != inode->i_mapping)
> > > > goto out_unlock;
> > > > + wait_on_stable_page_write(page);
> > > >
> > >
> > > Good god thanks, yes please ;-)
> > >
> > > > return VM_FAULT_LOCKED;
> > > > out_unlock:
> > > > diff --git a/fs/afs/write.c b/fs/afs/write.c
> > > > index 9aa52d9..39eb2a4 100644
> > > > --- a/fs/afs/write.c
> > > > +++ b/fs/afs/write.c
> > > > @@ -758,7 +758,7 @@ int afs_page_mkwrite(struct vm_area_struct *vma, struct page *page)
> > >
> > > afs, is it not a network filesystem? which means that it has it's own emulated none-block-device
> > > BDI, registered internally. So if you do need stable pages someone should call
> > > bdi_require_stable_pages()
> > >
> > > But again since it is a network filesystem I don't see how it is needed, and/or it might be
> > > taken care of already.
> > >
> > > > #ifdef CONFIG_AFS_FSCACHE
> > > > fscache_wait_on_page_write(vnode->cache, page);
> > > > #endif
> > > > -
> > > > + wait_on_stable_page_write(page);
> > > > _leave(" = 0");
> > > > - return 0;
> > > > + return VM_FAULT_LOCKED;
> > > > }
> > > > diff --git a/fs/ceph/addr.c b/fs/ceph/addr.c
> > >
> > > CEPH for sure has it's own "emulated none-block-device BDI". This one is also
> > > a pure networking filesystem.
> > >
> > > And it already does what it needs to do with wait_on_writeback().
> > >
> > > So i do not think you should touch CEPH
> > >
> > > > index 6690269..e9734bf 100644
> > > > --- a/fs/ceph/addr.c
> > > > +++ b/fs/ceph/addr.c
> > > > @@ -1208,6 +1208,7 @@ static int ceph_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > > set_page_dirty(page);
> > > > up_read(&mdsc->snap_rwsem);
> > > > ret = VM_FAULT_LOCKED;
> > > > + wait_on_stable_page_write(page);
> > > > } else {
> > > > if (ret == -ENOMEM)
> > > > ret = VM_FAULT_OOM;
> > > > diff --git a/fs/cifs/file.c b/fs/cifs/file.c
> > >
> > > Cifs also self-BDI network filesystem, but
> > >
> > > > index edb25b4..a8770bf 100644
> > > > --- a/fs/cifs/file.c
> > > > +++ b/fs/cifs/file.c
> > > > @@ -2997,6 +2997,7 @@ cifs_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > > struct page *page = vmf->page;
> > > >
> > > > lock_page(page);
> > >
> > > It waits by locking the page, that's cifs naive way of waiting for writeback
> > >
> > > > + wait_on_stable_page_write(page);
> > >
> > > Instead it could do better and not override page_mkwrite at all, and all it needs
> > > to do is call bdi_require_stable_pages() at it's own registered BDI
> > >
> >
> > Hmm...I don't know...
> >
> > I've never been crazy about using the page lock for this, but in the
> > absence of a better way to guarantee stable pages, it was what I ended
> > up with at the time. cifs_writepages will hold the page lock until
> > kernel_sendmsg returns. At that point the TCP layer will have copied
> > off the page data so it's safe to release it.
> >
> > With this change though, we're going to end up blocking until the
> > writeback flag clears, right? And I think that will happen when the
> > reply comes in? So, we'll end up blocking for much longer than is
> > really necessary in page_mkwrite with this change.
>
> That's a very good point to make-- network FSes can stop the stable-waiting
> after the request is sent.

Well, it depends...

If the fs in question uses kernel_sendpage (or the equivalent) then the
page will be inlined into the fraglist of the skb. If you use that,
then you can't just drop it after the send. It's also possible that the
fs doesn't care about page stability at all (like if signatures aren't
being used).

So I think you probably need to account for several different
possibilities of "page stability lifetimes" here...

> Can I interest you in a new page flag (PG_stable)
> that indicates when a page has to be held for stable write? Along with a
> modification to wait_on_stable_page_write that uses the new PG_stable flag
> instead of just writeback? Then, you can clear PG_stable right after the
> sendmsg() and release the page for further activity without having to overload
> the page lock.
>
> I wrote a patch that does exactly that as part of my work to defer the
> integrity checksumming until the last possible instant. However, I haven't
> gotten that part to work yet, so I left the PG_stable patch out of this
> submission. On the other hand, it sounds like you could use it.
>

That sounds much more suitable for CIFS and possibly for others too.

--
Jeff Layton <[email protected]>