LinuxLists.cc - semi-stable page writes

[permalink] [raw]

Subject: [RFC PATCH 2/2] mm: Gate stable page writes on the bdi flag

Create a helper function to decide if a particular address space is backed by a
device that requires stable page writes. For each wait_on_page_writeback call
that was inserted in the original stable page write patchset, change it to wait
only if the helper says it's necessary. This should eliminate unnecessary
waiting for devices that don't require page contents to be immutable during
writeout.

Signed-off-by: Darrick J. Wong <[email protected]>
---

fs/buffer.c | 3 ++-
fs/ext4/inode.c | 4 ++--
include/linux/fs.h | 2 ++
mm/filemap.c | 3 ++-
mm/page-writeback.c | 7 +++++++
5 files changed, 15 insertions(+), 4 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index b5f0442..f508ece 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2334,7 +2334,8 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
if (unlikely(ret < 0))
goto out_unlock;
set_page_dirty(page);
- wait_on_page_writeback(page);
+ if (page->mapping && mapping_needs_stable_writes(page->mapping))
+ wait_on_page_writeback(page);
return 0;
out_unlock:
unlock_page(page);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index b3c243b..82908b8 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4813,8 +4813,8 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (page_has_buffers(page)) {
if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
ext4_bh_unmapped)) {
- /* Wait so that we don't change page under IO */
- wait_on_page_writeback(page);
+ if (mapping_needs_stable_writes(mapping))
+ wait_on_page_writeback(page);
ret = VM_FAULT_LOCKED;
goto out;
}
diff --git a/include/linux/fs.h b/include/linux/fs.h
index b33cfc9..d99db0e 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -466,6 +466,8 @@ struct block_device {
struct percpu_rw_semaphore bd_block_size_semaphore;
};

+int mapping_needs_stable_writes(struct address_space *mapping);
+
/*
* Radix-tree tags, for tagging dirty and writeback pages within the pagecache
* radix trees
diff --git a/mm/filemap.c b/mm/filemap.c
index 83efee7..aedc6bd 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2274,7 +2274,8 @@ repeat:
return NULL;
}
found:
- wait_on_page_writeback(page);
+ if (mapping_needs_stable_writes(mapping))
+ wait_on_page_writeback(page);
return page;
}
EXPORT_SYMBOL(grab_cache_page_write_begin);
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 830893b..d0f042c 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2275,3 +2275,10 @@ int mapping_tagged(struct address_space *mapping, int tag)
return radix_tree_tagged(&mapping->page_tree, tag);
}
EXPORT_SYMBOL(mapping_tagged);
+
+int mapping_needs_stable_writes(struct address_space *mapping)
+{
+ return mapping->backing_dev_info &&
+ bdi_cap_stable_writes(mapping->backing_dev_info);
+}
+EXPORT_SYMBOL_GPL(mapping_needs_stable_writes);

2012-10-27 01:35:24

[permalink] [raw]

Subject: [RFC PATCH 1/2] bdi: Create a flag to indicate that a backing device needs stable page writes

This creates BDI_CAP_STABLE_WRITES, which indicates that a device requires
stable page writes. It also plumbs in a sysfs attribute so that admins can
check the device status.

Signed-off-by: Darrick J. Wong <[email protected]>
---

block/blk-integrity.c | 8 ++++++++
include/linux/backing-dev.h | 6 ++++++
mm/backing-dev.c | 11 +++++++++++
3 files changed, 25 insertions(+)

diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index da2a818..d05d7b3 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -384,6 +384,7 @@ EXPORT_SYMBOL(blk_integrity_is_initialized);
int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template)
{
struct blk_integrity *bi;
+ struct backing_dev_info *bdi;

BUG_ON(disk == NULL);

@@ -420,6 +421,9 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template)
} else
bi->name = bi_unsupported_name;

+ bdi = &disk->queue->backing_dev_info;
+ bdi->capabilities |= BDI_CAP_STABLE_WRITES;
+
return 0;
}
EXPORT_SYMBOL(blk_integrity_register);
@@ -434,10 +438,14 @@ EXPORT_SYMBOL(blk_integrity_register);
void blk_integrity_unregister(struct gendisk *disk)
{
struct blk_integrity *bi;
+ struct backing_dev_info *bdi;

if (!disk || !disk->integrity)
return;

+ bdi = &disk->queue->backing_dev_info;
+ bdi->capabilities &= ~BDI_CAP_STABLE_WRITES;
+
bi = disk->integrity;

kobject_uevent(&bi->kobj, KOBJ_REMOVE);
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index 2a9a9ab..b82a8bb 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -253,6 +253,7 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
#define BDI_CAP_EXEC_MAP 0x00000040
#define BDI_CAP_NO_ACCT_WB 0x00000080
#define BDI_CAP_SWAP_BACKED 0x00000100
+#define BDI_CAP_STABLE_WRITES 0x00000200

#define BDI_CAP_VMFLAGS \
(BDI_CAP_READ_MAP | BDI_CAP_WRITE_MAP | BDI_CAP_EXEC_MAP)
@@ -307,6 +308,11 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout);
int pdflush_proc_obsolete(struct ctl_table *table, int write,
void __user *buffer, size_t *lenp, loff_t *ppos);

+static inline bool bdi_cap_stable_writes(struct backing_dev_info *bdi)
+{
+ return bdi->capabilities & BDI_CAP_STABLE_WRITES;
+}
+
static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
{
return !(bdi->capabilities & BDI_CAP_NO_WRITEBACK);
diff --git a/mm/backing-dev.c b/mm/backing-dev.c
index d3ca2b3..a2f4c08 100644
--- a/mm/backing-dev.c
+++ b/mm/backing-dev.c
@@ -221,12 +221,23 @@ static ssize_t max_ratio_store(struct device *dev,
}
BDI_SHOW(max_ratio, bdi->max_ratio)

+static ssize_t stable_page_writes_show(struct device *dev,
+ struct device_attribute *attr, char *page)
+{
+ struct backing_dev_info *bdi = dev_get_drvdata(dev);
+
+ return snprintf(page, PAGE_SIZE-1, "%d\n",
+ bdi->capabilities & BDI_CAP_STABLE_WRITES ? 1 : 0);
+}
+
#define __ATTR_RW(attr) __ATTR(attr, 0644, attr##_show, attr##_store)
+#define __ATTR_RO(attr) __ATTR(attr, 0444, attr##_show, NULL)

static struct device_attribute bdi_dev_attrs[] = {
__ATTR_RW(read_ahead_kb),
__ATTR_RW(min_ratio),
__ATTR_RW(max_ratio),
+ __ATTR_RO(stable_page_writes),
__ATTR_NULL,
};

2012-10-29 18:14:00

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] bdi: Create a flag to indicate that a backing device needs stable page writes

On Fri 26-10-12 18:35:24, Darrick J. Wong wrote:
> This creates BDI_CAP_STABLE_WRITES, which indicates that a device requires
> stable page writes. It also plumbs in a sysfs attribute so that admins can
> check the device status.
>
> Signed-off-by: Darrick J. Wong <[email protected]>
I guess Jens Axboe <[email protected]> would be the best target for this
patch (so that he can merge it). The patch looks OK to me. You can add:
Reviewed-by: Jan Kara <[email protected]>

Honza

> ---
>
> block/blk-integrity.c | 8 ++++++++
> include/linux/backing-dev.h | 6 ++++++
> mm/backing-dev.c | 11 +++++++++++
> 3 files changed, 25 insertions(+)
>
> diff --git a/block/blk-integrity.c b/block/blk-integrity.c
> index da2a818..d05d7b3 100644
> --- a/block/blk-integrity.c
> +++ b/block/blk-integrity.c
> @@ -384,6 +384,7 @@ EXPORT_SYMBOL(blk_integrity_is_initialized);
> int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template)
> {
> struct blk_integrity *bi;
> + struct backing_dev_info *bdi;
>
> BUG_ON(disk == NULL);
>
> @@ -420,6 +421,9 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template)
> } else
> bi->name = bi_unsupported_name;
>
> + bdi = &disk->queue->backing_dev_info;
> + bdi->capabilities |= BDI_CAP_STABLE_WRITES;
> +
> return 0;
> }
> EXPORT_SYMBOL(blk_integrity_register);
> @@ -434,10 +438,14 @@ EXPORT_SYMBOL(blk_integrity_register);
> void blk_integrity_unregister(struct gendisk *disk)
> {
> struct blk_integrity *bi;
> + struct backing_dev_info *bdi;
>
> if (!disk || !disk->integrity)
> return;
>
> + bdi = &disk->queue->backing_dev_info;
> + bdi->capabilities &= ~BDI_CAP_STABLE_WRITES;
> +
> bi = disk->integrity;
>
> kobject_uevent(&bi->kobj, KOBJ_REMOVE);
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index 2a9a9ab..b82a8bb 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -253,6 +253,7 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
> #define BDI_CAP_EXEC_MAP 0x00000040
> #define BDI_CAP_NO_ACCT_WB 0x00000080
> #define BDI_CAP_SWAP_BACKED 0x00000100
> +#define BDI_CAP_STABLE_WRITES 0x00000200
>
> #define BDI_CAP_VMFLAGS \
> (BDI_CAP_READ_MAP | BDI_CAP_WRITE_MAP | BDI_CAP_EXEC_MAP)
> @@ -307,6 +308,11 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout);
> int pdflush_proc_obsolete(struct ctl_table *table, int write,
> void __user *buffer, size_t *lenp, loff_t *ppos);
>
> +static inline bool bdi_cap_stable_writes(struct backing_dev_info *bdi)
> +{
> + return bdi->capabilities & BDI_CAP_STABLE_WRITES;
> +}
> +
> static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
> {
> return !(bdi->capabilities & BDI_CAP_NO_WRITEBACK);
> diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> index d3ca2b3..a2f4c08 100644
> --- a/mm/backing-dev.c
> +++ b/mm/backing-dev.c
> @@ -221,12 +221,23 @@ static ssize_t max_ratio_store(struct device *dev,
> }
> BDI_SHOW(max_ratio, bdi->max_ratio)
>
> +static ssize_t stable_page_writes_show(struct device *dev,
> + struct device_attribute *attr, char *page)
> +{
> + struct backing_dev_info *bdi = dev_get_drvdata(dev);
> +
> + return snprintf(page, PAGE_SIZE-1, "%d\n",
> + bdi->capabilities & BDI_CAP_STABLE_WRITES ? 1 : 0);
> +}
> +
> #define __ATTR_RW(attr) __ATTR(attr, 0644, attr##_show, attr##_store)
> +#define __ATTR_RO(attr) __ATTR(attr, 0444, attr##_show, NULL)
>
> static struct device_attribute bdi_dev_attrs[] = {
> __ATTR_RW(read_ahead_kb),
> __ATTR_RW(min_ratio),
> __ATTR_RW(max_ratio),
> + __ATTR_RO(stable_page_writes),
> __ATTR_NULL,
> };
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-10-29 18:28:59

[permalink] [raw]

Subject: Re: [RFC PATCH 2/2] mm: Gate stable page writes on the bdi flag

On Fri 26-10-12 18:36:08, Darrick J. Wong wrote:
> Create a helper function to decide if a particular address space is backed by a
> device that requires stable page writes. For each wait_on_page_writeback call
> that was inserted in the original stable page write patchset, change it to wait
> only if the helper says it's necessary. This should eliminate unnecessary
> waiting for devices that don't require page contents to be immutable during
> writeout.
>
> Signed-off-by: Darrick J. Wong <[email protected]>
> ---
>
> fs/buffer.c | 3 ++-
> fs/ext4/inode.c | 4 ++--
> include/linux/fs.h | 2 ++
> mm/filemap.c | 3 ++-
> mm/page-writeback.c | 7 +++++++
> 5 files changed, 15 insertions(+), 4 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index b5f0442..f508ece 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2334,7 +2334,8 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> if (unlikely(ret < 0))
> goto out_unlock;
> set_page_dirty(page);
> - wait_on_page_writeback(page);
> + if (page->mapping && mapping_needs_stable_writes(page->mapping))
> + wait_on_page_writeback(page);
The page is locked and we checked whether it belongs to the right mapping
just after we locked it down. So you can safely skip the page->mapping
test.

> return 0;
> out_unlock:
> unlock_page(page);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index b3c243b..82908b8 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4813,8 +4813,8 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> if (page_has_buffers(page)) {
> if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
> ext4_bh_unmapped)) {
> - /* Wait so that we don't change page under IO */
> - wait_on_page_writeback(page);
> + if (mapping_needs_stable_writes(mapping))
> + wait_on_page_writeback(page);
> ret = VM_FAULT_LOCKED;
> goto out;
> }
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index b33cfc9..d99db0e 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -466,6 +466,8 @@ struct block_device {
> struct percpu_rw_semaphore bd_block_size_semaphore;
> };
>
> +int mapping_needs_stable_writes(struct address_space *mapping);
> +
> /*
> * Radix-tree tags, for tagging dirty and writeback pages within the pagecache
> * radix trees
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 83efee7..aedc6bd 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2274,7 +2274,8 @@ repeat:
> return NULL;
> }
> found:
> - wait_on_page_writeback(page);
> + if (mapping_needs_stable_writes(mapping))
> + wait_on_page_writeback(page);
> return page;
> }
> EXPORT_SYMBOL(grab_cache_page_write_begin);
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 830893b..d0f042c 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2275,3 +2275,10 @@ int mapping_tagged(struct address_space *mapping, int tag)
> return radix_tree_tagged(&mapping->page_tree, tag);
> }
> EXPORT_SYMBOL(mapping_tagged);
> +
> +int mapping_needs_stable_writes(struct address_space *mapping)
> +{
> + return mapping->backing_dev_info &&
> + bdi_cap_stable_writes(mapping->backing_dev_info);
> +}
> +EXPORT_SYMBOL_GPL(mapping_needs_stable_writes);
Traditional name for these shortcuts is "mapping_cap_..." and put them in
linux/backing-dev.h. And you don't have to check
mapping->backing_dev_info.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-10-29 18:30:51

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] bdi: Create a flag to indicate that a backing device needs stable page writes

On Mon 29-10-12 19:13:58, Jan Kara wrote:
> On Fri 26-10-12 18:35:24, Darrick J. Wong wrote:
> > This creates BDI_CAP_STABLE_WRITES, which indicates that a device requires
> > stable page writes. It also plumbs in a sysfs attribute so that admins can
> > check the device status.
> >
> > Signed-off-by: Darrick J. Wong <[email protected]>
> I guess Jens Axboe <[email protected]> would be the best target for this
> patch (so that he can merge it). The patch looks OK to me. You can add:
> Reviewed-by: Jan Kara <[email protected]>
One more thing popped up in my mind: What about NFS, Ceph or md RAID5?
These could (at least theoretically) care about stable writes as well. I'm
not sure if they really started to use them but it would be good to at
least let them know.

Honza

> > ---
> >
> > block/blk-integrity.c | 8 ++++++++
> > include/linux/backing-dev.h | 6 ++++++
> > mm/backing-dev.c | 11 +++++++++++
> > 3 files changed, 25 insertions(+)
> >
> > diff --git a/block/blk-integrity.c b/block/blk-integrity.c
> > index da2a818..d05d7b3 100644
> > --- a/block/blk-integrity.c
> > +++ b/block/blk-integrity.c
> > @@ -384,6 +384,7 @@ EXPORT_SYMBOL(blk_integrity_is_initialized);
> > int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template)
> > {
> > struct blk_integrity *bi;
> > + struct backing_dev_info *bdi;
> >
> > BUG_ON(disk == NULL);
> >
> > @@ -420,6 +421,9 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template)
> > } else
> > bi->name = bi_unsupported_name;
> >
> > + bdi = &disk->queue->backing_dev_info;
> > + bdi->capabilities |= BDI_CAP_STABLE_WRITES;
> > +
> > return 0;
> > }
> > EXPORT_SYMBOL(blk_integrity_register);
> > @@ -434,10 +438,14 @@ EXPORT_SYMBOL(blk_integrity_register);
> > void blk_integrity_unregister(struct gendisk *disk)
> > {
> > struct blk_integrity *bi;
> > + struct backing_dev_info *bdi;
> >
> > if (!disk || !disk->integrity)
> > return;
> >
> > + bdi = &disk->queue->backing_dev_info;
> > + bdi->capabilities &= ~BDI_CAP_STABLE_WRITES;
> > +
> > bi = disk->integrity;
> >
> > kobject_uevent(&bi->kobj, KOBJ_REMOVE);
> > diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> > index 2a9a9ab..b82a8bb 100644
> > --- a/include/linux/backing-dev.h
> > +++ b/include/linux/backing-dev.h
> > @@ -253,6 +253,7 @@ int bdi_set_max_ratio(struct backing_dev_info *bdi, unsigned int max_ratio);
> > #define BDI_CAP_EXEC_MAP 0x00000040
> > #define BDI_CAP_NO_ACCT_WB 0x00000080
> > #define BDI_CAP_SWAP_BACKED 0x00000100
> > +#define BDI_CAP_STABLE_WRITES 0x00000200
> >
> > #define BDI_CAP_VMFLAGS \
> > (BDI_CAP_READ_MAP | BDI_CAP_WRITE_MAP | BDI_CAP_EXEC_MAP)
> > @@ -307,6 +308,11 @@ long wait_iff_congested(struct zone *zone, int sync, long timeout);
> > int pdflush_proc_obsolete(struct ctl_table *table, int write,
> > void __user *buffer, size_t *lenp, loff_t *ppos);
> >
> > +static inline bool bdi_cap_stable_writes(struct backing_dev_info *bdi)
> > +{
> > + return bdi->capabilities & BDI_CAP_STABLE_WRITES;
> > +}
> > +
> > static inline bool bdi_cap_writeback_dirty(struct backing_dev_info *bdi)
> > {
> > return !(bdi->capabilities & BDI_CAP_NO_WRITEBACK);
> > diff --git a/mm/backing-dev.c b/mm/backing-dev.c
> > index d3ca2b3..a2f4c08 100644
> > --- a/mm/backing-dev.c
> > +++ b/mm/backing-dev.c
> > @@ -221,12 +221,23 @@ static ssize_t max_ratio_store(struct device *dev,
> > }
> > BDI_SHOW(max_ratio, bdi->max_ratio)
> >
> > +static ssize_t stable_page_writes_show(struct device *dev,
> > + struct device_attribute *attr, char *page)
> > +{
> > + struct backing_dev_info *bdi = dev_get_drvdata(dev);
> > +
> > + return snprintf(page, PAGE_SIZE-1, "%d\n",
> > + bdi->capabilities & BDI_CAP_STABLE_WRITES ? 1 : 0);
> > +}
> > +
> > #define __ATTR_RW(attr) __ATTR(attr, 0644, attr##_show, attr##_store)
> > +#define __ATTR_RO(attr) __ATTR(attr, 0444, attr##_show, NULL)
> >
> > static struct device_attribute bdi_dev_attrs[] = {
> > __ATTR_RW(read_ahead_kb),
> > __ATTR_RW(min_ratio),
> > __ATTR_RW(max_ratio),
> > + __ATTR_RO(stable_page_writes),
> > __ATTR_NULL,
> > };
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-10-29 22:01:22

by Dave Chinner

[permalink] [raw]

Subject: Re: semi-stable page writes

On Fri, Oct 26, 2012 at 03:19:09AM -0700, Darrick J. Wong wrote:
> Hi everyone,
>
> Are people still annoyed about writes taking unexpectedly long amounts of tme
> due to the stable page write patchset? I'm guessing yes...

I haven't heard anyone except th elunatic fringe complain
recently...

> I'm close to posting a patchset that (a) gates the wait_on_page_writeback calls
> on a flag that you can set in the bdi to indicate that you need stable writes
> (which blk_integrity_register will set);

I'd prefer stable pages by default (e.g. btrfs needs it for sane
data crc calculations), with an option to turn it off.

> (b) (ab)uses a page flag bit (PG_slab)
> to indicate that a page is actually being sent out to disk hardware; and (c)

I don't think you can do that. You can send slab allocated memory to
disk (e.g. kmalloc()d memory) and XFS definitely does that for
sub-page sized metadata. I'm pretty sure that means the PG_slab
flag is not available for (ab)use in the IO path....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-10-29 23:48:20

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] bdi: Create a flag to indicate that a backing device needs stable page writes

On Mon, 29 Oct 2012 19:30:51 +0100 Jan Kara <[email protected]> wrote:

> On Mon 29-10-12 19:13:58, Jan Kara wrote:
> > On Fri 26-10-12 18:35:24, Darrick J. Wong wrote:
> > > This creates BDI_CAP_STABLE_WRITES, which indicates that a device requires
> > > stable page writes. It also plumbs in a sysfs attribute so that admins can
> > > check the device status.
> > >
> > > Signed-off-by: Darrick J. Wong <[email protected]>
> > I guess Jens Axboe <[email protected]> would be the best target for this
> > patch (so that he can merge it). The patch looks OK to me. You can add:
> > Reviewed-by: Jan Kara <[email protected]>
> One more thing popped up in my mind: What about NFS, Ceph or md RAID5?
> These could (at least theoretically) care about stable writes as well. I'm
> not sure if they really started to use them but it would be good to at
> least let them know.
>

What exactly are the semantics of BDI_CAP_STABLE_WRITES ?

If I set it for md/RAID5, do I get a cast-iron guarantee that no byte in any
page submitted for write will ever change until after I call bio_endio()?

If so, is this true for all filesystems? - I would expect a bigger patch would
be needed for that.

If not, what exactly are the semantics?

Thanks,
NeilBrown

Attachments:

signature.asc (828.00 B)

2012-10-30 00:10:08

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] bdi: Create a flag to indicate that a backing device needs stable page writes

On Tue 30-10-12 10:48:37, NeilBrown wrote:
> On Mon, 29 Oct 2012 19:30:51 +0100 Jan Kara <[email protected]> wrote:
>
> > On Mon 29-10-12 19:13:58, Jan Kara wrote:
> > > On Fri 26-10-12 18:35:24, Darrick J. Wong wrote:
> > > > This creates BDI_CAP_STABLE_WRITES, which indicates that a device requires
> > > > stable page writes. It also plumbs in a sysfs attribute so that admins can
> > > > check the device status.
> > > >
> > > > Signed-off-by: Darrick J. Wong <[email protected]>
> > > I guess Jens Axboe <[email protected]> would be the best target for this
> > > patch (so that he can merge it). The patch looks OK to me. You can add:
> > > Reviewed-by: Jan Kara <[email protected]>
> > One more thing popped up in my mind: What about NFS, Ceph or md RAID5?
> > These could (at least theoretically) care about stable writes as well. I'm
> > not sure if they really started to use them but it would be good to at
> > least let them know.
> >
>
> What exactly are the semantics of BDI_CAP_STABLE_WRITES ?
>
> If I set it for md/RAID5, do I get a cast-iron guarantee that no byte in any
> page submitted for write will ever change until after I call bio_endio()?
Yes.

> If so, is this true for all filesystems? - I would expect a bigger patch would
> be needed for that.
Actually the code is in kernel for quite some time already. The problem
is it is always enabled causing unnecessary performance issues for some
workloads. So these patches try to be more selective in when the code gets
enabled.

Regarding "all filesystems" question: If we update filemap_page_mkwrite()
to call wait_on_page_writeback() then it should be for all filesystems.

Honza

2012-10-30 00:34:28

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] bdi: Create a flag to indicate that a backing device needs stable page writes

On Tue, 30 Oct 2012 01:10:08 +0100 Jan Kara <[email protected]> wrote:

> On Tue 30-10-12 10:48:37, NeilBrown wrote:
> > On Mon, 29 Oct 2012 19:30:51 +0100 Jan Kara <[email protected]> wrote:
> >
> > > On Mon 29-10-12 19:13:58, Jan Kara wrote:
> > > > On Fri 26-10-12 18:35:24, Darrick J. Wong wrote:
> > > > > This creates BDI_CAP_STABLE_WRITES, which indicates that a device requires
> > > > > stable page writes. It also plumbs in a sysfs attribute so that admins can
> > > > > check the device status.
> > > > >
> > > > > Signed-off-by: Darrick J. Wong <[email protected]>
> > > > I guess Jens Axboe <[email protected]> would be the best target for this
> > > > patch (so that he can merge it). The patch looks OK to me. You can add:
> > > > Reviewed-by: Jan Kara <[email protected]>
> > > One more thing popped up in my mind: What about NFS, Ceph or md RAID5?
> > > These could (at least theoretically) care about stable writes as well. I'm
> > > not sure if they really started to use them but it would be good to at
> > > least let them know.
> > >
> >
> > What exactly are the semantics of BDI_CAP_STABLE_WRITES ?
> >
> > If I set it for md/RAID5, do I get a cast-iron guarantee that no byte in any
> > page submitted for write will ever change until after I call bio_endio()?
> Yes.
>
> > If so, is this true for all filesystems? - I would expect a bigger patch would
> > be needed for that.
> Actually the code is in kernel for quite some time already. The problem
> is it is always enabled causing unnecessary performance issues for some
> workloads. So these patches try to be more selective in when the code gets
> enabled.
>
> Regarding "all filesystems" question: If we update filemap_page_mkwrite()
> to call wait_on_page_writeback() then it should be for all filesystems.

Cool. I didn't realise it had progressed that far.

I guess it is time to look at the possibility of removing the
'copy-into-cache' step for full-page, well-aligned bi_iovecs.

I assume this applies to swap-out as well ?? It has been a minor source of
frustration that when you swap-out to RAID1, you can occasionally get
different data on the two devices because memory changed between the two DMA
events.

NeilBrown

Attachments:

signature.asc (828.00 B)

2012-10-30 01:00:27

by Theodore Ts'o

[permalink] [raw]

Subject: Re: semi-stable page writes

On Tue, Oct 30, 2012 at 09:01:22AM +1100, Dave Chinner wrote:
> On Fri, Oct 26, 2012 at 03:19:09AM -0700, Darrick J. Wong wrote:
> > Hi everyone,
> >
> > Are people still annoyed about writes taking unexpectedly long amounts of tme
> > due to the stable page write patchset? I'm guessing yes...
>
> I haven't heard anyone except th elunatic fringe complain
> recently...

We are currently carrying a patch in the Google kernel which
unconditionally disables stable page writes specifically because it
introduced significant latencies that were unacceptable for some of
our (internal) customers of said production kernel.

I'll leave it to others to decide whether the Google production kernel
is part of the lunatic fringe or not. :-)

I would certainly welcome some option which allows the stable page
writes to be selectively enabled or disabled. I think it would be
better to only take the performance hit if the underyling hardware
requires it (i.e., for iSCSI, or for DIF/DIX) or some other part of
the storage stack (whether it be the file system or the dm layer), but
if people want to make it a mount option, I could with that.

I suspect disabling stable writes via a mount option or sysfs tunable
would be much more error prone, and hence much more of a support issue
for the enterprise distributions, however. So if it is done via
tunable, the kernel should warn, loudly, if it's an configuration that
will lead to problems (i.e., because btrfs wants to do data
checksumming, or because it's required by iSCSI, or whatever).
Otherwise it's going to be a support nightmare.

IMO, it would be better to have the system automatically do the right
thing, though. If there is no need for stable page writes, why pay
the performance penalty for it?

- Ted

2012-10-30 04:10:22

by Martin K. Petersen

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] bdi: Create a flag to indicate that a backing device needs stable page writes

>>>>> "Darrick" == Darrick J Wong <[email protected]> writes:

Darrick> This creates BDI_CAP_STABLE_WRITES, which indicates that a
Darrick> device requires stable page writes. It also plumbs in a sysfs
Darrick> attribute so that admins can check the device status.

Might be nice to make the sysfs knob tweakable. Also, don't forget to
add a suitable blurb to Documentation/ABI/.

Feel free to add my Acked-by:

--
Martin K. Petersen Oracle Linux Engineering

2012-10-30 04:48:44

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] bdi: Create a flag to indicate that a backing device needs stable page writes

On Tue, 30 Oct 2012 00:10:22 -0400 "Martin K. Petersen"
<[email protected]> wrote:

> >>>>> "Darrick" == Darrick J Wong <[email protected]> writes:
>
> Darrick> This creates BDI_CAP_STABLE_WRITES, which indicates that a
> Darrick> device requires stable page writes. It also plumbs in a sysfs
> Darrick> attribute so that admins can check the device status.
>
> Might be nice to make the sysfs knob tweakable. Also, don't forget to
> add a suitable blurb to Documentation/ABI/.

It isn't at all clear to me that having the sysfs knob 'tweakable' is a good
idea.
From the md/raid5 perspective, I would want to know for certain whether the
pages in a give bio are guaranteed not to change, or if they might.
I could set the BDI_CAP_STABLE_WRITES and believe they will never change, or
test the BDI_CAP_STABLE_WRITES and let that tell me if they might change or
not. But if the bit can be changed at any moment, then it can never be
trusted and so becomes worthless to me.

At the very least it should only be possible to change it when there is no
IO in flight.

NeilBrown

>
> Feel free to add my Acked-by:
>

Attachments:

signature.asc (828.00 B)

2012-10-30 13:38:28

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] bdi: Create a flag to indicate that a backing device needs stable page writes

On Tue 30-10-12 11:34:41, NeilBrown wrote:
> On Tue, 30 Oct 2012 01:10:08 +0100 Jan Kara <[email protected]> wrote:
>
> > On Tue 30-10-12 10:48:37, NeilBrown wrote:
> > > On Mon, 29 Oct 2012 19:30:51 +0100 Jan Kara <[email protected]> wrote:
> > >
> > > > On Mon 29-10-12 19:13:58, Jan Kara wrote:
> > > > > On Fri 26-10-12 18:35:24, Darrick J. Wong wrote:
> > > > > > This creates BDI_CAP_STABLE_WRITES, which indicates that a device requires
> > > > > > stable page writes. It also plumbs in a sysfs attribute so that admins can
> > > > > > check the device status.
> > > > > >
> > > > > > Signed-off-by: Darrick J. Wong <[email protected]>
> > > > > I guess Jens Axboe <[email protected]> would be the best target for this
> > > > > patch (so that he can merge it). The patch looks OK to me. You can add:
> > > > > Reviewed-by: Jan Kara <[email protected]>
> > > > One more thing popped up in my mind: What about NFS, Ceph or md RAID5?
> > > > These could (at least theoretically) care about stable writes as well. I'm
> > > > not sure if they really started to use them but it would be good to at
> > > > least let them know.
> > > >
> > >
> > > What exactly are the semantics of BDI_CAP_STABLE_WRITES ?
> > >
> > > If I set it for md/RAID5, do I get a cast-iron guarantee that no byte in any
> > > page submitted for write will ever change until after I call bio_endio()?
> > Yes.
> >
> > > If so, is this true for all filesystems? - I would expect a bigger patch would
> > > be needed for that.
> > Actually the code is in kernel for quite some time already. The problem
> > is it is always enabled causing unnecessary performance issues for some
> > workloads. So these patches try to be more selective in when the code gets
> > enabled.
> >
> > Regarding "all filesystems" question: If we update filemap_page_mkwrite()
> > to call wait_on_page_writeback() then it should be for all filesystems.
>
> Cool. I didn't realise it had progressed that far.
>
> I guess it is time to look at the possibility of removing the
> 'copy-into-cache' step for full-page, well-aligned bi_iovecs.
>
> I assume this applies to swap-out as well ?? It has been a minor source of
> frustration that when you swap-out to RAID1, you can occasionally get
> different data on the two devices because memory changed between the two DMA
> events.
Really? I'm somewhat surprised. I was under the impression that when a
page is added to a swap cache it is unmapped so there should be no
modification to it possible while it is being swapped out. But maybe it
could get mapped back and modified after we unlock the page and submit the
bio. So mm/memory.c:do_swap_page() might need wait_on_page_writeback() as
well. But I'm not an expert on swap code. I guess I'll experiment with this
a bit. Thanks for a pointer.

Honza

2012-10-30 12:19:41

by Martin K. Petersen

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] bdi: Create a flag to indicate that a backing device needs stable page writes

>>>>> "Neil" == NeilBrown <[email protected]> writes:

Neil,

>> Might be nice to make the sysfs knob tweakable. Also, don't forget to
>> add a suitable blurb to Documentation/ABI/.

Neil> It isn't at all clear to me that having the sysfs knob
Neil> 'tweakable' is a good idea. From the md/raid5 perspective, I
Neil> would want to know for certain whether the pages in a give bio
Neil> are guaranteed not to change, or if they might. I could set the
Neil> BDI_CAP_STABLE_WRITES and believe they will never change, or test
Neil> the BDI_CAP_STABLE_WRITES and let that tell me if they might
Neil> change or not. But if the bit can be changed at any moment, then
Neil> it can never be trusted and so becomes worthless to me.

I was mostly interested in being able to turn it on for devices that
haven't explicitly done so. I agree that turning it off can be
problematic.

--
Martin K. Petersen Oracle Linux Engineering

2012-10-30 20:14:31

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] bdi: Create a flag to indicate that a backing device needs stable page writes

On Tue, Oct 30, 2012 at 08:19:41AM -0400, Martin K. Petersen wrote:
> >>>>> "Neil" == NeilBrown <[email protected]> writes:
>
> Neil,
>
> >> Might be nice to make the sysfs knob tweakable. Also, don't forget to
> >> add a suitable blurb to Documentation/ABI/.
>
> Neil> It isn't at all clear to me that having the sysfs knob
> Neil> 'tweakable' is a good idea. From the md/raid5 perspective, I
> Neil> would want to know for certain whether the pages in a give bio
> Neil> are guaranteed not to change, or if they might. I could set the
> Neil> BDI_CAP_STABLE_WRITES and believe they will never change, or test
> Neil> the BDI_CAP_STABLE_WRITES and let that tell me if they might
> Neil> change or not. But if the bit can be changed at any moment, then
> Neil> it can never be trusted and so becomes worthless to me.
>
> I was mostly interested in being able to turn it on for devices that
> haven't explicitly done so. I agree that turning it off can be
> problematic.

I'm ok with having a tunable that can turn it on, but atm I can't really think
of a convincing reason to let people turn it /off/. If people yell loud enough
I'll add it, but I'd rather not have to distinguish between "on because user
set it on" vs "on because hw needs it".

It'd be nice if the presence BDI_CAP_STABLE_WRITES meant that all filesystems
would wait on page writes. Hrm, guess I'll see about adding that to the patch
set. Though ISTR that at least the vfat and ext2 maintainers weren't
interested the last time I asked.

--D
>
> --
> Martin K. Petersen Oracle Linux Engineering
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-10-30 20:40:49

[permalink] [raw]

Subject: Re: semi-stable page writes

On Tue, Oct 30, 2012 at 09:01:22AM +1100, Dave Chinner wrote:
> On Fri, Oct 26, 2012 at 03:19:09AM -0700, Darrick J. Wong wrote:
> > Hi everyone,
> >
> > Are people still annoyed about writes taking unexpectedly long amounts of tme
> > due to the stable page write patchset? I'm guessing yes...
>
> I haven't heard anyone except th elunatic fringe complain
> recently...
>
> > I'm close to posting a patchset that (a) gates the wait_on_page_writeback calls
> > on a flag that you can set in the bdi to indicate that you need stable writes
> > (which blk_integrity_register will set);
>
> I'd prefer stable pages by default (e.g. btrfs needs it for sane
> data crc calculations), with an option to turn it off.
>
> > (b) (ab)uses a page flag bit (PG_slab)
> > to indicate that a page is actually being sent out to disk hardware; and (c)
>
> I don't think you can do that. You can send slab allocated memory to
> disk (e.g. kmalloc()d memory) and XFS definitely does that for
> sub-page sized metadata. I'm pretty sure that means the PG_slab
> flag is not available for (ab)use in the IO path....

I gave up on PG_slab and declared my own PG_ bit. Unfortunately, atm I can't
remember which bit of code marks the page ptes so that they have to go back
through page_mkwrite, where we can trap the write. Hopefully for a shorter
duration.

Also, I was wondering -- is it possible to pursue a dual strategy? If we can
obtain a memory page without sleeping or causing any writeback, then use the
page as a bounce buffer. Otherwise, just wait like we do now. It looks as
though one could use __GFP_NORETRY | __GFP_NO_MEMALLOC to see if the allocator
can give out a page without having to run reclaim...?

--D
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-10-30 21:49:11

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] bdi: Create a flag to indicate that a backing device needs stable page writes

On Tue, 30 Oct 2012 14:38:25 +0100 Jan Kara <[email protected]> wrote:

> On Tue 30-10-12 11:34:41, NeilBrown wrote:
> > On Tue, 30 Oct 2012 01:10:08 +0100 Jan Kara <[email protected]> wrote:
> >
> > > On Tue 30-10-12 10:48:37, NeilBrown wrote:
> > > > On Mon, 29 Oct 2012 19:30:51 +0100 Jan Kara <[email protected]> wrote:
> > > >
> > > > > On Mon 29-10-12 19:13:58, Jan Kara wrote:
> > > > > > On Fri 26-10-12 18:35:24, Darrick J. Wong wrote:
> > > > > > > This creates BDI_CAP_STABLE_WRITES, which indicates that a device requires
> > > > > > > stable page writes. It also plumbs in a sysfs attribute so that admins can
> > > > > > > check the device status.
> > > > > > >
> > > > > > > Signed-off-by: Darrick J. Wong <[email protected]>
> > > > > > I guess Jens Axboe <[email protected]> would be the best target for this
> > > > > > patch (so that he can merge it). The patch looks OK to me. You can add:
> > > > > > Reviewed-by: Jan Kara <[email protected]>
> > > > > One more thing popped up in my mind: What about NFS, Ceph or md RAID5?
> > > > > These could (at least theoretically) care about stable writes as well. I'm
> > > > > not sure if they really started to use them but it would be good to at
> > > > > least let them know.
> > > > >
> > > >
> > > > What exactly are the semantics of BDI_CAP_STABLE_WRITES ?
> > > >
> > > > If I set it for md/RAID5, do I get a cast-iron guarantee that no byte in any
> > > > page submitted for write will ever change until after I call bio_endio()?
> > > Yes.
> > >
> > > > If so, is this true for all filesystems? - I would expect a bigger patch would
> > > > be needed for that.
> > > Actually the code is in kernel for quite some time already. The problem
> > > is it is always enabled causing unnecessary performance issues for some
> > > workloads. So these patches try to be more selective in when the code gets
> > > enabled.
> > >
> > > Regarding "all filesystems" question: If we update filemap_page_mkwrite()
> > > to call wait_on_page_writeback() then it should be for all filesystems.
> >
> > Cool. I didn't realise it had progressed that far.
> >
> > I guess it is time to look at the possibility of removing the
> > 'copy-into-cache' step for full-page, well-aligned bi_iovecs.
> >
> > I assume this applies to swap-out as well ?? It has been a minor source of
> > frustration that when you swap-out to RAID1, you can occasionally get
> > different data on the two devices because memory changed between the two DMA
> > events.
> Really? I'm somewhat surprised. I was under the impression that when a
> page is added to a swap cache it is unmapped so there should be no
> modification to it possible while it is being swapped out. But maybe it
> could get mapped back and modified after we unlock the page and submit the
> bio. So mm/memory.c:do_swap_page() might need wait_on_page_writeback() as
> well. But I'm not an expert on swap code. I guess I'll experiment with this
> a bit. Thanks for a pointer.
>
> Honza

Thanks for looking into it. I should mention that it was some years ago that
this occasional inconsistency in RAID1 was reported and that I concluded that
it as due to swap (though I don't recall how deeply I examined the code).
It could well be different now. I never bothered pursuing it because I don't
think that behaviour is really wrong, just mildly annoying.

Thanks,
NeilBrown

Attachments:

signature.asc (828.00 B)

2012-10-30 22:14:23

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] bdi: Create a flag to indicate that a backing device needs stable page writes

On Tue, 30 Oct 2012 13:14:24 -0700 "Darrick J. Wong"
<[email protected]> wrote:

> On Tue, Oct 30, 2012 at 08:19:41AM -0400, Martin K. Petersen wrote:
> > >>>>> "Neil" == NeilBrown <[email protected]> writes:
> >
> > Neil,
> >
> > >> Might be nice to make the sysfs knob tweakable. Also, don't forget to
> > >> add a suitable blurb to Documentation/ABI/.
> >
> > Neil> It isn't at all clear to me that having the sysfs knob
> > Neil> 'tweakable' is a good idea. From the md/raid5 perspective, I
> > Neil> would want to know for certain whether the pages in a give bio
> > Neil> are guaranteed not to change, or if they might. I could set the
> > Neil> BDI_CAP_STABLE_WRITES and believe they will never change, or test
> > Neil> the BDI_CAP_STABLE_WRITES and let that tell me if they might
> > Neil> change or not. But if the bit can be changed at any moment, then
> > Neil> it can never be trusted and so becomes worthless to me.
> >
> > I was mostly interested in being able to turn it on for devices that
> > haven't explicitly done so. I agree that turning it off can be
> > problematic.
>
> I'm ok with having a tunable that can turn it on, but atm I can't really think
> of a convincing reason to let people turn it /off/. If people yell loud enough
> I'll add it, but I'd rather not have to distinguish between "on because user
> set it on" vs "on because hw needs it".
>
> It'd be nice if the presence BDI_CAP_STABLE_WRITES meant that all filesystems
> would wait on page writes. Hrm, guess I'll see about adding that to the patch
> set. Though ISTR that at least the vfat and ext2 maintainers weren't
> interested the last time I asked.

I'm still a little foggy on the exact semantics and use-cases for this flag.
I'll try to piece together the bits that I know and ask you to tell me what
I've missed or what I've got wrong.

Stable writes are valuable when the filesystem or device driver calculates
some sort of 'redundancy' information based on the page in memory that is
about to be written.
This could be:
integrity data that will be sent with the page to the storage device
parity over a number of pages that will be written to a separate device
(i.e. RAID5/RAID6).
MAC or similar checksum that will be sent with the data over the network
and will be validated by the receiving device, which can then ACK or
NAK depending on correctness.

These could be implemented in the filesystem or in the device driver, so
either should be able to request stable writes. If neither request stable
writes, then the cost of stable writes should be avoided.

For the device driver (or network transport), not getting stable writes when
requested might be a performance issue, or it might be a correctness issue.
e.g. if an unstable write causes a MAC to be wrong, the network layer can
simply arrange a re-transmit. If an unstable write causes RAID5 parity to
be wrong, that unstable write could cause data corruption.

For the filesystem, the requirement to provide stable writes could just be a
performance cost (a few extra waits) or it could require significant
re-working of the code (you say vfat and ext2 aren't really comfortable with
supporting them).

Finally there is the VFS/VM which needs to provide support for stable
writes. It already does - your flag seems to just allow clients of the
VFS/VM to indicate whether stable writes are required.

So there seem to be several cases:

1/ The filesystem wants to always use stable writes. It just sets the flag,
and the device will see stable writes whether it cares or not. This would
happen if the filesystem is adding integrity data.
2/ The device would prefer stable writes if possible. This would apply to
iscsi (??) as it needs to add some sort of checksum before putting the
data on the network
3/ The device absolutely requires stable writes, or needs to provide
stability itself by taking a copy (md/RAID5 does this). So it needs to
know whether each write is stable, or it cannot take advantage of stable
writes.

So I see a need for 2 flags here.
The first one is set by the device or transport to say "I would prefer
stable writes if possible".
The second is set by the filesystem, either because it has its own needs, or
because it sees the first flag set on the device and chooses to honour it.
The VFS/VM would act based on this second flag, and devices like md/RAID5
would set the first flag, and assume writes are stable if the second flag is
also set.

This implies that setting that second flag must be handled synchronously by
the filesystem, so that the device doesn't see the flag set until the
filesystem has committed to honouring it. That seems to make a mount (or
remount) option the safest way to set it.

Comments?

NeilBrown

Attachments:

signature.asc (828.00 B)

2012-10-30 22:40:05

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] bdi: Create a flag to indicate that a backing device needs stable page writes

On 10/26/2012 06:35 PM, Darrick J. Wong wrote:
> This creates BDI_CAP_STABLE_WRITES, which indicates that a device requires
> stable page writes. It also plumbs in a sysfs attribute so that admins can
> check the device status.
>
> Signed-off-by: Darrick J. Wong <[email protected]>
> ---
<>
> @@ -434,10 +438,14 @@ EXPORT_SYMBOL(blk_integrity_register);
> void blk_integrity_unregister(struct gendisk *disk)
> {
> struct blk_integrity *bi;
> + struct backing_dev_info *bdi;
>
> if (!disk || !disk->integrity)
> return;
>
> + bdi = &disk->queue->backing_dev_info;
> + bdi->capabilities &= ~BDI_CAP_STABLE_WRITES;
> +

Once this patchset is in we'll need to add one such code site to
iscsi in the case data_integrity is turned on.

You are welcome to add such a patch to your patchset, (I can show
you where) but it will take a little while before I have the time to
tests such a change.

Currently if data_integrity is turned on iscsi reverts to copy-full
networking, as an hackish attempt to narrow the window of pages changing.
With such a patch in place, we can remove the hack.

But please note that your changes are only half the picture. Because not
all filesystems support stable pages, so it is not like if I turn this flag
on I will magically be protected. A more complete picture is if the FS could
communicate back to iscsi if it actually will provide stable pages, for it
to start trusting stable pages again.

But since iscsi's protection is just an hack, and an admin that mounts
a none supporting FS can always just turn off data_integrity for that
particular LUN, then I'd say it's fine. Even with this half of the
patchset iscsi should be fixed to use it.

But for the likes of future dm-raid5 that wants to be copy-less, you will
need in the future, a way for the FS to communicate back if it will actually
support stable-pages or the block device needs to revert to old tricks.

Thanks
Boaz

> bi = disk->integrity;
>
> kobject_uevent(&bi->kobj, KOBJ_REMOVE);

2012-10-30 23:31:15

by Dave Chinner

[permalink] [raw]

Subject: Re: semi-stable page writes

On Mon, Oct 29, 2012 at 09:00:27PM -0400, Theodore Ts'o wrote:
> On Tue, Oct 30, 2012 at 09:01:22AM +1100, Dave Chinner wrote:
> > On Fri, Oct 26, 2012 at 03:19:09AM -0700, Darrick J. Wong wrote:
> > > Hi everyone,
> > >
> > > Are people still annoyed about writes taking unexpectedly long amounts of tme
> > > due to the stable page write patchset? I'm guessing yes...
> >
> > I haven't heard anyone except th elunatic fringe complain
> > recently...
>
> We are currently carrying a patch in the Google kernel which
> unconditionally disables stable page writes specifically because it
> introduced significant latencies that were unacceptable for some of
> our (internal) customers of said production kernel.
>
> I'll leave it to others to decide whether the Google production kernel
> is part of the lunatic fringe or not. :-)

Google is, and has the resources to maintain a lunatic fringe kernel
;)

Besides, we've discussed google's problem before, and it came down
to bad application design (i.e. no buffering to protect against
variable filesystem/storage latency) and not stable pages being the
source of the problem. Turning off stable pages was a hack to work
around a badly designed application stack....

....

> IMO, it would be better to have the system automatically do the right
> thing, though. If there is no need for stable page writes, why pay
> the performance penalty for it?

Yes, and the right thing to do is to put correctness before
performance. Stable pages are needed for correctness in a lot of
cases, so that shoul dbe the default. If the user has performance
problems, then they can turn it off. At no time should the default
require tuning to get correct behaviour. case in point: filesystems
default to "barriers = on".

Safe should *always* be the default choice, even if it is sometimes
slow.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-10-30 23:43:35

by Dave Chinner

[permalink] [raw]

Subject: Re: semi-stable page writes

On Tue, Oct 30, 2012 at 01:40:37PM -0700, Darrick J. Wong wrote:
> On Tue, Oct 30, 2012 at 09:01:22AM +1100, Dave Chinner wrote:
> > On Fri, Oct 26, 2012 at 03:19:09AM -0700, Darrick J. Wong wrote:
> > > Hi everyone,
> > >
> > > Are people still annoyed about writes taking unexpectedly long amounts of tme
> > > due to the stable page write patchset? I'm guessing yes...
> >
> > I haven't heard anyone except th elunatic fringe complain
> > recently...
> >
> > > I'm close to posting a patchset that (a) gates the wait_on_page_writeback calls
> > > on a flag that you can set in the bdi to indicate that you need stable writes
> > > (which blk_integrity_register will set);
> >
> > I'd prefer stable pages by default (e.g. btrfs needs it for sane
> > data crc calculations), with an option to turn it off.
> >
> > > (b) (ab)uses a page flag bit (PG_slab)
> > > to indicate that a page is actually being sent out to disk hardware; and (c)
> >
> > I don't think you can do that. You can send slab allocated memory to
> > disk (e.g. kmalloc()d memory) and XFS definitely does that for
> > sub-page sized metadata. I'm pretty sure that means the PG_slab
> > flag is not available for (ab)use in the IO path....
>
> I gave up on PG_slab and declared my own PG_ bit. Unfortunately, atm I can't
> remember which bit of code marks the page ptes so that they have to go back
> through page_mkwrite, where we can trap the write. Hopefully for a shorter
> duration.

clear_page_dirty_for_io(), IIRC.

> Also, I was wondering -- is it possible to pursue a dual strategy? If we can
> obtain a memory page without sleeping or causing any writeback, then use the
> page as a bounce buffer. Otherwise, just wait like we do now.

Using bounce buffers for all IO is not a feasible solution. Way too
much overhead copying data, not to mention we are already suffering
from the problem of flusher threads going CPU bound trying to issue
enough IO to keep high bandwidth storage fully utilised...

Cheers,

Dave.
--
Dave Chinner
[email protected]

2012-10-30 23:58:41

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] bdi: Create a flag to indicate that a backing device needs stable page writes

On 10/30/2012 03:14 PM, NeilBrown wrote:
> On Tue, 30 Oct 2012 13:14:24 -0700 "Darrick J. Wong"
> <[email protected]> wrote:
>
>> On Tue, Oct 30, 2012 at 08:19:41AM -0400, Martin K. Petersen wrote:
>>>>>>>> "Neil" == NeilBrown <[email protected]> writes:
>>>
<>
> So I see a need for 2 flags here.

Yes that was my thought too. We need two flags. The FS should communicate
it's capabilities as well.

> The first one is set by the device or transport to say "I would prefer
> stable writes if possible".
> The second is set by the filesystem, either because it has its own needs, or
> because it sees the first flag set on the device and chooses to honour it.
> The VFS/VM would act based on this second flag, and devices like md/RAID5
> would set the first flag, and assume writes are stable if the second flag is
> also set.
>
> This implies that setting that second flag must be handled synchronously by
> the filesystem, so that the device doesn't see the flag set until the
> filesystem has committed to honouring it. That seems to make a mount (or
> remount) option the safest way to set it.
>

I think I do not like any mount option or any other tuneable. With the
block device stating it's needs and the FS confirming on it's capability
then I do not see how reverting that decision by admin can be any good.
Any overrides by an admin would then just be a bug.

> Comments?
>
> NeilBrown
>

Thanks
Boaz

2012-10-31 08:57:33

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] bdi: Create a flag to indicate that a backing device needs stable page writes

On Wed, Oct 31, 2012 at 09:14:41AM +1100, NeilBrown wrote:
> On Tue, 30 Oct 2012 13:14:24 -0700 "Darrick J. Wong"
> <[email protected]> wrote:
>
> > On Tue, Oct 30, 2012 at 08:19:41AM -0400, Martin K. Petersen wrote:
> > > >>>>> "Neil" == NeilBrown <[email protected]> writes:
> > >
> > > Neil,
> > >
> > > >> Might be nice to make the sysfs knob tweakable. Also, don't forget to
> > > >> add a suitable blurb to Documentation/ABI/.
> > >
> > > Neil> It isn't at all clear to me that having the sysfs knob
> > > Neil> 'tweakable' is a good idea. From the md/raid5 perspective, I
> > > Neil> would want to know for certain whether the pages in a give bio
> > > Neil> are guaranteed not to change, or if they might. I could set the
> > > Neil> BDI_CAP_STABLE_WRITES and believe they will never change, or test
> > > Neil> the BDI_CAP_STABLE_WRITES and let that tell me if they might
> > > Neil> change or not. But if the bit can be changed at any moment, then
> > > Neil> it can never be trusted and so becomes worthless to me.
> > >
> > > I was mostly interested in being able to turn it on for devices that
> > > haven't explicitly done so. I agree that turning it off can be
> > > problematic.
> >
> > I'm ok with having a tunable that can turn it on, but atm I can't really think
> > of a convincing reason to let people turn it /off/. If people yell loud enough
> > I'll add it, but I'd rather not have to distinguish between "on because user
> > set it on" vs "on because hw needs it".
> >
> > It'd be nice if the presence BDI_CAP_STABLE_WRITES meant that all filesystems
> > would wait on page writes. Hrm, guess I'll see about adding that to the patch
> > set. Though ISTR that at least the vfat and ext2 maintainers weren't
> > interested the last time I asked.
>
> I'm still a little foggy on the exact semantics and use-cases for this flag.
> I'll try to piece together the bits that I know and ask you to tell me what
> I've missed or what I've got wrong.
>
> Stable writes are valuable when the filesystem or device driver calculates
> some sort of 'redundancy' information based on the page in memory that is
> about to be written.
> This could be:
> integrity data that will be sent with the page to the storage device
> parity over a number of pages that will be written to a separate device
> (i.e. RAID5/RAID6).
> MAC or similar checksum that will be sent with the data over the network
> and will be validated by the receiving device, which can then ACK or
> NAK depending on correctness.
>
> These could be implemented in the filesystem or in the device driver, so
> either should be able to request stable writes. If neither request stable
> writes, then the cost of stable writes should be avoided.

Most of the changes are in the VM; the only place where filesystem-specific
bits are needed are for things like ext4 which override page_mkwrite. I'm not
sure what you mean by the filesystem requesting stable writes; unless I'm
missing something (it's late), it's only the storage device that has any sort
of needs. The filesystem is free either to accomodate the need for stable
writes, or ignore that and deal with the IO errors.

> For the device driver (or network transport), not getting stable writes when
> requested might be a performance issue, or it might be a correctness issue.
> e.g. if an unstable write causes a MAC to be wrong, the network layer can
> simply arrange a re-transmit. If an unstable write causes RAID5 parity to
> be wrong, that unstable write could cause data corruption.
>
> For the filesystem, the requirement to provide stable writes could just be a
> performance cost (a few extra waits) or it could require significant
> re-working of the code (you say vfat and ext2 aren't really comfortable with
> supporting them).

I vaguely recall that the reason for skipping vfat was that the maintainer
didn't like the idea of paying the wait costs even on a disk that didn't
require it. I think we're addressing that. As for ext2 there wasn't much
comment, though I think Ted or someone mentioned that since it was "stable"
code, it wasn't worth fixing. In any case, patching filemap_page_mkwrite to
have a write_on_page_writeback seems to have fixed a number of filesystems
(hfs, reiser, ext2...) with little fuss.

> Finally there is the VFS/VM which needs to provide support for stable
> writes. It already does - your flag seems to just allow clients of the
> VFS/VM to indicate whether stable writes are required.
>
> So there seem to be several cases:
>
> 1/ The filesystem wants to always use stable writes. It just sets the flag,
> and the device will see stable writes whether it cares or not. This would
> happen if the filesystem is adding integrity data.
> 2/ The device would prefer stable writes if possible. This would apply to
> iscsi (??) as it needs to add some sort of checksum before putting the
> data on the network
> 3/ The device absolutely requires stable writes, or needs to provide
> stability itself by taking a copy (md/RAID5 does this). So it needs to
> know whether each write is stable, or it cannot take advantage of stable
> writes.
>
>
> So I see a need for 2 flags here.
> The first one is set by the device or transport to say "I would prefer
> stable writes if possible".

Yep, that seems to correspond with what the patch provides right now.

> The second is set by the filesystem, either because it has its own needs, or
> because it sees the first flag set on the device and chooses to honour it.
> The VFS/VM would act based on this second flag, and devices like md/RAID5
> would set the first flag, and assume writes are stable if the second flag is
> also set.

This is a trickier piece... I guess this means that the bdi has to keep track
of how many clients are using it (it looks like multiple fses on a partitioned
disk share the same bdi) and of those clients, which ones claim to support
stable page writes. If all clients do, then the disk can assume that all
writes headed to it will be stable. If even one doesn't, then I guess the
device falls back to whatever strategy it used before (the raid5 copying).

I guess we also need a way to pass these flags through md/dm if appropriate.

> This implies that setting that second flag must be handled synchronously by
> the filesystem, so that the device doesn't see the flag set until the
> filesystem has committed to honouring it. That seems to make a mount (or
> remount) option the safest way to set it.

Yeah, I was thinking that fill_super and kill_sb might be good sites for
whatever calls we need to make back to the bdi to make the stable page write
declarations.

--D

> Comments?
>
> NeilBrown

2012-10-31 08:58:23

[permalink] [raw]

Subject: Re: [RFC PATCH 2/2] mm: Gate stable page writes on the bdi flag

On Mon, Oct 29, 2012 at 07:28:57PM +0100, Jan Kara wrote:
> On Fri 26-10-12 18:36:08, Darrick J. Wong wrote:
> > Create a helper function to decide if a particular address space is backed by a
> > device that requires stable page writes. For each wait_on_page_writeback call
> > that was inserted in the original stable page write patchset, change it to wait
> > only if the helper says it's necessary. This should eliminate unnecessary
> > waiting for devices that don't require page contents to be immutable during
> > writeout.
> >
> > Signed-off-by: Darrick J. Wong <[email protected]>
> > ---
> >
> > fs/buffer.c | 3 ++-
> > fs/ext4/inode.c | 4 ++--
> > include/linux/fs.h | 2 ++
> > mm/filemap.c | 3 ++-
> > mm/page-writeback.c | 7 +++++++
> > 5 files changed, 15 insertions(+), 4 deletions(-)
> >
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index b5f0442..f508ece 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -2334,7 +2334,8 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> > if (unlikely(ret < 0))
> > goto out_unlock;
> > set_page_dirty(page);
> > - wait_on_page_writeback(page);
> > + if (page->mapping && mapping_needs_stable_writes(page->mapping))
> > + wait_on_page_writeback(page);
> The page is locked and we checked whether it belongs to the right mapping
> just after we locked it down. So you can safely skip the page->mapping
> test.
>
> > return 0;
> > out_unlock:
> > unlock_page(page);
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index b3c243b..82908b8 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -4813,8 +4813,8 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > if (page_has_buffers(page)) {
> > if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
> > ext4_bh_unmapped)) {
> > - /* Wait so that we don't change page under IO */
> > - wait_on_page_writeback(page);
> > + if (mapping_needs_stable_writes(mapping))
> > + wait_on_page_writeback(page);
> > ret = VM_FAULT_LOCKED;
> > goto out;
> > }
> > diff --git a/include/linux/fs.h b/include/linux/fs.h
> > index b33cfc9..d99db0e 100644
> > --- a/include/linux/fs.h
> > +++ b/include/linux/fs.h
> > @@ -466,6 +466,8 @@ struct block_device {
> > struct percpu_rw_semaphore bd_block_size_semaphore;
> > };
> >
> > +int mapping_needs_stable_writes(struct address_space *mapping);
> > +
> > /*
> > * Radix-tree tags, for tagging dirty and writeback pages within the pagecache
> > * radix trees
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 83efee7..aedc6bd 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2274,7 +2274,8 @@ repeat:
> > return NULL;
> > }
> > found:
> > - wait_on_page_writeback(page);
> > + if (mapping_needs_stable_writes(mapping))
> > + wait_on_page_writeback(page);
> > return page;
> > }
> > EXPORT_SYMBOL(grab_cache_page_write_begin);
> > diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> > index 830893b..d0f042c 100644
> > --- a/mm/page-writeback.c
> > +++ b/mm/page-writeback.c
> > @@ -2275,3 +2275,10 @@ int mapping_tagged(struct address_space *mapping, int tag)
> > return radix_tree_tagged(&mapping->page_tree, tag);
> > }
> > EXPORT_SYMBOL(mapping_tagged);
> > +
> > +int mapping_needs_stable_writes(struct address_space *mapping)
> > +{
> > + return mapping->backing_dev_info &&
> > + bdi_cap_stable_writes(mapping->backing_dev_info);
> > +}
> > +EXPORT_SYMBOL_GPL(mapping_needs_stable_writes);
> Traditional name for these shortcuts is "mapping_cap_..." and put them in
> linux/backing-dev.h. And you don't have to check
> mapping->backing_dev_info.

Ok, I've removed both of those checks. Thank you for the comments!

--D
>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-10-31 09:05:34

[permalink] [raw]

Subject: Re: semi-stable page writes

On Wed, Oct 31, 2012 at 10:43:31AM +1100, Dave Chinner wrote:
> On Tue, Oct 30, 2012 at 01:40:37PM -0700, Darrick J. Wong wrote:
> > On Tue, Oct 30, 2012 at 09:01:22AM +1100, Dave Chinner wrote:
> > > On Fri, Oct 26, 2012 at 03:19:09AM -0700, Darrick J. Wong wrote:
> > > > Hi everyone,
> > > >
> > > > Are people still annoyed about writes taking unexpectedly long amounts of tme
> > > > due to the stable page write patchset? I'm guessing yes...
> > >
> > > I haven't heard anyone except th elunatic fringe complain
> > > recently...
> > >
> > > > I'm close to posting a patchset that (a) gates the wait_on_page_writeback calls
> > > > on a flag that you can set in the bdi to indicate that you need stable writes
> > > > (which blk_integrity_register will set);
> > >
> > > I'd prefer stable pages by default (e.g. btrfs needs it for sane
> > > data crc calculations), with an option to turn it off.
> > >
> > > > (b) (ab)uses a page flag bit (PG_slab)
> > > > to indicate that a page is actually being sent out to disk hardware; and (c)
> > >
> > > I don't think you can do that. You can send slab allocated memory to
> > > disk (e.g. kmalloc()d memory) and XFS definitely does that for
> > > sub-page sized metadata. I'm pretty sure that means the PG_slab
> > > flag is not available for (ab)use in the IO path....
> >
> > I gave up on PG_slab and declared my own PG_ bit. Unfortunately, atm I can't
> > remember which bit of code marks the page ptes so that they have to go back
> > through page_mkwrite, where we can trap the write. Hopefully for a shorter
> > duration.
>
> clear_page_dirty_for_io(), IIRC.

Yep, thanks. My memory is a bit rusty due to recent downtime. :/

Now to figure out if I can safely call that from deep inside the SCSI dispatch
functions as part of deferred-checksumming. I have a bad feeling that we have
to lock the page, which implies sleeping, and (unless they fixed this) the SCSI
dispatch functions hold the scsi host lock while running, which means we can't
sleep.

> > Also, I was wondering -- is it possible to pursue a dual strategy? If we can
> > obtain a memory page without sleeping or causing any writeback, then use the
> > page as a bounce buffer. Otherwise, just wait like we do now.
>
> Using bounce buffers for all IO is not a feasible solution. Way too
> much overhead copying data, not to mention we are already suffering
> from the problem of flusher threads going CPU bound trying to issue
> enough IO to keep high bandwidth storage fully utilised...

Ok.

--D

2012-10-31 11:45:45

[permalink] [raw]

Subject: Re: semi-stable page writes

On Wed 31-10-12 10:30:21, Dave Chinner wrote:
> On Mon, Oct 29, 2012 at 09:00:27PM -0400, Theodore Ts'o wrote:
> > On Tue, Oct 30, 2012 at 09:01:22AM +1100, Dave Chinner wrote:
> > > On Fri, Oct 26, 2012 at 03:19:09AM -0700, Darrick J. Wong wrote:
> > > > Hi everyone,
> > > >
> > > > Are people still annoyed about writes taking unexpectedly long amounts of tme
> > > > due to the stable page write patchset? I'm guessing yes...
> > >
> > > I haven't heard anyone except th elunatic fringe complain
> > > recently...
> >
> > We are currently carrying a patch in the Google kernel which
> > unconditionally disables stable page writes specifically because it
> > introduced significant latencies that were unacceptable for some of
> > our (internal) customers of said production kernel.
> >
> > I'll leave it to others to decide whether the Google production kernel
> > is part of the lunatic fringe or not. :-)
>
> Google is, and has the resources to maintain a lunatic fringe kernel
> ;)
>
> Besides, we've discussed google's problem before, and it came down
> to bad application design (i.e. no buffering to protect against
> variable filesystem/storage latency) and not stable pages being the
> source of the problem. Turning off stable pages was a hack to work
> around a badly designed application stack....
Well, so far I heard like 4 or 5 complaints about performance that were
tracked down to stable pages. Likely the most convincing was a case of an
application mmaping 1 GB file and randomly changing bits here and there.
Throughput of that application dropped to about a third with stable pages
(which surprised me at the first sight but after doing the math it's
obvious).

As much as I agree that the problems can be solved in the applications
(if you have the liberty to modify them...), reported problems seem to be
common enough so that we try to do better than we do now.

> > IMO, it would be better to have the system automatically do the right
> > thing, though. If there is no need for stable page writes, why pay
> > the performance penalty for it?
>
> Yes, and the right thing to do is to put correctness before performance.
> Stable pages are needed for correctness in a lot of cases, so that shoul
> dbe the default. If the user has performance problems, then they can turn
> it off. At no time should the default require tuning to get correct
> behaviour. case in point: filesystems default to "barriers = on".
I agree here. But still that does not rule out the possibility of getting
it right in the kernel without having to enable stable pages in all cases.
I.e., if btrfs needs stable pages, let it set the flag that it needs them.
Same for DIF/DIX, RAID5 or whatever else...

Honza

2012-10-31 11:56:17

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] bdi: Create a flag to indicate that a backing device needs stable page writes

On Wed 31-10-12 01:56:04, Darrick J. Wong wrote:
> On Wed, Oct 31, 2012 at 09:14:41AM +1100, NeilBrown wrote:
> > On Tue, 30 Oct 2012 13:14:24 -0700 "Darrick J. Wong"
> > <[email protected]> wrote:
> >
> > > On Tue, Oct 30, 2012 at 08:19:41AM -0400, Martin K. Petersen wrote:
> > > > >>>>> "Neil" == NeilBrown <[email protected]> writes:
> > > >
> > > > Neil,
> > > >
> > > > >> Might be nice to make the sysfs knob tweakable. Also, don't forget to
> > > > >> add a suitable blurb to Documentation/ABI/.
> > > >
> > > > Neil> It isn't at all clear to me that having the sysfs knob
> > > > Neil> 'tweakable' is a good idea. From the md/raid5 perspective, I
> > > > Neil> would want to know for certain whether the pages in a give bio
> > > > Neil> are guaranteed not to change, or if they might. I could set the
> > > > Neil> BDI_CAP_STABLE_WRITES and believe they will never change, or test
> > > > Neil> the BDI_CAP_STABLE_WRITES and let that tell me if they might
> > > > Neil> change or not. But if the bit can be changed at any moment, then
> > > > Neil> it can never be trusted and so becomes worthless to me.
> > > >
> > > > I was mostly interested in being able to turn it on for devices that
> > > > haven't explicitly done so. I agree that turning it off can be
> > > > problematic.
> > >
> > > I'm ok with having a tunable that can turn it on, but atm I can't really think
> > > of a convincing reason to let people turn it /off/. If people yell loud enough
> > > I'll add it, but I'd rather not have to distinguish between "on because user
> > > set it on" vs "on because hw needs it".
> > >
> > > It'd be nice if the presence BDI_CAP_STABLE_WRITES meant that all filesystems
> > > would wait on page writes. Hrm, guess I'll see about adding that to the patch
> > > set. Though ISTR that at least the vfat and ext2 maintainers weren't
> > > interested the last time I asked.
> >
> > I'm still a little foggy on the exact semantics and use-cases for this flag.
> > I'll try to piece together the bits that I know and ask you to tell me what
> > I've missed or what I've got wrong.
> >
> > Stable writes are valuable when the filesystem or device driver calculates
> > some sort of 'redundancy' information based on the page in memory that is
> > about to be written.
> > This could be:
> > integrity data that will be sent with the page to the storage device
> > parity over a number of pages that will be written to a separate device
> > (i.e. RAID5/RAID6).
> > MAC or similar checksum that will be sent with the data over the network
> > and will be validated by the receiving device, which can then ACK or
> > NAK depending on correctness.
> >
> > These could be implemented in the filesystem or in the device driver, so
> > either should be able to request stable writes. If neither request stable
> > writes, then the cost of stable writes should be avoided.
>
> Most of the changes are in the VM; the only place where filesystem-specific
> bits are needed are for things like ext4 which override page_mkwrite. I'm not
> sure what you mean by the filesystem requesting stable writes; unless I'm
> missing something (it's late), it's only the storage device that has any sort
> of needs. The filesystem is free either to accomodate the need for stable
> writes, or ignore that and deal with the IO errors.
As it was noted somewhere else in the thread, if you compute data
checksums in the filesystem, you need stable pages as well.

> > For the device driver (or network transport), not getting stable writes when
> > requested might be a performance issue, or it might be a correctness issue.
> > e.g. if an unstable write causes a MAC to be wrong, the network layer can
> > simply arrange a re-transmit. If an unstable write causes RAID5 parity to
> > be wrong, that unstable write could cause data corruption.
> >
> > For the filesystem, the requirement to provide stable writes could just be a
> > performance cost (a few extra waits) or it could require significant
> > re-working of the code (you say vfat and ext2 aren't really comfortable with
> > supporting them).
>
> I vaguely recall that the reason for skipping vfat was that the maintainer
> didn't like the idea of paying the wait costs even on a disk that didn't
> require it. I think we're addressing that. As for ext2 there wasn't much
> comment, though I think Ted or someone mentioned that since it was "stable"
> code, it wasn't worth fixing. In any case, patching filemap_page_mkwrite to
> have a write_on_page_writeback seems to have fixed a number of filesystems
> (hfs, reiser, ext2...) with little fuss.
Yup, I sent you a patch for this a while ago. If stable pages are
conditional only for case which require them (or make a good use of them),
I have no problem with making all filesystems provide them. After changing
filemap_page_mkwrite(), there are 5 filesystems (ncpfs, ceph, cifs, ubifs,
ocfs2) which won't support stable pages. I can fix these once we agree on
the generic infrastructure.

> > Finally there is the VFS/VM which needs to provide support for stable
> > writes. It already does - your flag seems to just allow clients of the
> > VFS/VM to indicate whether stable writes are required.
> >
> > So there seem to be several cases:
> >
> > 1/ The filesystem wants to always use stable writes. It just sets the flag,
> > and the device will see stable writes whether it cares or not. This would
> > happen if the filesystem is adding integrity data.
> > 2/ The device would prefer stable writes if possible. This would apply to
> > iscsi (??) as it needs to add some sort of checksum before putting the
> > data on the network
> > 3/ The device absolutely requires stable writes, or needs to provide
> > stability itself by taking a copy (md/RAID5 does this). So it needs to
> > know whether each write is stable, or it cannot take advantage of stable
> > writes.
> >
> >
> > So I see a need for 2 flags here.
> > The first one is set by the device or transport to say "I would prefer
> > stable writes if possible".
>
> Yep, that seems to correspond with what the patch provides right now.
>
> > The second is set by the filesystem, either because it has its own needs, or
> > because it sees the first flag set on the device and chooses to honour it.
> > The VFS/VM would act based on this second flag, and devices like md/RAID5
> > would set the first flag, and assume writes are stable if the second flag is
> > also set.
>
> This is a trickier piece... I guess this means that the bdi has to keep track
> of how many clients are using it (it looks like multiple fses on a partitioned
> disk share the same bdi) and of those clients, which ones claim to support
> stable page writes. If all clients do, then the disk can assume that all
> writes headed to it will be stable. If even one doesn't, then I guess the
> device falls back to whatever strategy it used before (the raid5 copying).
>
> I guess we also need a way to pass these flags through md/dm if appropriate.
I'd like to keep things simple. It's not hard to make all filesystems
support stable pages so let's just do that to remove one variable from the
picture. Then we are in the situation where storage can request stable
pages by setting bdi bit and filesystem will obey. Also filesystem itself can
request stable pages by setting the bit and generic functions in mm will
take that into account. No need for two flags...

You are right that we need a mechanism to push the flags from the devices
through various storage layers up into the bdi filesystem sees to make
things reliable.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-10-31 19:37:06

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] bdi: Create a flag to indicate that a backing device needs stable page writes

On Wed, Oct 31, 2012 at 12:56:14PM +0100, Jan Kara wrote:
> On Wed 31-10-12 01:56:04, Darrick J. Wong wrote:
> > On Wed, Oct 31, 2012 at 09:14:41AM +1100, NeilBrown wrote:
> > > On Tue, 30 Oct 2012 13:14:24 -0700 "Darrick J. Wong"
> > > <[email protected]> wrote:
> > >
> > > > On Tue, Oct 30, 2012 at 08:19:41AM -0400, Martin K. Petersen wrote:
> > > > > >>>>> "Neil" == NeilBrown <[email protected]> writes:
> > > > >
> > > > > Neil,
> > > > >
> > > > > >> Might be nice to make the sysfs knob tweakable. Also, don't forget to
> > > > > >> add a suitable blurb to Documentation/ABI/.
> > > > >
> > > > > Neil> It isn't at all clear to me that having the sysfs knob
> > > > > Neil> 'tweakable' is a good idea. From the md/raid5 perspective, I
> > > > > Neil> would want to know for certain whether the pages in a give bio
> > > > > Neil> are guaranteed not to change, or if they might. I could set the
> > > > > Neil> BDI_CAP_STABLE_WRITES and believe they will never change, or test
> > > > > Neil> the BDI_CAP_STABLE_WRITES and let that tell me if they might
> > > > > Neil> change or not. But if the bit can be changed at any moment, then
> > > > > Neil> it can never be trusted and so becomes worthless to me.
> > > > >
> > > > > I was mostly interested in being able to turn it on for devices that
> > > > > haven't explicitly done so. I agree that turning it off can be
> > > > > problematic.
> > > >
> > > > I'm ok with having a tunable that can turn it on, but atm I can't really think
> > > > of a convincing reason to let people turn it /off/. If people yell loud enough
> > > > I'll add it, but I'd rather not have to distinguish between "on because user
> > > > set it on" vs "on because hw needs it".
> > > >
> > > > It'd be nice if the presence BDI_CAP_STABLE_WRITES meant that all filesystems
> > > > would wait on page writes. Hrm, guess I'll see about adding that to the patch
> > > > set. Though ISTR that at least the vfat and ext2 maintainers weren't
> > > > interested the last time I asked.
> > >
> > > I'm still a little foggy on the exact semantics and use-cases for this flag.
> > > I'll try to piece together the bits that I know and ask you to tell me what
> > > I've missed or what I've got wrong.
> > >
> > > Stable writes are valuable when the filesystem or device driver calculates
> > > some sort of 'redundancy' information based on the page in memory that is
> > > about to be written.
> > > This could be:
> > > integrity data that will be sent with the page to the storage device
> > > parity over a number of pages that will be written to a separate device
> > > (i.e. RAID5/RAID6).
> > > MAC or similar checksum that will be sent with the data over the network
> > > and will be validated by the receiving device, which can then ACK or
> > > NAK depending on correctness.
> > >
> > > These could be implemented in the filesystem or in the device driver, so
> > > either should be able to request stable writes. If neither request stable
> > > writes, then the cost of stable writes should be avoided.
> >
> > Most of the changes are in the VM; the only place where filesystem-specific
> > bits are needed are for things like ext4 which override page_mkwrite. I'm not
> > sure what you mean by the filesystem requesting stable writes; unless I'm
> > missing something (it's late), it's only the storage device that has any sort
> > of needs. The filesystem is free either to accomodate the need for stable
> > writes, or ignore that and deal with the IO errors.
> As it was noted somewhere else in the thread, if you compute data
> checksums in the filesystem, you need stable pages as well.

Which filesystems other than btrfs do this?

iirc btrfs already provides its own stable page writes by virtue of its COW
nature, though I'm not sure if that holds when nodatacow is set.

> > > For the device driver (or network transport), not getting stable writes when
> > > requested might be a performance issue, or it might be a correctness issue.
> > > e.g. if an unstable write causes a MAC to be wrong, the network layer can
> > > simply arrange a re-transmit. If an unstable write causes RAID5 parity to
> > > be wrong, that unstable write could cause data corruption.
> > >
> > > For the filesystem, the requirement to provide stable writes could just be a
> > > performance cost (a few extra waits) or it could require significant
> > > re-working of the code (you say vfat and ext2 aren't really comfortable with
> > > supporting them).
> >
> > I vaguely recall that the reason for skipping vfat was that the maintainer
> > didn't like the idea of paying the wait costs even on a disk that didn't
> > require it. I think we're addressing that. As for ext2 there wasn't much
> > comment, though I think Ted or someone mentioned that since it was "stable"
> > code, it wasn't worth fixing. In any case, patching filemap_page_mkwrite to
> > have a write_on_page_writeback seems to have fixed a number of filesystems
> > (hfs, reiser, ext2...) with little fuss.
> Yup, I sent you a patch for this a while ago. If stable pages are
> conditional only for case which require them (or make a good use of them),
> I have no problem with making all filesystems provide them. After changing
> filemap_page_mkwrite(), there are 5 filesystems (ncpfs, ceph, cifs, ubifs,
> ocfs2) which won't support stable pages. I can fix these once we agree on
> the generic infrastructure.

Yesterday I added a patch to my tree that's similar to yours. :)

Actually, I created a function so that we don't have to open-code the if and
wait pieces.

It looks like ceph, cifs, and ocfs2 won't be difficult to modify. ncpfs seems
to use the generic vm_ops and should be covered. That leaves ubifs, which
seems to return VM_FAULT_MINOR after unlocking the page.

> > > Finally there is the VFS/VM which needs to provide support for stable
> > > writes. It already does - your flag seems to just allow clients of the
> > > VFS/VM to indicate whether stable writes are required.
> > >
> > > So there seem to be several cases:
> > >
> > > 1/ The filesystem wants to always use stable writes. It just sets the flag,
> > > and the device will see stable writes whether it cares or not. This would
> > > happen if the filesystem is adding integrity data.
> > > 2/ The device would prefer stable writes if possible. This would apply to
> > > iscsi (??) as it needs to add some sort of checksum before putting the
> > > data on the network
> > > 3/ The device absolutely requires stable writes, or needs to provide
> > > stability itself by taking a copy (md/RAID5 does this). So it needs to
> > > know whether each write is stable, or it cannot take advantage of stable
> > > writes.
> > >
> > >
> > > So I see a need for 2 flags here.
> > > The first one is set by the device or transport to say "I would prefer
> > > stable writes if possible".
> >
> > Yep, that seems to correspond with what the patch provides right now.
> >
> > > The second is set by the filesystem, either because it has its own needs, or
> > > because it sees the first flag set on the device and chooses to honour it.
> > > The VFS/VM would act based on this second flag, and devices like md/RAID5
> > > would set the first flag, and assume writes are stable if the second flag is
> > > also set.
> >
> > This is a trickier piece... I guess this means that the bdi has to keep track
> > of how many clients are using it (it looks like multiple fses on a partitioned
> > disk share the same bdi) and of those clients, which ones claim to support
> > stable page writes. If all clients do, then the disk can assume that all
> > writes headed to it will be stable. If even one doesn't, then I guess the
> > device falls back to whatever strategy it used before (the raid5 copying).
> >
> > I guess we also need a way to pass these flags through md/dm if appropriate.
> I'd like to keep things simple. It's not hard to make all filesystems
> support stable pages so let's just do that to remove one variable from the
> picture. Then we are in the situation where storage can request stable
> pages by setting bdi bit and filesystem will obey. Also filesystem itself can
> request stable pages by setting the bit and generic functions in mm will
> take that into account. No need for two flags...
>
> You are right that we need a mechanism to push the flags from the devices
> through various storage layers up into the bdi filesystem sees to make
> things reliable.

md/dm will call blk_integrity_register, so pushing the "stable page writes
required" flag through the various layers is already taken care of. If devices
and filesystems can both indicate that they want stable page writes, I'll have
to keep track of however many users there are.

It does seem like less work to fix all the filesystems than to dork around with
another flag.

--D
>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-10-31 23:12:53

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] bdi: Create a flag to indicate that a backing device needs stable page writes

On 10/31/2012 12:36 PM, Darrick J. Wong wrote:
> On Wed, Oct 31, 2012 at 12:56:14PM +0100, Jan Kara wrote:
<snip>
>> You are right that we need a mechanism to push the flags from the devices
>> through various storage layers up into the bdi filesystem sees to make
>> things reliable.
>
> md/dm will call blk_integrity_register, so pushing the "stable page writes
> required" flag through the various layers is already taken care of. If devices
> and filesystems can both indicate that they want stable page writes, I'll have
> to keep track of however many users there are.
>

Please note again the iscsi case. Say the admin defined half of an md iscsi devices
with data_integrity and half without.

For me I would like an OR. If any underline device needs "stable pages" they all
get them.

Please also provide - or how easy is to make an API - for the like of iscsi that
given a request_queue, set the BDI's "stable pages" on. Something like:

/* stable_pages can only be turned on never off */
blk_set_stable_pages(struct request_queue);

> It does seem like less work to fix all the filesystems than to dork around with
> another flag.

Sure if that is possible, that will be perfect, then I do not need to keep
the old "unstable pages" code around at all.

Thanks for working on this
Boaz

2012-11-01 05:51:25

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] bdi: Create a flag to indicate that a backing device needs stable page writes

On Wed, Oct 31, 2012 at 04:12:53PM -0700, Boaz Harrosh wrote:
> On 10/31/2012 12:36 PM, Darrick J. Wong wrote:
> > On Wed, Oct 31, 2012 at 12:56:14PM +0100, Jan Kara wrote:
> <snip>
> >> You are right that we need a mechanism to push the flags from the devices
> >> through various storage layers up into the bdi filesystem sees to make
> >> things reliable.
> >
> > md/dm will call blk_integrity_register, so pushing the "stable page writes
> > required" flag through the various layers is already taken care of. If devices
> > and filesystems can both indicate that they want stable page writes, I'll have
> > to keep track of however many users there are.
> >
>
> Please note again the iscsi case. Say the admin defined half of an md iscsi devices
> with data_integrity and half without.
>
> For me I would like an OR. If any underline device needs "stable pages" they all
> get them.
>
> Please also provide - or how easy is to make an API - for the like of iscsi that
> given a request_queue, set the BDI's "stable pages" on. Something like:
>
> /* stable_pages can only be turned on never off */
> blk_set_stable_pages(struct request_queue);

I'll have to test this further, but I think at least DM requires that a given
dm device's sub-devices all have the same integrity profile, which atm seems to
imply that stable writes will be turned on for everything or not at all.

Of course, all that goes out the door with iscsi since it doesn't necessarily
care about DIF/DIX but wants stable pages anyway. It'd be useful to be able to
play around with iscsi too. Could you point me to where in the iscsi code it
does the copy-and-checksum behavior?

--D

> > It does seem like less work to fix all the filesystems than to dork around with
> > another flag.
>
> Sure if that is possible, that will be perfect, then I do not need to keep
> the old "unstable pages" code around at all.
>
> Thanks for working on this
> Boaz
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html

2012-11-01 06:25:39

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] bdi: Create a flag to indicate that a backing device needs stable page writes

On 10/31/2012 10:51 PM, Darrick J. Wong wrote:
>
> Of course, all that goes out the door with iscsi since it doesn't necessarily
> care about DIF/DIX but wants stable pages anyway. It'd be useful to be able to
> play around with iscsi too. Could you point me to where in the iscsi code it
> does the copy-and-checksum behavior?
>

I'll send you an untested proof of concept patch tomorrow. So at least you can
see where the code is. But no guaranty that it actually works.

You will need to dig into iscsi docs to enable it in config files, look for
data_digest= or something like that.

header digest is when only the headers are check-summed and data-digest is when
all the data is check-summed.

> --D
>

Cheers
Boaz

2012-11-01 08:59:45

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] bdi: Create a flag to indicate that a backing device needs stable page writes

On Wed 31-10-12 12:36:52, Darrick J. Wong wrote:
> On Wed, Oct 31, 2012 at 12:56:14PM +0100, Jan Kara wrote:
> > I'd like to keep things simple. It's not hard to make all filesystems
> > support stable pages so let's just do that to remove one variable from the
> > picture. Then we are in the situation where storage can request stable
> > pages by setting bdi bit and filesystem will obey. Also filesystem itself can
> > request stable pages by setting the bit and generic functions in mm will
> > take that into account. No need for two flags...
> >
> > You are right that we need a mechanism to push the flags from the devices
> > through various storage layers up into the bdi filesystem sees to make
> > things reliable.
>
> md/dm will call blk_integrity_register, so pushing the "stable page writes
> required" flag through the various layers is already taken care of. If devices
> and filesystems can both indicate that they want stable page writes, I'll have
> to keep track of however many users there are.
Oh, right. When the device is partitioned and more filesystems are used,
it would be reasonable to clear the flag when there are no stable-page
users. Actually, thinking more about this case, two flags would make things
more transparent. One flag will be in bdi and one in superblock. When
device needs stable pages, it sets bdi flag. If filesystem needs stable pages,
it sets the sb flag. This way we don't have to track any users.

I was thinking about transferring the flag from bdi to sb on mount so that
we can check only sb flag but that actually won't work because of writes
directly to the block device (all block device inodes share one
superblock).

Thoughts?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2012-11-01 17:24:48

[permalink] [raw]

Subject: Re: [RFC PATCH 1/2] bdi: Create a flag to indicate that a backing device needs stable page writes

On 11/01/2012 01:59 AM, Jan Kara wrote:
<>
> (all block device inodes share one superblock).
>

Really? that is not so good is it, for other obvious reasons.
Why is it not one superblock per BDI? That would be more obvious
to me.

> Thoughts?

It's a really bad design. I think it is worth fixing. For the above
problem, as well as a much better fit with our current thread-per-bdi,
and the rest of the Kernel model. No?

>
> Honza
>

Thanks
Boaz

2012-11-01 22:42:18