We've recently discovered a workload at Google where the page
stablization patches (specifically commit 0e499890c1f: ext4: wait for
writeback to complete while making pages writable) resulted in a
**major** performance regression. As in, kernel threads that were
writing to log files were getting hit by up to 2 seconds stalls, which
very badly hurt a particular application. Reverting this commit fixed
the performance regression.
The main reason for the page stablizatoin patches was for DIF/DIX
support, right? So I'm wondering if we should just disable the calls
to wait_on_page_writeback if CONFIG_BLKDEV_INTEGRITY is not defined.
i.e., something like this.
What do people think? I have a feeling this is going to be very
controversial....
- Ted
ext4: Disable page stablization if DIF/DIX not enabled
Requiring processes which are writing to files which are under writeback
until the writeback is complete can result in massive performance hits.
This is especially true if writes are being throttled due to I/O cgroup
limits and the application is running on an especially busy server.
If CONFIG_BLKDEV_INTEGRITY is not enabled, disable page stablization,
since that's the main case where this is needed, and page stablization
can be very painful.
Signed-off-by: "Theodore Ts'o" <[email protected]>
diff --git a/fs/buffer.c b/fs/buffer.c
index 1a30db7..d25c60f 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2333,7 +2333,9 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
ret = -EAGAIN;
goto out_unlock;
}
+#ifdef CONFIG_BLKDEV_INTEGRITY
wait_on_page_writeback(page);
+#endif
return 0;
out_unlock:
unlock_page(page);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 5f8081c..01f86c5 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4638,8 +4638,10 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (page_has_buffers(page)) {
if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
ext4_bh_unmapped)) {
+#ifdef CONFIG_BLKDEV_INTEGRITY
/* Wait so that we don't change page under IO */
wait_on_page_writeback(page);
+#endif
ret = VM_FAULT_LOCKED;
goto out;
}
On 3/7/12 5:40 PM, Theodore Ts'o wrote:
> We've recently discovered a workload at Google where the page
> stablization patches (specifically commit 0e499890c1f: ext4: wait for
> writeback to complete while making pages writable) resulted in a
> **major** performance regression. As in, kernel threads that were
> writing to log files were getting hit by up to 2 seconds stalls, which
> very badly hurt a particular application. Reverting this commit fixed
> the performance regression.
>
> The main reason for the page stablizatoin patches was for DIF/DIX
> support, right? So I'm wondering if we should just disable the calls
> to wait_on_page_writeback if CONFIG_BLKDEV_INTEGRITY is not defined.
> i.e., something like this.
Can you devise a non-secret testcase that demonstrates this?
Thanks,
-Eric
> What do people think? I have a feeling this is going to be very
> controversial....
>
> - Ted
>
> ext4: Disable page stablization if DIF/DIX not enabled
>
> Requiring processes which are writing to files which are under writeback
> until the writeback is complete can result in massive performance hits.
> This is especially true if writes are being throttled due to I/O cgroup
> limits and the application is running on an especially busy server.
>
> If CONFIG_BLKDEV_INTEGRITY is not enabled, disable page stablization,
> since that's the main case where this is needed, and page stablization
> can be very painful.
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 1a30db7..d25c60f 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2333,7 +2333,9 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> ret = -EAGAIN;
> goto out_unlock;
> }
> +#ifdef CONFIG_BLKDEV_INTEGRITY
> wait_on_page_writeback(page);
> +#endif
> return 0;
> out_unlock:
> unlock_page(page);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5f8081c..01f86c5 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4638,8 +4638,10 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> if (page_has_buffers(page)) {
> if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
> ext4_bh_unmapped)) {
> +#ifdef CONFIG_BLKDEV_INTEGRITY
> /* Wait so that we don't change page under IO */
> wait_on_page_writeback(page);
> +#endif
> ret = VM_FAULT_LOCKED;
> goto out;
> }
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 07, 2012 at 05:54:11PM -0600, Eric Sandeen wrote:
> On 3/7/12 5:40 PM, Theodore Ts'o wrote:
> > We've recently discovered a workload at Google where the page
> > stablization patches (specifically commit 0e499890c1f: ext4: wait for
> > writeback to complete while making pages writable) resulted in a
> > **major** performance regression. As in, kernel threads that were
> > writing to log files were getting hit by up to 2 seconds stalls, which
> > very badly hurt a particular application. Reverting this commit fixed
> > the performance regression.
> >
> > The main reason for the page stablizatoin patches was for DIF/DIX
> > support, right? So I'm wondering if we should just disable the calls
> > to wait_on_page_writeback if CONFIG_BLKDEV_INTEGRITY is not defined.
> > i.e., something like this.
>
> Can you devise a non-secret testcase that demonstrates this?
It sure would be nice if the block device could know if it requires stable
writeback, and the fs could figure that out.... though iirc there was more to
my patchset than just these two wait_on_page_writeback() calls.
--D
>
> Thanks,
> -Eric
>
> > What do people think? I have a feeling this is going to be very
> > controversial....
> >
> > - Ted
> >
> > ext4: Disable page stablization if DIF/DIX not enabled
> >
> > Requiring processes which are writing to files which are under writeback
> > until the writeback is complete can result in massive performance hits.
> > This is especially true if writes are being throttled due to I/O cgroup
> > limits and the application is running on an especially busy server.
> >
> > If CONFIG_BLKDEV_INTEGRITY is not enabled, disable page stablization,
> > since that's the main case where this is needed, and page stablization
> > can be very painful.
> >
> > Signed-off-by: "Theodore Ts'o" <[email protected]>
> >
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index 1a30db7..d25c60f 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -2333,7 +2333,9 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> > ret = -EAGAIN;
> > goto out_unlock;
> > }
> > +#ifdef CONFIG_BLKDEV_INTEGRITY
> > wait_on_page_writeback(page);
> > +#endif
> > return 0;
> > out_unlock:
> > unlock_page(page);
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 5f8081c..01f86c5 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -4638,8 +4638,10 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > if (page_has_buffers(page)) {
> > if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
> > ext4_bh_unmapped)) {
> > +#ifdef CONFIG_BLKDEV_INTEGRITY
> > /* Wait so that we don't change page under IO */
> > wait_on_page_writeback(page);
> > +#endif
> > ret = VM_FAULT_LOCKED;
> > goto out;
> > }
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
On 03/07/2012 03:40 PM, Theodore Ts'o wrote:
>
> We've recently discovered a workload at Google where the page
> stablization patches (specifically commit 0e499890c1f: ext4: wait for
> writeback to complete while making pages writable) resulted in a
> **major** performance regression. As in, kernel threads that were
> writing to log files were getting hit by up to 2 seconds stalls, which
> very badly hurt a particular application.
That 2 seconds hit I think I know how to fix somewhat with a smarter
write-back. I want to talk about this in LSF with people
> Reverting this commit fixed the performance regression.
>
> The main reason for the page stablizatoin patches was for DIF/DIX
> support, right? So I'm wondering if we should just disable the calls
> to wait_on_page_writeback if CONFIG_BLKDEV_INTEGRITY is not defined.
> i.e., something like this.
>
> What do people think? I have a feeling this is going to be very
> controversial....
>
NACK
It's not a CONFIG_ thing it's: Is this particular device needs stable pages?
As I stated many times before, the device should have a property that
says if it needs stable pages or not. The candidates for stable pages are:
- DIF/DIX enabled devices
- RAID-1/4/5/6 devices
- iscsi devices with data digest signature
- Any other checksum enabled block device.
A fedora distro will have CONFIG_BLKDEV_INTEGRITY set then you are always
out of luck, even with devices that can care less.
Please submit a proper patch, even a temporary mount option. But this is
ABI. The best is to find where to export it as part of the device's
properties sysfs dir. And inspect that
> - Ted
>
Thanks
Boaz
> ext4: Disable page stablization if DIF/DIX not enabled
>
> Requiring processes which are writing to files which are under writeback
> until the writeback is complete can result in massive performance hits.
> This is especially true if writes are being throttled due to I/O cgroup
> limits and the application is running on an especially busy server.
>
> If CONFIG_BLKDEV_INTEGRITY is not enabled, disable page stablization,
> since that's the main case where this is needed, and page stablization
> can be very painful.
>
> Signed-off-by: "Theodore Ts'o" <[email protected]>
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 1a30db7..d25c60f 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2333,7 +2333,9 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> ret = -EAGAIN;
> goto out_unlock;
> }
> +#ifdef CONFIG_BLKDEV_INTEGRITY
> wait_on_page_writeback(page);
> +#endif
> return 0;
> out_unlock:
> unlock_page(page);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 5f8081c..01f86c5 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4638,8 +4638,10 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> if (page_has_buffers(page)) {
> if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
> ext4_bh_unmapped)) {
> +#ifdef CONFIG_BLKDEV_INTEGRITY
> /* Wait so that we don't change page under IO */
> wait_on_page_writeback(page);
> +#endif
> ret = VM_FAULT_LOCKED;
> goto out;
> }
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Mar 07, 2012 at 04:05:10PM -0800, Darrick J. Wong wrote:
> On Wed, Mar 07, 2012 at 05:54:11PM -0600, Eric Sandeen wrote:
> > On 3/7/12 5:40 PM, Theodore Ts'o wrote:
> > > We've recently discovered a workload at Google where the page
> > > stablization patches (specifically commit 0e499890c1f: ext4: wait for
> > > writeback to complete while making pages writable) resulted in a
> > > **major** performance regression. As in, kernel threads that were
> > > writing to log files were getting hit by up to 2 seconds stalls, which
> > > very badly hurt a particular application. Reverting this commit fixed
> > > the performance regression.
> > >
> > > The main reason for the page stablizatoin patches was for DIF/DIX
> > > support, right? So I'm wondering if we should just disable the calls
> > > to wait_on_page_writeback if CONFIG_BLKDEV_INTEGRITY is not defined.
> > > i.e., something like this.
> >
> > Can you devise a non-secret testcase that demonstrates this?
>
> It sure would be nice if the block device could know if it requires stable
> writeback, and the fs could figure that out.... though iirc there was more to
> my patchset than just these two wait_on_page_writeback() calls.
Ted,
Would something along the lines of the following patch address your concern in
a somewhat more flexible manner?
Provide a BDI flag to indicate that a device requires stable pages during
writeback. Use the flag to skip wait_on_page_writeback() if we don't have such
a device that needs it.
(Obviously this needs to be refactored a bit...)
Signed-off-by: Darrick J. Wong <[email protected]>
---
block/blk-integrity.c | 7 +++++++
fs/buffer.c | 2 +-
fs/ext4/inode.c | 2 +-
include/linux/backing-dev.h | 3 +++
include/linux/fs.h | 1 +
include/linux/mm.h | 13 +------------
include/linux/pagemap.h | 14 ++++++++++++++
mm/filemap.c | 2 +-
mm/memory.c | 14 ++++++++++++++
mm/page-writeback.c | 10 ++++++++++
10 files changed, 53 insertions(+), 15 deletions(-)
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index da2a818..f2d51f9 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -420,6 +420,10 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template)
} else
bi->name = bi_unsupported_name;
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+ disk->queue->backing_dev_info.state |= (1 << BDI_stable_writes);
+#endif
+
return 0;
}
EXPORT_SYMBOL(blk_integrity_register);
@@ -438,6 +442,9 @@ void blk_integrity_unregister(struct gendisk *disk)
if (!disk || !disk->integrity)
return;
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+ disk->queue->backing_dev_info.state &= ~(1 << BDI_stable_writes);
+#endif
bi = disk->integrity;
kobject_uevent(&bi->kobj, KOBJ_REMOVE);
diff --git a/fs/buffer.c b/fs/buffer.c
index 1a30db7..d994d3d 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2333,7 +2333,7 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
ret = -EAGAIN;
goto out_unlock;
}
- wait_on_page_writeback(page);
+ wait_on_stable_page_writeback(page);
return 0;
out_unlock:
unlock_page(page);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index e94ac91..79e6c90 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -4724,7 +4724,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
ext4_bh_unmapped)) {
/* Wait so that we don't change page under IO */
- wait_on_page_writeback(page);
+ wait_on_stable_page_writeback(page);
ret = VM_FAULT_LOCKED;
goto out;
}
diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
index b1038bd..a28fecb 100644
--- a/include/linux/backing-dev.h
+++ b/include/linux/backing-dev.h
@@ -32,6 +32,9 @@ enum bdi_state {
BDI_sync_congested, /* The sync queue is getting full */
BDI_registered, /* bdi_register() was done */
BDI_writeback_running, /* Writeback is in progress */
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+ BDI_stable_writes, /* Pages must not change during write */
+#endif
BDI_unused, /* Available bits start here */
};
diff --git a/include/linux/fs.h b/include/linux/fs.h
index 69cd5bb..d1eb3ce 100644
--- a/include/linux/fs.h
+++ b/include/linux/fs.h
@@ -709,6 +709,7 @@ struct block_device {
#define PAGECACHE_TAG_TOWRITE 2
int mapping_tagged(struct address_space *mapping, int tag);
+int mapping_needs_stable_writes(struct address_space *as);
/*
* Might pages of this file be mapped into userspace?
diff --git a/include/linux/mm.h b/include/linux/mm.h
index 17b27cd..a069bcf 100644
--- a/include/linux/mm.h
+++ b/include/linux/mm.h
@@ -783,18 +783,7 @@ void page_address_init(void);
#define PAGE_MAPPING_KSM 2
#define PAGE_MAPPING_FLAGS (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM)
-extern struct address_space swapper_space;
-static inline struct address_space *page_mapping(struct page *page)
-{
- struct address_space *mapping = page->mapping;
-
- VM_BUG_ON(PageSlab(page));
- if (unlikely(PageSwapCache(page)))
- mapping = &swapper_space;
- else if ((unsigned long)mapping & PAGE_MAPPING_ANON)
- mapping = NULL;
- return mapping;
-}
+struct address_space *page_mapping(struct page *page);
/* Neutral page->mapping pointer to address_space or anon_vma or other */
static inline void *page_rmapping(struct page *page)
diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
index cfaaa69..0f82b91 100644
--- a/include/linux/pagemap.h
+++ b/include/linux/pagemap.h
@@ -392,6 +392,20 @@ static inline void wait_on_page_writeback(struct page *page)
wait_on_page_bit(page, PG_writeback);
}
+static inline void wait_on_stable_page_writeback(struct page *page)
+{
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+ struct address_space *as;
+
+ as = page_mapping(page);
+ if (!as)
+ return;
+
+ if (mapping_needs_stable_writes(as))
+ wait_on_page_writeback(page);
+#endif
+}
+
extern void end_page_writeback(struct page *page);
/*
diff --git a/mm/filemap.c b/mm/filemap.c
index b662757..08ffa94 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2361,7 +2361,7 @@ repeat:
return NULL;
}
found:
- wait_on_page_writeback(page);
+ wait_on_stable_page_writeback(page);
return page;
}
EXPORT_SYMBOL(grab_cache_page_write_begin);
diff --git a/mm/memory.c b/mm/memory.c
index fa2f04e..40288e5 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -67,6 +67,20 @@
#include "internal.h"
+extern struct address_space swapper_space;
+struct address_space *page_mapping(struct page *page)
+{
+ struct address_space *mapping = page->mapping;
+
+ VM_BUG_ON(PageSlab(page));
+ if (unlikely(PageSwapCache(page)))
+ mapping = &swapper_space;
+ else if ((unsigned long)mapping & PAGE_MAPPING_ANON)
+ mapping = NULL;
+ return mapping;
+}
+EXPORT_SYMBOL(page_mapping);
+
#ifndef CONFIG_NEED_MULTIPLE_NODES
/* use the per-pgdat data instead for discontigmem - mbligh */
unsigned long max_mapnr;
diff --git a/mm/page-writeback.c b/mm/page-writeback.c
index 363ba70..dc86f8f 100644
--- a/mm/page-writeback.c
+++ b/mm/page-writeback.c
@@ -2239,3 +2239,13 @@ int mapping_tagged(struct address_space *mapping, int tag)
return radix_tree_tagged(&mapping->page_tree, tag);
}
EXPORT_SYMBOL(mapping_tagged);
+
+int mapping_needs_stable_writes(struct address_space *as)
+{
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+ if (as->backing_dev_info->state & (1 << BDI_stable_writes))
+ return 1;
+#endif
+ return 0;
+}
+EXPORT_SYMBOL_GPL(mapping_needs_stable_writes);
> Can you devise a non-secret testcase that demonstrates this?
Hmm. I bet you could get fio to do it. Giant file, random mmap()
writes, spin until the CPU overwhelms writeback?
- z
On 03/07/2012 06:18 PM, Darrick J. Wong wrote:
> On Wed, Mar 07, 2012 at 04:05:10PM -0800, Darrick J. Wong wrote:
>> On Wed, Mar 07, 2012 at 05:54:11PM -0600, Eric Sandeen wrote:
>>> On 3/7/12 5:40 PM, Theodore Ts'o wrote:
>>>> We've recently discovered a workload at Google where the page
>>>> stablization patches (specifically commit 0e499890c1f: ext4: wait for
>>>> writeback to complete while making pages writable) resulted in a
>>>> **major** performance regression. As in, kernel threads that were
>>>> writing to log files were getting hit by up to 2 seconds stalls, which
>>>> very badly hurt a particular application. Reverting this commit fixed
>>>> the performance regression.
>>>>
>>>> The main reason for the page stablizatoin patches was for DIF/DIX
>>>> support, right? So I'm wondering if we should just disable the calls
>>>> to wait_on_page_writeback if CONFIG_BLKDEV_INTEGRITY is not defined.
>>>> i.e., something like this.
>>>
>>> Can you devise a non-secret testcase that demonstrates this?
>>
>> It sure would be nice if the block device could know if it requires stable
>> writeback, and the fs could figure that out.... though iirc there was more to
>> my patchset than just these two wait_on_page_writeback() calls.
>
> Ted,
>
> Would something along the lines of the following patch address your concern in
> a somewhat more flexible manner?
>
> Provide a BDI flag to indicate that a device requires stable pages during
> writeback. Use the flag to skip wait_on_page_writeback() if we don't have such
> a device that needs it.
>
> (Obviously this needs to be refactored a bit...)
>
> Signed-off-by: Darrick J. Wong <[email protected]>
> ---
>
> block/blk-integrity.c | 7 +++++++
> fs/buffer.c | 2 +-
> fs/ext4/inode.c | 2 +-
> include/linux/backing-dev.h | 3 +++
> include/linux/fs.h | 1 +
> include/linux/mm.h | 13 +------------
> include/linux/pagemap.h | 14 ++++++++++++++
> mm/filemap.c | 2 +-
> mm/memory.c | 14 ++++++++++++++
> mm/page-writeback.c | 10 ++++++++++
> 10 files changed, 53 insertions(+), 15 deletions(-)
>
> diff --git a/block/blk-integrity.c b/block/blk-integrity.c
> index da2a818..f2d51f9 100644
> --- a/block/blk-integrity.c
> +++ b/block/blk-integrity.c
> @@ -420,6 +420,10 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template)
> } else
> bi->name = bi_unsupported_name;
>
> +#ifdef CONFIG_BLK_DEV_INTEGRITY
Ooof I feel like a mute.
No also iscsi data digest needs stable pages. Please remove any reference to
CONFIG_BLK_DEV_INTEGRITY. It needs to be supported for other
transport needs.
Just make the default off. (Which it is)
I'll check to see how to add this for iscsi, when data digest is on
Thanks
Boaz
> + disk->queue->backing_dev_info.state |= (1 << BDI_stable_writes);
> +#endif
> +
> return 0;
> }
> EXPORT_SYMBOL(blk_integrity_register);
> @@ -438,6 +442,9 @@ void blk_integrity_unregister(struct gendisk *disk)
> if (!disk || !disk->integrity)
> return;
>
> +#ifdef CONFIG_BLK_DEV_INTEGRITY
> + disk->queue->backing_dev_info.state &= ~(1 << BDI_stable_writes);
> +#endif
> bi = disk->integrity;
>
> kobject_uevent(&bi->kobj, KOBJ_REMOVE);
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 1a30db7..d994d3d 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2333,7 +2333,7 @@ int __block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> ret = -EAGAIN;
> goto out_unlock;
> }
> - wait_on_page_writeback(page);
> + wait_on_stable_page_writeback(page);
> return 0;
> out_unlock:
> unlock_page(page);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index e94ac91..79e6c90 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -4724,7 +4724,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
> ext4_bh_unmapped)) {
> /* Wait so that we don't change page under IO */
> - wait_on_page_writeback(page);
> + wait_on_stable_page_writeback(page);
> ret = VM_FAULT_LOCKED;
> goto out;
> }
> diff --git a/include/linux/backing-dev.h b/include/linux/backing-dev.h
> index b1038bd..a28fecb 100644
> --- a/include/linux/backing-dev.h
> +++ b/include/linux/backing-dev.h
> @@ -32,6 +32,9 @@ enum bdi_state {
> BDI_sync_congested, /* The sync queue is getting full */
> BDI_registered, /* bdi_register() was done */
> BDI_writeback_running, /* Writeback is in progress */
> +#ifdef CONFIG_BLK_DEV_INTEGRITY
> + BDI_stable_writes, /* Pages must not change during write */
> +#endif
> BDI_unused, /* Available bits start here */
> };
>
> diff --git a/include/linux/fs.h b/include/linux/fs.h
> index 69cd5bb..d1eb3ce 100644
> --- a/include/linux/fs.h
> +++ b/include/linux/fs.h
> @@ -709,6 +709,7 @@ struct block_device {
> #define PAGECACHE_TAG_TOWRITE 2
>
> int mapping_tagged(struct address_space *mapping, int tag);
> +int mapping_needs_stable_writes(struct address_space *as);
>
> /*
> * Might pages of this file be mapped into userspace?
> diff --git a/include/linux/mm.h b/include/linux/mm.h
> index 17b27cd..a069bcf 100644
> --- a/include/linux/mm.h
> +++ b/include/linux/mm.h
> @@ -783,18 +783,7 @@ void page_address_init(void);
> #define PAGE_MAPPING_KSM 2
> #define PAGE_MAPPING_FLAGS (PAGE_MAPPING_ANON | PAGE_MAPPING_KSM)
>
> -extern struct address_space swapper_space;
> -static inline struct address_space *page_mapping(struct page *page)
> -{
> - struct address_space *mapping = page->mapping;
> -
> - VM_BUG_ON(PageSlab(page));
> - if (unlikely(PageSwapCache(page)))
> - mapping = &swapper_space;
> - else if ((unsigned long)mapping & PAGE_MAPPING_ANON)
> - mapping = NULL;
> - return mapping;
> -}
> +struct address_space *page_mapping(struct page *page);
>
> /* Neutral page->mapping pointer to address_space or anon_vma or other */
> static inline void *page_rmapping(struct page *page)
> diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h
> index cfaaa69..0f82b91 100644
> --- a/include/linux/pagemap.h
> +++ b/include/linux/pagemap.h
> @@ -392,6 +392,20 @@ static inline void wait_on_page_writeback(struct page *page)
> wait_on_page_bit(page, PG_writeback);
> }
>
> +static inline void wait_on_stable_page_writeback(struct page *page)
> +{
> +#ifdef CONFIG_BLK_DEV_INTEGRITY
> + struct address_space *as;
> +
> + as = page_mapping(page);
> + if (!as)
> + return;
> +
> + if (mapping_needs_stable_writes(as))
> + wait_on_page_writeback(page);
> +#endif
> +}
> +
> extern void end_page_writeback(struct page *page);
>
> /*
> diff --git a/mm/filemap.c b/mm/filemap.c
> index b662757..08ffa94 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2361,7 +2361,7 @@ repeat:
> return NULL;
> }
> found:
> - wait_on_page_writeback(page);
> + wait_on_stable_page_writeback(page);
> return page;
> }
> EXPORT_SYMBOL(grab_cache_page_write_begin);
> diff --git a/mm/memory.c b/mm/memory.c
> index fa2f04e..40288e5 100644
> --- a/mm/memory.c
> +++ b/mm/memory.c
> @@ -67,6 +67,20 @@
>
> #include "internal.h"
>
> +extern struct address_space swapper_space;
> +struct address_space *page_mapping(struct page *page)
> +{
> + struct address_space *mapping = page->mapping;
> +
> + VM_BUG_ON(PageSlab(page));
> + if (unlikely(PageSwapCache(page)))
> + mapping = &swapper_space;
> + else if ((unsigned long)mapping & PAGE_MAPPING_ANON)
> + mapping = NULL;
> + return mapping;
> +}
> +EXPORT_SYMBOL(page_mapping);
> +
> #ifndef CONFIG_NEED_MULTIPLE_NODES
> /* use the per-pgdat data instead for discontigmem - mbligh */
> unsigned long max_mapnr;
> diff --git a/mm/page-writeback.c b/mm/page-writeback.c
> index 363ba70..dc86f8f 100644
> --- a/mm/page-writeback.c
> +++ b/mm/page-writeback.c
> @@ -2239,3 +2239,13 @@ int mapping_tagged(struct address_space *mapping, int tag)
> return radix_tree_tagged(&mapping->page_tree, tag);
> }
> EXPORT_SYMBOL(mapping_tagged);
> +
> +int mapping_needs_stable_writes(struct address_space *as)
> +{
> +#ifdef CONFIG_BLK_DEV_INTEGRITY
> + if (as->backing_dev_info->state & (1 << BDI_stable_writes))
> + return 1;
> +#endif
> + return 0;
> +}
> +EXPORT_SYMBOL_GPL(mapping_needs_stable_writes);
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On 03/07/2012 07:00 PM, Boaz Harrosh wrote:
> On 03/07/2012 06:18 PM, Darrick J. Wong wrote:
>> On Wed, Mar 07, 2012 at 04:05:10PM -0800, Darrick J. Wong wrote:
<snip>
>> diff --git a/block/blk-integrity.c b/block/blk-integrity.c
>> index da2a818..f2d51f9 100644
>> --- a/block/blk-integrity.c
>> +++ b/block/blk-integrity.c
>> @@ -420,6 +420,10 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template)
>> } else
>> bi->name = bi_unsupported_name;
>>
>> +#ifdef CONFIG_BLK_DEV_INTEGRITY
>
> Ooof I feel like a mute.
>
> No also iscsi data digest needs stable pages. Please remove any reference to
> CONFIG_BLK_DEV_INTEGRITY. It needs to be supported for other
> transport needs.
> Just make the default off. (Which it is)
>
> I'll check to see how to add this for iscsi, when data digest is on
>
> Thanks
> Boaz
>
>> + disk->queue->backing_dev_info.state |= (1 << BDI_stable_writes);
Actually I will have an hard time accessing the BDI from the iscsi LLD
Can you please put this flag as part of the topology parameters exported
from a request_queue. And inspect it from there.
It should be a block device property. not a BDI one.
Thanks
Boaz
>> +#endif
>> +
>> return 0;
>> }
>> EXPORT_SYMBOL(blk_integrity_register);
<snip>
>>>>> "Boaz" == Boaz Harrosh <[email protected]> writes:
Boaz> As I stated many times before, the device should have a property
Boaz> that says if it needs stable pages or not. The candidates for
Boaz> stable pages are:
Boaz> - DIF/DIX enabled devices
Boaz> - RAID-1/4/5/6 devices
Boaz> - iscsi devices with data digest signature
Boaz> - Any other checksum enabled block device.
Boaz> A fedora distro will have CONFIG_BLKDEV_INTEGRITY set then you are
Boaz> always out of luck, even with devices that can care less.
Boaz> Please submit a proper patch, even a temporary mount option. But
Boaz> this is ABI. The best is to find where to export it as part of the
Boaz> device's properties sysfs dir.
We could do something like this:
diff --git a/block/blk-settings.c b/block/blk-settings.c
index 5680b91..442a0df 100644
--- a/block/blk-settings.c
+++ b/block/blk-settings.c
@@ -125,6 +125,7 @@ void blk_set_default_limits(struct queue_limits *lim)
lim->io_opt = 0;
lim->misaligned = 0;
lim->cluster = 1;
+ lim->needs_stable_pages = false;
}
EXPORT_SYMBOL(blk_set_default_limits);
@@ -571,6 +572,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
t->cluster &= b->cluster;
t->discard_zeroes_data &= b->discard_zeroes_data;
+ t->needs_stable_pages &= b->needs_stable_pages;
+
/* Physical block size a multiple of the logical block size? */
if (t->physical_block_size & (t->logical_block_size - 1)) {
t->physical_block_size = t->logical_block_size;
diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
index 5b85d91..d464aca 100644
--- a/block/blk-sysfs.c
+++ b/block/blk-sysfs.c
@@ -161,6 +161,11 @@ static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *pag
return queue_var_show(queue_discard_zeroes_data(q), page);
}
+static ssize_t queue_needs_stable_pages_show(struct request_queue *q, char *page)
+{
+ return queue_var_show(q->limits.needs_stable_pages, page);
+}
+
static ssize_t queue_write_same_max_show(struct request_queue *q, char *page)
{
return sprintf(page, "%llu\n",
@@ -364,6 +369,11 @@ static struct queue_sysfs_entry queue_discard_zeroes_data_entry = {
.show = queue_discard_zeroes_data_show,
};
+static struct queue_sysfs_entry queue_needs_stable_pages_entry = {
+ .attr = {.name = "needs_stable_pages", .mode = S_IRUGO },
+ .show = queue_needs_stable_pages_show,
+};
+
static struct queue_sysfs_entry queue_write_same_max_entry = {
.attr = {.name = "write_same_max_bytes", .mode = S_IRUGO },
.show = queue_write_same_max_show,
@@ -416,6 +426,7 @@ static struct attribute *default_attrs[] = {
&queue_discard_granularity_entry.attr,
&queue_discard_max_entry.attr,
&queue_discard_zeroes_data_entry.attr,
+ &queue_needs_stable_pages_entry.attr,
&queue_write_same_max_entry.attr,
&queue_nonrot_entry.attr,
&queue_nomerges_entry.attr,
diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
index 26eff46..146bed4 100644
--- a/drivers/scsi/sd.c
+++ b/drivers/scsi/sd.c
@@ -1752,10 +1752,11 @@ static void sd_read_protection_type(struct scsi_disk *sdkp, unsigned char *buffe
return;
}
- if (scsi_host_dif_capable(sdp->host, type))
+ if (scsi_host_dif_capable(sdp->host, type)) {
sd_printk(KERN_NOTICE, sdkp,
"Enabling DIF Type %u protection\n", type);
- else
+ sdkp->disk->queue->limits.needs_stable_pages = true;
+ } else
sd_printk(KERN_NOTICE, sdkp,
"Disabling DIF Type %u protection\n", type);
}
diff --git a/drivers/scsi/sd_dif.c b/drivers/scsi/sd_dif.c
index 0cb39ff..9dc330c 100644
--- a/drivers/scsi/sd_dif.c
+++ b/drivers/scsi/sd_dif.c
@@ -338,6 +338,8 @@ void sd_dif_config_host(struct scsi_disk *sdkp)
sd_printk(KERN_NOTICE, sdkp,
"Enabling DIX %s protection\n", disk->integrity->name);
+ disk->queue->limits.needs_stable_pages = true;
+
/* Signal to block layer that we support sector tagging */
if (dif && type && sdkp->ATO) {
if (type == SD_DIF_TYPE3_PROTECTION)
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 92956b7..a5a33db 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -266,6 +266,8 @@ struct queue_limits {
unsigned char discard_misaligned;
unsigned char cluster;
unsigned char discard_zeroes_data;
+
+ bool needs_stable_pages;
};
struct request_queue {
On 03/07/2012 07:45 PM, Martin K. Petersen wrote:
>>>>>> "Boaz" == Boaz Harrosh <[email protected]> writes:
>
> Boaz> As I stated many times before, the device should have a property
> Boaz> that says if it needs stable pages or not. The candidates for
> Boaz> stable pages are:
>
> Boaz> - DIF/DIX enabled devices
> Boaz> - RAID-1/4/5/6 devices
> Boaz> - iscsi devices with data digest signature
> Boaz> - Any other checksum enabled block device.
>
> Boaz> A fedora distro will have CONFIG_BLKDEV_INTEGRITY set then you are
> Boaz> always out of luck, even with devices that can care less.
>
> Boaz> Please submit a proper patch, even a temporary mount option. But
> Boaz> this is ABI. The best is to find where to export it as part of the
> Boaz> device's properties sysfs dir.
>
> We could do something like this:
>
Yes, this one is perfect.
Combined with Darrick's patch to actually inspect the flag at the filesystem level
is the solution I want.
When submitted I will also send a patch to set .needs_stable_pages in iscsi
when needed.
Thanks, Martin
Boaz
> diff --git a/block/blk-settings.c b/block/blk-settings.c
> index 5680b91..442a0df 100644
> --- a/block/blk-settings.c
> +++ b/block/blk-settings.c
> @@ -125,6 +125,7 @@ void blk_set_default_limits(struct queue_limits *lim)
> lim->io_opt = 0;
> lim->misaligned = 0;
> lim->cluster = 1;
> + lim->needs_stable_pages = false;
> }
> EXPORT_SYMBOL(blk_set_default_limits);
>
> @@ -571,6 +572,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
> t->cluster &= b->cluster;
> t->discard_zeroes_data &= b->discard_zeroes_data;
>
> + t->needs_stable_pages &= b->needs_stable_pages;
> +
> /* Physical block size a multiple of the logical block size? */
> if (t->physical_block_size & (t->logical_block_size - 1)) {
> t->physical_block_size = t->logical_block_size;
> diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> index 5b85d91..d464aca 100644
> --- a/block/blk-sysfs.c
> +++ b/block/blk-sysfs.c
> @@ -161,6 +161,11 @@ static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *pag
> return queue_var_show(queue_discard_zeroes_data(q), page);
> }
>
> +static ssize_t queue_needs_stable_pages_show(struct request_queue *q, char *page)
> +{
> + return queue_var_show(q->limits.needs_stable_pages, page);
> +}
> +
> static ssize_t queue_write_same_max_show(struct request_queue *q, char *page)
> {
> return sprintf(page, "%llu\n",
> @@ -364,6 +369,11 @@ static struct queue_sysfs_entry queue_discard_zeroes_data_entry = {
> .show = queue_discard_zeroes_data_show,
> };
>
> +static struct queue_sysfs_entry queue_needs_stable_pages_entry = {
> + .attr = {.name = "needs_stable_pages", .mode = S_IRUGO },
> + .show = queue_needs_stable_pages_show,
> +};
> +
> static struct queue_sysfs_entry queue_write_same_max_entry = {
> .attr = {.name = "write_same_max_bytes", .mode = S_IRUGO },
> .show = queue_write_same_max_show,
> @@ -416,6 +426,7 @@ static struct attribute *default_attrs[] = {
> &queue_discard_granularity_entry.attr,
> &queue_discard_max_entry.attr,
> &queue_discard_zeroes_data_entry.attr,
> + &queue_needs_stable_pages_entry.attr,
> &queue_write_same_max_entry.attr,
> &queue_nonrot_entry.attr,
> &queue_nomerges_entry.attr,
> diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> index 26eff46..146bed4 100644
> --- a/drivers/scsi/sd.c
> +++ b/drivers/scsi/sd.c
> @@ -1752,10 +1752,11 @@ static void sd_read_protection_type(struct scsi_disk *sdkp, unsigned char *buffe
> return;
> }
>
> - if (scsi_host_dif_capable(sdp->host, type))
> + if (scsi_host_dif_capable(sdp->host, type)) {
> sd_printk(KERN_NOTICE, sdkp,
> "Enabling DIF Type %u protection\n", type);
> - else
> + sdkp->disk->queue->limits.needs_stable_pages = true;
> + } else
> sd_printk(KERN_NOTICE, sdkp,
> "Disabling DIF Type %u protection\n", type);
> }
> diff --git a/drivers/scsi/sd_dif.c b/drivers/scsi/sd_dif.c
> index 0cb39ff..9dc330c 100644
> --- a/drivers/scsi/sd_dif.c
> +++ b/drivers/scsi/sd_dif.c
> @@ -338,6 +338,8 @@ void sd_dif_config_host(struct scsi_disk *sdkp)
> sd_printk(KERN_NOTICE, sdkp,
> "Enabling DIX %s protection\n", disk->integrity->name);
>
> + disk->queue->limits.needs_stable_pages = true;
> +
> /* Signal to block layer that we support sector tagging */
> if (dif && type && sdkp->ATO) {
> if (type == SD_DIF_TYPE3_PROTECTION)
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 92956b7..a5a33db 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -266,6 +266,8 @@ struct queue_limits {
> unsigned char discard_misaligned;
> unsigned char cluster;
> unsigned char discard_zeroes_data;
> +
> + bool needs_stable_pages;
> };
>
> struct request_queue {
Hi Boaz, Martin, Ted,
On Wed, 7 Mar 2012, Boaz Harrosh wrote:
> On 03/07/2012 07:45 PM, Martin K. Petersen wrote:
> >>>>>> "Boaz" == Boaz Harrosh <[email protected]> writes:
> >
> > Boaz> As I stated many times before, the device should have a property
> > Boaz> that says if it needs stable pages or not. The candidates for
> > Boaz> stable pages are:
> >
> > Boaz> - DIF/DIX enabled devices
> > Boaz> - RAID-1/4/5/6 devices
> > Boaz> - iscsi devices with data digest signature
> > Boaz> - Any other checksum enabled block device.
> >
> > Boaz> A fedora distro will have CONFIG_BLKDEV_INTEGRITY set then you are
> > Boaz> always out of luck, even with devices that can care less.
> >
> > Boaz> Please submit a proper patch, even a temporary mount option. But
> > Boaz> this is ABI. The best is to find where to export it as part of the
> > Boaz> device's properties sysfs dir.
> >
> > We could do something like this:
> >
>
> Yes, this one is perfect.
>
> Combined with Darrick's patch to actually inspect the flag at the
> filesystem level is the solution I want.
This avoids the problem for devices that don't need stable pages, but
doesn't help for those that do (btrfs, raid, iscsi, dif/dix, etc.). It
seems to me like a more elegant solution would be to COW the page in the
address_space so that you get stable writeback pages without blocking.
That's clearly more complex, and I'm sure there are a range of issues
involved in making that work, but I would hope that it would be doable
with generic MM infrastructure so that everyone would benefit.
I would love to talk to some MM people at LSF about what it would take to
make this work...
sage
>
> When submitted I will also send a patch to set .needs_stable_pages in
> iscsi when needed.
>
> Thanks, Martin
> Boaz
>
> > diff --git a/block/blk-settings.c b/block/blk-settings.c
> > index 5680b91..442a0df 100644
> > --- a/block/blk-settings.c
> > +++ b/block/blk-settings.c
> > @@ -125,6 +125,7 @@ void blk_set_default_limits(struct queue_limits *lim)
> > lim->io_opt = 0;
> > lim->misaligned = 0;
> > lim->cluster = 1;
> > + lim->needs_stable_pages = false;
> > }
> > EXPORT_SYMBOL(blk_set_default_limits);
> >
> > @@ -571,6 +572,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
> > t->cluster &= b->cluster;
> > t->discard_zeroes_data &= b->discard_zeroes_data;
> >
> > + t->needs_stable_pages &= b->needs_stable_pages;
> > +
> > /* Physical block size a multiple of the logical block size? */
> > if (t->physical_block_size & (t->logical_block_size - 1)) {
> > t->physical_block_size = t->logical_block_size;
> > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > index 5b85d91..d464aca 100644
> > --- a/block/blk-sysfs.c
> > +++ b/block/blk-sysfs.c
> > @@ -161,6 +161,11 @@ static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *pag
> > return queue_var_show(queue_discard_zeroes_data(q), page);
> > }
> >
> > +static ssize_t queue_needs_stable_pages_show(struct request_queue *q, char *page)
> > +{
> > + return queue_var_show(q->limits.needs_stable_pages, page);
> > +}
> > +
> > static ssize_t queue_write_same_max_show(struct request_queue *q, char *page)
> > {
> > return sprintf(page, "%llu\n",
> > @@ -364,6 +369,11 @@ static struct queue_sysfs_entry queue_discard_zeroes_data_entry = {
> > .show = queue_discard_zeroes_data_show,
> > };
> >
> > +static struct queue_sysfs_entry queue_needs_stable_pages_entry = {
> > + .attr = {.name = "needs_stable_pages", .mode = S_IRUGO },
> > + .show = queue_needs_stable_pages_show,
> > +};
> > +
> > static struct queue_sysfs_entry queue_write_same_max_entry = {
> > .attr = {.name = "write_same_max_bytes", .mode = S_IRUGO },
> > .show = queue_write_same_max_show,
> > @@ -416,6 +426,7 @@ static struct attribute *default_attrs[] = {
> > &queue_discard_granularity_entry.attr,
> > &queue_discard_max_entry.attr,
> > &queue_discard_zeroes_data_entry.attr,
> > + &queue_needs_stable_pages_entry.attr,
> > &queue_write_same_max_entry.attr,
> > &queue_nonrot_entry.attr,
> > &queue_nomerges_entry.attr,
> > diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> > index 26eff46..146bed4 100644
> > --- a/drivers/scsi/sd.c
> > +++ b/drivers/scsi/sd.c
> > @@ -1752,10 +1752,11 @@ static void sd_read_protection_type(struct scsi_disk *sdkp, unsigned char *buffe
> > return;
> > }
> >
> > - if (scsi_host_dif_capable(sdp->host, type))
> > + if (scsi_host_dif_capable(sdp->host, type)) {
> > sd_printk(KERN_NOTICE, sdkp,
> > "Enabling DIF Type %u protection\n", type);
> > - else
> > + sdkp->disk->queue->limits.needs_stable_pages = true;
> > + } else
> > sd_printk(KERN_NOTICE, sdkp,
> > "Disabling DIF Type %u protection\n", type);
> > }
> > diff --git a/drivers/scsi/sd_dif.c b/drivers/scsi/sd_dif.c
> > index 0cb39ff..9dc330c 100644
> > --- a/drivers/scsi/sd_dif.c
> > +++ b/drivers/scsi/sd_dif.c
> > @@ -338,6 +338,8 @@ void sd_dif_config_host(struct scsi_disk *sdkp)
> > sd_printk(KERN_NOTICE, sdkp,
> > "Enabling DIX %s protection\n", disk->integrity->name);
> >
> > + disk->queue->limits.needs_stable_pages = true;
> > +
> > /* Signal to block layer that we support sector tagging */
> > if (dif && type && sdkp->ATO) {
> > if (type == SD_DIF_TYPE3_PROTECTION)
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 92956b7..a5a33db 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -266,6 +266,8 @@ struct queue_limits {
> > unsigned char discard_misaligned;
> > unsigned char cluster;
> > unsigned char discard_zeroes_data;
> > +
> > + bool needs_stable_pages;
> > };
> >
> > struct request_queue {
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
On Wed, Mar 07, 2012 at 10:27:43PM -0800, Sage Weil wrote:
>
> This avoids the problem for devices that don't need stable pages, but
> doesn't help for those that do (btrfs, raid, iscsi, dif/dix, etc.). It
> seems to me like a more elegant solution would be to COW the page in the
> address_space so that you get stable writeback pages without blocking.
> That's clearly more complex, and I'm sure there are a range of issues
> involved in making that work, but I would hope that it would be doable
> with generic MM infrastructure so that everyone would benefit.
Well, even doing a COW (or anything that involves messing with page
tables) is not free. So even if we can make the cost of stable
writeback pages cheaper, if we can completely avoid the cost, this
would be good. I'd also rather fix the performance regression sooner
rather than later, and I suspect the COW solution is not something
that could be prepared in time for the upcoming merge window.
Martin, would you be willing to try to get your patch submitted for
the upcoming merge window? Or I'd be willing to carry your patch and
then rework Darrick's to use the exported flag, and carry it in my
tree, maybe that would be better.
> I would love to talk to some MM people at LSF about what it would take to
> make this work...
Sure, long term, I'm much more sympathetic to iSCSI than I am about
DIF/DIX (which due to drive manufacturer's pricing strategies I don't
think it will ever become mainstream --- which was my main concern;
why should we be inflicting pretty severe performance regressions for
the common case, just to improve things for obscure high-end hardware?
It's similar to how Solaris trashed 1 and 2 processor performance,
because they were optimizing things for their high margin SunFires).
So getting something which makes page stablization not suck so much in
the long term seems like a fine goal. But even though we've worked on
improve SMP scalability in a way that didn't sacrifice 2 and 4-way
processors, we've still supported compiling with !CONFIG_SMP....
- Ted
>
> sage
>
>
>
> >
> > When submitted I will also send a patch to set .needs_stable_pages in
> > iscsi when needed.
> >
> > Thanks, Martin
> > Boaz
> >
> > > diff --git a/block/blk-settings.c b/block/blk-settings.c
> > > index 5680b91..442a0df 100644
> > > --- a/block/blk-settings.c
> > > +++ b/block/blk-settings.c
> > > @@ -125,6 +125,7 @@ void blk_set_default_limits(struct queue_limits *lim)
> > > lim->io_opt = 0;
> > > lim->misaligned = 0;
> > > lim->cluster = 1;
> > > + lim->needs_stable_pages = false;
> > > }
> > > EXPORT_SYMBOL(blk_set_default_limits);
> > >
> > > @@ -571,6 +572,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
> > > t->cluster &= b->cluster;
> > > t->discard_zeroes_data &= b->discard_zeroes_data;
> > >
> > > + t->needs_stable_pages &= b->needs_stable_pages;
> > > +
> > > /* Physical block size a multiple of the logical block size? */
> > > if (t->physical_block_size & (t->logical_block_size - 1)) {
> > > t->physical_block_size = t->logical_block_size;
> > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > > index 5b85d91..d464aca 100644
> > > --- a/block/blk-sysfs.c
> > > +++ b/block/blk-sysfs.c
> > > @@ -161,6 +161,11 @@ static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *pag
> > > return queue_var_show(queue_discard_zeroes_data(q), page);
> > > }
> > >
> > > +static ssize_t queue_needs_stable_pages_show(struct request_queue *q, char *page)
> > > +{
> > > + return queue_var_show(q->limits.needs_stable_pages, page);
> > > +}
> > > +
> > > static ssize_t queue_write_same_max_show(struct request_queue *q, char *page)
> > > {
> > > return sprintf(page, "%llu\n",
> > > @@ -364,6 +369,11 @@ static struct queue_sysfs_entry queue_discard_zeroes_data_entry = {
> > > .show = queue_discard_zeroes_data_show,
> > > };
> > >
> > > +static struct queue_sysfs_entry queue_needs_stable_pages_entry = {
> > > + .attr = {.name = "needs_stable_pages", .mode = S_IRUGO },
> > > + .show = queue_needs_stable_pages_show,
> > > +};
> > > +
> > > static struct queue_sysfs_entry queue_write_same_max_entry = {
> > > .attr = {.name = "write_same_max_bytes", .mode = S_IRUGO },
> > > .show = queue_write_same_max_show,
> > > @@ -416,6 +426,7 @@ static struct attribute *default_attrs[] = {
> > > &queue_discard_granularity_entry.attr,
> > > &queue_discard_max_entry.attr,
> > > &queue_discard_zeroes_data_entry.attr,
> > > + &queue_needs_stable_pages_entry.attr,
> > > &queue_write_same_max_entry.attr,
> > > &queue_nonrot_entry.attr,
> > > &queue_nomerges_entry.attr,
> > > diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> > > index 26eff46..146bed4 100644
> > > --- a/drivers/scsi/sd.c
> > > +++ b/drivers/scsi/sd.c
> > > @@ -1752,10 +1752,11 @@ static void sd_read_protection_type(struct scsi_disk *sdkp, unsigned char *buffe
> > > return;
> > > }
> > >
> > > - if (scsi_host_dif_capable(sdp->host, type))
> > > + if (scsi_host_dif_capable(sdp->host, type)) {
> > > sd_printk(KERN_NOTICE, sdkp,
> > > "Enabling DIF Type %u protection\n", type);
> > > - else
> > > + sdkp->disk->queue->limits.needs_stable_pages = true;
> > > + } else
> > > sd_printk(KERN_NOTICE, sdkp,
> > > "Disabling DIF Type %u protection\n", type);
> > > }
> > > diff --git a/drivers/scsi/sd_dif.c b/drivers/scsi/sd_dif.c
> > > index 0cb39ff..9dc330c 100644
> > > --- a/drivers/scsi/sd_dif.c
> > > +++ b/drivers/scsi/sd_dif.c
> > > @@ -338,6 +338,8 @@ void sd_dif_config_host(struct scsi_disk *sdkp)
> > > sd_printk(KERN_NOTICE, sdkp,
> > > "Enabling DIX %s protection\n", disk->integrity->name);
> > >
> > > + disk->queue->limits.needs_stable_pages = true;
> > > +
> > > /* Signal to block layer that we support sector tagging */
> > > if (dif && type && sdkp->ATO) {
> > > if (type == SD_DIF_TYPE3_PROTECTION)
> > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > > index 92956b7..a5a33db 100644
> > > --- a/include/linux/blkdev.h
> > > +++ b/include/linux/blkdev.h
> > > @@ -266,6 +266,8 @@ struct queue_limits {
> > > unsigned char discard_misaligned;
> > > unsigned char cluster;
> > > unsigned char discard_zeroes_data;
> > > +
> > > + bool needs_stable_pages;
> > > };
> > >
> > > struct request_queue {
> >
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
> >
On Wed, Mar 07, 2012 at 09:39:50PM -0500, Zach Brown wrote:
>
> >Can you devise a non-secret testcase that demonstrates this?
>
> Hmm. I bet you could get fio to do it. Giant file, random mmap()
> writes, spin until the CPU overwhelms writeback?
Kick off a bunch of fio processes, each in separate I/O cgroups set up
so that each of the processes get a "fair" amount of the I/O
bandwidth. (This is quite common in cloud deployments where you are
packing a huge number of tasks onto a single box; whether the tasks
are inside virtual machines or containers don't really matter for the
purpose of this exercise. We basically need to simulate a system
where the disks are busy.)
Then in one of those cgroups, create a process which is constantly
appending to a file using buffered I/O; this could be a log file, or
an application-level journal file; and measure the latency of that
write system call. Every so often, writeback will push the dirty
pages corresponding to the log/journal file to disk. When that
happens, and page stablization is enabled, the latency of that write
system call will spike.
And any time you have a distributed system where you are depending on
a large number of RPC/SOAP/Service Oriented Architecture Enterpise
Service Bus calls (I don't really care which buzzword you use, but IBM
and Oracle really like the last one :-), long-tail latencies are what
kill your responsiveness and predictability. Especially when a thread
goes away for a second or more...
- Ted
>>>>> "Ted" == Ted Ts'o <[email protected]> writes:
Ted> Martin, would you be willing to try to get your patch submitted for
Ted> the upcoming merge window? Or I'd be willing to carry your patch
Ted> and then rework Darrick's to use the exported flag, and carry it in
Ted> my tree, maybe that would be better.
There's probably going to be some conflicts due to both topology updates
and the write same changes I have pending. So it's probably best that I
submit this patch as part of my kits for Jens and James. Should go out
today.
Ted> why should we be inflicting pretty severe performance regressions
Ted> for the common case, just to improve things for obscure high-end
Ted> hardware?
I'm perfectly ok with that now that we have established that there are
real world workloads that do suffer with the wait in place.
--
Martin K. Petersen Oracle Linux Engineering
On Thu, 8 Mar 2012, Ted Ts'o wrote:
> On Wed, Mar 07, 2012 at 10:27:43PM -0800, Sage Weil wrote:
> >
> > This avoids the problem for devices that don't need stable pages, but
> > doesn't help for those that do (btrfs, raid, iscsi, dif/dix, etc.). It
> > seems to me like a more elegant solution would be to COW the page in the
> > address_space so that you get stable writeback pages without blocking.
> > That's clearly more complex, and I'm sure there are a range of issues
> > involved in making that work, but I would hope that it would be doable
> > with generic MM infrastructure so that everyone would benefit.
>
> Well, even doing a COW (or anything that involves messing with page
> tables) is not free. So even if we can make the cost of stable
> writeback pages cheaper, if we can completely avoid the cost, this
> would be good. I'd also rather fix the performance regression sooner
> rather than later, and I suspect the COW solution is not something
> that could be prepared in time for the upcoming merge window.
Definitely. This patch looks like a fine approach for your situation. I
just don't want the subject to come up without talking about a general
solution. And it's very interesting to hear about a (simple) workload
that is affected by the wait_on_page_writeback().
Thanks-
sage
> Martin, would you be willing to try to get your patch submitted for
> the upcoming merge window? Or I'd be willing to carry your patch and
> then rework Darrick's to use the exported flag, and carry it in my
> tree, maybe that would be better.
>
> > I would love to talk to some MM people at LSF about what it would take to
> > make this work...
>
> Sure, long term, I'm much more sympathetic to iSCSI than I am about
> DIF/DIX (which due to drive manufacturer's pricing strategies I don't
> think it will ever become mainstream --- which was my main concern;
> why should we be inflicting pretty severe performance regressions for
> the common case, just to improve things for obscure high-end hardware?
> It's similar to how Solaris trashed 1 and 2 processor performance,
> because they were optimizing things for their high margin SunFires).
> So getting something which makes page stablization not suck so much in
> the long term seems like a fine goal. But even though we've worked on
> improve SMP scalability in a way that didn't sacrifice 2 and 4-way
> processors, we've still supported compiling with !CONFIG_SMP....
>
> - Ted
>
> >
> > sage
> >
> >
> >
> > >
> > > When submitted I will also send a patch to set .needs_stable_pages in
> > > iscsi when needed.
> > >
> > > Thanks, Martin
> > > Boaz
> > >
> > > > diff --git a/block/blk-settings.c b/block/blk-settings.c
> > > > index 5680b91..442a0df 100644
> > > > --- a/block/blk-settings.c
> > > > +++ b/block/blk-settings.c
> > > > @@ -125,6 +125,7 @@ void blk_set_default_limits(struct queue_limits *lim)
> > > > lim->io_opt = 0;
> > > > lim->misaligned = 0;
> > > > lim->cluster = 1;
> > > > + lim->needs_stable_pages = false;
> > > > }
> > > > EXPORT_SYMBOL(blk_set_default_limits);
> > > >
> > > > @@ -571,6 +572,8 @@ int blk_stack_limits(struct queue_limits *t, struct queue_limits *b,
> > > > t->cluster &= b->cluster;
> > > > t->discard_zeroes_data &= b->discard_zeroes_data;
> > > >
> > > > + t->needs_stable_pages &= b->needs_stable_pages;
> > > > +
> > > > /* Physical block size a multiple of the logical block size? */
> > > > if (t->physical_block_size & (t->logical_block_size - 1)) {
> > > > t->physical_block_size = t->logical_block_size;
> > > > diff --git a/block/blk-sysfs.c b/block/blk-sysfs.c
> > > > index 5b85d91..d464aca 100644
> > > > --- a/block/blk-sysfs.c
> > > > +++ b/block/blk-sysfs.c
> > > > @@ -161,6 +161,11 @@ static ssize_t queue_discard_zeroes_data_show(struct request_queue *q, char *pag
> > > > return queue_var_show(queue_discard_zeroes_data(q), page);
> > > > }
> > > >
> > > > +static ssize_t queue_needs_stable_pages_show(struct request_queue *q, char *page)
> > > > +{
> > > > + return queue_var_show(q->limits.needs_stable_pages, page);
> > > > +}
> > > > +
> > > > static ssize_t queue_write_same_max_show(struct request_queue *q, char *page)
> > > > {
> > > > return sprintf(page, "%llu\n",
> > > > @@ -364,6 +369,11 @@ static struct queue_sysfs_entry queue_discard_zeroes_data_entry = {
> > > > .show = queue_discard_zeroes_data_show,
> > > > };
> > > >
> > > > +static struct queue_sysfs_entry queue_needs_stable_pages_entry = {
> > > > + .attr = {.name = "needs_stable_pages", .mode = S_IRUGO },
> > > > + .show = queue_needs_stable_pages_show,
> > > > +};
> > > > +
> > > > static struct queue_sysfs_entry queue_write_same_max_entry = {
> > > > .attr = {.name = "write_same_max_bytes", .mode = S_IRUGO },
> > > > .show = queue_write_same_max_show,
> > > > @@ -416,6 +426,7 @@ static struct attribute *default_attrs[] = {
> > > > &queue_discard_granularity_entry.attr,
> > > > &queue_discard_max_entry.attr,
> > > > &queue_discard_zeroes_data_entry.attr,
> > > > + &queue_needs_stable_pages_entry.attr,
> > > > &queue_write_same_max_entry.attr,
> > > > &queue_nonrot_entry.attr,
> > > > &queue_nomerges_entry.attr,
> > > > diff --git a/drivers/scsi/sd.c b/drivers/scsi/sd.c
> > > > index 26eff46..146bed4 100644
> > > > --- a/drivers/scsi/sd.c
> > > > +++ b/drivers/scsi/sd.c
> > > > @@ -1752,10 +1752,11 @@ static void sd_read_protection_type(struct scsi_disk *sdkp, unsigned char *buffe
> > > > return;
> > > > }
> > > >
> > > > - if (scsi_host_dif_capable(sdp->host, type))
> > > > + if (scsi_host_dif_capable(sdp->host, type)) {
> > > > sd_printk(KERN_NOTICE, sdkp,
> > > > "Enabling DIF Type %u protection\n", type);
> > > > - else
> > > > + sdkp->disk->queue->limits.needs_stable_pages = true;
> > > > + } else
> > > > sd_printk(KERN_NOTICE, sdkp,
> > > > "Disabling DIF Type %u protection\n", type);
> > > > }
> > > > diff --git a/drivers/scsi/sd_dif.c b/drivers/scsi/sd_dif.c
> > > > index 0cb39ff..9dc330c 100644
> > > > --- a/drivers/scsi/sd_dif.c
> > > > +++ b/drivers/scsi/sd_dif.c
> > > > @@ -338,6 +338,8 @@ void sd_dif_config_host(struct scsi_disk *sdkp)
> > > > sd_printk(KERN_NOTICE, sdkp,
> > > > "Enabling DIX %s protection\n", disk->integrity->name);
> > > >
> > > > + disk->queue->limits.needs_stable_pages = true;
> > > > +
> > > > /* Signal to block layer that we support sector tagging */
> > > > if (dif && type && sdkp->ATO) {
> > > > if (type == SD_DIF_TYPE3_PROTECTION)
> > > > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > > > index 92956b7..a5a33db 100644
> > > > --- a/include/linux/blkdev.h
> > > > +++ b/include/linux/blkdev.h
> > > > @@ -266,6 +266,8 @@ struct queue_limits {
> > > > unsigned char discard_misaligned;
> > > > unsigned char cluster;
> > > > unsigned char discard_zeroes_data;
> > > > +
> > > > + bool needs_stable_pages;
> > > > };
> > > >
> > > > struct request_queue {
> > >
> > > --
> > > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > > the body of a message to [email protected]
> > > More majordomo info at http://vger.kernel.org/majordomo-info.html
> > >
> > >
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
On Thu, Mar 08, 2012 at 10:54:19AM -0500, Ted Ts'o wrote:
> On Wed, Mar 07, 2012 at 09:39:50PM -0500, Zach Brown wrote:
> >
> > >Can you devise a non-secret testcase that demonstrates this?
> >
> > Hmm. I bet you could get fio to do it. Giant file, random mmap()
> > writes, spin until the CPU overwhelms writeback?
>
> Kick off a bunch of fio processes, each in separate I/O cgroups set up
> so that each of the processes get a "fair" amount of the I/O
> bandwidth. (This is quite common in cloud deployments where you are
> packing a huge number of tasks onto a single box; whether the tasks
> are inside virtual machines or containers don't really matter for the
> purpose of this exercise. We basically need to simulate a system
> where the disks are busy.)
>
> Then in one of those cgroups, create a process which is constantly
> appending to a file using buffered I/O; this could be a log file, or
> an application-level journal file; and measure the latency of that
> write system call. Every so often, writeback will push the dirty
> pages corresponding to the log/journal file to disk. When that
> happens, and page stablization is enabled, the latency of that write
> system call will spike.
>
> And any time you have a distributed system where you are depending on
> a large number of RPC/SOAP/Service Oriented Architecture Enterpise
> Service Bus calls (I don't really care which buzzword you use, but IBM
> and Oracle really like the last one :-), long-tail latencies are what
> kill your responsiveness and predictability. Especially when a thread
> goes away for a second or more...
But, why are we writeback for a second or more? Aren't there other
parts of this we would want to fix as well?
I'm not against only turning on stable pages when they are needed, but
the code that isn't the default tends to be somewhat less used. So it
does increase testing burden when we do want stable pages, and it tends
to make for awkward bugs that are hard to reproduce because someone
neglects to mention it.
IMHO it's much more important to nail down the 2 second writeback
latency. That's not good.
-chris
On 03/08/2012 10:09 AM, Chris Mason wrote:
>
> But, why are we writeback for a second or more? Aren't there other
> parts of this we would want to fix as well?
>
> I'm not against only turning on stable pages when they are needed, but
> the code that isn't the default tends to be somewhat less used. So it
> does increase testing burden when we do want stable pages, and it tends
> to make for awkward bugs that are hard to reproduce because someone
> neglects to mention it.
>
> IMHO it's much more important to nail down the 2 second writeback
> latency. That's not good.
>
I think I understand this one. It's do to the sync nature introduced
by page_waiting in mkwrite.
The system is loaded everything is somewhat 2 second or more in a lag.
The 2 sec (or more) comes from the max-dirty-limit/disk-speed so any
IO you'll submit will probably be on stable disk 2 sec later. (In theory,
any power fail will loose all dirty pages which is in our case
max-dirty-limit)
Now usually that's fine because everything is queued and waits a bit
evenly distributed and you wait, theoretically, only the rate of your
IO. But here, all of a sudden, you are not allowed to be queued and you
are waiting for the head of queue to be actually done, and the app is
just frozen.
Actually now when I think of it the pages were already submitted for
them to be waited on. So the 2-sec is the depth of the block+scsi+target
queues. I guess they can be pretty deep.
I have a theory of how we can fix that 2-sec wait, by avoiding writeback of
the last n pages of an inode who's mtime is less then 2-sec. This would
solve any sequential writer wait penalty, which is Ted's case
Thanks
Boaz
On Thu, Mar 08, 2012 at 12:20:26PM -0800, Boaz Harrosh wrote:
> On 03/08/2012 10:09 AM, Chris Mason wrote:
> >
> > But, why are we writeback for a second or more? Aren't there other
> > parts of this we would want to fix as well?
> >
> > I'm not against only turning on stable pages when they are needed, but
> > the code that isn't the default tends to be somewhat less used. So it
> > does increase testing burden when we do want stable pages, and it tends
> > to make for awkward bugs that are hard to reproduce because someone
> > neglects to mention it.
> >
> > IMHO it's much more important to nail down the 2 second writeback
> > latency. That's not good.
> >
>
> I think I understand this one. It's do to the sync nature introduced
> by page_waiting in mkwrite.
Pages go from dirty to writeback for a few reasons. Background
writeout, or O_DIRECT or someone running sync
background writeout shouldn't be queueing up so much work that
synchronous writeout has a 2 second delay.
If the latencies are coming from something that was run through
fsync...well there's not too much we can do about that. The problem is
that our page_mkwrite call isn't starting the IO it is just waiting on
it, so we can't bump the priority on it.
-chris
Chris Mason <[email protected]> writes:
> On Thu, Mar 08, 2012 at 12:20:26PM -0800, Boaz Harrosh wrote:
>> I think I understand this one. It's do to the sync nature introduced
>> by page_waiting in mkwrite.
>
> Pages go from dirty to writeback for a few reasons. Background
> writeout, or O_DIRECT or someone running sync
>
> background writeout shouldn't be queueing up so much work that
> synchronous writeout has a 2 second delay.
So now we're back to figuring out how to tell how long I/O will take?
If writeback is issuing random access I/Os to spinning media, you can
bet it might be a while. Today, you could lower nr_requests to some
obscenely small number to improve worst-case latency. I thought there
was some talk about improving the intelligence of writeback in this
regard, but it's a tough problem, especially given that writeback isn't
the only cook in the kitchen.
Cheers,
Jeff
On 03/08/2012 12:37 PM, Chris Mason wrote:
> On Thu, Mar 08, 2012 at 12:20:26PM -0800, Boaz Harrosh wrote:
>> On 03/08/2012 10:09 AM, Chris Mason wrote:
>>>
>>> But, why are we writeback for a second or more? Aren't there other
>>> parts of this we would want to fix as well?
>>>
>>> I'm not against only turning on stable pages when they are needed, but
>>> the code that isn't the default tends to be somewhat less used. So it
>>> does increase testing burden when we do want stable pages, and it tends
>>> to make for awkward bugs that are hard to reproduce because someone
>>> neglects to mention it.
>>>
>>> IMHO it's much more important to nail down the 2 second writeback
>>> latency. That's not good.
>>>
>>
>> I think I understand this one. It's do to the sync nature introduced
>> by page_waiting in mkwrite.
>
> Pages go from dirty to writeback for a few reasons. Background
> writeout, or O_DIRECT or someone running sync
>
> background writeout shouldn't be queueing up so much work that
> synchronous writeout has a 2 second delay.
>
> If the latencies are coming from something that was run through
> fsync...well there's not too much we can do about that. The problem is
> that our page_mkwrite call isn't starting the IO it is just waiting on
> it, so we can't bump the priority on it.
>
I agree. I think the logger model is: write, than sync
Before they used to be waiting on the sync phase now their waiting
on write, when concurrent with sync. I would like to inspect this situation.
I agree with you that it's just shifting heaviness that is now hiding somewhere
else.
> -chris
Thanks
Boaz
On Thu, Mar 08, 2012 at 03:42:52PM -0500, Jeff Moyer wrote:
> Chris Mason <[email protected]> writes:
>
> > On Thu, Mar 08, 2012 at 12:20:26PM -0800, Boaz Harrosh wrote:
> >> I think I understand this one. It's do to the sync nature introduced
> >> by page_waiting in mkwrite.
> >
> > Pages go from dirty to writeback for a few reasons. Background
> > writeout, or O_DIRECT or someone running sync
> >
> > background writeout shouldn't be queueing up so much work that
> > synchronous writeout has a 2 second delay.
>
> So now we're back to figuring out how to tell how long I/O will take?
> If writeback is issuing random access I/Os to spinning media, you can
> bet it might be a while. Today, you could lower nr_requests to some
> obscenely small number to improve worst-case latency. I thought there
> was some talk about improving the intelligence of writeback in this
> regard, but it's a tough problem, especially given that writeback isn't
> the only cook in the kitchen.
I'm not against Ted's original idea, but I'd hate to miss the chance to
blame Jens for the block layer making filesystems slow.
Since there is a reliable way to reproduce, it makes sense to track it
down a little more instead of papering over the wait_on_page_writeback
call.
-chris
On Thu, Mar 08, 2012 at 03:42:52PM -0500, Jeff Moyer wrote:
>
> So now we're back to figuring out how to tell how long I/O will take?
> If writeback is issuing random access I/Os to spinning media, you can
> bet it might be a while. Today, you could lower nr_requests to some
> obscenely small number to improve worst-case latency. I thought there
> was some talk about improving the intelligence of writeback in this
> regard, but it's a tough problem, especially given that writeback isn't
> the only cook in the kitchen.
... and it gets worse if there is any kind of I/O prioritization going
on via ionice(), or (as was the case in our example) I/O cgroups were
being used to provide proportional I/O rate controls. I don't think
it's realistic to assume the writeback code can predict how long I/O
will take when it does a submission.
BTW, I'd have to check (having not looked at the application code in
depth; the bug was primarily solved by bisection and reverting the
problem commit) but I'm not entirely sure the thread doing the write
was calling fsync(); the main issue as I understand things was that
the application wasn't expecting the write(2) system call would block
unexpectedly for long periods of time while doing small buffered,
appending I/O's. (Again, for the kind of work that distributed
systems do, 99th percentile latency is important!)
- Ted
On Thu, Mar 08, 2012 at 04:12:21PM -0500, Ted Ts'o wrote:
> On Thu, Mar 08, 2012 at 03:42:52PM -0500, Jeff Moyer wrote:
> >
> > So now we're back to figuring out how to tell how long I/O will take?
> > If writeback is issuing random access I/Os to spinning media, you can
> > bet it might be a while. Today, you could lower nr_requests to some
> > obscenely small number to improve worst-case latency. I thought there
> > was some talk about improving the intelligence of writeback in this
> > regard, but it's a tough problem, especially given that writeback isn't
> > the only cook in the kitchen.
>
> ... and it gets worse if there is any kind of I/O prioritization going
> on via ionice(), or (as was the case in our example) I/O cgroups were
> being used to provide proportional I/O rate controls. I don't think
> it's realistic to assume the writeback code can predict how long I/O
> will take when it does a submission.
cgroups do make it much harder because it could be a simple IO priority
inversion. The latencies are just going to be a fact of life for now
and the best choice is to skip the stable pages.
-chris
On Thu, Mar 08, 2012 at 12:20:26PM -0800, Boaz Harrosh wrote:
>
> I have a theory of how we can fix that 2-sec wait, by avoiding writeback of
> the last n pages of an inode who's mtime is less then 2-sec. This would
> solve any sequential writer wait penalty, which is Ted's case
That won't work in general, *unless* 2 seconds is enough time that the
appending writer is done writing to that particular 4k page and moved
on to the next 4k block, so nothing touches that page and potentially
blocks for however long it takes for the queues to drain.
Let's take another example, suppose you have a file-backed mmap
region, and you modify the page, and now let's suppose the process is
under enough memory pressure that the page cleaner decides to initiate
writeback of the page. Now suppose you get unlucky (this is the 1% or
0.1% case; remember, 99th or 99.9 percentile latencies matter), and
you try to modify the page in question again. ***THUNK*** your
process takes a page fault, and is frozen solid in amber for
potentially seconds until the I/O queues drain.
Hmm.... let's turn this around. If the issue is checksum calculation,
how about trying to solve this problem in some cases by deferring the
checksum calculation until right before the block I/O layer is going
to schedule the write (i.e., have the I/O submitter provide a callback
function which calculates the checksum, which gets called by the BIO
layer at the very last moment)?
This won't work in all cases (I can see this getting really messy in
the software RAID-5/6 case if you don't want to memory copies) but it
might solve the problem in at least some of the cases where people
care about this.
- Ted
On Thu, Mar 08, 2012 at 04:24:12PM -0500, Ted Ts'o wrote:
> On Thu, Mar 08, 2012 at 12:20:26PM -0800, Boaz Harrosh wrote:
> >
> > I have a theory of how we can fix that 2-sec wait, by avoiding writeback of
> > the last n pages of an inode who's mtime is less then 2-sec. This would
> > solve any sequential writer wait penalty, which is Ted's case
>
> That won't work in general, *unless* 2 seconds is enough time that the
> appending writer is done writing to that particular 4k page and moved
> on to the next 4k block, so nothing touches that page and potentially
> blocks for however long it takes for the queues to drain.
>
> Let's take another example, suppose you have a file-backed mmap
> region, and you modify the page, and now let's suppose the process is
> under enough memory pressure that the page cleaner decides to initiate
> writeback of the page. Now suppose you get unlucky (this is the 1% or
> 0.1% case; remember, 99th or 99.9 percentile latencies matter), and
> you try to modify the page in question again. ***THUNK*** your
> process takes a page fault, and is frozen solid in amber for
> potentially seconds until the I/O queues drain.
>
> Hmm.... let's turn this around. If the issue is checksum calculation,
> how about trying to solve this problem in some cases by deferring the
> checksum calculation until right before the block I/O layer is going
> to schedule the write (i.e., have the I/O submitter provide a callback
> function which calculates the checksum, which gets called by the BIO
> layer at the very last moment)?
Btrfs currently does this, and the DIF code is by definition right
before the write. The pages really only get set writeback when they are
being sent in flight, so the waiting being done by the stable pages
patch is file_write or page_mkwrite being polite and waiting for the IO
to finish before changing the page.
-chris
On Thu, Mar 08, 2012 at 04:38:08PM -0500, Chris Mason wrote:
> Btrfs currently does this, and the DIF code is by definition right
> before the write. The pages really only get set writeback when they are
> being sent in flight, so the waiting being done by the stable pages
> patch is file_write or page_mkwrite being polite and waiting for the IO
> to finish before changing the page.
Right before submission to the bio layer? Or right before the device
driver sends the request to the host bus adapter? I was thinking of
the latter....
- Ted
On 03/08/2012 01:24 PM, Ted Ts'o wrote:
> On Thu, Mar 08, 2012 at 12:20:26PM -0800, Boaz Harrosh wrote:
>>
>> I have a theory of how we can fix that 2-sec wait, by avoiding writeback of
>> the last n pages of an inode who's mtime is less then 2-sec. This would
>> solve any sequential writer wait penalty, which is Ted's case
>
> That won't work in general, *unless* 2 seconds is enough time that the
> appending writer is done writing to that particular 4k page and moved
> on to the next 4k block, so nothing touches that page and potentially
> blocks for however long it takes for the queues to drain.
>
Exactly. Is that not the case for a sequential writer. It modifies
the top page for a while inode time keeps incrementing, then eventually
it advances to the next page, now the before the last page is never
touched again. and can be submitted for writeout.
> Let's take another example, suppose you have a file-backed mmap
> region, and you modify the page, and now let's suppose the process is
> under enough memory pressure that the page cleaner decides to initiate
> writeback of the page. Now suppose you get unlucky (this is the 1% or
> 0.1% case; remember, 99th or 99.9 percentile latencies matter), and
> you try to modify the page in question again. ***THUNK*** your
> process takes a page fault, and is frozen solid in amber for
> potentially seconds until the I/O queues drain.
>
As I said, if the IO is random you are in though luck, and BTW mmap
is not a must, just simple write() call will behave just the same
since it sits in the same mkwrite().
But that was not your case. Your case was an appending logger.
But my new theory is that your case is not the "writeback" to
app-write collision, but the app-write to sync() collision which
is a different case.
> Hmm.... let's turn this around. If the issue is checksum calculation,
> how about trying to solve this problem in some cases by deferring the
> checksum calculation until right before the block I/O layer is going
> to schedule the write (i.e., have the I/O submitter provide a callback
> function which calculates the checksum, which gets called by the BIO
> layer at the very last moment)?
>
iscsi is that case, but the problem is not when "calculates the checksum"
but: when "changing page state" your schema can work but you will need to
add a new page state, dirty => writeback => stable (new state).
> This won't work in all cases (I can see this getting really messy in
> the software RAID-5/6 case if you don't want to memory copies) but it
> might solve the problem in at least some of the cases where people
> care about this.
>
> - Ted
Thanks
Boaz
On Thu, Mar 08, 2012 at 12:50:42PM -0800, Boaz Harrosh wrote:
> On 03/08/2012 12:37 PM, Chris Mason wrote:
> > On Thu, Mar 08, 2012 at 12:20:26PM -0800, Boaz Harrosh wrote:
> >> On 03/08/2012 10:09 AM, Chris Mason wrote:
> >>>
> >>> But, why are we writeback for a second or more? Aren't there other
> >>> parts of this we would want to fix as well?
> >>>
> >>> I'm not against only turning on stable pages when they are needed, but
> >>> the code that isn't the default tends to be somewhat less used. So it
> >>> does increase testing burden when we do want stable pages, and it tends
> >>> to make for awkward bugs that are hard to reproduce because someone
> >>> neglects to mention it.
> >>>
> >>> IMHO it's much more important to nail down the 2 second writeback
> >>> latency. That's not good.
> >>>
> >>
> >> I think I understand this one. It's do to the sync nature introduced
> >> by page_waiting in mkwrite.
> >
> > Pages go from dirty to writeback for a few reasons. Background
> > writeout, or O_DIRECT or someone running sync
> >
> > background writeout shouldn't be queueing up so much work that
> > synchronous writeout has a 2 second delay.
> >
> > If the latencies are coming from something that was run through
> > fsync...well there's not too much we can do about that. The problem is
> > that our page_mkwrite call isn't starting the IO it is just waiting on
> > it, so we can't bump the priority on it.
> >
>
> I agree. I think the logger model is: write, than sync
>
> Before they used to be waiting on the sync phase now their waiting
> on write, when concurrent with sync. I would like to inspect this situation.
> I agree with you that it's just shifting heaviness that is now hiding somewhere
> else.
I'd argue that the application is broken, not the kernel. IO
latencies are always going to occur, so if the application is
sensitive to them, then the app needs to take measures to minimise
the effect of potential latencies. Double/ring buffering with async
IO dispatch is the general method of avoiding significant latencies
in IO aggregator applications like loggers...
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Thu, Mar 08, 2012 at 04:41:48PM -0500, Ted Ts'o wrote:
> On Thu, Mar 08, 2012 at 04:38:08PM -0500, Chris Mason wrote:
> > Btrfs currently does this, and the DIF code is by definition right
> > before the write. The pages really only get set writeback when they are
> > being sent in flight, so the waiting being done by the stable pages
> > patch is file_write or page_mkwrite being polite and waiting for the IO
> > to finish before changing the page.
>
> Right before submission to the bio layer? Or right before the device
> driver sends the request to the host bus adapter? I was thinking of
> the latter....
blktrace can probably give us numbers for how big that wait is, but
hopefully it is very small. If it is too big, you can always bump
nr_requests.
Btrfs crcs before submit_bio, not sure when the dif code does it.
-chris
>>>>> "Chris" == Chris Mason <[email protected]> writes:
Chris> Btrfs crcs before submit_bio, not sure when the dif code does it.
Slightly later. At make_request time.
--
Martin K. Petersen Oracle Linux Engineering
On Thu, Mar 08, 2012 at 04:20:54PM -0500, Chris Mason wrote:
> On Thu, Mar 08, 2012 at 04:12:21PM -0500, Ted Ts'o wrote:
> > On Thu, Mar 08, 2012 at 03:42:52PM -0500, Jeff Moyer wrote:
> > >
> > > So now we're back to figuring out how to tell how long I/O will take?
> > > If writeback is issuing random access I/Os to spinning media, you can
> > > bet it might be a while. Today, you could lower nr_requests to some
> > > obscenely small number to improve worst-case latency. I thought there
> > > was some talk about improving the intelligence of writeback in this
> > > regard, but it's a tough problem, especially given that writeback isn't
> > > the only cook in the kitchen.
> >
> > ... and it gets worse if there is any kind of I/O prioritization going
> > on via ionice(), or (as was the case in our example) I/O cgroups were
> > being used to provide proportional I/O rate controls. I don't think
> > it's realistic to assume the writeback code can predict how long I/O
> > will take when it does a submission.
>
> cgroups do make it much harder because it could be a simple IO priority
> inversion. The latencies are just going to be a fact of life for now
> and the best choice is to skip the stable pages.
They have always been a fact of life - just ask anyone that has to
deal with deterministic or "real-time" IO applications.
Unpredictable IO path latencies are not a new problem, and it
doesn't take stable pages to cause sigificant holdoffs in the
writing to a file. For example: writeback triggers triggers delayed
allocation, which locks the extent map and then blocks behind an
allocation already in progress or has to do IO to read in freespace
metadata. The next write comes along from another thread/process and
it has to map a new page and that now blocks on the extent map lock
and won't progress until the delayed allocation in progress
completes....
IO latencies are pretty much unavoidable, so the best thing to do is
to write applications that care about latency to minimise it's
impact as much as possible. Simple techniques like double buffering
and async IO dispatch techniques to decouple the IO stream from the
process/threads that are doing real work are the usual ways of
dealing with this problem.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Thu, Mar 08, 2012 at 08:02:07PM -0500, Chris Mason wrote:
> > Right before submission to the bio layer? Or right before the device
> > driver sends the request to the host bus adapter? I was thinking of
> > the latter....
>
> blktrace can probably give us numbers for how big that wait is, but
> hopefully it is very small. If it is too big, you can always bump
> nr_requests.
Not necessarily; things can get stalled for quite a while due to
ionice or io rate throttling. There's quite a lot that can happen
after submit_bio, at the cfq layer, for example.
- Ted
On 03/08/2012 08:43 AM, Sage Weil wrote:
> On Thu, 8 Mar 2012, Ted Ts'o wrote:
>> On Wed, Mar 07, 2012 at 10:27:43PM -0800, Sage Weil wrote:
>>>
>>> This avoids the problem for devices that don't need stable pages, but
>>> doesn't help for those that do (btrfs, raid, iscsi, dif/dix, etc.). It
>>> seems to me like a more elegant solution would be to COW the page in the
>>> address_space so that you get stable writeback pages without blocking.
>>> That's clearly more complex, and I'm sure there are a range of issues
>>> involved in making that work, but I would hope that it would be doable
>>> with generic MM infrastructure so that everyone would benefit.
>>
>> Well, even doing a COW (or anything that involves messing with page
>> tables) is not free. So even if we can make the cost of stable
>> writeback pages cheaper, if we can completely avoid the cost, this
>> would be good. I'd also rather fix the performance regression sooner
>> rather than later, and I suspect the COW solution is not something
>> that could be prepared in time for the upcoming merge window.
>
> Definitely. This patch looks like a fine approach for your situation. I
> just don't want the subject to come up without talking about a general
> solution. And it's very interesting to hear about a (simple) workload
> that is affected by the wait_on_page_writeback().
I'll add a simple workload. I have a soft real-time program that has
two threads. One of them fallocates some files, mmaps them, mlocks
them, and touches all the pages to prefault them. (This thread has no
real-time constraints -- it just needs to keep up.) The other thread
writes to the files.
On Windows, this works very well. On Linux without stable pages, it
almost works. With stable pages, it's a complete disaster. No amount
of minimizing the amount of time that pages under writeback can cause
writers to sleep will help -- writers *must not wait for io* when
writing mlocked, prefaulted pages for my code to work.
(The other issue involves file_update_time. I'll send a fix eventually.)
FWIW, it would be really nice if there was a way to lock a mapping so
hard that accesses are guaranteed to not even cause soft faults. We're
far from being able to do that now, though.
--Andy
On 03/14/2012 07:10 PM, Andy Lutomirski wrote:
> On 03/08/2012 08:43 AM, Sage Weil wrote:
>> On Thu, 8 Mar 2012, Ted Ts'o wrote:
>>> On Wed, Mar 07, 2012 at 10:27:43PM -0800, Sage Weil wrote:
>>>>
>>>> This avoids the problem for devices that don't need stable pages, but
>>>> doesn't help for those that do (btrfs, raid, iscsi, dif/dix, etc.). It
>>>> seems to me like a more elegant solution would be to COW the page in the
>>>> address_space so that you get stable writeback pages without blocking.
>>>> That's clearly more complex, and I'm sure there are a range of issues
>>>> involved in making that work, but I would hope that it would be doable
>>>> with generic MM infrastructure so that everyone would benefit.
>>>
>>> Well, even doing a COW (or anything that involves messing with page
>>> tables) is not free. So even if we can make the cost of stable
>>> writeback pages cheaper, if we can completely avoid the cost, this
>>> would be good. I'd also rather fix the performance regression sooner
>>> rather than later, and I suspect the COW solution is not something
>>> that could be prepared in time for the upcoming merge window.
>>
>> Definitely. This patch looks like a fine approach for your situation. I
>> just don't want the subject to come up without talking about a general
>> solution. And it's very interesting to hear about a (simple) workload
>> that is affected by the wait_on_page_writeback().
>
> I'll add a simple workload. I have a soft real-time program that has
> two threads. One of them fallocates some files, mmaps them, mlocks
> them, and touches all the pages to prefault them. (This thread has no
> real-time constraints -- it just needs to keep up.) The other thread
> writes to the files.
>
> On Windows, this works very well. On Linux without stable pages, it
> almost works. With stable pages, it's a complete disaster. No amount
> of minimizing the amount of time that pages under writeback can cause
> writers to sleep will help -- writers *must not wait for io* when
> writing mlocked, prefaulted pages for my code to work.
>
Right, this is Windows shit. If your goal is to never wait, IO
as fast as possible, and use the least amount of CPU time
then it's exactly the opposite of what you want to do.
You want to do async Direct IO.
Also as Dave Chinner said "Double/ring buffering with async
IO dispatch"
BTW Even On windows there are much better ways to do this. Also
there ring buffering with async direct IO.
> (The other issue involves file_update_time. I'll send a fix eventually.)
>
> FWIW, it would be really nice if there was a way to lock a mapping so
> hard that accesses are guaranteed to not even cause soft faults. We're
> far from being able to do that now, though.
>
> --Andy
Cheers
Boaz
On Wed, Mar 14, 2012 at 9:46 PM, Boaz Harrosh <[email protected]> wrote:
> On 03/14/2012 07:10 PM, Andy Lutomirski wrote:
>> On 03/08/2012 08:43 AM, Sage Weil wrote:
>> I'll add a simple workload. ?I have a soft real-time program that has
>> two threads. ?One of them fallocates some files, mmaps them, mlocks
>> them, and touches all the pages to prefault them. ?(This thread has no
>> real-time constraints -- it just needs to keep up.) ?The other thread
>> writes to the files.
>>
>> On Windows, this works very well. ?On Linux without stable pages, it
>> almost works. ?With stable pages, it's a complete disaster. ?No amount
>> of minimizing the amount of time that pages under writeback can cause
>> writers to sleep will help -- writers *must not wait for io* when
>> writing mlocked, prefaulted pages for my code to work.
>>
>
> Right, this is Windows shit. If your goal is to never wait, IO
> as fast as possible, and use the least amount of CPU time
> then it's exactly the opposite of what you want to do.
> You want to do async Direct IO.
I don't care if the io is as fast as possible and I don't care about
cpu time (that's what multicore is for). I care about real-time
threads not waiting. And I'm not sure why the superior performance on
Windows is "shit".
In any case, there's no possible way that async direct I/O is as
low-latency as memcpy for short writes.
>
> Also as Dave Chinner said "Double/ring buffering with async
> IO dispatch"
That's certainly an option. In fact, any program that has issues with
write latency due to stable pages or any other cause can do this. But
it's more code, it's more likely to lose data on an application crash,
and it's annoying since the kernel *already* has a perfectly good page
cache and mmap mechanism that ought to work.
Also, double buffering is kind of a PITA when other threads might be
trying to real the same file at the same time. That means I have to
get them all to find the same mapping, which can't live on an ext4
filesystem, requiring a daemon.
So I still claim my use case is valid, and I think Linux should handle
it better than it does now. The proposed fix of getting rid of stable
pages when they're not needed is good enough for me.
--Andy