Hi all,
Last summer there was a long thread entitled "Wrong DIF guard tag on ext2
write" (http://marc.info/?l=linux-scsi&m=127530531808556&w=2) that started a
discussion about how to deal with the situation where one program tells the
kernel to write a block to disk, the kernel computes the checksum of that data,
and then a second program begins writing to that same block before the disk HBA
can DMA the memory block, thereby causing the disk to complain about being sent
invalid checksums.
By my recollection, several solutions were proposed, such as retrying the I/O
with a new checksum; using the VM to track and stop page writes during I/O;
creating shadow pages so that subsequent memory writes go to a different page;
punting the integrity failure to the app to let it deal with it; and only
allowing DIF mode when O_DIRECT is enabled.
Unfortunately, I didn't get a sense that any sort of consensus had been reached
(much less anything patched into the kernel) and since I was able to write a
trivial program to trigger the write problem, I'm pretty sure that this has not
been fixed upstream. (FYI, using O_DIRECT still seems fine.)
Below is a simple if naive solution: (ab)use the bounce buffering code to copy
the memory page just prior to calculating the checksum, and send the copy and
the checksum to the disk controller. With this patch applied, the invalid
guard tag messages go away. An optimization would be to perform the copy only
when memory contents change, but I wanted to ask peoples' opinions before
continuing. I don't imagine bounce buffering is particularly speedy, though I
haven't noticed any corruption errors or weirdness yet.
Anyway, I'm mostly wondering: what do people think of this as a starting point
to fixing the DIF checksum problem?
--
block integrity: Fix write after integrity checksum calculation problem
The kernel does not prohibit writes to a dirty page during a write IO. This
leads to the race where a memory write occurs after the integrity data has been
calculated but before the hardware initiates the DMA operation; when this
happens, the data does not match the checksum and the disk fails the write due
to an invalid checksum.
An awkward fix for now is to (ab)use the bounce buffering mechanism to snapshot
a page's contents just prior to calculating the checksum, and sending the copy
and its checksum to the hardware. This is probably terrible for performance.
Signed-off-by: Darrick J. Wong <[email protected]>
---
block/Kconfig | 1
block/blk-core.c | 2 -
block/blk-integrity.c | 2 +
fs/bio-integrity.c | 12 +++-
include/linux/bio.h | 4 +
include/linux/blkdev.h | 12 ++++
mm/bounce.c | 129 ++++++++++++++++++++++++++++++++++++++++++++++++
7 files changed, 154 insertions(+), 8 deletions(-)
diff --git a/block/Kconfig b/block/Kconfig
index 60be1e0..ed89f19 100644
--- a/block/Kconfig
+++ b/block/Kconfig
@@ -66,6 +66,7 @@ config BLK_DEV_BSG
If unsure, say Y.
config BLK_DEV_INTEGRITY
+ depends on BOUNCE=y
bool "Block layer data integrity support"
---help---
Some storage devices allow extra information to be
diff --git a/block/blk-core.c b/block/blk-core.c
index 3cc3bed..81a4e50 100644
--- a/block/blk-core.c
+++ b/block/blk-core.c
@@ -1503,7 +1503,7 @@ static inline void __generic_make_request(struct bio *bio)
*/
blk_partition_remap(bio);
- if (bio_integrity_enabled(bio) && bio_integrity_prep(bio))
+ if (bio_integrity_enabled(bio) && bio_integrity_prep(&bio))
goto end_io;
if (old_sector != -1)
diff --git a/block/blk-integrity.c b/block/blk-integrity.c
index 54bcba6..24843f8 100644
--- a/block/blk-integrity.c
+++ b/block/blk-integrity.c
@@ -376,6 +376,8 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template)
BUG_ON(disk == NULL);
+ init_integrity_pool();
+
if (disk->integrity == NULL) {
bi = kmem_cache_alloc(integrity_cachep,
GFP_KERNEL | __GFP_ZERO);
diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index df4fdfd..9eee904 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -393,8 +393,9 @@ static inline unsigned short blk_integrity_tuple_size(struct blk_integrity *bi)
* block device's integrity function. In the READ case, the buffer
* will be prepared for DMA and a suitable end_io handler set up.
*/
-int bio_integrity_prep(struct bio *bio)
+int bio_integrity_prep(struct bio **pbio)
{
+ struct bio *bio;
struct bio_integrity_payload *bip;
struct blk_integrity *bi;
struct request_queue *q;
@@ -404,10 +405,13 @@ int bio_integrity_prep(struct bio *bio)
unsigned int bytes, offset, i;
unsigned int sectors;
- bi = bdev_get_integrity(bio->bi_bdev);
- q = bdev_get_queue(bio->bi_bdev);
+ bi = bdev_get_integrity((*pbio)->bi_bdev);
+ q = bdev_get_queue((*pbio)->bi_bdev);
BUG_ON(bi == NULL);
- BUG_ON(bio_integrity(bio));
+ BUG_ON(bio_integrity(*pbio));
+
+ blk_queue_integrity_bounce(q, pbio);
+ bio = *pbio;
sectors = bio_integrity_hw_sectors(bi, bio_sectors(bio));
diff --git a/include/linux/bio.h b/include/linux/bio.h
index 36d10f0..1834934 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -505,7 +505,7 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
for_each_bio(_bio) \
bip_for_each_vec(_bvl, _bio->bi_integrity, _iter)
-#define bio_integrity(bio) (bio->bi_integrity != NULL)
+#define bio_integrity(bio) ((bio)->bi_integrity != NULL)
extern struct bio_integrity_payload *bio_integrity_alloc_bioset(struct bio *, gfp_t, unsigned int, struct bio_set *);
extern struct bio_integrity_payload *bio_integrity_alloc(struct bio *, gfp_t, unsigned int);
@@ -514,7 +514,7 @@ extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, uns
extern int bio_integrity_enabled(struct bio *bio);
extern int bio_integrity_set_tag(struct bio *, void *, unsigned int);
extern int bio_integrity_get_tag(struct bio *, void *, unsigned int);
-extern int bio_integrity_prep(struct bio *);
+extern int bio_integrity_prep(struct bio **);
extern void bio_integrity_endio(struct bio *, int);
extern void bio_integrity_advance(struct bio *, unsigned int);
extern void bio_integrity_trim(struct bio *, unsigned int, unsigned int);
diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
index 8a082a5..1c9fc40 100644
--- a/include/linux/blkdev.h
+++ b/include/linux/blkdev.h
@@ -599,6 +599,18 @@ extern unsigned long blk_max_low_pfn, blk_max_pfn;
#define BLK_DEFAULT_SG_TIMEOUT (60 * HZ)
#define BLK_MIN_SG_TIMEOUT (7 * HZ)
+#ifdef CONFIG_BLK_DEV_INTEGRITY
+extern int init_integrity_pool(void);
+extern void blk_queue_integrity_bounce(struct request_queue *q,
+ struct bio **bio);
+#else
+#define init_integrity_pool() do { } while (0);
+static inline void blk_queue_integrity_bounce(struct request_queue *q,
+ struct bio **bio)
+{
+}
+#endif
+
#ifdef CONFIG_BOUNCE
extern int init_emergency_isa_pool(void);
extern void blk_queue_bounce(struct request_queue *q, struct bio **bio);
diff --git a/mm/bounce.c b/mm/bounce.c
index 1481de6..c0ce533 100644
--- a/mm/bounce.c
+++ b/mm/bounce.c
@@ -21,7 +21,19 @@
#define POOL_SIZE 64
#define ISA_POOL_SIZE 16
-static mempool_t *page_pool, *isa_page_pool;
+static mempool_t *integrity_pool, *page_pool, *isa_page_pool;
+
+int init_integrity_pool(void)
+{
+ if (integrity_pool)
+ return 0;
+
+ integrity_pool = mempool_create_page_pool(POOL_SIZE, 0);
+ BUG_ON(!integrity_pool);
+ printk(KERN_INFO "integrity bounce pool size: %d pages\n", POOL_SIZE);
+
+ return 0;
+}
#ifdef CONFIG_HIGHMEM
static __init int init_emergency_pool(void)
@@ -151,6 +163,11 @@ static void bounce_end_io_write(struct bio *bio, int err)
bounce_end_io(bio, page_pool, err);
}
+static void bounce_end_integrity_io_write(struct bio *bio, int err)
+{
+ bounce_end_io(bio, integrity_pool, err);
+}
+
static void bounce_end_io_write_isa(struct bio *bio, int err)
{
@@ -298,3 +315,113 @@ void blk_queue_bounce(struct request_queue *q, struct bio **bio_orig)
}
EXPORT_SYMBOL(blk_queue_bounce);
+
+/*
+ * Fix the problem of pages getting dirtied after the integrity checksum
+ * calculation by copying all writes to a shadow buffer prior to computing
+ * checksums.
+ */
+static void __blk_queue_integrity_bounce(struct request_queue *q,
+ struct bio **bio_orig,
+ mempool_t *pool)
+{
+ struct page *page;
+ struct bio *bio = NULL;
+ int i;
+ struct bio_vec *to, *from;
+
+ bio_for_each_segment(from, *bio_orig, i) {
+ char *vto, *vfrom;
+ page = from->bv_page;
+
+ /*
+ * irk, bounce it
+ */
+ if (!bio) {
+ unsigned int cnt = (*bio_orig)->bi_vcnt;
+
+ bio = bio_alloc(GFP_NOIO, cnt);
+ memset(bio->bi_io_vec, 0, cnt * sizeof(struct bio_vec));
+ }
+
+
+ to = bio->bi_io_vec + i;
+
+ to->bv_page = mempool_alloc(pool, q->bounce_gfp);
+ to->bv_len = from->bv_len;
+ to->bv_offset = from->bv_offset;
+ inc_zone_page_state(to->bv_page, NR_BOUNCE);
+
+ flush_dcache_page(from->bv_page);
+ vto = page_address(to->bv_page) + to->bv_offset;
+ vfrom = kmap(from->bv_page) + from->bv_offset;
+ memcpy(vto, vfrom, to->bv_len);
+ kunmap(from->bv_page);
+ }
+
+ /*
+ * no pages bounced
+ */
+ if (!bio)
+ return;
+
+ trace_block_bio_bounce(q, *bio_orig);
+
+ /*
+ * at least one page was bounced, fill in possible non-highmem
+ * pages
+ */
+ __bio_for_each_segment(from, *bio_orig, i, 0) {
+ to = bio_iovec_idx(bio, i);
+ if (!to->bv_page) {
+ to->bv_page = from->bv_page;
+ to->bv_len = from->bv_len;
+ to->bv_offset = from->bv_offset;
+ }
+ }
+
+ bio->bi_bdev = (*bio_orig)->bi_bdev;
+ bio->bi_flags |= (1 << BIO_BOUNCED);
+ bio->bi_sector = (*bio_orig)->bi_sector;
+ bio->bi_rw = (*bio_orig)->bi_rw;
+
+ bio->bi_vcnt = (*bio_orig)->bi_vcnt;
+ bio->bi_idx = (*bio_orig)->bi_idx;
+ bio->bi_size = (*bio_orig)->bi_size;
+
+ if (pool == integrity_pool)
+ bio->bi_end_io = bounce_end_integrity_io_write;
+ else
+ bio->bi_end_io = bounce_end_io_write_isa;
+
+ bio->bi_private = *bio_orig;
+ *bio_orig = bio;
+}
+
+void blk_queue_integrity_bounce(struct request_queue *q, struct bio **bio_orig)
+{
+ mempool_t *pool;
+
+ /* only do this for writes */
+ if (bio_data_dir(*bio_orig) != WRITE)
+ return;
+ /*
+ * Data-less bio, nothing to bounce
+ */
+ if (!bio_has_data(*bio_orig))
+ return;
+
+ if (!(q->bounce_gfp & GFP_DMA)) {
+ BUG_ON(!integrity_pool);
+ pool = integrity_pool;
+ } else {
+ BUG_ON(!isa_page_pool);
+ pool = isa_page_pool;
+ }
+
+ /*
+ * slow path
+ */
+ __blk_queue_integrity_bounce(q, bio_orig, pool);
+}
+EXPORT_SYMBOL(blk_queue_integrity_bounce);
On 02/21/2011 06:00 PM, Darrick J. Wong wrote:
> Hi all,
>
> Last summer there was a long thread entitled "Wrong DIF guard tag on ext2
> write" (http://marc.info/?l=linux-scsi&m=127530531808556&w=2) that started a
> discussion about how to deal with the situation where one program tells the
> kernel to write a block to disk, the kernel computes the checksum of that data,
> and then a second program begins writing to that same block before the disk HBA
> can DMA the memory block, thereby causing the disk to complain about being sent
> invalid checksums.
The brokenness is in ext2/3 if you'll use btrfs, xfs and I think late versions
of ext4 it should work much better. (If you still have problems please report
them, those FSs advertise stable pages write-out)
This problem is easily fixed at the FS layer or even at VFS, by overriding mk_write
and syncing with write-out for example by taking the page-lock. Currently each
FS is to itself because in VFS it would force the behaviour on FSs that it does
not make sense to.
Note that the proper solution does not copy any data, just forces the app to
wait before changing write-out pages.
Boaz
>
> By my recollection, several solutions were proposed, such as retrying the I/O
> with a new checksum; using the VM to track and stop page writes during I/O;
> creating shadow pages so that subsequent memory writes go to a different page;
> punting the integrity failure to the app to let it deal with it; and only
> allowing DIF mode when O_DIRECT is enabled.
>
> Unfortunately, I didn't get a sense that any sort of consensus had been reached
> (much less anything patched into the kernel) and since I was able to write a
> trivial program to trigger the write problem, I'm pretty sure that this has not
> been fixed upstream. (FYI, using O_DIRECT still seems fine.)
>
> Below is a simple if naive solution: (ab)use the bounce buffering code to copy
> the memory page just prior to calculating the checksum, and send the copy and
> the checksum to the disk controller. With this patch applied, the invalid
> guard tag messages go away. An optimization would be to perform the copy only
> when memory contents change, but I wanted to ask peoples' opinions before
> continuing. I don't imagine bounce buffering is particularly speedy, though I
> haven't noticed any corruption errors or weirdness yet.
>
> Anyway, I'm mostly wondering: what do people think of this as a starting point
> to fixing the DIF checksum problem?
>
> --
> block integrity: Fix write after integrity checksum calculation problem
>
> The kernel does not prohibit writes to a dirty page during a write IO. This
> leads to the race where a memory write occurs after the integrity data has been
> calculated but before the hardware initiates the DMA operation; when this
> happens, the data does not match the checksum and the disk fails the write due
> to an invalid checksum.
>
> An awkward fix for now is to (ab)use the bounce buffering mechanism to snapshot
> a page's contents just prior to calculating the checksum, and sending the copy
> and its checksum to the hardware. This is probably terrible for performance.
>
> Signed-off-by: Darrick J. Wong <[email protected]>
> ---
>
> block/Kconfig | 1
> block/blk-core.c | 2 -
> block/blk-integrity.c | 2 +
> fs/bio-integrity.c | 12 +++-
> include/linux/bio.h | 4 +
> include/linux/blkdev.h | 12 ++++
> mm/bounce.c | 129 ++++++++++++++++++++++++++++++++++++++++++++++++
> 7 files changed, 154 insertions(+), 8 deletions(-)
>
> diff --git a/block/Kconfig b/block/Kconfig
> index 60be1e0..ed89f19 100644
> --- a/block/Kconfig
> +++ b/block/Kconfig
> @@ -66,6 +66,7 @@ config BLK_DEV_BSG
> If unsure, say Y.
>
> config BLK_DEV_INTEGRITY
> + depends on BOUNCE=y
> bool "Block layer data integrity support"
> ---help---
> Some storage devices allow extra information to be
> diff --git a/block/blk-core.c b/block/blk-core.c
> index 3cc3bed..81a4e50 100644
> --- a/block/blk-core.c
> +++ b/block/blk-core.c
> @@ -1503,7 +1503,7 @@ static inline void __generic_make_request(struct bio *bio)
> */
> blk_partition_remap(bio);
>
> - if (bio_integrity_enabled(bio) && bio_integrity_prep(bio))
> + if (bio_integrity_enabled(bio) && bio_integrity_prep(&bio))
> goto end_io;
>
> if (old_sector != -1)
> diff --git a/block/blk-integrity.c b/block/blk-integrity.c
> index 54bcba6..24843f8 100644
> --- a/block/blk-integrity.c
> +++ b/block/blk-integrity.c
> @@ -376,6 +376,8 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template)
>
> BUG_ON(disk == NULL);
>
> + init_integrity_pool();
> +
> if (disk->integrity == NULL) {
> bi = kmem_cache_alloc(integrity_cachep,
> GFP_KERNEL | __GFP_ZERO);
> diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
> index df4fdfd..9eee904 100644
> --- a/fs/bio-integrity.c
> +++ b/fs/bio-integrity.c
> @@ -393,8 +393,9 @@ static inline unsigned short blk_integrity_tuple_size(struct blk_integrity *bi)
> * block device's integrity function. In the READ case, the buffer
> * will be prepared for DMA and a suitable end_io handler set up.
> */
> -int bio_integrity_prep(struct bio *bio)
> +int bio_integrity_prep(struct bio **pbio)
> {
> + struct bio *bio;
> struct bio_integrity_payload *bip;
> struct blk_integrity *bi;
> struct request_queue *q;
> @@ -404,10 +405,13 @@ int bio_integrity_prep(struct bio *bio)
> unsigned int bytes, offset, i;
> unsigned int sectors;
>
> - bi = bdev_get_integrity(bio->bi_bdev);
> - q = bdev_get_queue(bio->bi_bdev);
> + bi = bdev_get_integrity((*pbio)->bi_bdev);
> + q = bdev_get_queue((*pbio)->bi_bdev);
> BUG_ON(bi == NULL);
> - BUG_ON(bio_integrity(bio));
> + BUG_ON(bio_integrity(*pbio));
> +
> + blk_queue_integrity_bounce(q, pbio);
> + bio = *pbio;
>
> sectors = bio_integrity_hw_sectors(bi, bio_sectors(bio));
>
> diff --git a/include/linux/bio.h b/include/linux/bio.h
> index 36d10f0..1834934 100644
> --- a/include/linux/bio.h
> +++ b/include/linux/bio.h
> @@ -505,7 +505,7 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
> for_each_bio(_bio) \
> bip_for_each_vec(_bvl, _bio->bi_integrity, _iter)
>
> -#define bio_integrity(bio) (bio->bi_integrity != NULL)
> +#define bio_integrity(bio) ((bio)->bi_integrity != NULL)
>
> extern struct bio_integrity_payload *bio_integrity_alloc_bioset(struct bio *, gfp_t, unsigned int, struct bio_set *);
> extern struct bio_integrity_payload *bio_integrity_alloc(struct bio *, gfp_t, unsigned int);
> @@ -514,7 +514,7 @@ extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, uns
> extern int bio_integrity_enabled(struct bio *bio);
> extern int bio_integrity_set_tag(struct bio *, void *, unsigned int);
> extern int bio_integrity_get_tag(struct bio *, void *, unsigned int);
> -extern int bio_integrity_prep(struct bio *);
> +extern int bio_integrity_prep(struct bio **);
> extern void bio_integrity_endio(struct bio *, int);
> extern void bio_integrity_advance(struct bio *, unsigned int);
> extern void bio_integrity_trim(struct bio *, unsigned int, unsigned int);
> diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> index 8a082a5..1c9fc40 100644
> --- a/include/linux/blkdev.h
> +++ b/include/linux/blkdev.h
> @@ -599,6 +599,18 @@ extern unsigned long blk_max_low_pfn, blk_max_pfn;
> #define BLK_DEFAULT_SG_TIMEOUT (60 * HZ)
> #define BLK_MIN_SG_TIMEOUT (7 * HZ)
>
> +#ifdef CONFIG_BLK_DEV_INTEGRITY
> +extern int init_integrity_pool(void);
> +extern void blk_queue_integrity_bounce(struct request_queue *q,
> + struct bio **bio);
> +#else
> +#define init_integrity_pool() do { } while (0);
> +static inline void blk_queue_integrity_bounce(struct request_queue *q,
> + struct bio **bio)
> +{
> +}
> +#endif
> +
> #ifdef CONFIG_BOUNCE
> extern int init_emergency_isa_pool(void);
> extern void blk_queue_bounce(struct request_queue *q, struct bio **bio);
> diff --git a/mm/bounce.c b/mm/bounce.c
> index 1481de6..c0ce533 100644
> --- a/mm/bounce.c
> +++ b/mm/bounce.c
> @@ -21,7 +21,19 @@
> #define POOL_SIZE 64
> #define ISA_POOL_SIZE 16
>
> -static mempool_t *page_pool, *isa_page_pool;
> +static mempool_t *integrity_pool, *page_pool, *isa_page_pool;
> +
> +int init_integrity_pool(void)
> +{
> + if (integrity_pool)
> + return 0;
> +
> + integrity_pool = mempool_create_page_pool(POOL_SIZE, 0);
> + BUG_ON(!integrity_pool);
> + printk(KERN_INFO "integrity bounce pool size: %d pages\n", POOL_SIZE);
> +
> + return 0;
> +}
>
> #ifdef CONFIG_HIGHMEM
> static __init int init_emergency_pool(void)
> @@ -151,6 +163,11 @@ static void bounce_end_io_write(struct bio *bio, int err)
> bounce_end_io(bio, page_pool, err);
> }
>
> +static void bounce_end_integrity_io_write(struct bio *bio, int err)
> +{
> + bounce_end_io(bio, integrity_pool, err);
> +}
> +
> static void bounce_end_io_write_isa(struct bio *bio, int err)
> {
>
> @@ -298,3 +315,113 @@ void blk_queue_bounce(struct request_queue *q, struct bio **bio_orig)
> }
>
> EXPORT_SYMBOL(blk_queue_bounce);
> +
> +/*
> + * Fix the problem of pages getting dirtied after the integrity checksum
> + * calculation by copying all writes to a shadow buffer prior to computing
> + * checksums.
> + */
> +static void __blk_queue_integrity_bounce(struct request_queue *q,
> + struct bio **bio_orig,
> + mempool_t *pool)
> +{
> + struct page *page;
> + struct bio *bio = NULL;
> + int i;
> + struct bio_vec *to, *from;
> +
> + bio_for_each_segment(from, *bio_orig, i) {
> + char *vto, *vfrom;
> + page = from->bv_page;
> +
> + /*
> + * irk, bounce it
> + */
> + if (!bio) {
> + unsigned int cnt = (*bio_orig)->bi_vcnt;
> +
> + bio = bio_alloc(GFP_NOIO, cnt);
> + memset(bio->bi_io_vec, 0, cnt * sizeof(struct bio_vec));
> + }
> +
> +
> + to = bio->bi_io_vec + i;
> +
> + to->bv_page = mempool_alloc(pool, q->bounce_gfp);
> + to->bv_len = from->bv_len;
> + to->bv_offset = from->bv_offset;
> + inc_zone_page_state(to->bv_page, NR_BOUNCE);
> +
> + flush_dcache_page(from->bv_page);
> + vto = page_address(to->bv_page) + to->bv_offset;
> + vfrom = kmap(from->bv_page) + from->bv_offset;
> + memcpy(vto, vfrom, to->bv_len);
> + kunmap(from->bv_page);
> + }
> +
> + /*
> + * no pages bounced
> + */
> + if (!bio)
> + return;
> +
> + trace_block_bio_bounce(q, *bio_orig);
> +
> + /*
> + * at least one page was bounced, fill in possible non-highmem
> + * pages
> + */
> + __bio_for_each_segment(from, *bio_orig, i, 0) {
> + to = bio_iovec_idx(bio, i);
> + if (!to->bv_page) {
> + to->bv_page = from->bv_page;
> + to->bv_len = from->bv_len;
> + to->bv_offset = from->bv_offset;
> + }
> + }
> +
> + bio->bi_bdev = (*bio_orig)->bi_bdev;
> + bio->bi_flags |= (1 << BIO_BOUNCED);
> + bio->bi_sector = (*bio_orig)->bi_sector;
> + bio->bi_rw = (*bio_orig)->bi_rw;
> +
> + bio->bi_vcnt = (*bio_orig)->bi_vcnt;
> + bio->bi_idx = (*bio_orig)->bi_idx;
> + bio->bi_size = (*bio_orig)->bi_size;
> +
> + if (pool == integrity_pool)
> + bio->bi_end_io = bounce_end_integrity_io_write;
> + else
> + bio->bi_end_io = bounce_end_io_write_isa;
> +
> + bio->bi_private = *bio_orig;
> + *bio_orig = bio;
> +}
> +
> +void blk_queue_integrity_bounce(struct request_queue *q, struct bio **bio_orig)
> +{
> + mempool_t *pool;
> +
> + /* only do this for writes */
> + if (bio_data_dir(*bio_orig) != WRITE)
> + return;
> + /*
> + * Data-less bio, nothing to bounce
> + */
> + if (!bio_has_data(*bio_orig))
> + return;
> +
> + if (!(q->bounce_gfp & GFP_DMA)) {
> + BUG_ON(!integrity_pool);
> + pool = integrity_pool;
> + } else {
> + BUG_ON(!isa_page_pool);
> + pool = isa_page_pool;
> + }
> +
> + /*
> + * slow path
> + */
> + __blk_queue_integrity_bounce(q, bio_orig, pool);
> +}
> +EXPORT_SYMBOL(blk_queue_integrity_bounce);
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
Hi Boaz,
On Mon 21-02-11 21:45:51, Boaz Harrosh wrote:
> On 02/21/2011 06:00 PM, Darrick J. Wong wrote:
> > Last summer there was a long thread entitled "Wrong DIF guard tag on ext2
> > write" (http://marc.info/?l=linux-scsi&m=127530531808556&w=2) that started a
> > discussion about how to deal with the situation where one program tells the
> > kernel to write a block to disk, the kernel computes the checksum of that data,
> > and then a second program begins writing to that same block before the disk HBA
> > can DMA the memory block, thereby causing the disk to complain about being sent
> > invalid checksums.
>
> The brokenness is in ext2/3 if you'll use btrfs, xfs and I think late versions
> of ext4 it should work much better. (If you still have problems please report
> them, those FSs advertise stable pages write-out)
Do they? I've just checked ext4 and xfs and they don't seem to enforce
stable pages. They do lock the page (which implicitely happens in mm code
for any filesystem BTW) but this is not enough. You have to wait for
PageWriteback to get cleared and only btrfs does that.
> This problem is easily fixed at the FS layer or even at VFS, by overriding mk_write
> and syncing with write-out for example by taking the page-lock. Currently each
> FS is to itself because in VFS it would force the behaviour on FSs that it does
> not make sense to.
Yes, it's easy to fix but at a performance cost for any application doing
frequent rewrites regardless whether integrity features are used or not.
And I don't think that's a good thing. I even remember someone measured the
hit last time this came up and it was rather noticeable.
> Note that the proper solution does not copy any data, just forces the app to
> wait before changing write-out pages.
I think that's up for discussion. In fact what is going to be faster
depends pretty much on your system config. If you have enough CPU/RAM
bandwidth compared to storage speed, you're better of doing copying. If
you can barely saturate storage with your CPU/RAM, waiting is probably
better for you.
Moreover if you do data copyout, you push the performance cost only on
users of the integrity feature which is nice. But on the other hand users
of integrity take the cost even if they are not doing rewrites.
A solution which is technically plausible and penalizing only rewrites
of data-integrity protected pages would be a use of shadow pages as Darrick
describes below. So I'd lean towards that long term. But for now I think
Darrick's solution is OK to make the integrity feature actually useful and
later someone can try something more clever.
Honza
> > By my recollection, several solutions were proposed, such as retrying the I/O
> > with a new checksum; using the VM to track and stop page writes during I/O;
> > creating shadow pages so that subsequent memory writes go to a different page;
> > punting the integrity failure to the app to let it deal with it; and only
> > allowing DIF mode when O_DIRECT is enabled.
> >
> > Unfortunately, I didn't get a sense that any sort of consensus had been reached
> > (much less anything patched into the kernel) and since I was able to write a
> > trivial program to trigger the write problem, I'm pretty sure that this has not
> > been fixed upstream. (FYI, using O_DIRECT still seems fine.)
> >
> > Below is a simple if naive solution: (ab)use the bounce buffering code to copy
> > the memory page just prior to calculating the checksum, and send the copy and
> > the checksum to the disk controller. With this patch applied, the invalid
> > guard tag messages go away. An optimization would be to perform the copy only
> > when memory contents change, but I wanted to ask peoples' opinions before
> > continuing. I don't imagine bounce buffering is particularly speedy, though I
> > haven't noticed any corruption errors or weirdness yet.
> >
> > Anyway, I'm mostly wondering: what do people think of this as a starting point
> > to fixing the DIF checksum problem?
> >
> > --
> > block integrity: Fix write after integrity checksum calculation problem
> >
> > The kernel does not prohibit writes to a dirty page during a write IO. This
> > leads to the race where a memory write occurs after the integrity data has been
> > calculated but before the hardware initiates the DMA operation; when this
> > happens, the data does not match the checksum and the disk fails the write due
> > to an invalid checksum.
> >
> > An awkward fix for now is to (ab)use the bounce buffering mechanism to snapshot
> > a page's contents just prior to calculating the checksum, and sending the copy
> > and its checksum to the hardware. This is probably terrible for performance.
> >
> > Signed-off-by: Darrick J. Wong <[email protected]>
> > ---
> >
> > block/Kconfig | 1
> > block/blk-core.c | 2 -
> > block/blk-integrity.c | 2 +
> > fs/bio-integrity.c | 12 +++-
> > include/linux/bio.h | 4 +
> > include/linux/blkdev.h | 12 ++++
> > mm/bounce.c | 129 ++++++++++++++++++++++++++++++++++++++++++++++++
> > 7 files changed, 154 insertions(+), 8 deletions(-)
> >
> > diff --git a/block/Kconfig b/block/Kconfig
> > index 60be1e0..ed89f19 100644
> > --- a/block/Kconfig
> > +++ b/block/Kconfig
> > @@ -66,6 +66,7 @@ config BLK_DEV_BSG
> > If unsure, say Y.
> >
> > config BLK_DEV_INTEGRITY
> > + depends on BOUNCE=y
> > bool "Block layer data integrity support"
> > ---help---
> > Some storage devices allow extra information to be
> > diff --git a/block/blk-core.c b/block/blk-core.c
> > index 3cc3bed..81a4e50 100644
> > --- a/block/blk-core.c
> > +++ b/block/blk-core.c
> > @@ -1503,7 +1503,7 @@ static inline void __generic_make_request(struct bio *bio)
> > */
> > blk_partition_remap(bio);
> >
> > - if (bio_integrity_enabled(bio) && bio_integrity_prep(bio))
> > + if (bio_integrity_enabled(bio) && bio_integrity_prep(&bio))
> > goto end_io;
> >
> > if (old_sector != -1)
> > diff --git a/block/blk-integrity.c b/block/blk-integrity.c
> > index 54bcba6..24843f8 100644
> > --- a/block/blk-integrity.c
> > +++ b/block/blk-integrity.c
> > @@ -376,6 +376,8 @@ int blk_integrity_register(struct gendisk *disk, struct blk_integrity *template)
> >
> > BUG_ON(disk == NULL);
> >
> > + init_integrity_pool();
> > +
> > if (disk->integrity == NULL) {
> > bi = kmem_cache_alloc(integrity_cachep,
> > GFP_KERNEL | __GFP_ZERO);
> > diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
> > index df4fdfd..9eee904 100644
> > --- a/fs/bio-integrity.c
> > +++ b/fs/bio-integrity.c
> > @@ -393,8 +393,9 @@ static inline unsigned short blk_integrity_tuple_size(struct blk_integrity *bi)
> > * block device's integrity function. In the READ case, the buffer
> > * will be prepared for DMA and a suitable end_io handler set up.
> > */
> > -int bio_integrity_prep(struct bio *bio)
> > +int bio_integrity_prep(struct bio **pbio)
> > {
> > + struct bio *bio;
> > struct bio_integrity_payload *bip;
> > struct blk_integrity *bi;
> > struct request_queue *q;
> > @@ -404,10 +405,13 @@ int bio_integrity_prep(struct bio *bio)
> > unsigned int bytes, offset, i;
> > unsigned int sectors;
> >
> > - bi = bdev_get_integrity(bio->bi_bdev);
> > - q = bdev_get_queue(bio->bi_bdev);
> > + bi = bdev_get_integrity((*pbio)->bi_bdev);
> > + q = bdev_get_queue((*pbio)->bi_bdev);
> > BUG_ON(bi == NULL);
> > - BUG_ON(bio_integrity(bio));
> > + BUG_ON(bio_integrity(*pbio));
> > +
> > + blk_queue_integrity_bounce(q, pbio);
> > + bio = *pbio;
> >
> > sectors = bio_integrity_hw_sectors(bi, bio_sectors(bio));
> >
> > diff --git a/include/linux/bio.h b/include/linux/bio.h
> > index 36d10f0..1834934 100644
> > --- a/include/linux/bio.h
> > +++ b/include/linux/bio.h
> > @@ -505,7 +505,7 @@ static inline struct bio *bio_list_get(struct bio_list *bl)
> > for_each_bio(_bio) \
> > bip_for_each_vec(_bvl, _bio->bi_integrity, _iter)
> >
> > -#define bio_integrity(bio) (bio->bi_integrity != NULL)
> > +#define bio_integrity(bio) ((bio)->bi_integrity != NULL)
> >
> > extern struct bio_integrity_payload *bio_integrity_alloc_bioset(struct bio *, gfp_t, unsigned int, struct bio_set *);
> > extern struct bio_integrity_payload *bio_integrity_alloc(struct bio *, gfp_t, unsigned int);
> > @@ -514,7 +514,7 @@ extern int bio_integrity_add_page(struct bio *, struct page *, unsigned int, uns
> > extern int bio_integrity_enabled(struct bio *bio);
> > extern int bio_integrity_set_tag(struct bio *, void *, unsigned int);
> > extern int bio_integrity_get_tag(struct bio *, void *, unsigned int);
> > -extern int bio_integrity_prep(struct bio *);
> > +extern int bio_integrity_prep(struct bio **);
> > extern void bio_integrity_endio(struct bio *, int);
> > extern void bio_integrity_advance(struct bio *, unsigned int);
> > extern void bio_integrity_trim(struct bio *, unsigned int, unsigned int);
> > diff --git a/include/linux/blkdev.h b/include/linux/blkdev.h
> > index 8a082a5..1c9fc40 100644
> > --- a/include/linux/blkdev.h
> > +++ b/include/linux/blkdev.h
> > @@ -599,6 +599,18 @@ extern unsigned long blk_max_low_pfn, blk_max_pfn;
> > #define BLK_DEFAULT_SG_TIMEOUT (60 * HZ)
> > #define BLK_MIN_SG_TIMEOUT (7 * HZ)
> >
> > +#ifdef CONFIG_BLK_DEV_INTEGRITY
> > +extern int init_integrity_pool(void);
> > +extern void blk_queue_integrity_bounce(struct request_queue *q,
> > + struct bio **bio);
> > +#else
> > +#define init_integrity_pool() do { } while (0);
> > +static inline void blk_queue_integrity_bounce(struct request_queue *q,
> > + struct bio **bio)
> > +{
> > +}
> > +#endif
> > +
> > #ifdef CONFIG_BOUNCE
> > extern int init_emergency_isa_pool(void);
> > extern void blk_queue_bounce(struct request_queue *q, struct bio **bio);
> > diff --git a/mm/bounce.c b/mm/bounce.c
> > index 1481de6..c0ce533 100644
> > --- a/mm/bounce.c
> > +++ b/mm/bounce.c
> > @@ -21,7 +21,19 @@
> > #define POOL_SIZE 64
> > #define ISA_POOL_SIZE 16
> >
> > -static mempool_t *page_pool, *isa_page_pool;
> > +static mempool_t *integrity_pool, *page_pool, *isa_page_pool;
> > +
> > +int init_integrity_pool(void)
> > +{
> > + if (integrity_pool)
> > + return 0;
> > +
> > + integrity_pool = mempool_create_page_pool(POOL_SIZE, 0);
> > + BUG_ON(!integrity_pool);
> > + printk(KERN_INFO "integrity bounce pool size: %d pages\n", POOL_SIZE);
> > +
> > + return 0;
> > +}
> >
> > #ifdef CONFIG_HIGHMEM
> > static __init int init_emergency_pool(void)
> > @@ -151,6 +163,11 @@ static void bounce_end_io_write(struct bio *bio, int err)
> > bounce_end_io(bio, page_pool, err);
> > }
> >
> > +static void bounce_end_integrity_io_write(struct bio *bio, int err)
> > +{
> > + bounce_end_io(bio, integrity_pool, err);
> > +}
> > +
> > static void bounce_end_io_write_isa(struct bio *bio, int err)
> > {
> >
> > @@ -298,3 +315,113 @@ void blk_queue_bounce(struct request_queue *q, struct bio **bio_orig)
> > }
> >
> > EXPORT_SYMBOL(blk_queue_bounce);
> > +
> > +/*
> > + * Fix the problem of pages getting dirtied after the integrity checksum
> > + * calculation by copying all writes to a shadow buffer prior to computing
> > + * checksums.
> > + */
> > +static void __blk_queue_integrity_bounce(struct request_queue *q,
> > + struct bio **bio_orig,
> > + mempool_t *pool)
> > +{
> > + struct page *page;
> > + struct bio *bio = NULL;
> > + int i;
> > + struct bio_vec *to, *from;
> > +
> > + bio_for_each_segment(from, *bio_orig, i) {
> > + char *vto, *vfrom;
> > + page = from->bv_page;
> > +
> > + /*
> > + * irk, bounce it
> > + */
> > + if (!bio) {
> > + unsigned int cnt = (*bio_orig)->bi_vcnt;
> > +
> > + bio = bio_alloc(GFP_NOIO, cnt);
> > + memset(bio->bi_io_vec, 0, cnt * sizeof(struct bio_vec));
> > + }
> > +
> > +
> > + to = bio->bi_io_vec + i;
> > +
> > + to->bv_page = mempool_alloc(pool, q->bounce_gfp);
> > + to->bv_len = from->bv_len;
> > + to->bv_offset = from->bv_offset;
> > + inc_zone_page_state(to->bv_page, NR_BOUNCE);
> > +
> > + flush_dcache_page(from->bv_page);
> > + vto = page_address(to->bv_page) + to->bv_offset;
> > + vfrom = kmap(from->bv_page) + from->bv_offset;
> > + memcpy(vto, vfrom, to->bv_len);
> > + kunmap(from->bv_page);
> > + }
> > +
> > + /*
> > + * no pages bounced
> > + */
> > + if (!bio)
> > + return;
> > +
> > + trace_block_bio_bounce(q, *bio_orig);
> > +
> > + /*
> > + * at least one page was bounced, fill in possible non-highmem
> > + * pages
> > + */
> > + __bio_for_each_segment(from, *bio_orig, i, 0) {
> > + to = bio_iovec_idx(bio, i);
> > + if (!to->bv_page) {
> > + to->bv_page = from->bv_page;
> > + to->bv_len = from->bv_len;
> > + to->bv_offset = from->bv_offset;
> > + }
> > + }
> > +
> > + bio->bi_bdev = (*bio_orig)->bi_bdev;
> > + bio->bi_flags |= (1 << BIO_BOUNCED);
> > + bio->bi_sector = (*bio_orig)->bi_sector;
> > + bio->bi_rw = (*bio_orig)->bi_rw;
> > +
> > + bio->bi_vcnt = (*bio_orig)->bi_vcnt;
> > + bio->bi_idx = (*bio_orig)->bi_idx;
> > + bio->bi_size = (*bio_orig)->bi_size;
> > +
> > + if (pool == integrity_pool)
> > + bio->bi_end_io = bounce_end_integrity_io_write;
> > + else
> > + bio->bi_end_io = bounce_end_io_write_isa;
> > +
> > + bio->bi_private = *bio_orig;
> > + *bio_orig = bio;
> > +}
> > +
> > +void blk_queue_integrity_bounce(struct request_queue *q, struct bio **bio_orig)
> > +{
> > + mempool_t *pool;
> > +
> > + /* only do this for writes */
> > + if (bio_data_dir(*bio_orig) != WRITE)
> > + return;
> > + /*
> > + * Data-less bio, nothing to bounce
> > + */
> > + if (!bio_has_data(*bio_orig))
> > + return;
> > +
> > + if (!(q->bounce_gfp & GFP_DMA)) {
> > + BUG_ON(!integrity_pool);
> > + pool = integrity_pool;
> > + } else {
> > + BUG_ON(!isa_page_pool);
> > + pool = isa_page_pool;
> > + }
> > +
> > + /*
> > + * slow path
> > + */
> > + __blk_queue_integrity_bounce(q, bio_orig, pool);
> > +}
> > +EXPORT_SYMBOL(blk_queue_integrity_bounce);
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> >
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
--
Jan Kara <[email protected]>
SUSE Labs, CR
[ resend sorry if you get this twice ]
Excerpts from Jan Kara's message of 2011-02-22 06:42:22 -0500:
> Hi Boaz,
>
> On Mon 21-02-11 21:45:51, Boaz Harrosh wrote:
> > On 02/21/2011 06:00 PM, Darrick J. Wong wrote:
> > > Last summer there was a long thread entitled "Wrong DIF guard tag on ext2
> > > write" (http://marc.info/?l=linux-scsi&m=127530531808556&w=2) that started a
> > > discussion about how to deal with the situation where one program tells the
> > > kernel to write a block to disk, the kernel computes the checksum of that data,
> > > and then a second program begins writing to that same block before the disk HBA
> > > can DMA the memory block, thereby causing the disk to complain about being sent
> > > invalid checksums.
> >
> > The brokenness is in ext2/3 if you'll use btrfs, xfs and I think late versions
> > of ext4 it should work much better. (If you still have problems please report
> > them, those FSs advertise stable pages write-out)
> Do they? I've just checked ext4 and xfs and they don't seem to enforce
> stable pages. They do lock the page (which implicitely happens in mm code
> for any filesystem BTW) but this is not enough. You have to wait for
> PageWriteback to get cleared and only btrfs does that.
>
> > This problem is easily fixed at the FS layer or even at VFS, by overriding mk_write
> > and syncing with write-out for example by taking the page-lock. Currently each
> > FS is to itself because in VFS it would force the behaviour on FSs that it does
> > not make sense to.
> Yes, it's easy to fix but at a performance cost for any application doing
> frequent rewrites regardless whether integrity features are used or not.
> And I don't think that's a good thing. I even remember someone measured the
> hit last time this came up and it was rather noticeable.
Do you remember which workload this was? I do remember someone
mentioning a specific workload, but can't recall which one now. fsx is
definitely slower when we wait for writeback, but that's because it's
all evil inside.
>
> > Note that the proper solution does not copy any data, just forces the app to
> > wait before changing write-out pages.
> I think that's up for discussion. In fact what is going to be faster
> depends pretty much on your system config. If you have enough CPU/RAM
> bandwidth compared to storage speed, you're better of doing copying. If
> you can barely saturate storage with your CPU/RAM, waiting is probably
> better for you.
>
> Moreover if you do data copyout, you push the performance cost only on
> users of the integrity feature which is nice. But on the other hand users
> of integrity take the cost even if they are not doing rewrites.
>
> A solution which is technically plausible and penalizing only rewrites
> of data-integrity protected pages would be a use of shadow pages as Darrick
> describes below. So I'd lean towards that long term. But for now I think
> Darrick's solution is OK to make the integrity feature actually useful and
> later someone can try something more clever.
Rewrites in flight should be very rare though, and I think the bouncing
is going to have a big impact on the intended workloads. It's not just
the cost of the copy, it's also the increased time as we beat on the
page allocator.
We're working on adding stable pages to ext34 and other filesystems
missing it. When the work is done we can benchmark and decide on the
tradeoffs.
-chris
On 2011-02-21, at 19:00, "Darrick J. Wong" <[email protected]> wrote:
> Last summer there was a long thread entitled "Wrong DIF guard tag on ext2
> write" (http://marc.info/?l=linux-scsi&m=127530531808556&w=2) that started a
> discussion about how to deal with the situation where one program tells the
> kernel to write a block to disk, the kernel computes the checksum of that data,
> and then a second program begins writing to that same block before the disk HBA
> can DMA the memory block, thereby causing the disk to complain about being sent
> invalid checksums.
>
> I was able to write a
> trivial program to trigger the write problem, I'm pretty sure that this has not
> been fixed upstream. (FYI, using O_DIRECT still seems fine.)
Can you please attach your reproducer? IIRC it needed mmap() to hit this problem? Did you measure CPU usage during your testing?
> Below is a simple if naive solution: (ab)use the bounce buffering code to copy
> the memory page just prior to calculating the checksum, and send the copy and
> the checksum to the disk controller. With this patch applied, the invalid
> guard tag messages go away. An optimization would be to perform the copy only
> when memory contents change, but I wanted to ask peoples' opinions before
> continuing. I don't imagine bounce buffering is particularly speedy, though I
> haven't noticed any corruption errors or weirdness yet.
I don't like adding a data copy in the IO path at all. We are just looking to enable T10 DIF for Lustre and this would definitely hurt performance significantly, even though it isn't needed there at all (since the server side has proper locking of the pages to prevent multiple writers to the same page).
> Anyway, I'm mostly wondering: what do people think of this as a starting point
> to fixing the DIF checksum problem?
I'd definitely prefer that the filesystem be in charge of deciding whether this is needed or not. If the use of the data copy can be constrained to only the minimum required cases (e.g. if fs checks for rewrite on page that is marked as Writeback and either copies or blocks until writeback is complete, as a mount option) that would be better. At that point we can compare on different hardware whether copying or blocking should be the default.
Cheers, Andreas
>>>>> "Andreas" == Andreas Dilger <[email protected]> writes:
Andreas> I don't like adding a data copy in the IO path at all.
No thanks!
Andreas> I'd definitely prefer that the filesystem be in charge of
Andreas> deciding whether this is needed or not.
Absolutely. At the block layer we obviously have no idea whether the
filesystem is safe or not. So in my current tree the protection is only
generated if the relevant bio flag is set (unless the application
already added it, obviously).
Andreas> If the use of the data copy can be constrained to only the
Andreas> minimum required cases (e.g. if fs checks for rewrite on page
Andreas> that is marked as Writeback and either copies or blocks until
Andreas> writeback is complete, as a mount option) that would be
Andreas> better. At that point we can compare on different hardware
Andreas> whether copying or blocking should be the default.
Agreed.
As Chris mentioned we've got somebody on our team working through the
filesystem issues now. I'm hoping we can provide a status update at
LSF2011.
--
Martin K. Petersen Oracle Linux Engineering
>>>>> "Darrick" == Darrick J Wong <[email protected]> writes:
Darrick> Unfortunately, I didn't get a sense that any sort of consensus
Darrick> had been reached
We had a session on this topic at the storage workshop in Auguest. The
consensus was that filesystems really need to handle this.
Also, DIX is only the tip of the iceberg. Many other impending
technologies feature checksums and require pages to be stable during I/O
due to checksumming, encryption and so on.
The VM is already trying to do the right thing. We just need the
relevant filesystems to catch up.
--
Martin K. Petersen Oracle Linux Engineering
On 02/22/2011 05:02 AM, Chris Mason wrote:
> [ resend sorry if you get this twice ]
>
> Excerpts from Jan Kara's message of 2011-02-22 06:42:22 -0500:
>> Hi Boaz,
>>
>> On Mon 21-02-11 21:45:51, Boaz Harrosh wrote:
>>> On 02/21/2011 06:00 PM, Darrick J. Wong wrote:
>>>> Last summer there was a long thread entitled "Wrong DIF guard tag on ext2
>>>> write" (http://marc.info/?l=linux-scsi&m=127530531808556&w=2) that started a
>>>> discussion about how to deal with the situation where one program tells the
>>>> kernel to write a block to disk, the kernel computes the checksum of that data,
>>>> and then a second program begins writing to that same block before the disk HBA
>>>> can DMA the memory block, thereby causing the disk to complain about being sent
>>>> invalid checksums.
>>>
>>> The brokenness is in ext2/3 if you'll use btrfs, xfs and I think late versions
>>> of ext4 it should work much better. (If you still have problems please report
>>> them, those FSs advertise stable pages write-out)
>> Do they? I've just checked ext4 and xfs and they don't seem to enforce
>> stable pages. They do lock the page (which implicitely happens in mm code
>> for any filesystem BTW) but this is not enough. You have to wait for
>> PageWriteback to get cleared and only btrfs does that.
>>
>>> This problem is easily fixed at the FS layer or even at VFS, by overriding mk_write
>>> and syncing with write-out for example by taking the page-lock. Currently each
>>> FS is to itself because in VFS it would force the behaviour on FSs that it does
>>> not make sense to.
>> Yes, it's easy to fix but at a performance cost for any application doing
>> frequent rewrites regardless whether integrity features are used or not.
>> And I don't think that's a good thing. I even remember someone measured the
>> hit last time this came up and it was rather noticeable.
>
> Do you remember which workload this was? I do remember someone
> mentioning a specific workload, but can't recall which one now. fsx is
> definitely slower when we wait for writeback, but that's because it's
> all evil inside.
>
Me too
I have been asking on many occasions on multiple mailing lists and LSF last
year, if any one did any benchmarks on this issue, and no one came forward.
So please if someone is hiding his resaults come and show us we want to see?
I've been Playing with this for my raid5 code and Actually the problem
is not that bad. I'm using tar to exercise that. What I can see that
I can actually wait on a single page every once in like 15-30 seconds
but never consecutively on multiple pages. Because usually you wait on
an IOed page but all the pages in an IO are completed together and the
reset of the modifications go through.
I could not find any measurable performance difference, it was all well inside
the margin of the test. But maybe tar is just bad test.
>>
>>> Note that the proper solution does not copy any data, just forces the app to
>>> wait before changing write-out pages.
>> I think that's up for discussion. In fact what is going to be faster
>> depends pretty much on your system config. If you have enough CPU/RAM
>> bandwidth compared to storage speed, you're better of doing copying. If
>> you can barely saturate storage with your CPU/RAM, waiting is probably
>> better for you.
>>
>> Moreover if you do data copyout, you push the performance cost only on
>> users of the integrity feature which is nice. But on the other hand users
>> of integrity take the cost even if they are not doing rewrites.
>>
>> A solution which is technically plausible and penalizing only rewrites
>> of data-integrity protected pages would be a use of shadow pages as Darrick
>> describes below. So I'd lean towards that long term. But for now I think
>> Darrick's solution is OK to make the integrity feature actually useful and
>> later someone can try something more clever.
>
> Rewrites in flight should be very rare though, and I think the bouncing
> is going to have a big impact on the intended workloads. It's not just
> the cost of the copy, it's also the increased time as we beat on the
> page allocator.
>
> We're working on adding stable pages to ext34 and other filesystems
> missing it. When the work is done we can benchmark and decide on the
> tradeoffs.
>
Thanks, that would be nice. I bet the simple wait on write-out will beat
copy-everything , any day of the week.
Why don't we put some flags on the BDI that requests stable write-out
and devices with diff enabled or the likes of raid1/4/5/6 can turn
it on? (Also iscsi integrity checks as well)
> -chris
Boaz
On Tue, Feb 22, 2011 at 09:13:49AM -0700, Andreas Dilger wrote:
> On 2011-02-21, at 19:00, "Darrick J. Wong" <[email protected]> wrote:
> > Last summer there was a long thread entitled "Wrong DIF guard tag on ext2
> > write" (http://marc.info/?l=linux-scsi&m=127530531808556&w=2) that started a
> > discussion about how to deal with the situation where one program tells the
> > kernel to write a block to disk, the kernel computes the checksum of that data,
> > and then a second program begins writing to that same block before the disk HBA
> > can DMA the memory block, thereby causing the disk to complain about being sent
> > invalid checksums.
> >
> > I was able to write a
> > trivial program to trigger the write problem, I'm pretty sure that this has not
> > been fixed upstream. (FYI, using O_DIRECT still seems fine.)
>
> Can you please attach your reproducer? IIRC it needed mmap() to hit this
> problem? Did you measure CPU usage during your testing?
I didn't need mmap; a lot of threads using write() was enough. (The reproducer
program does have a mmap mode though). Basically it creates a lot of threads
to write small blobs to random offsets in a file, with optional mmap, dio, and
sync options.
CPU usage was 100% the entire time since there were 64 threads running on 4
CPUs. With just write() mode about half that was userspace and the other half
was kernel. With write and mmap the balance became closer to 80/20.
The program is attached below. It can be built with a simple "cc -o wac wac.c".
The invocation that I was using to produce the errors was:
./wac -l65536 -n64 -r /mnt/junk
This creates a file that is 64K (16 pages) long and starts 64 threads that will
write() a small buffer and then call sync_file_range to force IO to happen.
With that I get a steady stream of DIF checksum errors on the console. If -r
is omitted they happen about once every commit= seconds.
> > Below is a simple if naive solution: (ab)use the bounce buffering code to copy
> > the memory page just prior to calculating the checksum, and send the copy and
> > the checksum to the disk controller. With this patch applied, the invalid
> > guard tag messages go away. An optimization would be to perform the copy only
> > when memory contents change, but I wanted to ask peoples' opinions before
> > continuing. I don't imagine bounce buffering is particularly speedy, though I
> > haven't noticed any corruption errors or weirdness yet.
>
> I don't like adding a data copy in the IO path at all. We are just looking to
> enable T10 DIF for Lustre and this would definitely hurt performance
> significantly, even though it isn't needed there at all (since the server
> side has proper locking of the pages to prevent multiple writers to the same
> page).
>
> > Anyway, I'm mostly wondering: what do people think of this as a starting point
> > to fixing the DIF checksum problem?
>
> I'd definitely prefer that the filesystem be in charge of deciding whether
> this is needed or not. If the use of the data copy can be constrained to only
> the minimum required cases (e.g. if fs checks for rewrite on page that is
> marked as Writeback and either copies or blocks until writeback is complete,
> as a mount option) that would be better. At that point we can compare on
> different hardware whether copying or blocking should be the default.
Agreed. I too am curious to study which circumstances favor copying vs
blocking.
> Cheers, Andreas
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
/*
* Write-After-Checksum reproducer(?) program
* Copyright (C) 2011 IBM. All rights reserved.
* This program is licensed under the GPLv2.
*/
#define _XOPEN_SOURCE 600
#define _FILE_OFFSET_BITS 64
#include <stdio.h>
#include <stdint.h>
#include <stdlib.h>
#include <unistd.h>
#include <sys/types.h>
#include <sys/stat.h>
#include <sys/wait.h>
#include <sys/mman.h>
#include <string.h>
#define __USE_GNU
#include <fcntl.h>
#include <errno.h>
#include <inttypes.h>
#define SYNC_RANGE 1
#define SYNC_FILE 2
#define DEFAULT_BUFSIZE 4096
static uint32_t bufsize = DEFAULT_BUFSIZE;
void help(const char *pname)
{
printf("Usage: %s [-n threads] [-m threads] [-d] [-b blocksize] [-r] [-f] -l filesize filename\n", pname);
printf("-b Size of a memory page.\n");
printf("-d Use direct I/O.\n");
printf("-l Desired file size.\n");
printf("-n Use this many write() threads.\n");
printf("-m Use this many mmap write threads.\n");
printf("-s Synchronous disk writes.\n");
printf("-r Use sync_file_range after write.\n");
printf("-f fsync after write.\n");
}
int seed_random(void) {
int fp;
long seed;
fp = open("/dev/urandom", O_RDONLY);
if (fp < 0) {
perror("/dev/urandom");
return 0;
}
if (read(fp, &seed, sizeof(seed)) != sizeof(seed)) {
perror("read random seed");
return 0;
}
close(fp);
srand(seed);
return 1;
}
uint64_t get_randnum(uint64_t min, uint64_t max) {
return (min + (uint64_t)((double)(max - min) * (rand() / (RAND_MAX + 1.0))));
}
static uint64_t get_randnum_align(uint64_t min, uint64_t max, uint64_t align) {
return (min + (uint64_t)((double)(max - min) * (rand() / (RAND_MAX + 1.0)))) &
~(align - 1);
}
int write_junk(const char *fname, int flags, int sync_options, uint64_t file_size)
{
int fd, len;
uint64_t offset, generation = 0;
char *buf;
len = posix_memalign((void **)&buf, bufsize, bufsize);
if (len) {
errno = len;
perror("alloc");
return 66;
}
fd = open(fname, flags | O_WRONLY);
if (fd < 1) {
perror(fname);
return 64;
}
while (1) {
len = snprintf(buf, bufsize - 1, "%d - %"PRIu64, getpid(), generation++);
if (flags & O_DIRECT) {
len = bufsize;
offset = get_randnum_align(0, file_size - len, bufsize);
} else {
offset = get_randnum(0, file_size - len);
}
if (pwrite(fd, buf, len, offset) < 0) {
perror("pwrite");
close(fd);
free(buf);
return 65;
}
if ((sync_options & SYNC_RANGE) && sync_file_range(fd, offset, len, SYNC_FILE_RANGE_WAIT_BEFORE | SYNC_FILE_RANGE_WRITE | SYNC_FILE_RANGE_WAIT_AFTER) < 0) {
perror("sync_file_range");
close(fd);
free(buf);
return 67;
}
if ((sync_options & SYNC_FILE) && fsync(fd)) {
perror("fsync");
close(fd);
free(buf);
return 68;
}
}
return 0;
}
int mmap_junk(const char *fname, int flags, int sync_options, uint64_t file_size)
{
int fd, len;
uint64_t offset, generation = 0;
char *buf, *map;
long page_size;
page_size = sysconf(_SC_PAGESIZE);
if (page_size < 0) {
perror("_SC_PAGESIZE");
return 101;
}
fd = open(fname, flags | O_RDWR);
if (fd < 1) {
perror(fname);
return 96;
}
len = posix_memalign((void **)&buf, bufsize, bufsize);
if (len) {
errno = len;
perror("alloc");
return 102;
}
map = mmap(NULL, file_size, PROT_WRITE, MAP_SHARED, fd, 0);
if (map == MAP_FAILED) {
perror(fname);
return 97;
}
while (1) {
len = snprintf(buf, bufsize - 1, "%d - %"PRIu64, getpid(), generation++);
if (flags & O_DIRECT) {
len = bufsize;
offset = get_randnum_align(0, file_size - len, bufsize);
} else {
offset = get_randnum(0, file_size - len);
}
memcpy(map + offset, buf, len);
len += offset & (page_size - 1);
offset &= ~(page_size - 1);
if ((sync_options & SYNC_RANGE) && msync(map + offset, len, MS_SYNC | MS_INVALIDATE)) {
perror("msync");
munmap(map, file_size);
close(fd);
free(buf);
return 99;
}
if ((sync_options & SYNC_FILE) && fsync(fd)) {
perror("fsync");
munmap(map, file_size);
close(fd);
free(buf);
return 100;
}
}
return 0;
}
int main(int argc, char *argv[])
{
int opt, fd;
unsigned long i, mthreads = 0, nthreads = 1;
char *fname = NULL;
int flags = 0, sync_options = 0;
uint64_t file_size = 0;
pid_t pid;
int status;
while ((opt = getopt(argc, argv, "b:dn:l:srfm:")) != -1) {
switch (opt) {
case 'd':
flags |= O_DIRECT;
break;
case 'n':
nthreads = strtoul(optarg, NULL, 0);
break;
case 'm':
mthreads = strtoul(optarg, NULL, 0);
break;
case 'l':
file_size = strtoull(optarg, NULL, 0);
break;
case 's':
flags |= O_SYNC;
break;
case 'b':
bufsize = strtoul(optarg, NULL, 0);
break;
case 'r':
sync_options |= SYNC_RANGE;
break;
case 'f':
sync_options |= SYNC_FILE;
break;
default:
help(argv[0]);
return 4;
}
}
if (optind != argc - 1 || file_size < 1) {
help(argv[0]);
return 1;
}
fname = argv[optind];
if (!seed_random())
return 2;
// truncate first
fd = open(fname, flags | O_WRONLY | O_CREAT | O_TRUNC, S_IRUSR | S_IWUSR);
if (fd < 0) {
perror(fname);
return 3;
}
close(fd);
// spawn threads and go to town
if (nthreads == 1)
return write_junk(fname, flags, sync_options, file_size);
for (i = 0; i < nthreads; i++) {
pid = fork();
if (!pid)
return write_junk(fname, flags, sync_options, file_size);
}
for (i = 0; i < mthreads; i++) {
pid = fork();
if (!pid)
return mmap_junk(fname, flags, sync_options, file_size);
}
for (i = 0; i < mthreads + nthreads; i++) {
pid = wait(&status);
if (WIFEXITED(status))
printf("Child %d exited with status %d\n", pid, WEXITSTATUS(status));
}
return 0;
}
On Tue, Feb 22, 2011 at 11:45:38AM -0800, Darrick J. Wong wrote:
> On Tue, Feb 22, 2011 at 09:13:49AM -0700, Andreas Dilger wrote:
> > On 2011-02-21, at 19:00, "Darrick J. Wong" <[email protected]> wrote:
> > > Last summer there was a long thread entitled "Wrong DIF guard tag on ext2
> > > write" (http://marc.info/?l=linux-scsi&m=127530531808556&w=2) that started a
> > > discussion about how to deal with the situation where one program tells the
> > > kernel to write a block to disk, the kernel computes the checksum of that data,
> > > and then a second program begins writing to that same block before the disk HBA
> > > can DMA the memory block, thereby causing the disk to complain about being sent
> > > invalid checksums.
> > >
> > > I was able to write a
> > > trivial program to trigger the write problem, I'm pretty sure that this has not
> > > been fixed upstream. (FYI, using O_DIRECT still seems fine.)
> >
> > Can you please attach your reproducer? IIRC it needed mmap() to hit this
> > problem? Did you measure CPU usage during your testing?
>
> I didn't need mmap; a lot of threads using write() was enough. (The reproducer
> program does have a mmap mode though). Basically it creates a lot of threads
> to write small blobs to random offsets in a file, with optional mmap, dio, and
> sync options.
*nod*
Both mmap and write paths need to be block on
wait_for_page_writeback(page) once they have a locked page ready for
modification. btrfs does this in btrfs_page_mkwrite() and
prepare_pages(), so adding similar calls into block_page_mkwrite()
and grab_cache_page_write_begin() would probably fix the problem for
the other major filesystems....
> Agreed. I too am curious to study which circumstances favor copying vs
> blocking.
IMO blocking is generally preferable in high throughput threaded
workloads as there is always another thread that can do useful work
while we wait for IO to complete. Most use cases for DIF center
around high throughput environments....
Cheers,
Dave.
--
Dave Chinner
[email protected]
>>>>> "Dave" == Dave Chinner <[email protected]> writes:
>> Agreed. I too am curious to study which circumstances favor copying
>> vs blocking.
Dave> IMO blocking is generally preferable in high throughput threaded
Dave> workloads as there is always another thread that can do useful
Dave> work while we wait for IO to complete. Most use cases for DIF
Dave> center around high throughput environments....
Yeah.
A while back I did a bunch of tests with a liberal amount of
wait_on_page_writeback() calls added to (I think) ext2, ext3, and
XFS. For my regular workloads there was no measurable change (kernel
builds, random database and I/O tests). I'm sure we'll unearth some apps
that will suffer when DI is on but so far I'm not too worried about
blocking in the data path.
My main concern is wrt. metadata because that's where extN really
hurts. Simple test: Unpack a kernel tarball and watch the directory
block fireworks. Given how frequently those buffers get hit I'm sure
blocking would cause performance to tank completely. I looked into
fixing this in ext2 but I had to stop because my eyes were bleeding.
--
Martin K. Petersen Oracle Linux Engineering
On Tue, Feb 22, 2011 at 11:45:44AM -0500, Martin K. Petersen wrote:
> Also, DIX is only the tip of the iceberg. Many other impending
> technologies feature checksums and require pages to be stable during I/O
> due to checksumming, encryption and so on.
>
> The VM is already trying to do the right thing. We just need the
> relevant filesystems to catch up.
ocfs2 handles stable metadata for its checksums when feeding
things to the journal. If we're doing pagecache-based I/O, is the
pagecache going to help here for data?
Joel
--
"Gone to plant a weeping willow
On the bank's green edge it will roll, roll, roll.
Sing a lulaby beside the waters.
Lovers come and go, the river roll, roll, rolls."
http://www.jlbec.org/
[email protected]
Excerpts from Joel Becker's message of 2011-02-23 15:24:47 -0500:
> On Tue, Feb 22, 2011 at 11:45:44AM -0500, Martin K. Petersen wrote:
> > Also, DIX is only the tip of the iceberg. Many other impending
> > technologies feature checksums and require pages to be stable during I/O
> > due to checksumming, encryption and so on.
> >
> > The VM is already trying to do the right thing. We just need the
> > relevant filesystems to catch up.
>
> ocfs2 handles stable metadata for its checksums when feeding
> things to the journal. If we're doing pagecache-based I/O, is the
> pagecache going to help here for data?
Data is much easier than metadata. All you really need is to wait on
writeback in file_write, wait on writeback in page_mkwrite, and make
sure you don't free blocks back to the allocator that are actively under
IO.
I expect the hard part to be jbd and metadata in ext34.
-chris
On Wed, Feb 23, 2011 at 03:35:11PM -0500, Chris Mason wrote:
> > ocfs2 handles stable metadata for its checksums when feeding
> > things to the journal. If we're doing pagecache-based I/O, is the
> > pagecache going to help here for data?
>
> Data is much easier than metadata. All you really need is to wait on
> writeback in file_write, wait on writeback in page_mkwrite, and make
> sure you don't free blocks back to the allocator that are actively under
> IO.
>
> I expect the hard part to be jbd and metadata in ext34.
Yeah, catching use-without-access is not trivial. I keep
thinking we've found them all, and then another bug crops up ;-) At
least our checksums catch it.
Joel
--
"The doctrine of human equality reposes on this: that there is no
man really clever who has not found that he is stupid."
- Gilbert K. Chesterson
http://www.jlbec.org/
[email protected]
On Wed, Feb 23, 2011 at 11:24:50AM -0500, Martin K. Petersen wrote:
> >>>>> "Dave" == Dave Chinner <[email protected]> writes:
>
> >> Agreed. I too am curious to study which circumstances favor copying
> >> vs blocking.
>
> Dave> IMO blocking is generally preferable in high throughput threaded
> Dave> workloads as there is always another thread that can do useful
> Dave> work while we wait for IO to complete. Most use cases for DIF
> Dave> center around high throughput environments....
>
> Yeah.
>
> A while back I did a bunch of tests with a liberal amount of
> wait_on_page_writeback() calls added to (I think) ext2, ext3, and
> XFS. For my regular workloads there was no measurable change (kernel
> builds, random database and I/O tests). I'm sure we'll unearth some apps
> that will suffer when DI is on but so far I'm not too worried about
> blocking in the data path.
>
> My main concern is wrt. metadata because that's where extN really
> hurts. Simple test: Unpack a kernel tarball and watch the directory
> block fireworks. Given how frequently those buffers get hit I'm sure
> blocking would cause performance to tank completely. I looked into
> fixing this in ext2 but I had to stop because my eyes were bleeding.
Not problems with metadata modifications for XFS - all metadata IO
is done with a lock held preventing modifications while IO is in
progress.
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Wed 23-02-11 11:24:50, Martin K. Petersen wrote:
> >>>>> "Dave" == Dave Chinner <[email protected]> writes:
>
> >> Agreed. I too am curious to study which circumstances favor copying
> >> vs blocking.
>
> Dave> IMO blocking is generally preferable in high throughput threaded
> Dave> workloads as there is always another thread that can do useful
> Dave> work while we wait for IO to complete. Most use cases for DIF
> Dave> center around high throughput environments....
>
> Yeah.
>
> A while back I did a bunch of tests with a liberal amount of
> wait_on_page_writeback() calls added to (I think) ext2, ext3, and
> XFS. For my regular workloads there was no measurable change (kernel
> builds, random database and I/O tests). I'm sure we'll unearth some apps
> that will suffer when DI is on but so far I'm not too worried about
> blocking in the data path.
>
> My main concern is wrt. metadata because that's where extN really
> hurts. Simple test: Unpack a kernel tarball and watch the directory
> block fireworks. Given how frequently those buffers get hit I'm sure
> blocking would cause performance to tank completely. I looked into
> fixing this in ext2 but I had to stop because my eyes were bleeding.
Ext2 is problematic yes, but ext[34] should be OK because we do
metadata copy anyway because of journalling. So for ext[34] you shouldn't
need any additional metadata protection since JBD does it for you (apart
from nojournal mode of ext4 of course).
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Wed 23-02-11 15:35:11, Chris Mason wrote:
> Excerpts from Joel Becker's message of 2011-02-23 15:24:47 -0500:
> > On Tue, Feb 22, 2011 at 11:45:44AM -0500, Martin K. Petersen wrote:
> > > Also, DIX is only the tip of the iceberg. Many other impending
> > > technologies feature checksums and require pages to be stable during I/O
> > > due to checksumming, encryption and so on.
> > >
> > > The VM is already trying to do the right thing. We just need the
> > > relevant filesystems to catch up.
> >
> > ocfs2 handles stable metadata for its checksums when feeding
> > things to the journal. If we're doing pagecache-based I/O, is the
> > pagecache going to help here for data?
>
> Data is much easier than metadata. All you really need is to wait on
> writeback in file_write, wait on writeback in page_mkwrite, and make
> sure you don't free blocks back to the allocator that are actively under
> IO.
>
> I expect the hard part to be jbd and metadata in ext34.
But JBD already has to do data copy if a buffer is going to be modified
before/while it is written to the journal. So we should alredy do all that
is needed for metadata. I don't say there aren't any bugs as they could be
triggered only by crashing at the wrong moment and observing fs corruption.
But most of the work should be there...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
Excerpts from Jan Kara's message of 2011-02-24 11:47:58 -0500:
> On Wed 23-02-11 15:35:11, Chris Mason wrote:
> > Excerpts from Joel Becker's message of 2011-02-23 15:24:47 -0500:
> > > On Tue, Feb 22, 2011 at 11:45:44AM -0500, Martin K. Petersen wrote:
> > > > Also, DIX is only the tip of the iceberg. Many other impending
> > > > technologies feature checksums and require pages to be stable during I/O
> > > > due to checksumming, encryption and so on.
> > > >
> > > > The VM is already trying to do the right thing. We just need the
> > > > relevant filesystems to catch up.
> > >
> > > ocfs2 handles stable metadata for its checksums when feeding
> > > things to the journal. If we're doing pagecache-based I/O, is the
> > > pagecache going to help here for data?
> >
> > Data is much easier than metadata. All you really need is to wait on
> > writeback in file_write, wait on writeback in page_mkwrite, and make
> > sure you don't free blocks back to the allocator that are actively under
> > IO.
> >
> > I expect the hard part to be jbd and metadata in ext34.
> But JBD already has to do data copy if a buffer is going to be modified
> before/while it is written to the journal. So we should alredy do all that
> is needed for metadata. I don't say there aren't any bugs as they could be
> triggered only by crashing at the wrong moment and observing fs corruption.
> But most of the work should be there...
Most of it is there, but there are always little bits and pieces. The
ext4 journal csumming code was one semi-recent example where we found
metadata changing in flight.
A big part of testing this is getting some way to detect the bugs
without dif/dix. With btrfs I have patches to do set_memory_ro on
pages once I've don the crc, hopefully we can generalize that idea or
some up with something smarter.
-chris
On Thu, Feb 24, 2011 at 12:37:53PM -0500, Chris Mason wrote:
> Excerpts from Jan Kara's message of 2011-02-24 11:47:58 -0500:
> > On Wed 23-02-11 15:35:11, Chris Mason wrote:
> > > Excerpts from Joel Becker's message of 2011-02-23 15:24:47 -0500:
> > > > On Tue, Feb 22, 2011 at 11:45:44AM -0500, Martin K. Petersen wrote:
> > > > > Also, DIX is only the tip of the iceberg. Many other impending
> > > > > technologies feature checksums and require pages to be stable during I/O
> > > > > due to checksumming, encryption and so on.
> > > > >
> > > > > The VM is already trying to do the right thing. We just need the
> > > > > relevant filesystems to catch up.
> > > >
> > > > ocfs2 handles stable metadata for its checksums when feeding
> > > > things to the journal. If we're doing pagecache-based I/O, is the
> > > > pagecache going to help here for data?
> > >
> > > Data is much easier than metadata. All you really need is to wait on
> > > writeback in file_write, wait on writeback in page_mkwrite, and make
> > > sure you don't free blocks back to the allocator that are actively under
> > > IO.
> > >
> > > I expect the hard part to be jbd and metadata in ext34.
> > But JBD already has to do data copy if a buffer is going to be modified
> > before/while it is written to the journal. So we should alredy do all that
> > is needed for metadata. I don't say there aren't any bugs as they could be
> > triggered only by crashing at the wrong moment and observing fs corruption.
> > But most of the work should be there...
>
> Most of it is there, but there are always little bits and pieces. The
> ext4 journal csumming code was one semi-recent example where we found
> metadata changing in flight.
>
> A big part of testing this is getting some way to detect the bugs
> without dif/dix. With btrfs I have patches to do set_memory_ro on
> pages once I've don the crc, hopefully we can generalize that idea or
> some up with something smarter.
Right now I'm faking it with modprobe scsi_debug ato=1 guard=1 dif=3 dix=199.
Hm, would you mind sharing those patches? I've been working on a second patch
to do the wait-on-writeback per everyone's suggestions, but I still see the
occasional corruption error as soon as I enable the mmap write case and covet
some more debugging tools. It does seem to be working for the pure pwrite()
case. :)
--D
On Tue, Feb 22, 2011 at 09:13:49AM -0700, Andreas Dilger wrote:
> Can you please attach your reproducer? IIRC it needed mmap() to hit this problem? Did you measure CPU usage during your testing?
Running XFSQA on a DIF target reproduces it. Even scsi_debug in the
right mode is enough.
Excerpts from Darrick J. Wong's message of 2011-02-24 13:27:32 -0500:
> On Thu, Feb 24, 2011 at 12:37:53PM -0500, Chris Mason wrote:
> > Excerpts from Jan Kara's message of 2011-02-24 11:47:58 -0500:
> > > On Wed 23-02-11 15:35:11, Chris Mason wrote:
> > > > Excerpts from Joel Becker's message of 2011-02-23 15:24:47 -0500:
> > > > > On Tue, Feb 22, 2011 at 11:45:44AM -0500, Martin K. Petersen wrote:
> > > > > > Also, DIX is only the tip of the iceberg. Many other impending
> > > > > > technologies feature checksums and require pages to be stable during I/O
> > > > > > due to checksumming, encryption and so on.
> > > > > >
> > > > > > The VM is already trying to do the right thing. We just need the
> > > > > > relevant filesystems to catch up.
> > > > >
> > > > > ocfs2 handles stable metadata for its checksums when feeding
> > > > > things to the journal. If we're doing pagecache-based I/O, is the
> > > > > pagecache going to help here for data?
> > > >
> > > > Data is much easier than metadata. All you really need is to wait on
> > > > writeback in file_write, wait on writeback in page_mkwrite, and make
> > > > sure you don't free blocks back to the allocator that are actively under
> > > > IO.
> > > >
> > > > I expect the hard part to be jbd and metadata in ext34.
> > > But JBD already has to do data copy if a buffer is going to be modified
> > > before/while it is written to the journal. So we should alredy do all that
> > > is needed for metadata. I don't say there aren't any bugs as they could be
> > > triggered only by crashing at the wrong moment and observing fs corruption.
> > > But most of the work should be there...
> >
> > Most of it is there, but there are always little bits and pieces. The
> > ext4 journal csumming code was one semi-recent example where we found
> > metadata changing in flight.
> >
> > A big part of testing this is getting some way to detect the bugs
> > without dif/dix. With btrfs I have patches to do set_memory_ro on
> > pages once I've don the crc, hopefully we can generalize that idea or
> > some up with something smarter.
>
> Right now I'm faking it with modprobe scsi_debug ato=1 guard=1 dif=3 dix=199.
>
> Hm, would you mind sharing those patches? I've been working on a second patch
> to do the wait-on-writeback per everyone's suggestions, but I still see the
> occasional corruption error as soon as I enable the mmap write case and covet
> some more debugging tools. It does seem to be working for the pure pwrite()
> case. :)
Here's an ext4 version of the debugging patch. It's a few years old but
it'll give you the idea. This only covers metadata pages.
Looks like I hacked the btrfs version up and didn't keep the original,
I'll have to rework it, I was trying to use it for the big corruption I
fixed recently and made a bunch of changes.
For data if mmap is giving you trouble you need to wait on writeback in
page_mkwrite, with the page locked. fs/btrfs/inode.c has our
page_mkwrite, which uses wait_on_page_writeback() and also the btrfs
ordered write code. But for the other filesystems, waiting on writeback
should be enough.
-chris
diff --git a/fs/jbd2/commit.c b/fs/jbd2/commit.c
index d4cfd6d..9f278db 100644
--- a/fs/jbd2/commit.c
+++ b/fs/jbd2/commit.c
@@ -28,6 +28,16 @@
#include <linux/blkdev.h>
#include <trace/events/jbd2.h>
+int set_memory_ro(unsigned long addr, int numpages);
+
+static int set_page_ro(struct page *page)
+{
+ unsigned long addr = (unsigned long)page_address(page);
+ if (PageHighMem(page))
+ return 0;
+ return set_memory_ro(addr, 1);
+}
+
/*
* Default IO end handler for temporary BJ_IO buffer_heads.
*/
@@ -639,6 +654,8 @@ void jbd2_journal_commit_transaction(journal_t *journal)
set_bit(BH_JWrite, &jh2bh(new_jh)->b_state);
wbuf[bufs++] = jh2bh(new_jh);
+ set_page_ro(jh2bh(new_jh)->b_page);
+
/* Record the new block's tag in the current descriptor
buffer */
diff --git a/fs/jbd2/transaction.c b/fs/jbd2/transaction.c
index a051270..153888e 100644
--- a/fs/jbd2/transaction.c
+++ b/fs/jbd2/transaction.c
@@ -28,6 +28,15 @@
#include <linux/hrtimer.h>
static void __jbd2_journal_temp_unlink_buffer(struct journal_head *jh);
+int set_memory_rw(unsigned long addr, int numpages);
+static int set_page_rw(struct page *page)
+{
+ unsigned long addr = (unsigned long)page_address(page);
+ if (PageHighMem(page))
+ return 0;
+ return set_memory_rw(addr, 1);
+}
+
/*
* jbd2_get_transaction: obtain a new transaction_t object.
@@ -1474,9 +1487,11 @@ void __jbd2_journal_temp_unlink_buffer(struct journal_head *jh)
break;
case BJ_IO:
list = &transaction->t_iobuf_list;
+ set_page_rw(bh->b_page);
break;
case BJ_Shadow:
list = &transaction->t_shadow_list;
+ set_page_rw(bh->b_page);
break;
case BJ_LogCtl:
list = &transaction->t_log_list;
On Tue, Feb 22, 2011 at 12:42:22PM +0100, Jan Kara wrote:
> Hi Boaz,
>
> On Mon 21-02-11 21:45:51, Boaz Harrosh wrote:
> > On 02/21/2011 06:00 PM, Darrick J. Wong wrote:
> > > Last summer there was a long thread entitled "Wrong DIF guard tag on ext2
> > > write" (http://marc.info/?l=linux-scsi&m=127530531808556&w=2) that started a
> > > discussion about how to deal with the situation where one program tells the
> > > kernel to write a block to disk, the kernel computes the checksum of that data,
> > > and then a second program begins writing to that same block before the disk HBA
> > > can DMA the memory block, thereby causing the disk to complain about being sent
> > > invalid checksums.
> >
> > The brokenness is in ext2/3 if you'll use btrfs, xfs and I think late versions
> > of ext4 it should work much better. (If you still have problems please report
> > them, those FSs advertise stable pages write-out)
> Do they? I've just checked ext4 and xfs and they don't seem to enforce
> stable pages. They do lock the page (which implicitely happens in mm code
> for any filesystem BTW) but this is not enough. You have to wait for
> PageWriteback to get cleared and only btrfs does that.
>
> > This problem is easily fixed at the FS layer or even at VFS, by overriding mk_write
> > and syncing with write-out for example by taking the page-lock. Currently each
> > FS is to itself because in VFS it would force the behaviour on FSs that it does
> > not make sense to.
> Yes, it's easy to fix but at a performance cost for any application doing
> frequent rewrites regardless whether integrity features are used or not.
> And I don't think that's a good thing. I even remember someone measured the
> hit last time this came up and it was rather noticeable.
>
> > Note that the proper solution does not copy any data, just forces the app to
> > wait before changing write-out pages.
> I think that's up for discussion. In fact what is going to be faster
> depends pretty much on your system config. If you have enough CPU/RAM
> bandwidth compared to storage speed, you're better of doing copying. If
> you can barely saturate storage with your CPU/RAM, waiting is probably
> better for you.
>
> Moreover if you do data copyout, you push the performance cost only on
> users of the integrity feature which is nice. But on the other hand users
> of integrity take the cost even if they are not doing rewrites.
>
> A solution which is technically plausible and penalizing only rewrites
> of data-integrity protected pages would be a use of shadow pages as Darrick
> describes below. So I'd lean towards that long term. But for now I think
> Darrick's solution is OK to make the integrity feature actually useful and
> later someone can try something more clever.
Hmm. Any interest in pushing the page copy patch as an interim solution while
I work on getting the wait-on-writeback strategy to function? I agree it's not
the fastest solution, but at least it won't be running broken while I find the
faster solution(s).
(More on that writeback patch in a short while.)
--D
On Fri, Mar 04, 2011 at 12:51:43PM -0800, Darrick J. Wong wrote:
> Hmm. Any interest in pushing the page copy patch as an interim solution while
> I work on getting the wait-on-writeback strategy to function? I agree it's not
> the fastest solution, but at least it won't be running broken while I find the
> faster solution(s).
It's not only slow but also butt ugly. NAK from me - we need to get
this completely right, not hack around it.
On Mon, Feb 28, 2011 at 07:54:05AM -0500, Chris Mason wrote:
> Excerpts from Darrick J. Wong's message of 2011-02-24 13:27:32 -0500:
> > On Thu, Feb 24, 2011 at 12:37:53PM -0500, Chris Mason wrote:
> > > Excerpts from Jan Kara's message of 2011-02-24 11:47:58 -0500:
> > > > On Wed 23-02-11 15:35:11, Chris Mason wrote:
> > > > > Excerpts from Joel Becker's message of 2011-02-23 15:24:47 -0500:
> > > > > > On Tue, Feb 22, 2011 at 11:45:44AM -0500, Martin K. Petersen wrote:
> > > > > > > Also, DIX is only the tip of the iceberg. Many other impending
> > > > > > > technologies feature checksums and require pages to be stable during I/O
> > > > > > > due to checksumming, encryption and so on.
> > > > > > >
> > > > > > > The VM is already trying to do the right thing. We just need the
> > > > > > > relevant filesystems to catch up.
> > > > > >
> > > > > > ocfs2 handles stable metadata for its checksums when feeding
> > > > > > things to the journal. If we're doing pagecache-based I/O, is the
> > > > > > pagecache going to help here for data?
> > > > >
> > > > > Data is much easier than metadata. All you really need is to wait on
> > > > > writeback in file_write, wait on writeback in page_mkwrite, and make
> > > > > sure you don't free blocks back to the allocator that are actively under
> > > > > IO.
> > > > >
> > > > > I expect the hard part to be jbd and metadata in ext34.
> > > > But JBD already has to do data copy if a buffer is going to be modified
> > > > before/while it is written to the journal. So we should alredy do all that
> > > > is needed for metadata. I don't say there aren't any bugs as they could be
> > > > triggered only by crashing at the wrong moment and observing fs corruption.
> > > > But most of the work should be there...
> > >
> > > Most of it is there, but there are always little bits and pieces. The
> > > ext4 journal csumming code was one semi-recent example where we found
> > > metadata changing in flight.
> > >
> > > A big part of testing this is getting some way to detect the bugs
> > > without dif/dix. With btrfs I have patches to do set_memory_ro on
> > > pages once I've don the crc, hopefully we can generalize that idea or
> > > some up with something smarter.
> >
> > Right now I'm faking it with modprobe scsi_debug ato=1 guard=1 dif=3 dix=199.
> >
> > Hm, would you mind sharing those patches? I've been working on a second patch
> > to do the wait-on-writeback per everyone's suggestions, but I still see the
> > occasional corruption error as soon as I enable the mmap write case and covet
> > some more debugging tools. It does seem to be working for the pure pwrite()
> > case. :)
>
> Here's an ext4 version of the debugging patch. It's a few years old but
> it'll give you the idea. This only covers metadata pages.
>
> Looks like I hacked the btrfs version up and didn't keep the original,
> I'll have to rework it, I was trying to use it for the big corruption I
> fixed recently and made a bunch of changes.
>
> For data if mmap is giving you trouble you need to wait on writeback in
> page_mkwrite, with the page locked. fs/btrfs/inode.c has our
> page_mkwrite, which uses wait_on_page_writeback() and also the btrfs
> ordered write code. But for the other filesystems, waiting on writeback
> should be enough.
Ok, here's what I have so far. I took everyone's suggestions of where to add
calls to wait_on_page_writeback, which seems to handle the multiple-write case
adequately. Unfortunately, it is still possible to generate checksum errors by
scribbling furiously on a mmap'd region, even after adding the writeback wait
in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
removing its wait_for_page_writeback call, so I suspect there's a bit more
going on in btrfs than I've been able to figure out.
The set_memory_ro debugging trick didn't ferret out any write paths that I
didn't catch... though it did have the effect of causing occasional fsync()
deadlocks. I suppose I could sprinkle in a few more of those write calls to
see what happens.
Either way, I'm emailing to ask everyone's advice since I've run out of ideas.
Or: Did I miss something?
Thanks all for the feedback so far!
--
fs: Wait for page writeback when rewrite detected
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/buffer.c | 4 +++-
fs/ext4/inode.c | 3 +++
mm/filemap.c | 15 +++++++++++++--
3 files changed, 19 insertions(+), 3 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index 2219a76..39e934c 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2379,8 +2379,10 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
ret = VM_FAULT_OOM;
else /* -ENOSPC, -EIO, etc */
ret = VM_FAULT_SIGBUS;
- } else
+ } else {
+ wait_on_page_writeback(page);
ret = VM_FAULT_LOCKED;
+ }
out:
return ret;
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9f7f9e4..2364704 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2730,12 +2730,15 @@ static int ext4_writepage(struct page *page,
struct inode *inode = page->mapping->host;
trace_ext4_writepage(inode, page);
+lock_page(page);
size = i_size_read(inode);
if (page->index == size >> PAGE_CACHE_SHIFT)
len = size & ~PAGE_CACHE_MASK;
else
len = PAGE_CACHE_SIZE;
+wait_on_page_writeback(page);
+
/*
* If the page does not have buffers (for whatever reason),
* try to create them using __block_write_begin. If this
diff --git a/mm/filemap.c b/mm/filemap.c
index 83a45d3..f201d80 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2217,8 +2217,8 @@ EXPORT_SYMBOL(generic_file_direct_write);
* Find or create a page at the given pagecache position. Return the locked
* page. This function is specifically for buffered writes.
*/
-struct page *grab_cache_page_write_begin(struct address_space *mapping,
- pgoff_t index, unsigned flags)
+struct page *__grab_cache_page_write_begin(struct address_space *mapping,
+ pgoff_t index, unsigned flags)
{
int status;
struct page *page;
@@ -2243,6 +2243,17 @@ repeat:
}
return page;
}
+struct page *grab_cache_page_write_begin(struct address_space *mapping,
+ pgoff_t index, unsigned flags)
+{
+ struct page *p;
+
+ p = __grab_cache_page_write_begin(mapping, index, flags);
+ if (p)
+ wait_on_page_writeback(p);
+
+ return p;
+}
EXPORT_SYMBOL(grab_cache_page_write_begin);
static ssize_t generic_perform_write(struct file *file,
On 2011-03-04, at 2:07 PM, Darrick J. Wong wrote:
> Ok, here's what I have so far. I took everyone's suggestions of where to add
> calls to wait_on_page_writeback, which seems to handle the multiple-write case
> adequately. Unfortunately, it is still possible to generate checksum errors by
> scribbling furiously on a mmap'd region, even after adding the writeback wait
> in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
> removing its wait_for_page_writeback call, so I suspect there's a bit more
> going on in btrfs than I've been able to figure out.
Did you add this to ext4_page_mkwrite() or only ext4_writepage()? It wasn't clear from your description above.
> The set_memory_ro debugging trick didn't ferret out any write paths that I
> didn't catch... though it did have the effect of causing occasional fsync()
> deadlocks. I suppose I could sprinkle in a few more of those write calls to
> see what happens.
>
> Either way, I'm emailing to ask everyone's advice since I've run out of ideas.
> Or: Did I miss something?
>
> Thanks all for the feedback so far!
>
> --
> fs: Wait for page writeback when rewrite detected
>
> Signed-off-by: Darrick J. Wong <[email protected]>
> ---
>
> fs/buffer.c | 4 +++-
> fs/ext4/inode.c | 3 +++
> mm/filemap.c | 15 +++++++++++++--
> 3 files changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 2219a76..39e934c 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2379,8 +2379,10 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> ret = VM_FAULT_OOM;
> else /* -ENOSPC, -EIO, etc */
> ret = VM_FAULT_SIGBUS;
> - } else
> + } else {
> + wait_on_page_writeback(page);
> ret = VM_FAULT_LOCKED;
> + }
>
> out:
> return ret;
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 9f7f9e4..2364704 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2730,12 +2730,15 @@ static int ext4_writepage(struct page *page,
> struct inode *inode = page->mapping->host;
>
> trace_ext4_writepage(inode, page);
> +lock_page(page);
> size = i_size_read(inode);
> if (page->index == size >> PAGE_CACHE_SHIFT)
> len = size & ~PAGE_CACHE_MASK;
> else
> len = PAGE_CACHE_SIZE;
>
> +wait_on_page_writeback(page);
> +
> /*
> * If the page does not have buffers (for whatever reason),
> * try to create them using __block_write_begin. If this
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 83a45d3..f201d80 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2217,8 +2217,8 @@ EXPORT_SYMBOL(generic_file_direct_write);
> * Find or create a page at the given pagecache position. Return the locked
> * page. This function is specifically for buffered writes.
> */
> -struct page *grab_cache_page_write_begin(struct address_space *mapping,
> - pgoff_t index, unsigned flags)
> +struct page *__grab_cache_page_write_begin(struct address_space *mapping,
> + pgoff_t index, unsigned flags)
> {
> int status;
> struct page *page;
> @@ -2243,6 +2243,17 @@ repeat:
> }
> return page;
> }
> +struct page *grab_cache_page_write_begin(struct address_space *mapping,
> + pgoff_t index, unsigned flags)
> +{
> + struct page *p;
> +
> + p = __grab_cache_page_write_begin(mapping, index, flags);
> + if (p)
> + wait_on_page_writeback(p);
> +
> + return p;
> +}
> EXPORT_SYMBOL(grab_cache_page_write_begin);
>
> static ssize_t generic_perform_write(struct file *file,
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Cheers, Andreas
On Fri, Mar 04, 2011 at 03:22:30PM -0700, Andreas Dilger wrote:
> On 2011-03-04, at 2:07 PM, Darrick J. Wong wrote:
> > Ok, here's what I have so far. I took everyone's suggestions of where to add
> > calls to wait_on_page_writeback, which seems to handle the multiple-write case
> > adequately. Unfortunately, it is still possible to generate checksum errors by
> > scribbling furiously on a mmap'd region, even after adding the writeback wait
> > in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
> > removing its wait_for_page_writeback call, so I suspect there's a bit more
> > going on in btrfs than I've been able to figure out.
>
> Did you add this to ext4_page_mkwrite() or only ext4_writepage()? It wasn't
> clear from your description above.
I added a wait_on_page_writeback to ext4_page_mkwrite which seems to have cut
down on the error message frequency, but it hasn't gone away entirely.
--D
>
> > The set_memory_ro debugging trick didn't ferret out any write paths that I
> > didn't catch... though it did have the effect of causing occasional fsync()
> > deadlocks. I suppose I could sprinkle in a few more of those write calls to
> > see what happens.
> >
> > Either way, I'm emailing to ask everyone's advice since I've run out of ideas.
> > Or: Did I miss something?
> >
> > Thanks all for the feedback so far!
> >
> > --
> > fs: Wait for page writeback when rewrite detected
> >
> > Signed-off-by: Darrick J. Wong <[email protected]>
> > ---
> >
> > fs/buffer.c | 4 +++-
> > fs/ext4/inode.c | 3 +++
> > mm/filemap.c | 15 +++++++++++++--
> > 3 files changed, 19 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index 2219a76..39e934c 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -2379,8 +2379,10 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> > ret = VM_FAULT_OOM;
> > else /* -ENOSPC, -EIO, etc */
> > ret = VM_FAULT_SIGBUS;
> > - } else
> > + } else {
> > + wait_on_page_writeback(page);
> > ret = VM_FAULT_LOCKED;
> > + }
> >
> > out:
> > return ret;
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 9f7f9e4..2364704 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -2730,12 +2730,15 @@ static int ext4_writepage(struct page *page,
> > struct inode *inode = page->mapping->host;
> >
> > trace_ext4_writepage(inode, page);
> > +lock_page(page);
> > size = i_size_read(inode);
> > if (page->index == size >> PAGE_CACHE_SHIFT)
> > len = size & ~PAGE_CACHE_MASK;
> > else
> > len = PAGE_CACHE_SIZE;
> >
> > +wait_on_page_writeback(page);
> > +
> > /*
> > * If the page does not have buffers (for whatever reason),
> > * try to create them using __block_write_begin. If this
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 83a45d3..f201d80 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2217,8 +2217,8 @@ EXPORT_SYMBOL(generic_file_direct_write);
> > * Find or create a page at the given pagecache position. Return the locked
> > * page. This function is specifically for buffered writes.
> > */
> > -struct page *grab_cache_page_write_begin(struct address_space *mapping,
> > - pgoff_t index, unsigned flags)
> > +struct page *__grab_cache_page_write_begin(struct address_space *mapping,
> > + pgoff_t index, unsigned flags)
> > {
> > int status;
> > struct page *page;
> > @@ -2243,6 +2243,17 @@ repeat:
> > }
> > return page;
> > }
> > +struct page *grab_cache_page_write_begin(struct address_space *mapping,
> > + pgoff_t index, unsigned flags)
> > +{
> > + struct page *p;
> > +
> > + p = __grab_cache_page_write_begin(mapping, index, flags);
> > + if (p)
> > + wait_on_page_writeback(p);
> > +
> > + return p;
> > +}
> > EXPORT_SYMBOL(grab_cache_page_write_begin);
> >
> > static ssize_t generic_perform_write(struct file *file,
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
>
>
> Cheers, Andreas
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Excerpts from Darrick J. Wong's message of 2011-03-04 16:07:24 -0500:
> On Mon, Feb 28, 2011 at 07:54:05AM -0500, Chris Mason wrote:
> > Excerpts from Darrick J. Wong's message of 2011-02-24 13:27:32 -0500:
> > > On Thu, Feb 24, 2011 at 12:37:53PM -0500, Chris Mason wrote:
> > > > Excerpts from Jan Kara's message of 2011-02-24 11:47:58 -0500:
> > > > > On Wed 23-02-11 15:35:11, Chris Mason wrote:
> > > > > > Excerpts from Joel Becker's message of 2011-02-23 15:24:47 -0500:
> > > > > > > On Tue, Feb 22, 2011 at 11:45:44AM -0500, Martin K. Petersen wrote:
> > > > > > > > Also, DIX is only the tip of the iceberg. Many other impending
> > > > > > > > technologies feature checksums and require pages to be stable during I/O
> > > > > > > > due to checksumming, encryption and so on.
> > > > > > > >
> > > > > > > > The VM is already trying to do the right thing. We just need the
> > > > > > > > relevant filesystems to catch up.
> > > > > > >
> > > > > > > ocfs2 handles stable metadata for its checksums when feeding
> > > > > > > things to the journal. If we're doing pagecache-based I/O, is the
> > > > > > > pagecache going to help here for data?
> > > > > >
> > > > > > Data is much easier than metadata. All you really need is to wait on
> > > > > > writeback in file_write, wait on writeback in page_mkwrite, and make
> > > > > > sure you don't free blocks back to the allocator that are actively under
> > > > > > IO.
> > > > > >
> > > > > > I expect the hard part to be jbd and metadata in ext34.
> > > > > But JBD already has to do data copy if a buffer is going to be modified
> > > > > before/while it is written to the journal. So we should alredy do all that
> > > > > is needed for metadata. I don't say there aren't any bugs as they could be
> > > > > triggered only by crashing at the wrong moment and observing fs corruption.
> > > > > But most of the work should be there...
> > > >
> > > > Most of it is there, but there are always little bits and pieces. The
> > > > ext4 journal csumming code was one semi-recent example where we found
> > > > metadata changing in flight.
> > > >
> > > > A big part of testing this is getting some way to detect the bugs
> > > > without dif/dix. With btrfs I have patches to do set_memory_ro on
> > > > pages once I've don the crc, hopefully we can generalize that idea or
> > > > some up with something smarter.
> > >
> > > Right now I'm faking it with modprobe scsi_debug ato=1 guard=1 dif=3 dix=199.
> > >
> > > Hm, would you mind sharing those patches? I've been working on a second patch
> > > to do the wait-on-writeback per everyone's suggestions, but I still see the
> > > occasional corruption error as soon as I enable the mmap write case and covet
> > > some more debugging tools. It does seem to be working for the pure pwrite()
> > > case. :)
> >
> > Here's an ext4 version of the debugging patch. It's a few years old but
> > it'll give you the idea. This only covers metadata pages.
> >
> > Looks like I hacked the btrfs version up and didn't keep the original,
> > I'll have to rework it, I was trying to use it for the big corruption I
> > fixed recently and made a bunch of changes.
> >
> > For data if mmap is giving you trouble you need to wait on writeback in
> > page_mkwrite, with the page locked. fs/btrfs/inode.c has our
> > page_mkwrite, which uses wait_on_page_writeback() and also the btrfs
> > ordered write code. But for the other filesystems, waiting on writeback
> > should be enough.
>
> Ok, here's what I have so far. I took everyone's suggestions of where to add
> calls to wait_on_page_writeback, which seems to handle the multiple-write case
> adequately. Unfortunately, it is still possible to generate checksum errors by
> scribbling furiously on a mmap'd region, even after adding the writeback wait
> in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
> removing its wait_for_page_writeback call, so I suspect there's a bit more
> going on in btrfs than I've been able to figure out.
Hi, thanks for working on this.
Btrfs waits for page writeback but then it also waits for its own
ordered extents, which is basically tracks writeback and some extra
btrfs accounting. If you take out the btrfs_lookup_ordered_extent bits,
and the waiting on page writeback, you'll see fireworks.
Your patch changes ext4_writepage, but by the time writepage is called
we should already have waited for PageWriteback. So that should be
BUG_ON(PageWriteback(page)). And ext4_writepage should be called with
the page locked...so your extra lock_page call should be deadlocking.
Maybe all the writes are going through writepages instead?
You really want to be changing ext4_page_mkwrite, that's where all the
magic for mmap really happens. The block_page_mkwrite call is a good
start, but ext4 doesn't use it.
-chris
On Fri, Mar 04, 2011 at 01:07:24PM -0800, Darrick J. Wong wrote:
> On Mon, Feb 28, 2011 at 07:54:05AM -0500, Chris Mason wrote:
> > Excerpts from Darrick J. Wong's message of 2011-02-24 13:27:32 -0500:
> > > On Thu, Feb 24, 2011 at 12:37:53PM -0500, Chris Mason wrote:
> > > > Excerpts from Jan Kara's message of 2011-02-24 11:47:58 -0500:
> > > > > On Wed 23-02-11 15:35:11, Chris Mason wrote:
> > > > > > Excerpts from Joel Becker's message of 2011-02-23 15:24:47 -0500:
> > > > > > > On Tue, Feb 22, 2011 at 11:45:44AM -0500, Martin K. Petersen wrote:
> > > > > > > > Also, DIX is only the tip of the iceberg. Many other impending
> > > > > > > > technologies feature checksums and require pages to be stable during I/O
> > > > > > > > due to checksumming, encryption and so on.
> > > > > > > >
> > > > > > > > The VM is already trying to do the right thing. We just need the
> > > > > > > > relevant filesystems to catch up.
> > > > > > >
> > > > > > > ocfs2 handles stable metadata for its checksums when feeding
> > > > > > > things to the journal. If we're doing pagecache-based I/O, is the
> > > > > > > pagecache going to help here for data?
> > > > > >
> > > > > > Data is much easier than metadata. All you really need is to wait on
> > > > > > writeback in file_write, wait on writeback in page_mkwrite, and make
> > > > > > sure you don't free blocks back to the allocator that are actively under
> > > > > > IO.
> > > > > >
> > > > > > I expect the hard part to be jbd and metadata in ext34.
> > > > > But JBD already has to do data copy if a buffer is going to be modified
> > > > > before/while it is written to the journal. So we should alredy do all that
> > > > > is needed for metadata. I don't say there aren't any bugs as they could be
> > > > > triggered only by crashing at the wrong moment and observing fs corruption.
> > > > > But most of the work should be there...
> > > >
> > > > Most of it is there, but there are always little bits and pieces. The
> > > > ext4 journal csumming code was one semi-recent example where we found
> > > > metadata changing in flight.
> > > >
> > > > A big part of testing this is getting some way to detect the bugs
> > > > without dif/dix. With btrfs I have patches to do set_memory_ro on
> > > > pages once I've don the crc, hopefully we can generalize that idea or
> > > > some up with something smarter.
> > >
> > > Right now I'm faking it with modprobe scsi_debug ato=1 guard=1 dif=3 dix=199.
> > >
> > > Hm, would you mind sharing those patches? I've been working on a second patch
> > > to do the wait-on-writeback per everyone's suggestions, but I still see the
> > > occasional corruption error as soon as I enable the mmap write case and covet
> > > some more debugging tools. It does seem to be working for the pure pwrite()
> > > case. :)
> >
> > Here's an ext4 version of the debugging patch. It's a few years old but
> > it'll give you the idea. This only covers metadata pages.
> >
> > Looks like I hacked the btrfs version up and didn't keep the original,
> > I'll have to rework it, I was trying to use it for the big corruption I
> > fixed recently and made a bunch of changes.
> >
> > For data if mmap is giving you trouble you need to wait on writeback in
> > page_mkwrite, with the page locked. fs/btrfs/inode.c has our
> > page_mkwrite, which uses wait_on_page_writeback() and also the btrfs
> > ordered write code. But for the other filesystems, waiting on writeback
> > should be enough.
>
> Ok, here's what I have so far. I took everyone's suggestions of where to add
> calls to wait_on_page_writeback, which seems to handle the multiple-write case
> adequately. Unfortunately, it is still possible to generate checksum errors by
> scribbling furiously on a mmap'd region, even after adding the writeback wait
> in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
> removing its wait_for_page_writeback call, so I suspect there's a bit more
> going on in btrfs than I've been able to figure out.
>
> The set_memory_ro debugging trick didn't ferret out any write paths that I
> didn't catch... though it did have the effect of causing occasional fsync()
> deadlocks. I suppose I could sprinkle in a few more of those write calls to
> see what happens.
>
> Either way, I'm emailing to ask everyone's advice since I've run out of ideas.
> Or: Did I miss something?
>
> Thanks all for the feedback so far!
>
> --
> fs: Wait for page writeback when rewrite detected
>
> Signed-off-by: Darrick J. Wong <[email protected]>
> ---
>
> fs/buffer.c | 4 +++-
> fs/ext4/inode.c | 3 +++
> mm/filemap.c | 15 +++++++++++++--
> 3 files changed, 19 insertions(+), 3 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 2219a76..39e934c 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2379,8 +2379,10 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> ret = VM_FAULT_OOM;
> else /* -ENOSPC, -EIO, etc */
> ret = VM_FAULT_SIGBUS;
> - } else
> + } else {
> + wait_on_page_writeback(page);
> ret = VM_FAULT_LOCKED;
> + }
I think this needs to wait before the __block_write_begin() call,
not after it. i.e. wait before the page is mapped, not afterwards.
....
> diff --git a/mm/filemap.c b/mm/filemap.c
> index 83a45d3..f201d80 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2217,8 +2217,8 @@ EXPORT_SYMBOL(generic_file_direct_write);
> * Find or create a page at the given pagecache position. Return the locked
> * page. This function is specifically for buffered writes.
> */
> -struct page *grab_cache_page_write_begin(struct address_space *mapping,
> - pgoff_t index, unsigned flags)
> +struct page *__grab_cache_page_write_begin(struct address_space *mapping,
> + pgoff_t index, unsigned flags)
> {
> int status;
> struct page *page;
> @@ -2243,6 +2243,17 @@ repeat:
> }
> return page;
> }
> +struct page *grab_cache_page_write_begin(struct address_space *mapping,
> + pgoff_t index, unsigned flags)
> +{
> + struct page *p;
> +
> + p = __grab_cache_page_write_begin(mapping, index, flags);
> + if (p)
> + wait_on_page_writeback(p);
> +
> + return p;
> +}
> EXPORT_SYMBOL(grab_cache_page_write_begin);
Not much point in add in a wrapper when nothing else calls
__grab_cache_page_write_begin(), which should also be static....
Cheers,
Dave.
--
Dave Chinner
[email protected]
On Tue, Mar 08, 2011 at 03:56:26PM +1100, Dave Chinner wrote:
> On Fri, Mar 04, 2011 at 01:07:24PM -0800, Darrick J. Wong wrote:
> > On Mon, Feb 28, 2011 at 07:54:05AM -0500, Chris Mason wrote:
> > > Excerpts from Darrick J. Wong's message of 2011-02-24 13:27:32 -0500:
> > > > On Thu, Feb 24, 2011 at 12:37:53PM -0500, Chris Mason wrote:
> > > > > Excerpts from Jan Kara's message of 2011-02-24 11:47:58 -0500:
> > > > > > On Wed 23-02-11 15:35:11, Chris Mason wrote:
> > > > > > > Excerpts from Joel Becker's message of 2011-02-23 15:24:47 -0500:
> > > > > > > > On Tue, Feb 22, 2011 at 11:45:44AM -0500, Martin K. Petersen wrote:
> > > > > > > > > Also, DIX is only the tip of the iceberg. Many other impending
> > > > > > > > > technologies feature checksums and require pages to be stable during I/O
> > > > > > > > > due to checksumming, encryption and so on.
> > > > > > > > >
> > > > > > > > > The VM is already trying to do the right thing. We just need the
> > > > > > > > > relevant filesystems to catch up.
> > > > > > > >
> > > > > > > > ocfs2 handles stable metadata for its checksums when feeding
> > > > > > > > things to the journal. If we're doing pagecache-based I/O, is the
> > > > > > > > pagecache going to help here for data?
> > > > > > >
> > > > > > > Data is much easier than metadata. All you really need is to wait on
> > > > > > > writeback in file_write, wait on writeback in page_mkwrite, and make
Hrm... I've been looking for a file_write in ext4; was the aio_write function
pointer what you had in mind here?
Adding a wait_on_page_writeback to ext4_page_mkwrite eliminated most of the DIF
checksum errors in the mmap case. Unfortunately, I still see them, usually
within the first few seconds of kicking off the not-first run. I suspect that
may be related to the test program truncating the file after each run causing
blocks to come and go, so I'll investigate that part of ext4 next.
I also noticed that testing the directio+mmap case exhibits the same symptoms.
Anyhow, just attaching the latest hack of mine in case anyone wants to have a
look.
--
fs: Wait for page writeback when rewrite detected
---
fs/buffer.c | 2 ++
fs/ext4/inode.c | 6 +++++-
mm/filemap.c | 19 +++++++++++++++++--
mm/memory.c | 3 +++
4 files changed, 27 insertions(+), 3 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index 2219a76..f3639f2 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2369,6 +2369,8 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
else
end = PAGE_CACHE_SIZE;
+ WARN_ON(!PageLocked(page));
+ wait_on_page_writeback(page);
ret = __block_write_begin(page, 0, end, get_block);
if (!ret)
ret = block_commit_write(page, 0, end);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9f7f9e4..ba45b31 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2762,6 +2762,8 @@ static int ext4_writepage(struct page *page,
*/
goto redirty_page;
}
+ WARN_ON(!PageLocked(page));
+ wait_on_page_writeback(page);
if (commit_write)
/* now mark the buffer_heads as dirty and uptodate */
block_commit_write(page, 0, len);
@@ -5865,12 +5867,14 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (PageMappedToDisk(page))
goto out_unlock;
+ lock_page(page);
+ /* this one seems to handle mmap */
+ wait_on_page_writeback(page);
if (page->index == size >> PAGE_CACHE_SHIFT)
len = size & ~PAGE_CACHE_MASK;
else
len = PAGE_CACHE_SIZE;
- lock_page(page);
/*
* return if we have all the buffers mapped. This avoid
* the need to call write_begin/write_end which does a
diff --git a/mm/filemap.c b/mm/filemap.c
index 83a45d3..61af700 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2217,8 +2217,9 @@ EXPORT_SYMBOL(generic_file_direct_write);
* Find or create a page at the given pagecache position. Return the locked
* page. This function is specifically for buffered writes.
*/
-struct page *grab_cache_page_write_begin(struct address_space *mapping,
- pgoff_t index, unsigned flags)
+static struct page *
+__grab_cache_page_write_begin(struct address_space *mapping, pgoff_t index,
+ unsigned flags)
{
int status;
struct page *page;
@@ -2243,6 +2244,20 @@ repeat:
}
return page;
}
+
+struct page *grab_cache_page_write_begin(struct address_space *mapping,
+ pgoff_t index, unsigned flags)
+{
+ struct page *p;
+
+ p = __grab_cache_page_write_begin(mapping, index, flags);
+ if (p) {
+ WARN_ON(!PageLocked(p));
+ wait_on_page_writeback(p);
+ }
+
+ return p;
+}
EXPORT_SYMBOL(grab_cache_page_write_begin);
static ssize_t generic_perform_write(struct file *file,
Excerpts from Darrick J. Wong's message of 2011-03-10 18:57:22 -0500:
> On Tue, Mar 08, 2011 at 03:56:26PM +1100, Dave Chinner wrote:
>
> > On Fri, Mar 04, 2011 at 01:07:24PM -0800, Darrick J. Wong wrote:
> > > On Mon, Feb 28, 2011 at 07:54:05AM -0500, Chris Mason wrote:
> > > > Excerpts from Darrick J. Wong's message of 2011-02-24 13:27:32 -0500:
> > > > > On Thu, Feb 24, 2011 at 12:37:53PM -0500, Chris Mason wrote:
> > > > > > Excerpts from Jan Kara's message of 2011-02-24 11:47:58 -0500:
> > > > > > > On Wed 23-02-11 15:35:11, Chris Mason wrote:
> > > > > > > > Excerpts from Joel Becker's message of 2011-02-23 15:24:47 -0500:
> > > > > > > > > On Tue, Feb 22, 2011 at 11:45:44AM -0500, Martin K. Petersen wrote:
> > > > > > > > > > Also, DIX is only the tip of the iceberg. Many other impending
> > > > > > > > > > technologies feature checksums and require pages to be stable during I/O
> > > > > > > > > > due to checksumming, encryption and so on.
> > > > > > > > > >
> > > > > > > > > > The VM is already trying to do the right thing. We just need the
> > > > > > > > > > relevant filesystems to catch up.
> > > > > > > > >
> > > > > > > > > ocfs2 handles stable metadata for its checksums when feeding
> > > > > > > > > things to the journal. If we're doing pagecache-based I/O, is the
> > > > > > > > > pagecache going to help here for data?
> > > > > > > >
> > > > > > > > Data is much easier than metadata. All you really need is to wait on
> > > > > > > > writeback in file_write, wait on writeback in page_mkwrite, and make
>
> Hrm... I've been looking for a file_write in ext4; was the aio_write function
> pointer what you had in mind here?
Your change to grab_cache_page_write_begin looks good to me, at least
for ext4. For ext3 you have to actually go in and wait for each of the
buffer heads in the page, since ext3 (and reiserfs) will write the buffer heads
directly without using writepage.
Have you confirmed by looking at the block mapping that your crc errors
are from data blocks?
-chris
On Fri, Mar 11, 2011 at 11:34:05AM -0500, Chris Mason wrote:
> Excerpts from Darrick J. Wong's message of 2011-03-10 18:57:22 -0500:
> > On Tue, Mar 08, 2011 at 03:56:26PM +1100, Dave Chinner wrote:
> >
> > > On Fri, Mar 04, 2011 at 01:07:24PM -0800, Darrick J. Wong wrote:
> > > > On Mon, Feb 28, 2011 at 07:54:05AM -0500, Chris Mason wrote:
> > > > > Excerpts from Darrick J. Wong's message of 2011-02-24 13:27:32 -0500:
> > > > > > On Thu, Feb 24, 2011 at 12:37:53PM -0500, Chris Mason wrote:
> > > > > > > Excerpts from Jan Kara's message of 2011-02-24 11:47:58 -0500:
> > > > > > > > On Wed 23-02-11 15:35:11, Chris Mason wrote:
> > > > > > > > > Excerpts from Joel Becker's message of 2011-02-23 15:24:47 -0500:
> > > > > > > > > > On Tue, Feb 22, 2011 at 11:45:44AM -0500, Martin K. Petersen wrote:
> > > > > > > > > > > Also, DIX is only the tip of the iceberg. Many other impending
> > > > > > > > > > > technologies feature checksums and require pages to be stable during I/O
> > > > > > > > > > > due to checksumming, encryption and so on.
> > > > > > > > > > >
> > > > > > > > > > > The VM is already trying to do the right thing. We just need the
> > > > > > > > > > > relevant filesystems to catch up.
> > > > > > > > > >
> > > > > > > > > > ocfs2 handles stable metadata for its checksums when feeding
> > > > > > > > > > things to the journal. If we're doing pagecache-based I/O, is the
> > > > > > > > > > pagecache going to help here for data?
> > > > > > > > >
> > > > > > > > > Data is much easier than metadata. All you really need is to wait on
> > > > > > > > > writeback in file_write, wait on writeback in page_mkwrite, and make
> >
> > Hrm... I've been looking for a file_write in ext4; was the aio_write function
> > pointer what you had in mind here?
>
> Your change to grab_cache_page_write_begin looks good to me, at least
> for ext4. For ext3 you have to actually go in and wait for each of the
> buffer heads in the page, since ext3 (and reiserfs) will write the buffer heads
> directly without using writepage.
Heh, I hadn't really given much thought to ext2/3 yet. I'm still trying to
figure out what causes the occasional ext4 failures. :)
> Have you confirmed by looking at the block mapping that your crc errors
> are from data blocks?
Yes, debugfs confirms that they are data blocks.
--D
On Tue, Mar 08, 2011 at 03:56:26PM +1100, Dave Chinner wrote:
> On Fri, Mar 04, 2011 at 01:07:24PM -0800, Darrick J. Wong wrote:
> > On Mon, Feb 28, 2011 at 07:54:05AM -0500, Chris Mason wrote:
> > > Excerpts from Darrick J. Wong's message of 2011-02-24 13:27:32 -0500:
> > > > On Thu, Feb 24, 2011 at 12:37:53PM -0500, Chris Mason wrote:
> > > > > Excerpts from Jan Kara's message of 2011-02-24 11:47:58 -0500:
> > > > > > On Wed 23-02-11 15:35:11, Chris Mason wrote:
> > > > > > > Excerpts from Joel Becker's message of 2011-02-23 15:24:47 -0500:
> > > > > > > > On Tue, Feb 22, 2011 at 11:45:44AM -0500, Martin K. Petersen wrote:
> > > > > > > > > Also, DIX is only the tip of the iceberg. Many other impending
> > > > > > > > > technologies feature checksums and require pages to be stable during I/O
> > > > > > > > > due to checksumming, encryption and so on.
> > > > > > > > >
> > > > > > > > > The VM is already trying to do the right thing. We just need the
> > > > > > > > > relevant filesystems to catch up.
> > > > > > > >
> > > > > > > > ocfs2 handles stable metadata for its checksums when feeding
> > > > > > > > things to the journal. If we're doing pagecache-based I/O, is the
> > > > > > > > pagecache going to help here for data?
> > > > > > >
> > > > > > > Data is much easier than metadata. All you really need is to wait on
> > > > > > > writeback in file_write, wait on writeback in page_mkwrite, and make
> > > > > > > sure you don't free blocks back to the allocator that are actively under
> > > > > > > IO.
> > > > > > >
> > > > > > > I expect the hard part to be jbd and metadata in ext34.
> > > > > > But JBD already has to do data copy if a buffer is going to be modified
> > > > > > before/while it is written to the journal. So we should alredy do all that
> > > > > > is needed for metadata. I don't say there aren't any bugs as they could be
> > > > > > triggered only by crashing at the wrong moment and observing fs corruption.
> > > > > > But most of the work should be there...
> > > > >
> > > > > Most of it is there, but there are always little bits and pieces. The
> > > > > ext4 journal csumming code was one semi-recent example where we found
> > > > > metadata changing in flight.
> > > > >
> > > > > A big part of testing this is getting some way to detect the bugs
> > > > > without dif/dix. With btrfs I have patches to do set_memory_ro on
> > > > > pages once I've don the crc, hopefully we can generalize that idea or
> > > > > some up with something smarter.
> > > >
> > > > Right now I'm faking it with modprobe scsi_debug ato=1 guard=1 dif=3 dix=199.
> > > >
> > > > Hm, would you mind sharing those patches? I've been working on a second patch
> > > > to do the wait-on-writeback per everyone's suggestions, but I still see the
> > > > occasional corruption error as soon as I enable the mmap write case and covet
> > > > some more debugging tools. It does seem to be working for the pure pwrite()
> > > > case. :)
> > >
> > > Here's an ext4 version of the debugging patch. It's a few years old but
> > > it'll give you the idea. This only covers metadata pages.
> > >
> > > Looks like I hacked the btrfs version up and didn't keep the original,
> > > I'll have to rework it, I was trying to use it for the big corruption I
> > > fixed recently and made a bunch of changes.
> > >
> > > For data if mmap is giving you trouble you need to wait on writeback in
> > > page_mkwrite, with the page locked. fs/btrfs/inode.c has our
> > > page_mkwrite, which uses wait_on_page_writeback() and also the btrfs
> > > ordered write code. But for the other filesystems, waiting on writeback
> > > should be enough.
> >
> > Ok, here's what I have so far. I took everyone's suggestions of where to add
> > calls to wait_on_page_writeback, which seems to handle the multiple-write case
> > adequately. Unfortunately, it is still possible to generate checksum errors by
> > scribbling furiously on a mmap'd region, even after adding the writeback wait
> > in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
> > removing its wait_for_page_writeback call, so I suspect there's a bit more
> > going on in btrfs than I've been able to figure out.
I wonder, is it possible for this to happen:
1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes,
but there's no ongoing writeback, so it returns without delay.
2. Thread A starts writing furiously to the page.
3. Thread B runs fsync() or something that results in the page being
checksummed and scheduled for writeout.
4. Thread A continues to write furiously(!) on that same page before the
controller finishes the DMA transfer.
5. Disk gets the page, which now doesn't match its checksum, and *boom*
After letting the stress tool run for a few days, I can say fairly confidently
that the write() case doesn't seem to fail regardless of the O_DIRECT setting.
However, with writes to mmap regions, failures happen about once every 20-40
minutes, even with O_DIRECT set. To me this suggests some sort of race
condition that we seem to win except once every 20 minutes.
I then thought, if page_mkwrite contains a wait_on_page_writeback, then perhaps
there's something that I could do just prior to calculating the DIF checksum
that would cause any subsequent write attempts to be shuffled back into
page_mkwrite. I tried the set_memory_ro thing again, though that led to some
recursive lock errors and I noticed that those functions only seem to exist in
arch/x86/. Next I tried directly mucking with PTEs, in addition to feeling
messy, only seemed to corrupt memory. :)
Is there a "correct" way to take a writeable page and make it so that any
process trying to write to it ends up hitting the page fault handler where we
can then wait for writeback? Or perhaps I am simply barking up the wrong tree?
(Just FYI I took the old copy-everything-to-bounce-buffers patch that few
people liked for a second spin, and the errors did not surface regardless of
what combination of write/mmap and directio/bufferedio I told it to use.)
--D
> >
> > The set_memory_ro debugging trick didn't ferret out any write paths that I
> > didn't catch... though it did have the effect of causing occasional fsync()
> > deadlocks. I suppose I could sprinkle in a few more of those write calls to
> > see what happens.
> >
> > Either way, I'm emailing to ask everyone's advice since I've run out of ideas.
> > Or: Did I miss something?
> >
> > Thanks all for the feedback so far!
> >
> > --
> > fs: Wait for page writeback when rewrite detected
> >
> > Signed-off-by: Darrick J. Wong <[email protected]>
> > ---
> >
> > fs/buffer.c | 4 +++-
> > fs/ext4/inode.c | 3 +++
> > mm/filemap.c | 15 +++++++++++++--
> > 3 files changed, 19 insertions(+), 3 deletions(-)
> >
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index 2219a76..39e934c 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -2379,8 +2379,10 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> > ret = VM_FAULT_OOM;
> > else /* -ENOSPC, -EIO, etc */
> > ret = VM_FAULT_SIGBUS;
> > - } else
> > + } else {
> > + wait_on_page_writeback(page);
> > ret = VM_FAULT_LOCKED;
> > + }
>
> I think this needs to wait before the __block_write_begin() call,
> not after it. i.e. wait before the page is mapped, not afterwards.
>
> ....
> > diff --git a/mm/filemap.c b/mm/filemap.c
> > index 83a45d3..f201d80 100644
> > --- a/mm/filemap.c
> > +++ b/mm/filemap.c
> > @@ -2217,8 +2217,8 @@ EXPORT_SYMBOL(generic_file_direct_write);
> > * Find or create a page at the given pagecache position. Return the locked
> > * page. This function is specifically for buffered writes.
> > */
> > -struct page *grab_cache_page_write_begin(struct address_space *mapping,
> > - pgoff_t index, unsigned flags)
> > +struct page *__grab_cache_page_write_begin(struct address_space *mapping,
> > + pgoff_t index, unsigned flags)
> > {
> > int status;
> > struct page *page;
> > @@ -2243,6 +2243,17 @@ repeat:
> > }
> > return page;
> > }
> > +struct page *grab_cache_page_write_begin(struct address_space *mapping,
> > + pgoff_t index, unsigned flags)
> > +{
> > + struct page *p;
> > +
> > + p = __grab_cache_page_write_begin(mapping, index, flags);
> > + if (p)
> > + wait_on_page_writeback(p);
> > +
> > + return p;
> > +}
> > EXPORT_SYMBOL(grab_cache_page_write_begin);
>
> Not much point in add in a wrapper when nothing else calls
> __grab_cache_page_write_begin(), which should also be static....
>
> Cheers,
>
> Dave.
> --
> Dave Chinner
> [email protected]
> --
> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On 2011-03-18, at 6:07 PM, Darrick J. Wong wrote:
>> Ok, here's what I have so far. I took everyone's suggestions of where to add
>> calls to wait_on_page_writeback, which seems to handle the multiple-write case
>> adequately. Unfortunately, it is still possible to generate checksum errors by
>> scribbling furiously on a mmap'd region, even after adding the writeback wait
>> in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
>> removing its wait_for_page_writeback call, so I suspect there's a bit more
>> going on in btrfs than I've been able to figure out.
> I wonder, is it possible for this to happen:
>
> 1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes,
> but there's no ongoing writeback, so it returns without delay.
> 2. Thread A starts writing furiously to the page.
> 3. Thread B runs fsync() or something that results in the page being
> checksummed and scheduled for writeout.
> 4. Thread A continues to write furiously(!) on that same page before the
> controller finishes the DMA transfer.
Right, page_mkwrite() is only called for the ro->rw transition.
> 5. Disk gets the page, which now doesn't match its checksum, and *boom*
>
> After letting the stress tool run for a few days, I can say fairly confidently
> that the write() case doesn't seem to fail regardless of the O_DIRECT setting.
> However, with writes to mmap regions, failures happen about once every 20-40
> minutes, even with O_DIRECT set. To me this suggests some sort of race
> condition that we seem to win except once every 20 minutes.
>
> I then thought, if page_mkwrite contains a wait_on_page_writeback, then perhaps
> there's something that I could do just prior to calculating the DIF checksum
> that would cause any subsequent write attempts to be shuffled back into
> page_mkwrite. I tried the set_memory_ro thing again, though that led to some
> recursive lock errors and I noticed that those functions only seem to exist in
> arch/x86/. Next I tried directly mucking with PTEs, in addition to feeling
> messy, only seemed to corrupt memory. :)
This seems like the best solution, IMHO, to ensure that mmap is blocked in page_mkwrite() before it has any chance to dirty the page undergoing checksum. The trick is that you need to set_page_writeback() before setting the page read-only, otherwise the race still exists.
> Is there a "correct" way to take a writeable page and make it so that any
> process trying to write to it ends up hitting the page fault handler where we
> can then wait for writeback? Or perhaps I am simply barking up the wrong tree?
>
> (Just FYI I took the old copy-everything-to-bounce-buffers patch that few
> people liked for a second spin, and the errors did not surface regardless of
> what combination of write/mmap and directio/bufferedio I told it to use.)
I wouldn't be so much against memcpy() for mmap pages, but it does seem kind of gross that mmap is forcing data copies when a major reason to use mmap is to AVOID data copies.
>>> The set_memory_ro debugging trick didn't ferret out any write paths that I
>>> didn't catch... though it did have the effect of causing occasional fsync()
>>> deadlocks. I suppose I could sprinkle in a few more of those write calls to
>>> see what happens.
>>>
>>> Either way, I'm emailing to ask everyone's advice since I've run out of ideas.
>>> Or: Did I miss something?
>>>
>>> Thanks all for the feedback so far!
>>>
>>> --
>>> fs: Wait for page writeback when rewrite detected
>>>
>>> Signed-off-by: Darrick J. Wong <[email protected]>
>>> ---
>>>
>>> fs/buffer.c | 4 +++-
>>> fs/ext4/inode.c | 3 +++
>>> mm/filemap.c | 15 +++++++++++++--
>>> 3 files changed, 19 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/fs/buffer.c b/fs/buffer.c
>>> index 2219a76..39e934c 100644
>>> --- a/fs/buffer.c
>>> +++ b/fs/buffer.c
>>> @@ -2379,8 +2379,10 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
>>> ret = VM_FAULT_OOM;
>>> else /* -ENOSPC, -EIO, etc */
>>> ret = VM_FAULT_SIGBUS;
>>> - } else
>>> + } else {
>>> + wait_on_page_writeback(page);
>>> ret = VM_FAULT_LOCKED;
>>> + }
>>
>> I think this needs to wait before the __block_write_begin() call,
>> not after it. i.e. wait before the page is mapped, not afterwards.
>>
>> ....
>>> diff --git a/mm/filemap.c b/mm/filemap.c
>>> index 83a45d3..f201d80 100644
>>> --- a/mm/filemap.c
>>> +++ b/mm/filemap.c
>>> @@ -2217,8 +2217,8 @@ EXPORT_SYMBOL(generic_file_direct_write);
>>> * Find or create a page at the given pagecache position. Return the locked
>>> * page. This function is specifically for buffered writes.
>>> */
>>> -struct page *grab_cache_page_write_begin(struct address_space *mapping,
>>> - pgoff_t index, unsigned flags)
>>> +struct page *__grab_cache_page_write_begin(struct address_space *mapping,
>>> + pgoff_t index, unsigned flags)
>>> {
>>> int status;
>>> struct page *page;
>>> @@ -2243,6 +2243,17 @@ repeat:
>>> }
>>> return page;
>>> }
>>> +struct page *grab_cache_page_write_begin(struct address_space *mapping,
>>> + pgoff_t index, unsigned flags)
>>> +{
>>> + struct page *p;
>>> +
>>> + p = __grab_cache_page_write_begin(mapping, index, flags);
>>> + if (p)
>>> + wait_on_page_writeback(p);
>>> +
>>> + return p;
>>> +}
>>> EXPORT_SYMBOL(grab_cache_page_write_begin);
>>
>> Not much point in add in a wrapper when nothing else calls
>> __grab_cache_page_write_begin(), which should also be static....
>>
>> Cheers,
>>
>> Dave.
>> --
>> Dave Chinner
>> [email protected]
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-scsi" in
>> the body of a message to [email protected]
>> More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
Cheers, Andreas
On Fri 18-03-11 17:07:55, Darrick J. Wong wrote:
> > > Ok, here's what I have so far. I took everyone's suggestions of where to add
> > > calls to wait_on_page_writeback, which seems to handle the multiple-write case
> > > adequately. Unfortunately, it is still possible to generate checksum errors by
> > > scribbling furiously on a mmap'd region, even after adding the writeback wait
> > > in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
> > > removing its wait_for_page_writeback call, so I suspect there's a bit more
> > > going on in btrfs than I've been able to figure out.
>
> I wonder, is it possible for this to happen:
>
> 1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes,
> but there's no ongoing writeback, so it returns without delay.
> 2. Thread A starts writing furiously to the page.
> 3. Thread B runs fsync() or something that results in the page being
> checksummed and scheduled for writeout.
> 4. Thread A continues to write furiously(!) on that same page before the
> controller finishes the DMA transfer.
> 5. Disk gets the page, which now doesn't match its checksum, and *boom*
What happens on writepage (see mm/page-writeback.c:write_cache_pages())
is:
lock_page(page)
...
clear_page_dirty_for_io() - removes PageDirty, marks page as read-only in
PTE
...
set_page_writeback() (happens e.g. in __block_write_full_page() called
from filesystem's writepage implementation).
unlock_page(page)
So if you compute the checksum after set_page_writeback() is done in the
writepage() implementation (you cannot use __block_write_full_page() in
that case) and you call wait_on_page_writeback() in ext4_page_mkwrite()
under page lock, you should be safe. If you do all this and still see
errors, something is broken I'd say...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
Excerpts from Jan Kara's message of 2011-03-21 10:04:51 -0400:
> On Fri 18-03-11 17:07:55, Darrick J. Wong wrote:
> > > > Ok, here's what I have so far. I took everyone's suggestions of where to add
> > > > calls to wait_on_page_writeback, which seems to handle the multiple-write case
> > > > adequately. Unfortunately, it is still possible to generate checksum errors by
> > > > scribbling furiously on a mmap'd region, even after adding the writeback wait
> > > > in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
> > > > removing its wait_for_page_writeback call, so I suspect there's a bit more
> > > > going on in btrfs than I've been able to figure out.
> >
> > I wonder, is it possible for this to happen:
> >
> > 1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes,
> > but there's no ongoing writeback, so it returns without delay.
> > 2. Thread A starts writing furiously to the page.
> > 3. Thread B runs fsync() or something that results in the page being
> > checksummed and scheduled for writeout.
> > 4. Thread A continues to write furiously(!) on that same page before the
> > controller finishes the DMA transfer.
> > 5. Disk gets the page, which now doesn't match its checksum, and *boom*
> What happens on writepage (see mm/page-writeback.c:write_cache_pages())
> is:
> lock_page(page)
> ...
> clear_page_dirty_for_io() - removes PageDirty, marks page as read-only in
> PTE
> ...
> set_page_writeback() (happens e.g. in __block_write_full_page() called
> from filesystem's writepage implementation).
> unlock_page(page)
>
> So if you compute the checksum after set_page_writeback() is done in the
> writepage() implementation (you cannot use __block_write_full_page() in
> that case) and you call wait_on_page_writeback() in ext4_page_mkwrite()
> under page lock, you should be safe. If you do all this and still see
> errors, something is broken I'd say...
Looking at the ext4_page_mkwrite, it does this:
lock the page
check for holes
unlock the page
if (no_holes)
return;
write_begin/write_end
return
So, to have page_mkwrite work, you need to wait for writeback with the
page locked in both the no holes case and after the
write_begin/write_end. write_begin will dirty the page, so someone can
wander in and start the IO while we are still in page_mkwrite.
This is untested and uncompiled, but it should
do the trick.
Jan, did you get rid of all the buffer head based writeback for
data=ordered in ext4? That's my only other idea, that someone is doing
writeback directly without taking the page lock.
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 9f7f9e4..8a75e12 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5880,6 +5880,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (page_has_buffers(page)) {
if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
ext4_bh_unmapped)) {
+ wait_on_page_writeback(page);
unlock_page(page);
goto out_unlock;
}
@@ -5901,6 +5902,16 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (ret < 0)
goto out_unlock;
ret = 0;
+
+ /*
+ * write_begin/end might have created a dirty page and someone
+ * could wander in and start the IO. Make sure that hasn't
+ * happened
+ */
+ lock_page(page);
+ wait_on_page_writeback(page);
+ unlock_page(page);
+
out_unlock:
if (ret)
ret = VM_FAULT_SIGBUS;
On Mon 21-03-11 10:24:41, Chris Mason wrote:
> Excerpts from Jan Kara's message of 2011-03-21 10:04:51 -0400:
> > On Fri 18-03-11 17:07:55, Darrick J. Wong wrote:
> > > > > Ok, here's what I have so far. I took everyone's suggestions of where to add
> > > > > calls to wait_on_page_writeback, which seems to handle the multiple-write case
> > > > > adequately. Unfortunately, it is still possible to generate checksum errors by
> > > > > scribbling furiously on a mmap'd region, even after adding the writeback wait
> > > > > in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
> > > > > removing its wait_for_page_writeback call, so I suspect there's a bit more
> > > > > going on in btrfs than I've been able to figure out.
> > >
> > > I wonder, is it possible for this to happen:
> > >
> > > 1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes,
> > > but there's no ongoing writeback, so it returns without delay.
> > > 2. Thread A starts writing furiously to the page.
> > > 3. Thread B runs fsync() or something that results in the page being
> > > checksummed and scheduled for writeout.
> > > 4. Thread A continues to write furiously(!) on that same page before the
> > > controller finishes the DMA transfer.
> > > 5. Disk gets the page, which now doesn't match its checksum, and *boom*
> > What happens on writepage (see mm/page-writeback.c:write_cache_pages())
> > is:
> > lock_page(page)
> > ...
> > clear_page_dirty_for_io() - removes PageDirty, marks page as read-only in
> > PTE
> > ...
> > set_page_writeback() (happens e.g. in __block_write_full_page() called
> > from filesystem's writepage implementation).
> > unlock_page(page)
> >
> > So if you compute the checksum after set_page_writeback() is done in the
> > writepage() implementation (you cannot use __block_write_full_page() in
> > that case)
I should add that if you are computing the checksum in the block layer
once the bio is submitted, you obviously are computing it after the page is
marked as writeback. So that should be fine...
> > and you call wait_on_page_writeback() in ext4_page_mkwrite()
> > under page lock, you should be safe. If you do all this and still see
> > errors, something is broken I'd say...
>
> Looking at the ext4_page_mkwrite, it does this:
>
> lock the page
> check for holes
> unlock the page
> if (no_holes)
> return;
>
> write_begin/write_end
> return
>
> So, to have page_mkwrite work, you need to wait for writeback with the
> page locked in both the no holes case and after the
> write_begin/write_end. write_begin will dirty the page, so someone can
> wander in and start the IO while we are still in page_mkwrite.
Oh right, that's a good point.
> This is untested and uncompiled, but it should
> do the trick.
>
> Jan, did you get rid of all the buffer head based writeback for
> data=ordered in ext4? That's my only other idea, that someone is doing
> writeback directly without taking the page lock.
Yes, ext4 shouldn't do any buffer based writeback.
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 9f7f9e4..8a75e12 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -5880,6 +5880,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> if (page_has_buffers(page)) {
> if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
> ext4_bh_unmapped)) {
> + wait_on_page_writeback(page);
> unlock_page(page);
> goto out_unlock;
> }
> @@ -5901,6 +5902,16 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> if (ret < 0)
> goto out_unlock;
> ret = 0;
> +
> + /*
> + * write_begin/end might have created a dirty page and someone
> + * could wander in and start the IO. Make sure that hasn't
> + * happened
> + */
> + lock_page(page);
> + wait_on_page_writeback(page);
> + unlock_page(page);
> +
> out_unlock:
> if (ret)
> ret = VM_FAULT_SIGBUS;
>
This looks good AFAICT.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri, Mar 18, 2011 at 08:28:26PM -0600, Andreas Dilger wrote:
> On 2011-03-18, at 6:07 PM, Darrick J. Wong wrote:
> >> Ok, here's what I have so far. I took everyone's suggestions of where to add
> >> calls to wait_on_page_writeback, which seems to handle the multiple-write case
> >> adequately. Unfortunately, it is still possible to generate checksum errors by
> >> scribbling furiously on a mmap'd region, even after adding the writeback wait
> >> in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
> >> removing its wait_for_page_writeback call, so I suspect there's a bit more
> >> going on in btrfs than I've been able to figure out.
>
> > I wonder, is it possible for this to happen:
> >
> > 1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes,
> > but there's no ongoing writeback, so it returns without delay.
> > 2. Thread A starts writing furiously to the page.
> > 3. Thread B runs fsync() or something that results in the page being
> > checksummed and scheduled for writeout.
> > 4. Thread A continues to write furiously(!) on that same page before the
> > controller finishes the DMA transfer.
>
> Right, page_mkwrite() is only called for the ro->rw transition.
>
> > 5. Disk gets the page, which now doesn't match its checksum, and *boom*
> >
> > After letting the stress tool run for a few days, I can say fairly confidently
> > that the write() case doesn't seem to fail regardless of the O_DIRECT setting.
> > However, with writes to mmap regions, failures happen about once every 20-40
> > minutes, even with O_DIRECT set. To me this suggests some sort of race
> > condition that we seem to win except once every 20 minutes.
> >
> > I then thought, if page_mkwrite contains a wait_on_page_writeback, then perhaps
> > there's something that I could do just prior to calculating the DIF checksum
> > that would cause any subsequent write attempts to be shuffled back into
> > page_mkwrite. I tried the set_memory_ro thing again, though that led to some
> > recursive lock errors and I noticed that those functions only seem to exist in
> > arch/x86/. Next I tried directly mucking with PTEs, in addition to feeling
> > messy, only seemed to corrupt memory. :)
>
> This seems like the best solution, IMHO, to ensure that mmap is blocked in
> page_mkwrite() before it has any chance to dirty the page undergoing
> checksum. The trick is that you need to set_page_writeback() before setting
> the page read-only, otherwise the race still exists.
I figured out that the recursive locking errors only happened in the
set_memory_rw half of the ro/rw memory pair, and that I could make them go away
(for now) by do set_memory_rw in the kintegrityd workqueue. Then I added a
call to set_page_writeback just prior to the set_memory_ro call, though that
resulted in a lot of complaints about invalid page states and the like. It
would seem that the memory pages that arrive in bio_integrity_prep from jbd2
don't have the writeback flag set, and setting it causes problems for it. The
writeback flag is set on all the pages that are associated with a checksum
failure, I noticed.
As for changing pte's around... does that set_memory_ro change the pte flags
for all running processes? I'm not so sure it does for anything other than the
current process. I think I saw a flush_tlb call in there... though I don't
think it helps me much.
If I /don't/ set the flag, the frequency of the errors decreases further to
about once an hour, but I still see the occasional error. :/ Currently I'm
trying to figure out how one might distinguish dirty pages that shouldn't have
writeback set vs. pages that ought to have it but don't.
I suppose I could pull out the 're-checksum and resubmit' patch I made a while
back, though it seems like a bandaid.
> > Is there a "correct" way to take a writeable page and make it so that any
> > process trying to write to it ends up hitting the page fault handler where we
> > can then wait for writeback? Or perhaps I am simply barking up the wrong tree?
> >
> > (Just FYI I took the old copy-everything-to-bounce-buffers patch that few
> > people liked for a second spin, and the errors did not surface regardless of
> > what combination of write/mmap and directio/bufferedio I told it to use.)
>
> I wouldn't be so much against memcpy() for mmap pages, but it does seem kind
> of gross that mmap is forcing data copies when a major reason to use mmap is
> to AVOID data copies.
Yeah. We probably want to avoid having to find extra pages too. :(
--D
On Tue 22-03-11 12:23:01, Darrick J. Wong wrote:
> On Fri, Mar 18, 2011 at 08:28:26PM -0600, Andreas Dilger wrote:
> > This seems like the best solution, IMHO, to ensure that mmap is blocked in
> > page_mkwrite() before it has any chance to dirty the page undergoing
> > checksum. The trick is that you need to set_page_writeback() before setting
> > the page read-only, otherwise the race still exists.
>
> I figured out that the recursive locking errors only happened in the
> set_memory_rw half of the ro/rw memory pair, and that I could make them go away
> (for now) by do set_memory_rw in the kintegrityd workqueue. Then I added a
> call to set_page_writeback just prior to the set_memory_ro call, though that
> resulted in a lot of complaints about invalid page states and the like. It
> would seem that the memory pages that arrive in bio_integrity_prep from jbd2
> don't have the writeback flag set, and setting it causes problems for it. The
> writeback flag is set on all the pages that are associated with a checksum
> failure, I noticed.
Yeah, pages submitted by jbd2 don't have writeback flag set because they
are metadata blocks written directly via buffer heads. But as you noted,
these are protected in a different way by the journalling layer so
shouldn't need to worry.
> As for changing pte's around... does that set_memory_ro change the pte flags
> for all running processes? I'm not so sure it does for anything other than the
> current process. I think I saw a flush_tlb call in there... though I don't
> think it helps me much.
>
> If I /don't/ set the flag, the frequency of the errors decreases further to
> about once an hour, but I still see the occasional error. :/ Currently I'm
> trying to figure out how one might distinguish dirty pages that shouldn't have
> writeback set vs. pages that ought to have it but don't.
It's difficult at the block layer... If page->mapping->host is not a
device inode, the page should have PageWriteback set. If it is a device
inode, you don't know - JBD2 will submit pages without PageWriteback set,
flusher thread will submit pages with PageWriteback set. And both is OK
since we use buffer state for synchronization.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Mon, Mar 21, 2011 at 05:43:05PM +0100, Jan Kara wrote:
> On Mon 21-03-11 10:24:41, Chris Mason wrote:
> > Excerpts from Jan Kara's message of 2011-03-21 10:04:51 -0400:
> > > On Fri 18-03-11 17:07:55, Darrick J. Wong wrote:
> > > > > > Ok, here's what I have so far. I took everyone's suggestions of where to add
> > > > > > calls to wait_on_page_writeback, which seems to handle the multiple-write case
> > > > > > adequately. Unfortunately, it is still possible to generate checksum errors by
> > > > > > scribbling furiously on a mmap'd region, even after adding the writeback wait
> > > > > > in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
> > > > > > removing its wait_for_page_writeback call, so I suspect there's a bit more
> > > > > > going on in btrfs than I've been able to figure out.
> > > >
> > > > I wonder, is it possible for this to happen:
> > > >
> > > > 1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes,
> > > > but there's no ongoing writeback, so it returns without delay.
> > > > 2. Thread A starts writing furiously to the page.
> > > > 3. Thread B runs fsync() or something that results in the page being
> > > > checksummed and scheduled for writeout.
> > > > 4. Thread A continues to write furiously(!) on that same page before the
> > > > controller finishes the DMA transfer.
> > > > 5. Disk gets the page, which now doesn't match its checksum, and *boom*
> > > What happens on writepage (see mm/page-writeback.c:write_cache_pages())
> > > is:
> > > lock_page(page)
> > > ...
> > > clear_page_dirty_for_io() - removes PageDirty, marks page as read-only in
> > > PTE
> > > ...
> > > set_page_writeback() (happens e.g. in __block_write_full_page() called
> > > from filesystem's writepage implementation).
> > > unlock_page(page)
> > >
> > > So if you compute the checksum after set_page_writeback() is done in the
> > > writepage() implementation (you cannot use __block_write_full_page() in
> > > that case)
> I should add that if you are computing the checksum in the block layer
> once the bio is submitted, you obviously are computing it after the page is
> marked as writeback. So that should be fine...
>
> > > and you call wait_on_page_writeback() in ext4_page_mkwrite()
> > > under page lock, you should be safe. If you do all this and still see
> > > errors, something is broken I'd say...
> >
> > Looking at the ext4_page_mkwrite, it does this:
> >
> > lock the page
> > check for holes
> > unlock the page
> > if (no_holes)
> > return;
> >
> > write_begin/write_end
> > return
> >
> > So, to have page_mkwrite work, you need to wait for writeback with the
> > page locked in both the no holes case and after the
> > write_begin/write_end. write_begin will dirty the page, so someone can
> > wander in and start the IO while we are still in page_mkwrite.
> Oh right, that's a good point.
>
> > This is untested and uncompiled, but it should
> > do the trick.
> >
> > Jan, did you get rid of all the buffer head based writeback for
> > data=ordered in ext4? That's my only other idea, that someone is doing
> > writeback directly without taking the page lock.
> Yes, ext4 shouldn't do any buffer based writeback.
>
> > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > index 9f7f9e4..8a75e12 100644
> > --- a/fs/ext4/inode.c
> > +++ b/fs/ext4/inode.c
> > @@ -5880,6 +5880,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > if (page_has_buffers(page)) {
> > if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
> > ext4_bh_unmapped)) {
> > + wait_on_page_writeback(page);
> > unlock_page(page);
> > goto out_unlock;
> > }
> > @@ -5901,6 +5902,16 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > if (ret < 0)
> > goto out_unlock;
> > ret = 0;
> > +
> > + /*
> > + * write_begin/end might have created a dirty page and someone
> > + * could wander in and start the IO. Make sure that hasn't
> > + * happened
> > + */
> > + lock_page(page);
> > + wait_on_page_writeback(page);
> > + unlock_page(page);
> > +
> > out_unlock:
> > if (ret)
> > ret = VM_FAULT_SIGBUS;
> >
> This looks good AFAICT.
I gave this a spin a couple of weeks ago (and accidentally left the test
machines running for a full week!) From what I can tell, with all the various
wait_for_page_writeback stuff-ins, we've cut the frequency of writeback errors
down to about 7-8 per day. Not bad, but not fixed.
On the odd chance that jbd2 really can provide stable pages during writeback, I
am now rerunning the test with no patches and data=journal, while noting that
(a) DIO mode doesn't work with data=journal and (b) the first write failure
will probably cause the journal to abort == game over. When that's done I'll
give the wait-for-writeback patches a whirl with 2.6.39-rc.
Enjoy the warm SF weather!
--D
On Wed, Apr 06, 2011 at 04:29:38PM -0700, Darrick J. Wong wrote:
> On Mon, Mar 21, 2011 at 05:43:05PM +0100, Jan Kara wrote:
> > On Mon 21-03-11 10:24:41, Chris Mason wrote:
> > > Excerpts from Jan Kara's message of 2011-03-21 10:04:51 -0400:
> > > > On Fri 18-03-11 17:07:55, Darrick J. Wong wrote:
> > > > > > > Ok, here's what I have so far. I took everyone's suggestions of where to add
> > > > > > > calls to wait_on_page_writeback, which seems to handle the multiple-write case
> > > > > > > adequately. Unfortunately, it is still possible to generate checksum errors by
> > > > > > > scribbling furiously on a mmap'd region, even after adding the writeback wait
> > > > > > > in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
> > > > > > > removing its wait_for_page_writeback call, so I suspect there's a bit more
> > > > > > > going on in btrfs than I've been able to figure out.
> > > > >
> > > > > I wonder, is it possible for this to happen:
> > > > >
> > > > > 1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes,
> > > > > but there's no ongoing writeback, so it returns without delay.
> > > > > 2. Thread A starts writing furiously to the page.
> > > > > 3. Thread B runs fsync() or something that results in the page being
> > > > > checksummed and scheduled for writeout.
> > > > > 4. Thread A continues to write furiously(!) on that same page before the
> > > > > controller finishes the DMA transfer.
> > > > > 5. Disk gets the page, which now doesn't match its checksum, and *boom*
> > > > What happens on writepage (see mm/page-writeback.c:write_cache_pages())
> > > > is:
> > > > lock_page(page)
> > > > ...
> > > > clear_page_dirty_for_io() - removes PageDirty, marks page as read-only in
> > > > PTE
> > > > ...
> > > > set_page_writeback() (happens e.g. in __block_write_full_page() called
> > > > from filesystem's writepage implementation).
> > > > unlock_page(page)
> > > >
> > > > So if you compute the checksum after set_page_writeback() is done in the
> > > > writepage() implementation (you cannot use __block_write_full_page() in
> > > > that case)
> > I should add that if you are computing the checksum in the block layer
> > once the bio is submitted, you obviously are computing it after the page is
> > marked as writeback. So that should be fine...
> >
> > > > and you call wait_on_page_writeback() in ext4_page_mkwrite()
> > > > under page lock, you should be safe. If you do all this and still see
> > > > errors, something is broken I'd say...
> > >
> > > Looking at the ext4_page_mkwrite, it does this:
> > >
> > > lock the page
> > > check for holes
> > > unlock the page
> > > if (no_holes)
> > > return;
> > >
> > > write_begin/write_end
> > > return
> > >
> > > So, to have page_mkwrite work, you need to wait for writeback with the
> > > page locked in both the no holes case and after the
> > > write_begin/write_end. write_begin will dirty the page, so someone can
> > > wander in and start the IO while we are still in page_mkwrite.
> > Oh right, that's a good point.
> >
> > > This is untested and uncompiled, but it should
> > > do the trick.
> > >
> > > Jan, did you get rid of all the buffer head based writeback for
> > > data=ordered in ext4? That's my only other idea, that someone is doing
> > > writeback directly without taking the page lock.
> > Yes, ext4 shouldn't do any buffer based writeback.
> >
> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > index 9f7f9e4..8a75e12 100644
> > > --- a/fs/ext4/inode.c
> > > +++ b/fs/ext4/inode.c
> > > @@ -5880,6 +5880,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > if (page_has_buffers(page)) {
> > > if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
> > > ext4_bh_unmapped)) {
> > > + wait_on_page_writeback(page);
> > > unlock_page(page);
> > > goto out_unlock;
> > > }
> > > @@ -5901,6 +5902,16 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > if (ret < 0)
> > > goto out_unlock;
> > > ret = 0;
> > > +
> > > + /*
> > > + * write_begin/end might have created a dirty page and someone
> > > + * could wander in and start the IO. Make sure that hasn't
> > > + * happened
> > > + */
> > > + lock_page(page);
> > > + wait_on_page_writeback(page);
> > > + unlock_page(page);
> > > +
> > > out_unlock:
> > > if (ret)
> > > ret = VM_FAULT_SIGBUS;
> > >
> > This looks good AFAICT.
>
> I gave this a spin a couple of weeks ago (and accidentally left the test
> machines running for a full week!) From what I can tell, with all the various
> wait_for_page_writeback stuff-ins, we've cut the frequency of writeback errors
> down to about 7-8 per day. Not bad, but not fixed.
>
> On the odd chance that jbd2 really can provide stable pages during writeback, I
> am now rerunning the test with no patches and data=journal, while noting that
> (a) DIO mode doesn't work with data=journal and (b) the first write failure
> will probably cause the journal to abort == game over. When that's done I'll
Heh, nope, even with data=journal I still see checksum errors. So much for
that hare-brained theory. :(
--D
On Wed 06-04-11 16:29:38, Darrick J. Wong wrote:
> On Mon, Mar 21, 2011 at 05:43:05PM +0100, Jan Kara wrote:
> > On Mon 21-03-11 10:24:41, Chris Mason wrote:
> > > Excerpts from Jan Kara's message of 2011-03-21 10:04:51 -0400:
> > > > On Fri 18-03-11 17:07:55, Darrick J. Wong wrote:
> > > > > > > Ok, here's what I have so far. I took everyone's suggestions of where to add
> > > > > > > calls to wait_on_page_writeback, which seems to handle the multiple-write case
> > > > > > > adequately. Unfortunately, it is still possible to generate checksum errors by
> > > > > > > scribbling furiously on a mmap'd region, even after adding the writeback wait
> > > > > > > in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
> > > > > > > removing its wait_for_page_writeback call, so I suspect there's a bit more
> > > > > > > going on in btrfs than I've been able to figure out.
> > > > >
> > > > > I wonder, is it possible for this to happen:
> > > > >
> > > > > 1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes,
> > > > > but there's no ongoing writeback, so it returns without delay.
> > > > > 2. Thread A starts writing furiously to the page.
> > > > > 3. Thread B runs fsync() or something that results in the page being
> > > > > checksummed and scheduled for writeout.
> > > > > 4. Thread A continues to write furiously(!) on that same page before the
> > > > > controller finishes the DMA transfer.
> > > > > 5. Disk gets the page, which now doesn't match its checksum, and *boom*
> > > > What happens on writepage (see mm/page-writeback.c:write_cache_pages())
> > > > is:
> > > > lock_page(page)
> > > > ...
> > > > clear_page_dirty_for_io() - removes PageDirty, marks page as read-only in
> > > > PTE
> > > > ...
> > > > set_page_writeback() (happens e.g. in __block_write_full_page() called
> > > > from filesystem's writepage implementation).
> > > > unlock_page(page)
> > > >
> > > > So if you compute the checksum after set_page_writeback() is done in the
> > > > writepage() implementation (you cannot use __block_write_full_page() in
> > > > that case)
> > I should add that if you are computing the checksum in the block layer
> > once the bio is submitted, you obviously are computing it after the page is
> > marked as writeback. So that should be fine...
> >
> > > > and you call wait_on_page_writeback() in ext4_page_mkwrite()
> > > > under page lock, you should be safe. If you do all this and still see
> > > > errors, something is broken I'd say...
> > >
> > > Looking at the ext4_page_mkwrite, it does this:
> > >
> > > lock the page
> > > check for holes
> > > unlock the page
> > > if (no_holes)
> > > return;
> > >
> > > write_begin/write_end
> > > return
> > >
> > > So, to have page_mkwrite work, you need to wait for writeback with the
> > > page locked in both the no holes case and after the
> > > write_begin/write_end. write_begin will dirty the page, so someone can
> > > wander in and start the IO while we are still in page_mkwrite.
> > Oh right, that's a good point.
> >
> > > This is untested and uncompiled, but it should
> > > do the trick.
> > >
> > > Jan, did you get rid of all the buffer head based writeback for
> > > data=ordered in ext4? That's my only other idea, that someone is doing
> > > writeback directly without taking the page lock.
> > Yes, ext4 shouldn't do any buffer based writeback.
> >
> > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > index 9f7f9e4..8a75e12 100644
> > > --- a/fs/ext4/inode.c
> > > +++ b/fs/ext4/inode.c
> > > @@ -5880,6 +5880,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > if (page_has_buffers(page)) {
> > > if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
> > > ext4_bh_unmapped)) {
> > > + wait_on_page_writeback(page);
> > > unlock_page(page);
> > > goto out_unlock;
> > > }
> > > @@ -5901,6 +5902,16 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > if (ret < 0)
> > > goto out_unlock;
> > > ret = 0;
> > > +
> > > + /*
> > > + * write_begin/end might have created a dirty page and someone
> > > + * could wander in and start the IO. Make sure that hasn't
> > > + * happened
> > > + */
> > > + lock_page(page);
> > > + wait_on_page_writeback(page);
> > > + unlock_page(page);
> > > +
> > > out_unlock:
> > > if (ret)
> > > ret = VM_FAULT_SIGBUS;
> > >
> > This looks good AFAICT.
>
> I gave this a spin a couple of weeks ago (and accidentally left the test
> machines running for a full week!) From what I can tell, with all the various
> wait_for_page_writeback stuff-ins, we've cut the frequency of writeback errors
> down to about 7-8 per day. Not bad, but not fixed.
Ugh, strange. Can you post the full patch you are currently using? I've
already lost track of all the proposed changes... Thanks.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Thu, Apr 07, 2011 at 06:57:00PM +0200, Jan Kara wrote:
> On Wed 06-04-11 16:29:38, Darrick J. Wong wrote:
> > On Mon, Mar 21, 2011 at 05:43:05PM +0100, Jan Kara wrote:
> > > On Mon 21-03-11 10:24:41, Chris Mason wrote:
> > > > Excerpts from Jan Kara's message of 2011-03-21 10:04:51 -0400:
> > > > > On Fri 18-03-11 17:07:55, Darrick J. Wong wrote:
> > > > > > > > Ok, here's what I have so far. I took everyone's suggestions of where to add
> > > > > > > > calls to wait_on_page_writeback, which seems to handle the multiple-write case
> > > > > > > > adequately. Unfortunately, it is still possible to generate checksum errors by
> > > > > > > > scribbling furiously on a mmap'd region, even after adding the writeback wait
> > > > > > > > in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
> > > > > > > > removing its wait_for_page_writeback call, so I suspect there's a bit more
> > > > > > > > going on in btrfs than I've been able to figure out.
> > > > > >
> > > > > > I wonder, is it possible for this to happen:
> > > > > >
> > > > > > 1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes,
> > > > > > but there's no ongoing writeback, so it returns without delay.
> > > > > > 2. Thread A starts writing furiously to the page.
> > > > > > 3. Thread B runs fsync() or something that results in the page being
> > > > > > checksummed and scheduled for writeout.
> > > > > > 4. Thread A continues to write furiously(!) on that same page before the
> > > > > > controller finishes the DMA transfer.
> > > > > > 5. Disk gets the page, which now doesn't match its checksum, and *boom*
> > > > > What happens on writepage (see mm/page-writeback.c:write_cache_pages())
> > > > > is:
> > > > > lock_page(page)
> > > > > ...
> > > > > clear_page_dirty_for_io() - removes PageDirty, marks page as read-only in
> > > > > PTE
> > > > > ...
> > > > > set_page_writeback() (happens e.g. in __block_write_full_page() called
> > > > > from filesystem's writepage implementation).
> > > > > unlock_page(page)
> > > > >
> > > > > So if you compute the checksum after set_page_writeback() is done in the
> > > > > writepage() implementation (you cannot use __block_write_full_page() in
> > > > > that case)
> > > I should add that if you are computing the checksum in the block layer
> > > once the bio is submitted, you obviously are computing it after the page is
> > > marked as writeback. So that should be fine...
> > >
> > > > > and you call wait_on_page_writeback() in ext4_page_mkwrite()
> > > > > under page lock, you should be safe. If you do all this and still see
> > > > > errors, something is broken I'd say...
> > > >
> > > > Looking at the ext4_page_mkwrite, it does this:
> > > >
> > > > lock the page
> > > > check for holes
> > > > unlock the page
> > > > if (no_holes)
> > > > return;
> > > >
> > > > write_begin/write_end
> > > > return
> > > >
> > > > So, to have page_mkwrite work, you need to wait for writeback with the
> > > > page locked in both the no holes case and after the
> > > > write_begin/write_end. write_begin will dirty the page, so someone can
> > > > wander in and start the IO while we are still in page_mkwrite.
> > > Oh right, that's a good point.
> > >
> > > > This is untested and uncompiled, but it should
> > > > do the trick.
> > > >
> > > > Jan, did you get rid of all the buffer head based writeback for
> > > > data=ordered in ext4? That's my only other idea, that someone is doing
> > > > writeback directly without taking the page lock.
> > > Yes, ext4 shouldn't do any buffer based writeback.
> > >
> > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > > index 9f7f9e4..8a75e12 100644
> > > > --- a/fs/ext4/inode.c
> > > > +++ b/fs/ext4/inode.c
> > > > @@ -5880,6 +5880,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > > if (page_has_buffers(page)) {
> > > > if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
> > > > ext4_bh_unmapped)) {
> > > > + wait_on_page_writeback(page);
> > > > unlock_page(page);
> > > > goto out_unlock;
> > > > }
> > > > @@ -5901,6 +5902,16 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > > if (ret < 0)
> > > > goto out_unlock;
> > > > ret = 0;
> > > > +
> > > > + /*
> > > > + * write_begin/end might have created a dirty page and someone
> > > > + * could wander in and start the IO. Make sure that hasn't
> > > > + * happened
> > > > + */
> > > > + lock_page(page);
> > > > + wait_on_page_writeback(page);
> > > > + unlock_page(page);
> > > > +
> > > > out_unlock:
> > > > if (ret)
> > > > ret = VM_FAULT_SIGBUS;
> > > >
> > > This looks good AFAICT.
> >
> > I gave this a spin a couple of weeks ago (and accidentally left the test
> > machines running for a full week!) From what I can tell, with all the various
> > wait_for_page_writeback stuff-ins, we've cut the frequency of writeback errors
> > down to about 7-8 per day. Not bad, but not fixed.
> Ugh, strange. Can you post the full patch you are currently using? I've
> already lost track of all the proposed changes... Thanks.
Yes, me too. Attached is the giant patch I've been working on.
--D
fs: Wait for page writeback when rewrite detected, and mark pages ro during writeback
Signed-off-by: Darrick J. Wong <[email protected]>
---
diff --git a/fs/buffer.c b/fs/buffer.c
index a08bb8e..dd429fe 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2357,6 +2357,8 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
else
end = PAGE_CACHE_SIZE;
+ WARN_ON(!PageLocked(page));
+ wait_on_page_writeback(page);
ret = __block_write_begin(page, 0, end, get_block);
if (!ret)
ret = block_commit_write(page, 0, end);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 1a86282..57cd028 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2666,6 +2666,8 @@ static int ext4_writepage(struct page *page,
*/
goto redirty_page;
}
+ WARN_ON(!PageLocked(page));
+ wait_on_page_writeback(page);
if (commit_write)
/* now mark the buffer_heads as dirty and uptodate */
block_commit_write(page, 0, len);
@@ -2778,7 +2780,8 @@ static int write_cache_pages_da(struct address_space *mapping,
}
lock_page(page);
-
+ if (PageWriteback(page))
+ wait_on_page_writeback(page);
/*
* If the page is no longer dirty, or its
* mapping no longer corresponds to inode we
@@ -5803,12 +5806,14 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (PageMappedToDisk(page))
goto out_unlock;
+ lock_page(page);
+ /* this one seems to handle mmap */
+ wait_on_page_writeback(page);
if (page->index == size >> PAGE_CACHE_SHIFT)
len = size & ~PAGE_CACHE_MASK;
else
len = PAGE_CACHE_SIZE;
- lock_page(page);
/*
* return if we have all the buffers mapped. This avoid
* the need to call write_begin/write_end which does a
@@ -5839,6 +5844,15 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (ret < 0)
goto out_unlock;
ret = 0;
+
+ /*
+ * write_begin/end might have created a dirty page and someone
+ * could wander in and start the IO. Make sure that hasn't
+ * happened.
+ */
+ lock_page(page);
+ wait_on_page_writeback(page);
+ unlock_page(page);
out_unlock:
if (ret)
ret = VM_FAULT_SIGBUS;
diff --git a/mm/filemap.c b/mm/filemap.c
index c641edf..3ed13a3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2277,8 +2277,9 @@ EXPORT_SYMBOL(generic_file_direct_write);
* Find or create a page at the given pagecache position. Return the locked
* page. This function is specifically for buffered writes.
*/
-struct page *grab_cache_page_write_begin(struct address_space *mapping,
- pgoff_t index, unsigned flags)
+static struct page *
+__grab_cache_page_write_begin(struct address_space *mapping, pgoff_t index,
+ unsigned flags)
{
int status;
struct page *page;
@@ -2303,6 +2304,20 @@ repeat:
}
return page;
}
+
+struct page *grab_cache_page_write_begin(struct address_space *mapping,
+ pgoff_t index, unsigned flags)
+{
+ struct page *p;
+
+ p = __grab_cache_page_write_begin(mapping, index, flags);
+ if (p) {
+ WARN_ON(!PageLocked(p));
+ wait_on_page_writeback(p);
+ }
+
+ return p;
+}
EXPORT_SYMBOL(grab_cache_page_write_begin);
static ssize_t generic_perform_write(struct file *file,
diff --git a/mm/memory.c b/mm/memory.c
index 9da8cab..17fb560 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3146,6 +3146,9 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
} else
VM_BUG_ON(!PageLocked(page));
page_mkwrite = 1;
+ } else {
+ WARN_ON(!PageLocked(page));
+ wait_on_page_writeback(page);
}
}
diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 9c5e6b2..7d2c57d 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -25,6 +25,7 @@
#include <linux/bio.h>
#include <linux/workqueue.h>
#include <linux/slab.h>
+#include <asm/tlbflush.h>
struct integrity_slab {
struct kmem_cache *slab;
@@ -382,6 +383,52 @@ static inline unsigned short blk_integrity_tuple_size(struct blk_integrity *bi)
return 0;
}
+static int set_page_ro(struct page *page)
+{
+ int set_memory_ro(unsigned long addr, int numpages);
+ unsigned long addr = (unsigned long)page_address(page);
+ if (PageHighMem(page))
+ return 0;
+ return set_memory_ro(addr, 1);
+}
+
+static int set_page_rw(struct page *page)
+{
+ int set_memory_rw(unsigned long addr, int numpages);
+ unsigned long addr = (unsigned long)page_address(page);
+ if (PageHighMem(page))
+ return 0;
+ return set_memory_rw(addr, 1);
+}
+
+static void bio_integrity_write_fn(struct work_struct *work)
+{
+ struct bio_integrity_payload *bip =
+ container_of(work, struct bio_integrity_payload, bip_work);
+ struct bio *bio = bip->bip_bio;
+ struct bio_vec *from;
+ int i;
+
+ __bio_for_each_segment(from, bio, i, 0) {
+ set_page_rw(from->bv_page);
+ }
+
+ /* Restore original bio completion handler */
+ bio->bi_end_io = bip->bip_end_io;
+ bio_endio(bio, bip->bip_error);
+}
+
+static void bio_integrity_endio_write(struct bio *bio, int error)
+{
+ struct bio_integrity_payload *bip = bio->bi_integrity;
+
+ BUG_ON(bip->bip_bio != bio);
+
+ bip->bip_error = error;
+ INIT_WORK(&bip->bip_work, bio_integrity_write_fn);
+ queue_work(kintegrityd_wq, &bip->bip_work);
+}
+
/**
* bio_integrity_prep - Prepare bio for integrity I/O
* @bio: bio to prepare
@@ -468,8 +515,30 @@ int bio_integrity_prep(struct bio *bio)
}
/* Auto-generate integrity metadata if this is a write */
- if (bio_data_dir(bio) == WRITE)
+ if (bio_data_dir(bio) == WRITE) {
+ struct bio_vec *from;
+ int i;
+
+ bip->bip_end_io = bio->bi_end_io;
+ bio->bi_end_io = bio_integrity_endio_write;
+ __bio_for_each_segment(from, bio, i, 0) {
+ set_page_writeback(from->bv_page);
+#if 0
+ if (!PagePrivate(from->bv_page) &&
+ !PageWriteback(from->bv_page) &&
+ from->bv_page->mapping &&
+ from->bv_page->mapping->host &&
+ !from->bv_page->mapping->host->i_bdev)
+ {
+ printk(KERN_ERR "%s: writebacking file(?) page=%p flags=%lx mode=%x...\n", __FUNCTION__, from->bv_page, from->bv_page->flags, from->bv_page->mapping->host->i_mode);
+ set_page_writeback(from->bv_page);
+ }
+#endif
+ set_page_ro(from->bv_page);
+ flush_tlb_all();
+ }
bio_integrity_generate(bio);
+ }
return 0;
}
diff --git a/fs/buffer.c b/fs/buffer.c
index dd429fe..02156c5 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -376,8 +376,10 @@ void end_buffer_async_write(struct buffer_head *bh, int uptodate)
if (!quiet_error(bh)) {
buffer_io_error(bh);
printk(KERN_WARNING "lost page write due to "
- "I/O error on %s\n",
- bdevname(bh->b_bdev, b));
+ "I/O error on %s, flags=%lx\n",
+ bdevname(bh->b_bdev, b), page->flags);
+if (page->mapping && page->mapping->host)
+printk(KERN_ERR "mode=%x inode=%d dev=%d\n", page->mapping->host->i_mode, page->mapping->host->i_ino, (page->mapping->host->i_rdev));
}
set_bit(AS_EIO, &page->mapping->flags);
set_buffer_write_io_error(bh);
diff --git a/include/linux/bio.h b/include/linux/bio.h
index ce33e68..ea36e89 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -180,6 +180,7 @@ struct bio_integrity_payload {
unsigned short bip_vcnt; /* # of integrity bio_vecs */
unsigned short bip_idx; /* current bip_vec index */
+ int bip_error; /* bio completion status? */
struct work_struct bip_work; /* I/O completion */
struct bio_vec bip_vec[0]; /* embedded bvec array */
};
On Fri, 8 Apr 2011 13:31:35 -0700
"Darrick J. Wong" <[email protected]> wrote:
> On Thu, Apr 07, 2011 at 06:57:00PM +0200, Jan Kara wrote:
> > On Wed 06-04-11 16:29:38, Darrick J. Wong wrote:
> > > On Mon, Mar 21, 2011 at 05:43:05PM +0100, Jan Kara wrote:
> > > > On Mon 21-03-11 10:24:41, Chris Mason wrote:
> > > > > Excerpts from Jan Kara's message of 2011-03-21 10:04:51 -0400:
> > > > > > On Fri 18-03-11 17:07:55, Darrick J. Wong wrote:
> > > > > > > > > Ok, here's what I have so far. I took everyone's suggestions of where to add
> > > > > > > > > calls to wait_on_page_writeback, which seems to handle the multiple-write case
> > > > > > > > > adequately. Unfortunately, it is still possible to generate checksum errors by
> > > > > > > > > scribbling furiously on a mmap'd region, even after adding the writeback wait
> > > > > > > > > in the ext4 writepage function. Oddly, I couldn't break btrfs with mmap by
> > > > > > > > > removing its wait_for_page_writeback call, so I suspect there's a bit more
> > > > > > > > > going on in btrfs than I've been able to figure out.
> > > > > > >
> > > > > > > I wonder, is it possible for this to happen:
> > > > > > >
> > > > > > > 1. Thread A mmaps a page and tries to write to it. ext4_page_mkwrite executes,
> > > > > > > but there's no ongoing writeback, so it returns without delay.
> > > > > > > 2. Thread A starts writing furiously to the page.
> > > > > > > 3. Thread B runs fsync() or something that results in the page being
> > > > > > > checksummed and scheduled for writeout.
> > > > > > > 4. Thread A continues to write furiously(!) on that same page before the
> > > > > > > controller finishes the DMA transfer.
> > > > > > > 5. Disk gets the page, which now doesn't match its checksum, and *boom*
> > > > > > What happens on writepage (see mm/page-writeback.c:write_cache_pages())
> > > > > > is:
> > > > > > lock_page(page)
> > > > > > ...
> > > > > > clear_page_dirty_for_io() - removes PageDirty, marks page as read-only in
> > > > > > PTE
> > > > > > ...
> > > > > > set_page_writeback() (happens e.g. in __block_write_full_page() called
> > > > > > from filesystem's writepage implementation).
> > > > > > unlock_page(page)
> > > > > >
> > > > > > So if you compute the checksum after set_page_writeback() is done in the
> > > > > > writepage() implementation (you cannot use __block_write_full_page() in
> > > > > > that case)
> > > > I should add that if you are computing the checksum in the block layer
> > > > once the bio is submitted, you obviously are computing it after the page is
> > > > marked as writeback. So that should be fine...
> > > >
> > > > > > and you call wait_on_page_writeback() in ext4_page_mkwrite()
> > > > > > under page lock, you should be safe. If you do all this and still see
> > > > > > errors, something is broken I'd say...
> > > > >
> > > > > Looking at the ext4_page_mkwrite, it does this:
> > > > >
> > > > > lock the page
> > > > > check for holes
> > > > > unlock the page
> > > > > if (no_holes)
> > > > > return;
> > > > >
> > > > > write_begin/write_end
> > > > > return
> > > > >
> > > > > So, to have page_mkwrite work, you need to wait for writeback with the
> > > > > page locked in both the no holes case and after the
> > > > > write_begin/write_end. write_begin will dirty the page, so someone can
> > > > > wander in and start the IO while we are still in page_mkwrite.
> > > > Oh right, that's a good point.
> > > >
> > > > > This is untested and uncompiled, but it should
> > > > > do the trick.
> > > > >
> > > > > Jan, did you get rid of all the buffer head based writeback for
> > > > > data=ordered in ext4? That's my only other idea, that someone is doing
> > > > > writeback directly without taking the page lock.
> > > > Yes, ext4 shouldn't do any buffer based writeback.
> > > >
> > > > > diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> > > > > index 9f7f9e4..8a75e12 100644
> > > > > --- a/fs/ext4/inode.c
> > > > > +++ b/fs/ext4/inode.c
> > > > > @@ -5880,6 +5880,7 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > > > if (page_has_buffers(page)) {
> > > > > if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
> > > > > ext4_bh_unmapped)) {
> > > > > + wait_on_page_writeback(page);
> > > > > unlock_page(page);
> > > > > goto out_unlock;
> > > > > }
> > > > > @@ -5901,6 +5902,16 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > > > if (ret < 0)
> > > > > goto out_unlock;
> > > > > ret = 0;
> > > > > +
> > > > > + /*
> > > > > + * write_begin/end might have created a dirty page and someone
> > > > > + * could wander in and start the IO. Make sure that hasn't
> > > > > + * happened
> > > > > + */
> > > > > + lock_page(page);
> > > > > + wait_on_page_writeback(page);
> > > > > + unlock_page(page);
> > > > > +
> > > > > out_unlock:
> > > > > if (ret)
> > > > > ret = VM_FAULT_SIGBUS;
> > > > >
> > > > This looks good AFAICT.
> > >
> > > I gave this a spin a couple of weeks ago (and accidentally left the test
> > > machines running for a full week!) From what I can tell, with all the various
> > > wait_for_page_writeback stuff-ins, we've cut the frequency of writeback errors
> > > down to about 7-8 per day. Not bad, but not fixed.
> > Ugh, strange. Can you post the full patch you are currently using? I've
> > already lost track of all the proposed changes... Thanks.
>
> Yes, me too. Attached is the giant patch I've been working on.
>
> --D
>
> fs: Wait for page writeback when rewrite detected, and mark pages ro during writeback
>
> Signed-off-by: Darrick J. Wong <[email protected]>
> ---
> diff --git a/fs/buffer.c b/fs/buffer.c
> index a08bb8e..dd429fe 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2357,6 +2357,8 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> else
> end = PAGE_CACHE_SIZE;
>
> + WARN_ON(!PageLocked(page));
> + wait_on_page_writeback(page);
> ret = __block_write_begin(page, 0, end, get_block);
> if (!ret)
> ret = block_commit_write(page, 0, end);
> diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
> index 1a86282..57cd028 100644
> --- a/fs/ext4/inode.c
> +++ b/fs/ext4/inode.c
> @@ -2666,6 +2666,8 @@ static int ext4_writepage(struct page *page,
> */
> goto redirty_page;
> }
> + WARN_ON(!PageLocked(page));
> + wait_on_page_writeback(page);
> if (commit_write)
> /* now mark the buffer_heads as dirty and uptodate */
> block_commit_write(page, 0, len);
> @@ -2778,7 +2780,8 @@ static int write_cache_pages_da(struct address_space *mapping,
> }
>
> lock_page(page);
> -
> + if (PageWriteback(page))
> + wait_on_page_writeback(page);
> /*
> * If the page is no longer dirty, or its
> * mapping no longer corresponds to inode we
> @@ -5803,12 +5806,14 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> if (PageMappedToDisk(page))
> goto out_unlock;
>
> + lock_page(page);
> + /* this one seems to handle mmap */
> + wait_on_page_writeback(page);
> if (page->index == size >> PAGE_CACHE_SHIFT)
> len = size & ~PAGE_CACHE_MASK;
> else
> len = PAGE_CACHE_SIZE;
>
> - lock_page(page);
> /*
> * return if we have all the buffers mapped. This avoid
> * the need to call write_begin/write_end which does a
> @@ -5839,6 +5844,15 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> if (ret < 0)
> goto out_unlock;
> ret = 0;
> +
> + /*
> + * write_begin/end might have created a dirty page and someone
> + * could wander in and start the IO. Make sure that hasn't
> + * happened.
> + */
> + lock_page(page);
> + wait_on_page_writeback(page);
> + unlock_page(page);
nit:
The callers of page_mkwrite always lock the page afterward if you
return from page_mkwrite with it unlocked. If you plan to take page
lock anyway, it's probably slightly more efficient not to unlock it and
instead return VM_FAULT_LOCKED.
--
Jeff Layton <[email protected]>
Excerpts from Jeff Layton's message of 2011-04-11 12:42:29 -0400:
> > @@ -5839,6 +5844,15 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > if (ret < 0)
> > goto out_unlock;
> > ret = 0;
> > +
> > + /*
> > + * write_begin/end might have created a dirty page and someone
> > + * could wander in and start the IO. Make sure that hasn't
> > + * happened.
> > + */
> > + lock_page(page);
> > + wait_on_page_writeback(page);
> > + unlock_page(page);
>
> nit:
>
> The callers of page_mkwrite always lock the page afterward if you
> return from page_mkwrite with it unlocked. If you plan to take page
> lock anyway, it's probably slightly more efficient not to unlock it and
> instead return VM_FAULT_LOCKED.
>
Actually this isn't a nit. Keeping the page locked closes an important
hole where it can become writeback again. It might fix the last
remaining problem.
-chris
On Mon, Apr 11, 2011 at 01:41:16PM -0400, Chris Mason wrote:
> Actually this isn't a nit. Keeping the page locked closes an important
> hole where it can become writeback again. It might fix the last
> remaining problem.
Note that the generic block_page_mkwrite atually does that. But only
xfs and nilfs2 actually make use of it.
Excerpts from Christoph Hellwig's message of 2011-04-11 14:25:41 -0400:
> On Mon, Apr 11, 2011 at 01:41:16PM -0400, Chris Mason wrote:
> > Actually this isn't a nit. Keeping the page locked closes an important
> > hole where it can become writeback again. It might fix the last
> > remaining problem.
>
> Note that the generic block_page_mkwrite atually does that. But only
> xfs and nilfs2 actually make use of it.
>
Btrfs does return VM_FAULT_LOCKED though. I had forgotten to look for
this in the ext4 page_mkwrite call.
-chris
On Mon, 2011-04-11 at 13:41 -0400, Chris Mason wrote:
> Excerpts from Jeff Layton's message of 2011-04-11 12:42:29 -0400:
> > > @@ -5839,6 +5844,15 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> > > if (ret < 0)
> > > goto out_unlock;
> > > ret = 0;
> > > +
> > > + /*
> > > + * write_begin/end might have created a dirty page and someone
> > > + * could wander in and start the IO. Make sure that hasn't
> > > + * happened.
> > > + */
> > > + lock_page(page);
> > > + wait_on_page_writeback(page);
> > > + unlock_page(page);
> >
I am little puzzled here. if someone wander in and start the IO, the
page is up-to-date (dirtied by this page_mkwrite). We shouldn't see the
checksum inconsistancy, right?
> > nit:
> >
> > The callers of page_mkwrite always lock the page afterward if you
> > return from page_mkwrite with it unlocked. If you plan to take page
> > lock anyway, it's probably slightly more efficient not to unlock it and
> > instead return VM_FAULT_LOCKED.
> >
>
> Actually this isn't a nit. Keeping the page locked closes an important
> hole where it can become writeback again. It might fix the last
> remaining problem.
>
Oh, right. Currently ext4_page_mkwrite drops the page lock before
calling it's dirty the page (by write_begin() and write_end(). I
suspect regrab the lock() after write_end() (with your proposed change)
and returning with locked still leave the dirty by ext4_page_mkwrite
unlocked. We probably should to keep the page locked the page during
the entire ext4_page_mkwrite() call. Any reason to drop the page lock()
before calling aops->write_begin()?
Mingming
On Mon, Apr 11, 2011 at 05:46:52PM -0700, Mingming Cao wrote:
> Oh, right. Currently ext4_page_mkwrite drops the page lock before
> calling it's dirty the page (by write_begin() and write_end(). I
> suspect regrab the lock() after write_end() (with your proposed change)
> and returning with locked still leave the dirty by ext4_page_mkwrite
> unlocked. We probably should to keep the page locked the page during
> the entire ext4_page_mkwrite() call. Any reason to drop the page lock()
> before calling aops->write_begin()?
write_begin takes the page lock by itself. That's one of the reasons why
block_page_mkwrite doesn't use plain ->write_begin / write_end, the
other beeing that we already get a page passed to use, so there's no
need to do the pagecache lookup or allocation done by
grab_cache_page_write_begin.
The best thing would be to completely drop ext4's current version
of page_mkwrite and start out with a copy of block_page_mkwrite which
has the journalling calls added back into it.
On Mon, 2011-04-11 at 20:57 -0400, Christoph Hellwig wrote:
> On Mon, Apr 11, 2011 at 05:46:52PM -0700, Mingming Cao wrote:
> > Oh, right. Currently ext4_page_mkwrite drops the page lock before
> > calling it's dirty the page (by write_begin() and write_end(). I
> > suspect regrab the lock() after write_end() (with your proposed change)
> > and returning with locked still leave the dirty by ext4_page_mkwrite
> > unlocked. We probably should to keep the page locked the page during
> > the entire ext4_page_mkwrite() call. Any reason to drop the page lock()
> > before calling aops->write_begin()?
>
> write_begin takes the page lock by itself. That's one of the reasons why
> block_page_mkwrite doesn't use plain ->write_begin / write_end, the
> other beeing that we already get a page passed to use, so there's no
> need to do the pagecache lookup or allocation done by
> grab_cache_page_write_begin.
>
> The best thing would be to completely drop ext4's current version
> of page_mkwrite and start out with a copy of block_page_mkwrite which
> has the journalling calls added back into it.
The problem is the locking order, we can't hold page lock then start the
journal lock. Kjournald will need to hold the journal lock first, then
commit, commit may need to callback writepages, which need to hold the
page lock.
I looked at ext3, in that case, ext3 even don't have ext3_page_mkwrite()
to do the stable page yet. It requires some block reservation/delayed
allocation for filling holes in mmaped IO case. Jan tried that before I
don't think the proposal get far.
Now looking back ext4_page_mkwrite(), it calls write_begin(), as long as
we do wait in write_begin() things would be fine, right? It seems
Darrick already did that wait per Dave Chinner's suggestion when grab
the page and lock that page. But still a puzzle to me why that's not
sufficient.
Mingming
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Wed, Apr 13, 2011 at 05:48:48PM -0700, Mingming Cao wrote:
> On Mon, 2011-04-11 at 20:57 -0400, Christoph Hellwig wrote:
> > On Mon, Apr 11, 2011 at 05:46:52PM -0700, Mingming Cao wrote:
> > > Oh, right. Currently ext4_page_mkwrite drops the page lock before
> > > calling it's dirty the page (by write_begin() and write_end(). I
> > > suspect regrab the lock() after write_end() (with your proposed change)
> > > and returning with locked still leave the dirty by ext4_page_mkwrite
> > > unlocked. We probably should to keep the page locked the page during
> > > the entire ext4_page_mkwrite() call. Any reason to drop the page lock()
> > > before calling aops->write_begin()?
> >
> > write_begin takes the page lock by itself. That's one of the reasons why
> > block_page_mkwrite doesn't use plain ->write_begin / write_end, the
> > other beeing that we already get a page passed to use, so there's no
> > need to do the pagecache lookup or allocation done by
> > grab_cache_page_write_begin.
> >
> > The best thing would be to completely drop ext4's current version
> > of page_mkwrite and start out with a copy of block_page_mkwrite which
> > has the journalling calls added back into it.
>
> The problem is the locking order, we can't hold page lock then start the
> journal lock. Kjournald will need to hold the journal lock first, then
> commit, commit may need to callback writepages, which need to hold the
> page lock.
>
>
> I looked at ext3, in that case, ext3 even don't have ext3_page_mkwrite()
> to do the stable page yet. It requires some block reservation/delayed
> allocation for filling holes in mmaped IO case. Jan tried that before I
> don't think the proposal get far.
>
>
> Now looking back ext4_page_mkwrite(), it calls write_begin(), as long as
> we do wait in write_begin() things would be fine, right? It seems
> Darrick already did that wait per Dave Chinner's suggestion when grab
> the page and lock that page. But still a puzzle to me why that's not
> sufficient.
Hi everyone,
I've finally managed to get together a patch that seems to provide stable pages
during writeback, or at least gets us to the point that after several days of
running tests I don't see DIF checksum errors anymore. :)
The last two pieces to go into this puzzle were (a) bio_integrity_prep needs to
walk the process tree to find all userland ptes that map to a particular memory
page and revoke write access, and (b) ext4_page_mkwrite should always lock (and
wait on) a page before returning. There are still some huge warts with the
patch; in particular, the x86-specific set_memory_r[ow] needs to go away, which
is what I was trying to do with the #ifdef USE_X86 case. Also, there are
probably more wait_for_page_writeback()s than are really necessary.
Going forward, however, it might be easier to take all those waits out and
simply copy-migrate the page to another location when the page fault handler
sees the write-during-io case, so I'll look into that tomorrow. I'm also not
particularly sure what happens when huge pages get involved.
Debug-quality patch follows, before my kernel-build disks crash again.
--D
---
diff --git a/fs/bio-integrity.c b/fs/bio-integrity.c
index 9c5e6b2..8031bd7 100644
--- a/fs/bio-integrity.c
+++ b/fs/bio-integrity.c
@@ -25,6 +25,7 @@
#include <linux/bio.h>
#include <linux/workqueue.h>
#include <linux/slab.h>
+#include <asm/tlbflush.h>
struct integrity_slab {
struct kmem_cache *slab;
@@ -382,6 +383,110 @@ static inline unsigned short blk_integrity_tuple_size(struct blk_integrity *bi)
return 0;
}
+#define USE_X86
+
+#ifdef USE_X86
+static int set_page_ro(struct page *page)
+{
+ int set_memory_ro(unsigned long addr, int numpages);
+ unsigned long addr = (unsigned long)page_address(page);
+ if (PageHighMem(page))
+ return 0;
+ return set_memory_ro(addr, 1);
+}
+
+static int set_page_rw(struct page *page)
+{
+ int set_memory_rw(unsigned long addr, int numpages);
+ unsigned long addr = (unsigned long)page_address(page);
+ if (PageHighMem(page))
+ return 0;
+ return set_memory_rw(addr, 1);
+}
+#endif
+
+#include <linux/rmap.h>
+static void set_bio_page_access(struct mm_struct *mm, struct bio *bio, int rw)
+{
+ struct bio_vec *from;
+ int i;
+
+ __bio_for_each_segment(from, bio, i, 0) {
+ struct vm_area_struct *vma;
+ int modified = 0;
+
+ for (vma = mm->mmap; vma; vma = vma->vm_next) {
+ unsigned long address;
+ pte_t *pte;
+ spinlock_t *ptl;
+
+ // grab pte, change to ro?
+ address = page_address_in_vma(from->bv_page, vma);
+ if (address == -EFAULT) /* out of vma range */
+ continue;
+ pte = page_check_address(from->bv_page, vma->vm_mm, address, &ptl, 1);
+ if (!pte) /* the page is not in this mm */
+ continue;
+ modified = 1;
+ if (rw) {
+ pte_t old_pte = *pte;
+ set_pte_at(mm, address, pte, pte_mkwrite(old_pte));
+ } else
+ ptep_set_wrprotect(mm, address, pte);
+ pte_unmap_unlock(pte, ptl);
+ }
+ if (modified)
+ flush_tlb_mm(mm);
+ }
+}
+
+static void set_bio_page_rw(struct mm_struct *mm, struct bio *bio)
+{
+ set_bio_page_access(mm, bio, 1);
+}
+
+static void set_bio_page_ro(struct mm_struct *mm, struct bio *bio)
+{
+ set_bio_page_access(mm, bio, 0);
+}
+
+#ifdef USE_X86
+static void bio_integrity_write_fn(struct work_struct *work)
+{
+ struct bio_integrity_payload *bip =
+ container_of(work, struct bio_integrity_payload, bip_work);
+ struct bio *bio = bip->bip_bio;
+ struct bio_vec *from;
+ int i;
+
+ __bio_for_each_segment(from, bio, i, 0) {
+ set_page_rw(from->bv_page);
+ }
+
+ /* Restore original bio completion handler */
+ bio->bi_end_io = bip->bip_end_io;
+ bio_endio(bio, bip->bip_error);
+}
+#endif
+
+static void bio_integrity_endio_write(struct bio *bio, int error)
+{
+ struct bio_integrity_payload *bip = bio->bi_integrity;
+
+ BUG_ON(bip->bip_bio != bio);
+
+#ifndef USE_X86
+ set_bio_page_rw(&init_mm, bio);
+ bio->bi_end_io = bip->bip_end_io;
+ bio_endio(bio, bip->bip_error);
+ return;
+#else
+ bip->bip_error = error;
+ INIT_WORK(&bip->bip_work, bio_integrity_write_fn);
+ queue_work(kintegrityd_wq, &bip->bip_work);
+#endif
+}
+
/**
* bio_integrity_prep - Prepare bio for integrity I/O
* @bio: bio to prepare
@@ -468,8 +573,42 @@ int bio_integrity_prep(struct bio *bio)
}
/* Auto-generate integrity metadata if this is a write */
- if (bio_data_dir(bio) == WRITE)
+ if (bio_data_dir(bio) == WRITE) {
+ struct task_struct *p;
+ struct mm_struct *mm;
+
+ bip->bip_end_io = bio->bi_end_io;
+ bio->bi_end_io = bio_integrity_endio_write;
+
+#ifndef USE_X86
+// kmap_flush_unused();
+// vm_unmap_aliases();
+ set_bio_page_ro(&init_mm, bio);
+ flush_tlb_all();
+#else
+ {
+ struct bio_vec *from;
+ int i;
+
+ __bio_for_each_segment(from, bio, i, 0) {
+ set_page_ro(from->bv_page);
+ }
+ }
+#endif
+
+ for_each_process(p) {
+ task_lock(p);
+ mm = p->mm;
+ if (!mm) {
+ task_unlock(p);
+ continue;
+ }
+
+ set_bio_page_ro(mm, bio);
+ task_unlock(p);
+ }
bio_integrity_generate(bio);
+ }
return 0;
}
diff --git a/fs/buffer.c b/fs/buffer.c
index a08bb8e..02156c5 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -376,8 +376,10 @@ void end_buffer_async_write(struct buffer_head *bh, int uptodate)
if (!quiet_error(bh)) {
buffer_io_error(bh);
printk(KERN_WARNING "lost page write due to "
- "I/O error on %s\n",
- bdevname(bh->b_bdev, b));
+ "I/O error on %s, flags=%lx\n",
+ bdevname(bh->b_bdev, b), page->flags);
+if (page->mapping && page->mapping->host)
+printk(KERN_ERR "mode=%x inode=%d dev=%d\n", page->mapping->host->i_mode, page->mapping->host->i_ino, (page->mapping->host->i_rdev));
}
set_bit(AS_EIO, &page->mapping->flags);
set_buffer_write_io_error(bh);
@@ -2357,6 +2359,8 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
else
end = PAGE_CACHE_SIZE;
+ WARN_ON(!PageLocked(page));
+ wait_on_page_writeback(page);
ret = __block_write_begin(page, 0, end, get_block);
if (!ret)
ret = block_commit_write(page, 0, end);
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f2fa5e8..228b4aa 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2667,6 +2667,8 @@ static int ext4_writepage(struct page *page,
*/
goto redirty_page;
}
+ WARN_ON(!PageLocked(page));
+ wait_on_page_writeback(page);
if (commit_write)
/* now mark the buffer_heads as dirty and uptodate */
block_commit_write(page, 0, len);
@@ -2779,7 +2781,8 @@ static int write_cache_pages_da(struct address_space *mapping,
}
lock_page(page);
-
+ if (PageWriteback(page))
+ wait_on_page_writeback(page);
/*
* If the page is no longer dirty, or its
* mapping no longer corresponds to inode we
@@ -5811,15 +5814,21 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
goto out_unlock;
}
ret = 0;
- if (PageMappedToDisk(page))
- goto out_unlock;
+ lock_page(page);
+ if (PageWriteback(page))
+ wait_on_page_writeback(page);
+ if (PageMappedToDisk(page)) {
+ up_read(&inode->i_alloc_sem);
+ return VM_FAULT_LOCKED;
+ }
+ /* this one seems to handle mmap */
+ wait_on_page_writeback(page);
if (page->index == size >> PAGE_CACHE_SHIFT)
len = size & ~PAGE_CACHE_MASK;
else
len = PAGE_CACHE_SIZE;
- lock_page(page);
/*
* return if we have all the buffers mapped. This avoid
* the need to call write_begin/write_end which does a
@@ -5829,8 +5838,8 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (page_has_buffers(page)) {
if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
ext4_bh_unmapped)) {
- unlock_page(page);
- goto out_unlock;
+ up_read(&inode->i_alloc_sem);
+ return VM_FAULT_LOCKED;
}
}
unlock_page(page);
@@ -5850,6 +5859,17 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (ret < 0)
goto out_unlock;
ret = 0;
+
+ /*
+ * write_begin/end might have created a dirty page and someone
+ * could wander in and start the IO. Make sure that hasn't
+ * happened.
+ */
+ lock_page(page);
+ if (PageWriteback(page))
+ wait_on_page_writeback(page);
+ up_read(&inode->i_alloc_sem);
+ return VM_FAULT_LOCKED;
out_unlock:
if (ret)
ret = VM_FAULT_SIGBUS;
diff --git a/include/linux/bio.h b/include/linux/bio.h
index ce33e68..ea36e89 100644
--- a/include/linux/bio.h
+++ b/include/linux/bio.h
@@ -180,6 +180,7 @@ struct bio_integrity_payload {
unsigned short bip_vcnt; /* # of integrity bio_vecs */
unsigned short bip_idx; /* current bip_vec index */
+ int bip_error; /* bio completion status? */
struct work_struct bip_work; /* I/O completion */
struct bio_vec bip_vec[0]; /* embedded bvec array */
};
diff --git a/mm/filemap.c b/mm/filemap.c
index c641edf..3ed13a3 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2277,8 +2277,9 @@ EXPORT_SYMBOL(generic_file_direct_write);
* Find or create a page at the given pagecache position. Return the locked
* page. This function is specifically for buffered writes.
*/
-struct page *grab_cache_page_write_begin(struct address_space *mapping,
- pgoff_t index, unsigned flags)
+static struct page *
+__grab_cache_page_write_begin(struct address_space *mapping, pgoff_t index,
+ unsigned flags)
{
int status;
struct page *page;
@@ -2303,6 +2304,20 @@ repeat:
}
return page;
}
+
+struct page *grab_cache_page_write_begin(struct address_space *mapping,
+ pgoff_t index, unsigned flags)
+{
+ struct page *p;
+
+ p = __grab_cache_page_write_begin(mapping, index, flags);
+ if (p) {
+ WARN_ON(!PageLocked(p));
+ wait_on_page_writeback(p);
+ }
+
+ return p;
+}
EXPORT_SYMBOL(grab_cache_page_write_begin);
static ssize_t generic_perform_write(struct file *file,
diff --git a/mm/memory.c b/mm/memory.c
index ce22a25..9d3dc5d 100644
--- a/mm/memory.c
+++ b/mm/memory.c
@@ -3156,6 +3156,9 @@ static int __do_fault(struct mm_struct *mm, struct vm_area_struct *vma,
} else
VM_BUG_ON(!PageLocked(page));
page_mkwrite = 1;
+ } else {
+ WARN_ON(!PageLocked(page));
+ wait_on_page_writeback(page);
}
}
Excerpts from Darrick J. Wong's message of 2011-04-21 20:02:26 -0400:
> Hi everyone,
>
> I've finally managed to get together a patch that seems to provide stable pages
> during writeback, or at least gets us to the point that after several days of
> running tests I don't see DIF checksum errors anymore. :)
>
> The last two pieces to go into this puzzle were (a) bio_integrity_prep needs to
> walk the process tree to find all userland ptes that map to a particular memory
> page and revoke write access, and
Hmm, did you need the bio_integrity_prep change for all the filesystems?
This should be happening already as part of using page_mkwrite.
-chris
On Fri 22-04-11 08:50:01, Chris Mason wrote:
> Excerpts from Darrick J. Wong's message of 2011-04-21 20:02:26 -0400:
> > Hi everyone,
> >
> > I've finally managed to get together a patch that seems to provide stable pages
> > during writeback, or at least gets us to the point that after several days of
> > running tests I don't see DIF checksum errors anymore. :)
> >
> > The last two pieces to go into this puzzle were (a) bio_integrity_prep needs to
> > walk the process tree to find all userland ptes that map to a particular memory
> > page and revoke write access, and
>
> Hmm, did you need the bio_integrity_prep change for all the filesystems?
> This should be happening already as part of using page_mkwrite.
Or more precisely page_mkclean() should do what you try to do in
bio_integrity_prep()... It would certainly be interesting (bug) if you
could write to the page after calling page_mkclean() without page_mkwrite()
being called.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Fri, Apr 22, 2011 at 10:34:34PM +0200, Jan Kara wrote:
> On Fri 22-04-11 08:50:01, Chris Mason wrote:
> > Excerpts from Darrick J. Wong's message of 2011-04-21 20:02:26 -0400:
> > > Hi everyone,
> > >
> > > I've finally managed to get together a patch that seems to provide stable pages
> > > during writeback, or at least gets us to the point that after several days of
> > > running tests I don't see DIF checksum errors anymore. :)
> > >
> > > The last two pieces to go into this puzzle were (a) bio_integrity_prep needs to
> > > walk the process tree to find all userland ptes that map to a particular memory
> > > page and revoke write access, and
> >
> > Hmm, did you need the bio_integrity_prep change for all the filesystems?
> > This should be happening already as part of using page_mkwrite.
> Or more precisely page_mkclean() should do what you try to do in
> bio_integrity_prep()... It would certainly be interesting (bug) if you
> could write to the page after calling page_mkclean() without page_mkwrite()
> being called.
Hm... in mpage_da_submit_io I see the following sequence of calls:
1. clear_page_dirty_for_io
2. possibly one of: ext4_bio_write_page or block_write_full_page.
If ext4_bio_write_page,
2a. kmem_cache_alloc
2b. set_page_writeback
Before and after #1, the page is locked but writeback is not set.
Before #2, the page must be locked and writeback must not be set, because both
of those two functions want to set the writeback bit themselves. However,
ext4_bio_write_page tries to allocate memory with GFP_NOFS, which means it can
sleep (I think).
Unfortunately, ext4_page_mkwrite will check for page locked, wait for page
writeback, and then return the page. I think it is theoretically possible for
#1 to trigger a page_mkwrite which completes before #2b, right? In which case
the thread that called mkwrite will think that the page isn't being written
out, and happily scribble on it during writeback. I could be wrong, but it
seems to me that one has to write-protect the page after setting the writeback
bit.
I guess we could call page_mkclean from bio_integrity_prep, though.
--D
Excerpts from Darrick J. Wong's message of 2011-04-25 20:37:38 -0400:
> On Fri, Apr 22, 2011 at 10:34:34PM +0200, Jan Kara wrote:
> > On Fri 22-04-11 08:50:01, Chris Mason wrote:
> > > Excerpts from Darrick J. Wong's message of 2011-04-21 20:02:26 -0400:
> > > > Hi everyone,
> > > >
> > > > I've finally managed to get together a patch that seems to provide stable pages
> > > > during writeback, or at least gets us to the point that after several days of
> > > > running tests I don't see DIF checksum errors anymore. :)
> > > >
> > > > The last two pieces to go into this puzzle were (a) bio_integrity_prep needs to
> > > > walk the process tree to find all userland ptes that map to a particular memory
> > > > page and revoke write access, and
> > >
> > > Hmm, did you need the bio_integrity_prep change for all the filesystems?
> > > This should be happening already as part of using page_mkwrite.
> > Or more precisely page_mkclean() should do what you try to do in
> > bio_integrity_prep()... It would certainly be interesting (bug) if you
> > could write to the page after calling page_mkclean() without page_mkwrite()
> > being called.
>
> Hm... in mpage_da_submit_io I see the following sequence of calls:
>
> 1. clear_page_dirty_for_io
> 2. possibly one of: ext4_bio_write_page or block_write_full_page.
> If ext4_bio_write_page,
> 2a. kmem_cache_alloc
> 2b. set_page_writeback
>
> Before and after #1, the page is locked but writeback is not set.
>
> Before #2, the page must be locked and writeback must not be set, because both
> of those two functions want to set the writeback bit themselves. However,
> ext4_bio_write_page tries to allocate memory with GFP_NOFS, which means it can
> sleep (I think).
Sleeping isn't the problem as long as you sleep with the page locked.
The idea is that writepage will:
1) lock the page
2) clear_page_dirty_for_io (which calls page_mkclean)
3) set_page_writeback()
4) unlock the page
5) start the IO
page_mkwrite will:
1) lock the page
2) wait on page writeback
3) do other stuff
So if ext is calling set_page_writeback() on an unlocked page, that's a
problem. Otherwise it should be working.
-chris
On Mon 25-04-11 17:37:38, Darrick J. Wong wrote:
> On Fri, Apr 22, 2011 at 10:34:34PM +0200, Jan Kara wrote:
> > On Fri 22-04-11 08:50:01, Chris Mason wrote:
> > > Excerpts from Darrick J. Wong's message of 2011-04-21 20:02:26 -0400:
> > > > Hi everyone,
> > > >
> > > > I've finally managed to get together a patch that seems to provide stable pages
> > > > during writeback, or at least gets us to the point that after several days of
> > > > running tests I don't see DIF checksum errors anymore. :)
> > > >
> > > > The last two pieces to go into this puzzle were (a) bio_integrity_prep needs to
> > > > walk the process tree to find all userland ptes that map to a particular memory
> > > > page and revoke write access, and
> > >
> > > Hmm, did you need the bio_integrity_prep change for all the filesystems?
> > > This should be happening already as part of using page_mkwrite.
> > Or more precisely page_mkclean() should do what you try to do in
> > bio_integrity_prep()... It would certainly be interesting (bug) if you
> > could write to the page after calling page_mkclean() without page_mkwrite()
> > being called.
>
> Hm... in mpage_da_submit_io I see the following sequence of calls:
>
> 1. clear_page_dirty_for_io
> 2. possibly one of: ext4_bio_write_page or block_write_full_page.
> If ext4_bio_write_page,
> 2a. kmem_cache_alloc
> 2b. set_page_writeback
>
> Before and after #1, the page is locked but writeback is not set.
>
> Before #2, the page must be locked and writeback must not be set, because both
> of those two functions want to set the writeback bit themselves. However,
> ext4_bio_write_page tries to allocate memory with GFP_NOFS, which means it can
> sleep (I think).
Yes, it can sleep. But the page remains locked until we set page as
writeback in both cases.
> Unfortunately, ext4_page_mkwrite will check for page locked, wait for
> page writeback, and then return the page. I think it is theoretically
> possible for #1 to trigger a page_mkwrite which completes before #2b,
> right?
I'm not sure I understand but once the page is locked by ext4_writepages()
before #1, ext4_page_mkwrite() will block until it can get the page lock -
which can happen only after set_page_writeback() in #2 is done. So then
ext4_page_mkwrite() will block waiting for PageWriteback which gets cleared
after the IO is finished... Or did you mean something else?
> In which case the thread that called mkwrite will think that the
> page isn't being written out, and happily scribble on it during
> writeback. I could be wrong, but it seems to me that one has to
> write-protect the page after setting the writeback bit.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue, Apr 26, 2011 at 07:33:18AM -0400, Chris Mason wrote:
> Excerpts from Darrick J. Wong's message of 2011-04-25 20:37:38 -0400:
> > On Fri, Apr 22, 2011 at 10:34:34PM +0200, Jan Kara wrote:
> > > On Fri 22-04-11 08:50:01, Chris Mason wrote:
> > > > Excerpts from Darrick J. Wong's message of 2011-04-21 20:02:26 -0400:
> > > > > Hi everyone,
> > > > >
> > > > > I've finally managed to get together a patch that seems to provide stable pages
> > > > > during writeback, or at least gets us to the point that after several days of
> > > > > running tests I don't see DIF checksum errors anymore. :)
> > > > >
> > > > > The last two pieces to go into this puzzle were (a) bio_integrity_prep needs to
> > > > > walk the process tree to find all userland ptes that map to a particular memory
> > > > > page and revoke write access, and
> > > >
> > > > Hmm, did you need the bio_integrity_prep change for all the filesystems?
> > > > This should be happening already as part of using page_mkwrite.
> > > Or more precisely page_mkclean() should do what you try to do in
> > > bio_integrity_prep()... It would certainly be interesting (bug) if you
> > > could write to the page after calling page_mkclean() without page_mkwrite()
> > > being called.
> >
> > Hm... in mpage_da_submit_io I see the following sequence of calls:
> >
> > 1. clear_page_dirty_for_io
> > 2. possibly one of: ext4_bio_write_page or block_write_full_page.
> > If ext4_bio_write_page,
> > 2a. kmem_cache_alloc
> > 2b. set_page_writeback
> >
> > Before and after #1, the page is locked but writeback is not set.
> >
> > Before #2, the page must be locked and writeback must not be set, because both
> > of those two functions want to set the writeback bit themselves. However,
> > ext4_bio_write_page tries to allocate memory with GFP_NOFS, which means it can
> > sleep (I think).
>
> Sleeping isn't the problem as long as you sleep with the page locked.
> The idea is that writepage will:
>
> 1) lock the page
> 2) clear_page_dirty_for_io (which calls page_mkclean)
> 3) set_page_writeback()
> 4) unlock the page
> 5) start the IO
>
> page_mkwrite will:
>
> 1) lock the page
> 2) wait on page writeback
> 3) do other stuff
>
> So if ext is calling set_page_writeback() on an unlocked page, that's a
> problem. Otherwise it should be working.
You're right, at this point in time writepage and page_mkwrite in ext4 both
behave as you describe. I began backing out parts of my patches to
bio-integrity.c and discovered that with the current kernel (2.6.39-rc5) the
only part that seems useful is the set_memory_ro/rw pair from that old
debugging patch. Unfortunately, those two functions only seem to exist on x86;
I suppose I could port them to others. If that's even a sane idea.
--D
>
> -chris
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
On Mon, May 02, 2011 at 06:59:31PM -0700, Darrick J. Wong wrote:
> On Tue, Apr 26, 2011 at 07:33:18AM -0400, Chris Mason wrote:
> > Excerpts from Darrick J. Wong's message of 2011-04-25 20:37:38 -0400:
> > > On Fri, Apr 22, 2011 at 10:34:34PM +0200, Jan Kara wrote:
> > > > On Fri 22-04-11 08:50:01, Chris Mason wrote:
> > > > > Excerpts from Darrick J. Wong's message of 2011-04-21 20:02:26 -0400:
> > > > > > Hi everyone,
> > > > > >
> > > > > > I've finally managed to get together a patch that seems to provide stable pages
> > > > > > during writeback, or at least gets us to the point that after several days of
> > > > > > running tests I don't see DIF checksum errors anymore. :)
> > > > > >
> > > > > > The last two pieces to go into this puzzle were (a) bio_integrity_prep needs to
> > > > > > walk the process tree to find all userland ptes that map to a particular memory
> > > > > > page and revoke write access, and
> > > > >
> > > > > Hmm, did you need the bio_integrity_prep change for all the filesystems?
> > > > > This should be happening already as part of using page_mkwrite.
> > > > Or more precisely page_mkclean() should do what you try to do in
> > > > bio_integrity_prep()... It would certainly be interesting (bug) if you
> > > > could write to the page after calling page_mkclean() without page_mkwrite()
> > > > being called.
> > >
> > > Hm... in mpage_da_submit_io I see the following sequence of calls:
> > >
> > > 1. clear_page_dirty_for_io
> > > 2. possibly one of: ext4_bio_write_page or block_write_full_page.
> > > If ext4_bio_write_page,
> > > 2a. kmem_cache_alloc
> > > 2b. set_page_writeback
> > >
> > > Before and after #1, the page is locked but writeback is not set.
> > >
> > > Before #2, the page must be locked and writeback must not be set, because both
> > > of those two functions want to set the writeback bit themselves. However,
> > > ext4_bio_write_page tries to allocate memory with GFP_NOFS, which means it can
> > > sleep (I think).
> >
> > Sleeping isn't the problem as long as you sleep with the page locked.
> > The idea is that writepage will:
> >
> > 1) lock the page
> > 2) clear_page_dirty_for_io (which calls page_mkclean)
> > 3) set_page_writeback()
> > 4) unlock the page
> > 5) start the IO
> >
> > page_mkwrite will:
> >
> > 1) lock the page
> > 2) wait on page writeback
> > 3) do other stuff
> >
> > So if ext is calling set_page_writeback() on an unlocked page, that's a
> > problem. Otherwise it should be working.
>
> You're right, at this point in time writepage and page_mkwrite in ext4 both
> behave as you describe. I began backing out parts of my patches to
> bio-integrity.c and discovered that with the current kernel (2.6.39-rc5) the
> only part that seems useful is the set_memory_ro/rw pair from that old
> debugging patch. Unfortunately, those two functions only seem to exist on x86;
> I suppose I could port them to others. If that's even a sane idea.
Never mind, I looked around for anything that would result in the kernel trying
to map a page for the purpose of writing, and I think adding a
wait_on_writeback to get_cache_page_for_write will solve that last hole without
having to port set_memory_* to other arches. With that, a whole lot of code
falls out of the patches, which I will post shortly.
--D
>
> --D
> >
> > -chris
> > --
> > To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> > the body of a message to [email protected]
> > More majordomo info at http://vger.kernel.org/majordomo-info.html
> --
> To unsubscribe from this list: send the line "unsubscribe linux-fsdevel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
>
Hi all,
This is v3 of the stable-page-writes patchset for ext4. A lot of code has been
cut out since v2 of this patch set. For v3, the large hairy function to walk
the page tables of every process is gone since Chris Mason pointed out that
page_mkclean does what I need. The set_memory_* hack is also gone, since (I
think) the only time the kernel maps a file data blocks for writing is in the
buffered IO case. That leaves us with some surgery to ext4_page_mkwrite to
return locked pages and to be careful about re-checking the writeback status
after dropping and re-grabbing the page lock; and a slight modification to the
mm code to wait for page writeback when grabbing pages for buffered writes.
There are also some cleanups for wait_on_page_writeback use in ext4.
I ran my write-after-checksum ("wac") reproducer program to try to create the
DIF checksum errors by madly rewriting the same memory pages. In fact, I tried
the following combinations:
a. 64 write() threads + sync_file_range
b. 64 mmap write threads + msync
c. 32 write() threads + sync_file_range + 32 mmap write threads + msync
d. Same as C, but with all threads in directio mode
e. Same as A, but with all threads in directio mode
f. Same as B, but with all threads in directio mode
After some 44 hours of safety testing across 4 machines, I saw zero errors.
Before the patchset, I could run any of A-F for 10 seconds or less and have a
screen full of errors.
To assess the performance impact of stable page writes, I moved to a disk that
doesn't have DIF support so that I could measure just the impact of waiting for
writeback. I first ran wac with 64 threads madly scribbling on a 64k file and
saw about a 12% performance decrease. I then reran the wac program with 64
threads and a 64MB file and saw about the same performance numbers. I will of
course be testing a wider range of hardware now that I have a functioning patch
set, though as I suspected the patchset only seems to impact workloads that
rewrite the same memory page frequently.
As always, questions and comments are welcome; and thank you to all the
previous reviewers of this patchset!
--D
wait_on_page_writeback already checks the writeback bit, so callers of it
needn't do that test.
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/inode.c | 4 +---
fs/ext4/move_extent.c | 3 +--
2 files changed, 2 insertions(+), 5 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f2fa5e8..3db34b2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2796,9 +2796,7 @@ static int write_cache_pages_da(struct address_space *mapping,
continue;
}
- if (PageWriteback(page))
- wait_on_page_writeback(page);
-
+ wait_on_page_writeback(page);
BUG_ON(PageWriteback(page));
if (mpd->next_page != page->index)
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index d5c5783..d1548b1 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -876,8 +876,7 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
* It needs to call wait_on_page_writeback() to wait for the
* writeback of the page.
*/
- if (PageWriteback(page))
- wait_on_page_writeback(page);
+ wait_on_page_writeback(page);
/* Release old bh and drop refs */
try_to_release_page(page, 0);
In order to stabilize pages during disk writes, ext4_page_mkwrite must wait for
writeback operations to complete before making a page writable. Furthermore,
the function must return locked pages, and recheck the writeback status if the
page lock is ever dropped.
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/inode.c | 24 +++++++++++++++++++-----
1 files changed, 19 insertions(+), 5 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3db34b2..1d162a2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5809,15 +5809,19 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
goto out_unlock;
}
ret = 0;
- if (PageMappedToDisk(page))
- goto out_unlock;
+
+ lock_page(page);
+ wait_on_page_writeback(page);
+ if (PageMappedToDisk(page)) {
+ up_read(&inode->i_alloc_sem);
+ return VM_FAULT_LOCKED;
+ }
if (page->index == size >> PAGE_CACHE_SHIFT)
len = size & ~PAGE_CACHE_MASK;
else
len = PAGE_CACHE_SIZE;
- lock_page(page);
/*
* return if we have all the buffers mapped. This avoid
* the need to call write_begin/write_end which does a
@@ -5827,8 +5831,8 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (page_has_buffers(page)) {
if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
ext4_bh_unmapped)) {
- unlock_page(page);
- goto out_unlock;
+ up_read(&inode->i_alloc_sem);
+ return VM_FAULT_LOCKED;
}
}
unlock_page(page);
@@ -5848,6 +5852,16 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (ret < 0)
goto out_unlock;
ret = 0;
+
+ /*
+ * write_begin/end might have created a dirty page and someone
+ * could wander in and start the IO. Make sure that hasn't
+ * happened.
+ */
+ lock_page(page);
+ wait_on_page_writeback(page);
+ up_read(&inode->i_alloc_sem);
+ return VM_FAULT_LOCKED;
out_unlock:
if (ret)
ret = VM_FAULT_SIGBUS;
When grabbing a page for a buffered IO write, the mm should wait for writeback
on the page to complete so that the page does not become writable during the IO
operation.
Signed-off-by: Darrick J. Wong <[email protected]>
---
mm/filemap.c | 5 ++++-
1 files changed, 4 insertions(+), 1 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index c641edf..c22675f 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2287,8 +2287,10 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
gfp_notmask = __GFP_FS;
repeat:
page = find_lock_page(mapping, index);
- if (page)
+ if (page) {
+ wait_on_page_writeback(page);
return page;
+ }
page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~gfp_notmask);
if (!page)
@@ -2301,6 +2303,7 @@ repeat:
goto repeat;
return NULL;
}
+ wait_on_page_writeback(page);
return page;
}
EXPORT_SYMBOL(grab_cache_page_write_begin);
This seems to miss out on a lot of the generic functionality like
write_cache_pages and block_page_mkwrite and just patch it into
the ext4 copy & paste variants. Please make sure your patches also
work for filesystem that use more of the generic functionality like
xfs or ext2 (the latter one might be fun for the mmap case).
Also what's the status of btrfs? I remembered there was one or two
bits missing despite doing the right thing in most areas.
On Wed, May 04, 2011 at 10:42:27AM -0700, Darrick J. Wong wrote:
> When grabbing a page for a buffered IO write, the mm should wait for writeback
> on the page to complete so that the page does not become writable during the IO
> operation.
>
> Signed-off-by: Darrick J. Wong <[email protected]>
> ---
>
> mm/filemap.c | 5 ++++-
> 1 files changed, 4 insertions(+), 1 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c641edf..c22675f 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2287,8 +2287,10 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
> gfp_notmask = __GFP_FS;
> repeat:
> page = find_lock_page(mapping, index);
> - if (page)
> + if (page) {
> + wait_on_page_writeback(page);
> return page;
> + }
goto found;
>
> page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~gfp_notmask);
> if (!page)
> @@ -2301,6 +2303,7 @@ repeat:
> goto repeat;
> return NULL;
> }
found:
> + wait_on_page_writeback(page);
> return page;
> }
> EXPORT_SYMBOL(grab_cache_page_write_begin);
Excerpts from Christoph Hellwig's message of 2011-05-04 14:46:44 -0400:
> This seems to miss out on a lot of the generic functionality like
> write_cache_pages and block_page_mkwrite and just patch it into
> the ext4 copy & paste variants. Please make sure your patches also
> work for filesystem that use more of the generic functionality like
> xfs or ext2 (the latter one might be fun for the mmap case).
Probably after the block_commit_write in block_page_mkwrite()
Another question is, do we want to introduce a wait_on_stable_page_writeback()?
This would allow us to add a check against the bdi requesting stable
pages.
>
> Also what's the status of btrfs? I remembered there was one or two
> bits missing despite doing the right thing in most areas.
As far as I know btrfs is getting it right. The only bit missing is the
one Nick Piggin pointed out where it is possible to change mmap'd O_DIRECT
memory in flight while a DIO is in progress. Josef has a test case that
demonstrates this.
Nick had a plan to fix it, but it involved redoing the get_user_pages
api.
-chris
On Wed, May 04, 2011 at 03:21:55PM -0400, Chris Mason wrote:
> Excerpts from Christoph Hellwig's message of 2011-05-04 14:46:44 -0400:
> > This seems to miss out on a lot of the generic functionality like
> > write_cache_pages and block_page_mkwrite and just patch it into
> > the ext4 copy & paste variants. Please make sure your patches also
> > work for filesystem that use more of the generic functionality like
> > xfs or ext2 (the latter one might be fun for the mmap case).
>
> Probably after the block_commit_write in block_page_mkwrite()
Yes, I'm working on providing more generic fixes for ext3 & friends too, but
they're not really working yet, so I was posting the parts that fix ext4, since
they seem usable.
> Another question is, do we want to introduce a wait_on_stable_page_writeback()?
>
> This would allow us to add a check against the bdi requesting stable
> pages.
Sounds like a good idea.
> > Also what's the status of btrfs? I remembered there was one or two
> > bits missing despite doing the right thing in most areas.
>
> As far as I know btrfs is getting it right. The only bit missing is the
> one Nick Piggin pointed out where it is possible to change mmap'd O_DIRECT
> memory in flight while a DIO is in progress. Josef has a test case that
> demonstrates this.
>
> Nick had a plan to fix it, but it involved redoing the get_user_pages
> api.
I ran the same six tests A-F on btrfs and it reported -ENOSPC with 1% of the
space used, though until it did that I didn't see any checksum errors.
--D
On Wed, May 04, 2011 at 03:21:55PM -0400, Chris Mason wrote:
> Excerpts from Christoph Hellwig's message of 2011-05-04 14:46:44 -0400:
> > This seems to miss out on a lot of the generic functionality like
> > write_cache_pages and block_page_mkwrite and just patch it into
> > the ext4 copy & paste variants. Please make sure your patches also
> > work for filesystem that use more of the generic functionality like
> > xfs or ext2 (the latter one might be fun for the mmap case).
>
> Probably after the block_commit_write in block_page_mkwrite()
> Another question is, do we want to introduce a wait_on_stable_page_writeback()?
Something like this here? It fixes block_page_mkwrite users and sticks in a
simple page_mkwrite for fses that don't provide one at all. From a quick wac
run it seems to make xfs work. ext2 seems to have some issues with modifying a
buffer_head's bh_data without locking the bh during the update, so I guess it
needs some review.
--D
--
fs: Modify/provide generic writepage/page_mkwrite functions to wait for writeback
Modify the generic writepage function, and add an empty page_mkwrite function,
to wait for page writeback to finish before allowing writes. This is so that
simple filesystems have stable pages during write operations.
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/buffer.c | 1 +
mm/filemap.c | 10 ++++++++++
2 files changed, 11 insertions(+), 0 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index a08bb8e..cf9a795 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2361,6 +2361,7 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
if (!ret)
ret = block_commit_write(page, 0, end);
+ wait_on_page_writeback(page);
if (unlikely(ret)) {
unlock_page(page);
if (ret == -ENOMEM)
diff --git a/mm/filemap.c b/mm/filemap.c
index c22675f..9cb4e51 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1713,8 +1713,18 @@ page_not_uptodate:
}
EXPORT_SYMBOL(filemap_fault);
+static int empty_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct page *page = vmf->page;
+
+ lock_page(page);
+ wait_on_page_writeback(page);
+ return VM_FAULT_LOCKED;
+}
+
const struct vm_operations_struct generic_file_vm_ops = {
.fault = filemap_fault,
+ .page_mkwrite = empty_page_mkwrite,
};
/* This is used for a general mmap of a disk file */
--
Here's the beginning of a patch against ext2 also. It fixes most of the
checksum errors when metadata writes are in progress, though I suspect there
are more spots that I haven't caught yet.
--
ext2: Lock buffer_heads during metadata update
ext2 does not protect buffer head that represent metadata against modifications
during write operations. This is a problem if the memory buffer needs to be
stable during writes; I think there are a few more spots in ext2 that need this
treatment. :)
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext2/inode.c | 3 +++
1 files changed, 3 insertions(+), 0 deletions(-)
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 788e09a..5314b0b 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -1426,6 +1426,7 @@ static int __ext2_write_inode(struct inode *inode, int do_sync)
if (IS_ERR(raw_inode))
return -EIO;
+ lock_buffer(bh);
/* For fields not not tracking in the in-memory inode,
* initialise them to zero for new inodes. */
if (ei->i_state & EXT2_STATE_NEW)
@@ -1502,6 +1503,8 @@ static int __ext2_write_inode(struct inode *inode, int do_sync)
}
} else for (n = 0; n < EXT2_N_BLOCKS; n++)
raw_inode->i_block[n] = ei->i_data[n];
+ unlock_buffer(bh);
+
mark_buffer_dirty(bh);
if (do_sync) {
sync_dirty_buffer(bh);
Hello,
On Wed 04-05-11 16:57:06, Darrick J. Wong wrote:
> On Wed, May 04, 2011 at 03:21:55PM -0400, Chris Mason wrote:
> > Excerpts from Christoph Hellwig's message of 2011-05-04 14:46:44 -0400:
> > > This seems to miss out on a lot of the generic functionality like
> > > write_cache_pages and block_page_mkwrite and just patch it into
> > > the ext4 copy & paste variants. Please make sure your patches also
> > > work for filesystem that use more of the generic functionality like
> > > xfs or ext2 (the latter one might be fun for the mmap case).
> >
> > Probably after the block_commit_write in block_page_mkwrite()
> > Another question is, do we want to introduce a wait_on_stable_page_writeback()?
>
> Something like this here? It fixes block_page_mkwrite users and sticks in a
> simple page_mkwrite for fses that don't provide one at all. From a quick wac
> run it seems to make xfs work. ext2 seems to have some issues with modifying a
> buffer_head's bh_data without locking the bh during the update, so I guess it
> needs some review.
Yes, ext2 is rather difficult because of all the metadata updates to
buffers happening. That would need a serious work I suspect.
> fs: Modify/provide generic writepage/page_mkwrite functions to wait for writeback
>
> Modify the generic writepage function, and add an empty page_mkwrite function,
> to wait for page writeback to finish before allowing writes. This is so that
> simple filesystems have stable pages during write operations.
>
> Signed-off-by: Darrick J. Wong <[email protected]>
> ---
>
> fs/buffer.c | 1 +
> mm/filemap.c | 10 ++++++++++
> 2 files changed, 11 insertions(+), 0 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index a08bb8e..cf9a795 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2361,6 +2361,7 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> if (!ret)
> ret = block_commit_write(page, 0, end);
>
> + wait_on_page_writeback(page);
> if (unlikely(ret)) {
> unlock_page(page);
> if (ret == -ENOMEM)
> diff --git a/mm/filemap.c b/mm/filemap.c
> index c22675f..9cb4e51 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -1713,8 +1713,18 @@ page_not_uptodate:
> }
> EXPORT_SYMBOL(filemap_fault);
>
> +static int empty_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
> +{
> + struct page *page = vmf->page;
> +
> + lock_page(page);
> + wait_on_page_writeback(page);
> + return VM_FAULT_LOCKED;
> +}
> +
I guess you miss the whether the page has been truncated here (in which
case you should return VM_FAULT_NOPAGE).
> const struct vm_operations_struct generic_file_vm_ops = {
> .fault = filemap_fault,
> + .page_mkwrite = empty_page_mkwrite,
> };
>
> /* This is used for a general mmap of a disk file */
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR