Hi all,
This is v3.1 of the stable-page-writes patchset for ext4/3/2, xfs, and fat.
The purpose of this patchset is to prohibit processes from writing on memory
pages that are currently being written to disk because certain storage setups
(e.g. SCSI disks with DIF integrity checksums) will fail a write if the page
contents don't match the checksum. btrfs already guarantees page stability, so
it does not use these changes.
The technique used is fairly simple -- whenever a page is about to become
writable (either because of a write fault to a mapped page, or a buffered write
is in progress), wait for the page writeback flag to be clear, indicating that
the page is not being written to disk. This means that it is necessary (1) to
add wait for writeback code to grab_cache_page_write_begin to take care of
buffered writes, and (2) all filesystems must have a page_mkwrite that locks a
page, waits for writeback, and returns the locked page. For filesystems that
piggyback on the generic block_page_mkwrite, the patchset adds the writeback
wait to that function; for filesystems that do not use the page_mkwrite hook at
all, the patchset provides a stub page_mkwrite.
I ran my write-after-checksum ("wac") reproducer program to try to create the
DIF checksum errors by madly rewriting the same memory pages. In fact, I tried
the following combinations against ext2/3/4, xfs, btrfs, and vfat:
a. 64 write() threads + sync_file_range
b. 64 mmap write threads + msync
c. 32 write() threads + sync_file_range + 32 mmap write threads + msync
d. Same as C, but with all threads in directio mode
e. Same as A, but with all threads in directio mode
f. Same as B, but with all threads in directio mode
After running profiles A-F for 30 minutes each on 6 different machines, ext2,
ext4, xfs, and vfat reported no errors. ext3 still has a lingering failure
case (which I will touch on briefly later) and btrfs eventually reports -ENOSPC
and fails the test, though it does that even without any of the patches applied.
To assess the performance impact of stable page writes, I moved to a disk that
doesn't have DIF support so that I could measure just the impact of waiting for
writeback. I first ran wac with 64 threads madly scribbling on a 64k file and
saw about a 12 percent performance decrease. I then reran the wac program with
64 threads and a 64MB file and saw about the same performance numbers. As I
suspected, the patchset only seems to impact workloads that rewrite the same
memory page frequently.
I am still chasing down what exactly is broken in ext3. data=writeback mode
passes with no failures. data=ordered, however, does not pass; my current
suspicion is that jbd is calling submit_bh on data buffers but doesn't call
page_mkclean to kick the userspace programs off the page before writing it.
Per various comments regarding v3 of this patchset, I've integrated his
suggestions, reworked the patch descriptions to make it clearer which ones
touch all the filesystems and which ones are to fix remaining holes in specific
filesystems, and expanded the scope of filesystems that got fixed.
As always, questions and comments are welcome; and thank you to all the
previous reviewers of this patchset. I am also soliciting people's opinions on
whether or not these patches could go upstream for .40.
--D
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
For filesystems such as nilfs2 and xfs that use block_page_mkwrite, modify that
function to wait for pending writeback before allowing the page to become
writable. This is needed to stabilize pages during writeback for those two
filesystems.
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/buffer.c | 1 +
1 files changed, 1 insertions(+), 0 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index a08bb8e..cf9a795 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2361,6 +2361,7 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
if (!ret)
ret = block_commit_write(page, 0, end);
+ wait_on_page_writeback(page);
if (unlikely(ret)) {
unlock_page(page);
if (ret == -ENOMEM)
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
When grabbing a page for a buffered IO write, the mm should wait for writeback
on the page to complete so that the page does not become writable during the IO
operation. This change is needed to provide page stability during writes for
all filesystems.
Signed-off-by: Darrick J. Wong <[email protected]>
---
mm/filemap.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index c641edf..fd0e7f2 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2288,7 +2288,7 @@ struct page *grab_cache_page_write_begin(struct address_space *mapping,
repeat:
page = find_lock_page(mapping, index);
if (page)
- return page;
+ goto found;
page = __page_cache_alloc(mapping_gfp_mask(mapping) & ~gfp_notmask);
if (!page)
@@ -2301,6 +2301,8 @@ repeat:
goto repeat;
return NULL;
}
+found:
+ wait_on_page_writeback(page);
return page;
}
EXPORT_SYMBOL(grab_cache_page_write_begin);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
wait_on_page_writeback already checks the writeback bit, so callers of it
needn't do that test.
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/inode.c | 4 +---
fs/ext4/move_extent.c | 3 +--
2 files changed, 2 insertions(+), 5 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index f2fa5e8..3db34b2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -2796,9 +2796,7 @@ static int write_cache_pages_da(struct address_space *mapping,
continue;
}
- if (PageWriteback(page))
- wait_on_page_writeback(page);
-
+ wait_on_page_writeback(page);
BUG_ON(PageWriteback(page));
if (mpd->next_page != page->index)
diff --git a/fs/ext4/move_extent.c b/fs/ext4/move_extent.c
index b9f3e78..2b8304b 100644
--- a/fs/ext4/move_extent.c
+++ b/fs/ext4/move_extent.c
@@ -876,8 +876,7 @@ move_extent_per_page(struct file *o_filp, struct inode *donor_inode,
* It needs to call wait_on_page_writeback() to wait for the
* writeback of the page.
*/
- if (PageWriteback(page))
- wait_on_page_writeback(page);
+ wait_on_page_writeback(page);
/* Release old bh and drop refs */
try_to_release_page(page, 0);
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
ext2 does not protect memory pages containing metadata against writes during
disk write operations. To stabilize the page during a write, lock the
buffer_head while updating metadata pages.
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext2/balloc.c | 6 ++++++
fs/ext2/ialloc.c | 7 +++++++
fs/ext2/inode.c | 7 +++++++
fs/ext2/super.c | 2 ++
fs/ext2/xattr.c | 2 ++
5 files changed, 24 insertions(+), 0 deletions(-)
diff --git a/fs/ext2/balloc.c b/fs/ext2/balloc.c
index 8f44cef..50d7d4c 100644
--- a/fs/ext2/balloc.c
+++ b/fs/ext2/balloc.c
@@ -176,10 +176,12 @@ static void group_adjust_blocks(struct super_block *sb, int group_no,
struct ext2_sb_info *sbi = EXT2_SB(sb);
unsigned free_blocks;
+ lock_buffer(bh);
spin_lock(sb_bgl_lock(sbi, group_no));
free_blocks = le16_to_cpu(desc->bg_free_blocks_count);
desc->bg_free_blocks_count = cpu_to_le16(free_blocks + count);
spin_unlock(sb_bgl_lock(sbi, group_no));
+ unlock_buffer(bh);
sb->s_dirt = 1;
mark_buffer_dirty(bh);
}
@@ -546,6 +548,7 @@ do_more:
goto error_return;
}
+ lock_buffer(bitmap_bh);
for (i = 0, group_freed = 0; i < count; i++) {
if (!ext2_clear_bit_atomic(sb_bgl_lock(sbi, block_group),
bit + i, bitmap_bh->b_data)) {
@@ -555,6 +558,7 @@ do_more:
group_freed++;
}
}
+ unlock_buffer(bitmap_bh);
mark_buffer_dirty(bitmap_bh);
if (sb->s_flags & MS_SYNCHRONOUS)
@@ -1351,8 +1355,10 @@ retry_alloc:
/*
* try to allocate block(s) from this group, without a goal(-1).
*/
+ lock_buffer(bitmap_bh);
grp_alloc_blk = ext2_try_to_allocate_with_rsv(sb, group_no,
bitmap_bh, -1, my_rsv, &num);
+ unlock_buffer(bitmap_bh);
if (grp_alloc_blk >= 0)
goto allocated;
}
diff --git a/fs/ext2/ialloc.c b/fs/ext2/ialloc.c
index ee9ed31..3bf624f 100644
--- a/fs/ext2/ialloc.c
+++ b/fs/ext2/ialloc.c
@@ -74,11 +74,13 @@ static void ext2_release_inode(struct super_block *sb, int group, int dir)
return;
}
+ lock_buffer(bh);
spin_lock(sb_bgl_lock(EXT2_SB(sb), group));
le16_add_cpu(&desc->bg_free_inodes_count, 1);
if (dir)
le16_add_cpu(&desc->bg_used_dirs_count, -1);
spin_unlock(sb_bgl_lock(EXT2_SB(sb), group));
+ unlock_buffer(bh);
if (dir)
percpu_counter_dec(&EXT2_SB(sb)->s_dirs_counter);
sb->s_dirt = 1;
@@ -139,12 +141,14 @@ void ext2_free_inode (struct inode * inode)
return;
/* Ok, now we can actually update the inode bitmaps.. */
+ lock_buffer(bitmap_bh);
if (!ext2_clear_bit_atomic(sb_bgl_lock(EXT2_SB(sb), block_group),
bit, (void *) bitmap_bh->b_data))
ext2_error (sb, "ext2_free_inode",
"bit already cleared for inode %lu", ino);
else
ext2_release_inode(sb, block_group, is_directory);
+ unlock_buffer(bitmap_bh);
mark_buffer_dirty(bitmap_bh);
if (sb->s_flags & MS_SYNCHRONOUS)
sync_dirty_buffer(bitmap_bh);
@@ -491,8 +495,10 @@ repeat_in_this_group:
group = 0;
continue;
}
+ lock_buffer(bitmap_bh);
if (ext2_set_bit_atomic(sb_bgl_lock(sbi, group),
ino, bitmap_bh->b_data)) {
+ unlock_buffer(bitmap_bh);
/* we lost this inode */
if (++ino >= EXT2_INODES_PER_GROUP(sb)) {
/* this group is exhausted, try next group */
@@ -503,6 +509,7 @@ repeat_in_this_group:
/* try to find free inode in the same group */
goto repeat_in_this_group;
}
+ unlock_buffer(bitmap_bh);
goto got;
}
diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 788e09a..f102b82 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -557,11 +557,15 @@ static void ext2_splice_branch(struct inode *inode,
* Update the host buffer_head or inode to point to more just allocated
* direct blocks blocks
*/
+ if (where->bh)
+ lock_buffer(where->bh);
if (num == 0 && blks > 1) {
current_block = le32_to_cpu(where->key) + 1;
for (i = 1; i < blks; i++)
*(where->p + i ) = cpu_to_le32(current_block++);
}
+ if (where->bh)
+ unlock_buffer(where->bh);
/*
* update the most recently allocated logical & physical block
@@ -1426,6 +1430,7 @@ static int __ext2_write_inode(struct inode *inode, int do_sync)
if (IS_ERR(raw_inode))
return -EIO;
+ lock_buffer(bh);
/* For fields not not tracking in the in-memory inode,
* initialise them to zero for new inodes. */
if (ei->i_state & EXT2_STATE_NEW)
@@ -1502,6 +1507,8 @@ static int __ext2_write_inode(struct inode *inode, int do_sync)
}
} else for (n = 0; n < EXT2_N_BLOCKS; n++)
raw_inode->i_block[n] = ei->i_data[n];
+ unlock_buffer(bh);
+
mark_buffer_dirty(bh);
if (do_sync) {
sync_dirty_buffer(bh);
diff --git a/fs/ext2/super.c b/fs/ext2/super.c
index 0a78dae..be2c9ca 100644
--- a/fs/ext2/super.c
+++ b/fs/ext2/super.c
@@ -1152,12 +1152,14 @@ static void ext2_sync_super(struct super_block *sb, struct ext2_super_block *es,
int wait)
{
ext2_clear_super_error(sb);
+ lock_buffer(EXT2_SB(sb)->s_sbh);
spin_lock(&EXT2_SB(sb)->s_lock);
es->s_free_blocks_count = cpu_to_le32(ext2_count_free_blocks(sb));
es->s_free_inodes_count = cpu_to_le32(ext2_count_free_inodes(sb));
es->s_wtime = cpu_to_le32(get_seconds());
/* unlock before we do IO */
spin_unlock(&EXT2_SB(sb)->s_lock);
+ unlock_buffer(EXT2_SB(sb)->s_sbh);
mark_buffer_dirty(EXT2_SB(sb)->s_sbh);
if (wait)
sync_dirty_buffer(EXT2_SB(sb)->s_sbh);
diff --git a/fs/ext2/xattr.c b/fs/ext2/xattr.c
index 5299706..2e0652d 100644
--- a/fs/ext2/xattr.c
+++ b/fs/ext2/xattr.c
@@ -337,9 +337,11 @@ static void ext2_xattr_update_super_block(struct super_block *sb)
if (EXT2_HAS_COMPAT_FEATURE(sb, EXT2_FEATURE_COMPAT_EXT_ATTR))
return;
+ lock_buffer(EXT2_SB(sb)->s_sbh);
spin_lock(&EXT2_SB(sb)->s_lock);
EXT2_SET_COMPAT_FEATURE(sb, EXT2_FEATURE_COMPAT_EXT_ATTR);
spin_unlock(&EXT2_SB(sb)->s_lock);
+ unlock_buffer(EXT2_SB(sb)->s_sbh);
sb->s_dirt = 1;
mark_buffer_dirty(EXT2_SB(sb)->s_sbh);
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
In order to stabilize page writes during writeback operations, it is necessary
to enforce a rule that writes to memory pages containing metadata cannot happen
at the same time that the page is being written to disk. To provide this, lock
the buffer_head representing a piece of metadata while updating memory.
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/fat/dir.c | 14 ++++++++++++++
fs/fat/fatent.c | 12 ++++++++++++
fs/fat/inode.c | 2 ++
fs/fat/misc.c | 2 ++
fs/fat/namei_msdos.c | 4 ++++
fs/fat/namei_vfat.c | 4 ++++
6 files changed, 38 insertions(+), 0 deletions(-)
diff --git a/fs/fat/dir.c b/fs/fat/dir.c
index ee42b9e..9efea13 100644
--- a/fs/fat/dir.c
+++ b/fs/fat/dir.c
@@ -958,11 +958,13 @@ static int __fat_remove_entries(struct inode *dir, loff_t pos, int nr_slots)
orig_slots = nr_slots;
endp = (struct msdos_dir_entry *)(bh->b_data + sb->s_blocksize);
+ lock_buffer(bh);
while (nr_slots && de < endp) {
de->name[0] = DELETED_FLAG;
de++;
nr_slots--;
}
+ unlock_buffer(bh);
mark_buffer_dirty_inode(bh, dir);
if (IS_DIRSYNC(dir))
err = sync_dirty_buffer(bh);
@@ -992,11 +994,13 @@ int fat_remove_entries(struct inode *dir, struct fat_slot_info *sinfo)
sinfo->de = NULL;
bh = sinfo->bh;
sinfo->bh = NULL;
+ lock_buffer(bh);
while (nr_slots && de >= (struct msdos_dir_entry *)bh->b_data) {
de->name[0] = DELETED_FLAG;
de--;
nr_slots--;
}
+ unlock_buffer(bh);
mark_buffer_dirty_inode(bh, dir);
if (IS_DIRSYNC(dir))
err = sync_dirty_buffer(bh);
@@ -1045,7 +1049,9 @@ static int fat_zeroed_cluster(struct inode *dir, sector_t blknr, int nr_used,
err = -ENOMEM;
goto error;
}
+ lock_buffer(bhs[n]);
memset(bhs[n]->b_data, 0, sb->s_blocksize);
+ unlock_buffer(bhs[n]);
set_buffer_uptodate(bhs[n]);
mark_buffer_dirty_inode(bhs[n], dir);
@@ -1103,6 +1109,7 @@ int fat_alloc_new_dir(struct inode *dir, struct timespec *ts)
fat_time_unix2fat(sbi, ts, &time, &date, &time_cs);
de = (struct msdos_dir_entry *)bhs[0]->b_data;
+ lock_buffer(bhs[0]);
/* filling the new directory slots ("." and ".." entries) */
memcpy(de[0].name, MSDOS_DOT, MSDOS_NAME);
memcpy(de[1].name, MSDOS_DOTDOT, MSDOS_NAME);
@@ -1126,6 +1133,7 @@ int fat_alloc_new_dir(struct inode *dir, struct timespec *ts)
de[1].starthi = cpu_to_le16(MSDOS_I(dir)->i_logstart >> 16);
de[0].size = de[1].size = 0;
memset(de + 2, 0, sb->s_blocksize - 2 * sizeof(*de));
+ unlock_buffer(bhs[0]);
set_buffer_uptodate(bhs[0]);
mark_buffer_dirty_inode(bhs[0], dir);
@@ -1185,7 +1193,9 @@ static int fat_add_new_entries(struct inode *dir, void *slots, int nr_slots,
/* fill the directory entry */
copy = min(size, sb->s_blocksize);
+ lock_buffer(bhs[n]);
memcpy(bhs[n]->b_data, slots, copy);
+ unlock_buffer(bhs[n]);
slots += copy;
size -= copy;
set_buffer_uptodate(bhs[n]);
@@ -1288,7 +1298,9 @@ found:
/* Fill the long name slots. */
for (i = 0; i < long_bhs; i++) {
int copy = min_t(int, sb->s_blocksize - offset, size);
+ lock_buffer(bhs[i]);
memcpy(bhs[i]->b_data + offset, slots, copy);
+ unlock_buffer(bhs[i]);
mark_buffer_dirty_inode(bhs[i], dir);
offset = 0;
slots += copy;
@@ -1299,7 +1311,9 @@ found:
if (!err && i < nr_bhs) {
/* Fill the short name slot. */
int copy = min_t(int, sb->s_blocksize - offset, size);
+ lock_buffer(bhs[i]);
memcpy(bhs[i]->b_data + offset, slots, copy);
+ unlock_buffer(bhs[i]);
mark_buffer_dirty_inode(bhs[i], dir);
if (IS_DIRSYNC(dir))
err = sync_dirty_buffer(bhs[i]);
diff --git a/fs/fat/fatent.c b/fs/fat/fatent.c
index b47d2c9..e49a9dd 100644
--- a/fs/fat/fatent.c
+++ b/fs/fat/fatent.c
@@ -160,6 +160,9 @@ static void fat12_ent_put(struct fat_entry *fatent, int new)
if (new == FAT_ENT_EOF)
new = EOF_FAT12;
+ lock_buffer(fatent->bhs[0]);
+ if (fatent->nr_bhs == 2)
+ lock_buffer(fatent->bhs[1]);
spin_lock(&fat12_entry_lock);
if (fatent->entry & 1) {
*ent12_p[0] = (new << 4) | (*ent12_p[0] & 0x0f);
@@ -169,6 +172,9 @@ static void fat12_ent_put(struct fat_entry *fatent, int new)
*ent12_p[1] = (*ent12_p[1] & 0xf0) | (new >> 8);
}
spin_unlock(&fat12_entry_lock);
+ if (fatent->nr_bhs == 2)
+ unlock_buffer(fatent->bhs[1]);
+ unlock_buffer(fatent->bhs[0]);
mark_buffer_dirty_inode(fatent->bhs[0], fatent->fat_inode);
if (fatent->nr_bhs == 2)
@@ -180,7 +186,9 @@ static void fat16_ent_put(struct fat_entry *fatent, int new)
if (new == FAT_ENT_EOF)
new = EOF_FAT16;
+ lock_buffer(fatent->bhs[0]);
*fatent->u.ent16_p = cpu_to_le16(new);
+ unlock_buffer(fatent->bhs[0]);
mark_buffer_dirty_inode(fatent->bhs[0], fatent->fat_inode);
}
@@ -191,7 +199,9 @@ static void fat32_ent_put(struct fat_entry *fatent, int new)
WARN_ON(new & 0xf0000000);
new |= le32_to_cpu(*fatent->u.ent32_p) & ~0x0fffffff;
+ lock_buffer(fatent->bhs[0]);
*fatent->u.ent32_p = cpu_to_le32(new);
+ unlock_buffer(fatent->bhs[0]);
mark_buffer_dirty_inode(fatent->bhs[0], fatent->fat_inode);
}
@@ -382,7 +392,9 @@ static int fat_mirror_bhs(struct super_block *sb, struct buffer_head **bhs,
err = -ENOMEM;
goto error;
}
+ lock_buffer(c_bh);
memcpy(c_bh->b_data, bhs[n]->b_data, sb->s_blocksize);
+ unlock_buffer(c_bh);
set_buffer_uptodate(c_bh);
mark_buffer_dirty_inode(c_bh, sbi->fat_inode);
if (sb->s_flags & MS_SYNCHRONOUS)
diff --git a/fs/fat/inode.c b/fs/fat/inode.c
index 8d68690..96da554 100644
--- a/fs/fat/inode.c
+++ b/fs/fat/inode.c
@@ -623,6 +623,7 @@ retry:
"for updating (i_pos %lld)\n", i_pos);
return -EIO;
}
+ lock_buffer(bh);
spin_lock(&sbi->inode_hash_lock);
if (i_pos != MSDOS_I(inode)->i_pos) {
spin_unlock(&sbi->inode_hash_lock);
@@ -649,6 +650,7 @@ retry:
&raw_entry->adate, NULL);
}
spin_unlock(&sbi->inode_hash_lock);
+ unlock_buffer(bh);
mark_buffer_dirty(bh);
err = 0;
if (wait)
diff --git a/fs/fat/misc.c b/fs/fat/misc.c
index 970e682..3386d81 100644
--- a/fs/fat/misc.c
+++ b/fs/fat/misc.c
@@ -70,10 +70,12 @@ int fat_clusters_flush(struct super_block *sb)
le32_to_cpu(fsinfo->signature2),
sbi->fsinfo_sector);
} else {
+ lock_buffer(bh);
if (sbi->free_clusters != -1)
fsinfo->free_clusters = cpu_to_le32(sbi->free_clusters);
if (sbi->prev_free != -1)
fsinfo->next_cluster = cpu_to_le32(sbi->prev_free);
+ unlock_buffer(bh);
mark_buffer_dirty(bh);
}
brelse(bh);
diff --git a/fs/fat/namei_msdos.c b/fs/fat/namei_msdos.c
index 7114990..cb53c8f 100644
--- a/fs/fat/namei_msdos.c
+++ b/fs/fat/namei_msdos.c
@@ -540,8 +540,10 @@ static int do_msdos_rename(struct inode *old_dir, unsigned char *old_name,
if (update_dotdot) {
int start = MSDOS_I(new_dir)->i_logstart;
+ lock_buffer(dotdot_bh);
dotdot_de->start = cpu_to_le16(start);
dotdot_de->starthi = cpu_to_le16(start >> 16);
+ unlock_buffer(dotdot_bh);
mark_buffer_dirty_inode(dotdot_bh, old_inode);
if (IS_DIRSYNC(new_dir)) {
err = sync_dirty_buffer(dotdot_bh);
@@ -582,8 +584,10 @@ error_dotdot:
if (update_dotdot) {
int start = MSDOS_I(old_dir)->i_logstart;
+ lock_buffer(dotdot_bh);
dotdot_de->start = cpu_to_le16(start);
dotdot_de->starthi = cpu_to_le16(start >> 16);
+ unlock_buffer(dotdot_bh);
mark_buffer_dirty_inode(dotdot_bh, old_inode);
corrupt |= sync_dirty_buffer(dotdot_bh);
}
diff --git a/fs/fat/namei_vfat.c b/fs/fat/namei_vfat.c
index adae3fb..f7e43f4 100644
--- a/fs/fat/namei_vfat.c
+++ b/fs/fat/namei_vfat.c
@@ -978,8 +978,10 @@ static int vfat_rename(struct inode *old_dir, struct dentry *old_dentry,
if (update_dotdot) {
int start = MSDOS_I(new_dir)->i_logstart;
+ lock_buffer(dotdot_bh);
dotdot_de->start = cpu_to_le16(start);
dotdot_de->starthi = cpu_to_le16(start >> 16);
+ unlock_buffer(dotdot_bh);
mark_buffer_dirty_inode(dotdot_bh, old_inode);
if (IS_DIRSYNC(new_dir)) {
err = sync_dirty_buffer(dotdot_bh);
@@ -1022,8 +1024,10 @@ error_dotdot:
if (update_dotdot) {
int start = MSDOS_I(old_dir)->i_logstart;
+ lock_buffer(dotdot_bh);
dotdot_de->start = cpu_to_le16(start);
dotdot_de->starthi = cpu_to_le16(start >> 16);
+ unlock_buffer(dotdot_bh);
mark_buffer_dirty_inode(dotdot_bh, old_inode);
corrupt |= sync_dirty_buffer(dotdot_bh);
}
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
For filesystems that do not provide any page_mkwrite handler, provide a stub
page_mkwrite function that locks the page and waits for pending writeback to
complete. This is needed to stabilize pages during writes for a large variety
of filesystem drivers (ext2, ext3, vfat, hfs...).
Signed-off-by: Darrick J. Wong <[email protected]>
---
mm/filemap.c | 19 +++++++++++++++++++
1 files changed, 19 insertions(+), 0 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index fd0e7f2..1e096a0 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -1713,8 +1713,27 @@ page_not_uptodate:
}
EXPORT_SYMBOL(filemap_fault);
+static int empty_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
+{
+ struct page *page = vmf->page;
+ struct inode *inode = vma->vm_file->f_path.dentry->d_inode;
+ loff_t size;
+
+ lock_page(page);
+ size = i_size_read(inode);
+ if ((page->mapping != inode->i_mapping) ||
+ (page_offset(page) > size)) {
+ /* page got truncated out from underneath us */
+ unlock_page(page);
+ return VM_FAULT_NOPAGE;
+ }
+ wait_on_page_writeback(page);
+ return VM_FAULT_LOCKED;
+}
+
const struct vm_operations_struct generic_file_vm_ops = {
.fault = filemap_fault,
+ .page_mkwrite = empty_page_mkwrite,
};
/* This is used for a general mmap of a disk file */
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
In order to stabilize pages during disk writes, ext4_page_mkwrite must wait for
writeback operations to complete before making a page writable. Furthermore,
the function must return locked pages, and recheck the writeback status if the
page lock is ever dropped. The "someone could wander in" part of this patch
was suggested by Chris Mason.
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/ext4/inode.c | 24 +++++++++++++++++++-----
1 files changed, 19 insertions(+), 5 deletions(-)
diff --git a/fs/ext4/inode.c b/fs/ext4/inode.c
index 3db34b2..1d162a2 100644
--- a/fs/ext4/inode.c
+++ b/fs/ext4/inode.c
@@ -5809,15 +5809,19 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
goto out_unlock;
}
ret = 0;
- if (PageMappedToDisk(page))
- goto out_unlock;
+
+ lock_page(page);
+ wait_on_page_writeback(page);
+ if (PageMappedToDisk(page)) {
+ up_read(&inode->i_alloc_sem);
+ return VM_FAULT_LOCKED;
+ }
if (page->index == size >> PAGE_CACHE_SHIFT)
len = size & ~PAGE_CACHE_MASK;
else
len = PAGE_CACHE_SIZE;
- lock_page(page);
/*
* return if we have all the buffers mapped. This avoid
* the need to call write_begin/write_end which does a
@@ -5827,8 +5831,8 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (page_has_buffers(page)) {
if (!walk_page_buffers(NULL, page_buffers(page), 0, len, NULL,
ext4_bh_unmapped)) {
- unlock_page(page);
- goto out_unlock;
+ up_read(&inode->i_alloc_sem);
+ return VM_FAULT_LOCKED;
}
}
unlock_page(page);
@@ -5848,6 +5852,16 @@ int ext4_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf)
if (ret < 0)
goto out_unlock;
ret = 0;
+
+ /*
+ * write_begin/end might have created a dirty page and someone
+ * could wander in and start the IO. Make sure that hasn't
+ * happened.
+ */
+ lock_page(page);
+ wait_on_page_writeback(page);
+ up_read(&inode->i_alloc_sem);
+ return VM_FAULT_LOCKED;
out_unlock:
if (ret)
ret = VM_FAULT_SIGBUS;
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Mon, May 09, 2011 at 04:03:18PM -0700, Darrick J. Wong wrote:
> Hi all,
>
> This is v3.1 of the stable-page-writes patchset for ext4/3/2, xfs, and fat.
> The purpose of this patchset is to prohibit processes from writing on memory
> pages that are currently being written to disk because certain storage setups
> (e.g. SCSI disks with DIF integrity checksums) will fail a write if the page
> contents don't match the checksum. btrfs already guarantees page stability, so
> it does not use these changes.
>
> The technique used is fairly simple -- whenever a page is about to become
> writable (either because of a write fault to a mapped page, or a buffered write
> is in progress), wait for the page writeback flag to be clear, indicating that
> the page is not being written to disk. This means that it is necessary (1) to
> add wait for writeback code to grab_cache_page_write_begin to take care of
> buffered writes, and (2) all filesystems must have a page_mkwrite that locks a
> page, waits for writeback, and returns the locked page. For filesystems that
> piggyback on the generic block_page_mkwrite, the patchset adds the writeback
> wait to that function; for filesystems that do not use the page_mkwrite hook at
> all, the patchset provides a stub page_mkwrite.
>
> I ran my write-after-checksum ("wac") reproducer program to try to create the
> DIF checksum errors by madly rewriting the same memory pages. In fact, I tried
> the following combinations against ext2/3/4, xfs, btrfs, and vfat:
>
> a. 64 write() threads + sync_file_range
> b. 64 mmap write threads + msync
> c. 32 write() threads + sync_file_range + 32 mmap write threads + msync
> d. Same as C, but with all threads in directio mode
> e. Same as A, but with all threads in directio mode
> f. Same as B, but with all threads in directio mode
>
> After running profiles A-F for 30 minutes each on 6 different machines, ext2,
> ext4, xfs, and vfat reported no errors. ext3 still has a lingering failure
> case (which I will touch on briefly later) and btrfs eventually reports -ENOSPC
> and fails the test, though it does that even without any of the patches applied.
Can you turn this into an new test for xfstests? That was the test
is published and can be run by anyone and (hopefully) prevent
regressions from being introduced...
Cheers,
Dave.
--
Dave Chinner
[email protected]
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
"Darrick J. Wong" <[email protected]> writes:
> To assess the performance impact of stable page writes, I moved to a disk that
> doesn't have DIF support so that I could measure just the impact of waiting for
> writeback. I first ran wac with 64 threads madly scribbling on a 64k file and
> saw about a 12 percent performance decrease. I then reran the wac program with
> 64 threads and a 64MB file and saw about the same performance numbers. As I
> suspected, the patchset only seems to impact workloads that rewrite the same
> memory page frequently.
>
> I am still chasing down what exactly is broken in ext3. data=writeback mode
> passes with no failures. data=ordered, however, does not pass; my current
> suspicion is that jbd is calling submit_bh on data buffers but doesn't call
> page_mkclean to kick the userspace programs off the page before writing it.
>
> Per various comments regarding v3 of this patchset, I've integrated his
> suggestions, reworked the patch descriptions to make it clearer which ones
> touch all the filesystems and which ones are to fix remaining holes in specific
> filesystems, and expanded the scope of filesystems that got fixed.
>
> As always, questions and comments are welcome; and thank you to all the
> previous reviewers of this patchset. I am also soliciting people's opinions on
> whether or not these patches could go upstream for .40.
I'd like to know those patches are on what state. Waiting in writeback
page makes slower, like you mentioned it (I guess it would more
noticeable if device was slower that like FAT uses). And I think
currently it doesn't help anything others for blk-integrity stuff
(without other technic, it doesn't help FS consistency)?
So, why is this locking stuff enabled always? I think it would be better
to enable only if blk-integrity stuff was enabled.
If it was more sophisticate but more complex stuff (e.g. use
copy-on-write technic for it), I would agree always enable though.
Thanks.
--
OGAWA Hirofumi <[email protected]>
On Tue 10-05-11 10:59:15, OGAWA Hirofumi wrote:
> "Darrick J. Wong" <[email protected]> writes:
>
> > To assess the performance impact of stable page writes, I moved to a disk that
> > doesn't have DIF support so that I could measure just the impact of waiting for
> > writeback. I first ran wac with 64 threads madly scribbling on a 64k file and
> > saw about a 12 percent performance decrease. I then reran the wac program with
> > 64 threads and a 64MB file and saw about the same performance numbers. As I
> > suspected, the patchset only seems to impact workloads that rewrite the same
> > memory page frequently.
> >
> > I am still chasing down what exactly is broken in ext3. data=writeback mode
> > passes with no failures. data=ordered, however, does not pass; my current
> > suspicion is that jbd is calling submit_bh on data buffers but doesn't call
> > page_mkclean to kick the userspace programs off the page before writing it.
> >
> > Per various comments regarding v3 of this patchset, I've integrated his
> > suggestions, reworked the patch descriptions to make it clearer which ones
> > touch all the filesystems and which ones are to fix remaining holes in specific
> > filesystems, and expanded the scope of filesystems that got fixed.
> >
> > As always, questions and comments are welcome; and thank you to all the
> > previous reviewers of this patchset. I am also soliciting people's opinions on
> > whether or not these patches could go upstream for .40.
>
> I'd like to know those patches are on what state. Waiting in writeback
> page makes slower, like you mentioned it (I guess it would more
> noticeable if device was slower that like FAT uses). And I think
> currently it doesn't help anything others for blk-integrity stuff
> (without other technic, it doesn't help FS consistency)?
>
> So, why is this locking stuff enabled always? I think it would be better
> to enable only if blk-integrity stuff was enabled.
>
> If it was more sophisticate but more complex stuff (e.g. use
> copy-on-write technic for it), I would agree always enable though.
Well, also software RAID generally needs this feature (so that parity
information / mirror can be properly kept in sync). Not that I'd advocate
that this feature must be always enabled, it's just that there are also
other users besides blk-integrity.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Mon 09-05-11 16:03:34, Darrick J. Wong wrote:
> For filesystems such as nilfs2 and xfs that use block_page_mkwrite, modify that
> function to wait for pending writeback before allowing the page to become
> writable. This is needed to stabilize pages during writeback for those two
> filesystems.
>
> Signed-off-by: Darrick J. Wong <[email protected]>
> ---
> fs/buffer.c | 1 +
> 1 files changed, 1 insertions(+), 0 deletions(-)
>
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index a08bb8e..cf9a795 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -2361,6 +2361,7 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
> if (!ret)
> ret = block_commit_write(page, 0, end);
>
> + wait_on_page_writeback(page);
Not that it matters much but it would seem more logical to me if we
waited only in not-error case (i.e. after the error handling below).
Honza
> if (unlikely(ret)) {
> unlock_page(page);
> if (ret == -ENOMEM)
>
--
Jan Kara <[email protected]>
SUSE Labs, CR
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Mon 09-05-11 16:03:18, Darrick J. Wong wrote:
> I am still chasing down what exactly is broken in ext3. data=writeback mode
> passes with no failures. data=ordered, however, does not pass; my current
> suspicion is that jbd is calling submit_bh on data buffers but doesn't call
> page_mkclean to kick the userspace programs off the page before writing it.
Yes, ext3 in data=ordered mode writes pages from
journal_commit_transaction() via submit_bh() without clearing page dirty
bits thus page_mkclean() is not called for these pages. Frankly, do you
really want to bother with adding support for ext2 and ext3? People can use
ext4 as a fs driver when they want to start using blk-integrity support.
Especially ext2 patch looks really painful and just from a quick look I can
see code e.g. in fs/ext2/namei.c which isn't handled by your patch yet.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
Jan Kara <[email protected]> writes:
>> I'd like to know those patches are on what state. Waiting in writeback
>> page makes slower, like you mentioned it (I guess it would more
>> noticeable if device was slower that like FAT uses). And I think
>> currently it doesn't help anything others for blk-integrity stuff
>> (without other technic, it doesn't help FS consistency)?
>>
>> So, why is this locking stuff enabled always? I think it would be better
>> to enable only if blk-integrity stuff was enabled.
>>
>> If it was more sophisticate but more complex stuff (e.g. use
>> copy-on-write technic for it), I would agree always enable though.
> Well, also software RAID generally needs this feature (so that parity
> information / mirror can be properly kept in sync). Not that I'd advocate
> that this feature must be always enabled, it's just that there are also
> other users besides blk-integrity.
I see. So many block layer stuff sounds like broken on corner case? If
so, I more feel this approach should be temporary workaround, and should
use another less-blocking approach.
Thanks.
--
OGAWA Hirofumi <[email protected]>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Tue 10-05-11 22:12:54, OGAWA Hirofumi wrote:
> Jan Kara <[email protected]> writes:
>
> >> I'd like to know those patches are on what state. Waiting in writeback
> >> page makes slower, like you mentioned it (I guess it would more
> >> noticeable if device was slower that like FAT uses). And I think
> >> currently it doesn't help anything others for blk-integrity stuff
> >> (without other technic, it doesn't help FS consistency)?
> >>
> >> So, why is this locking stuff enabled always? I think it would be better
> >> to enable only if blk-integrity stuff was enabled.
> >>
> >> If it was more sophisticate but more complex stuff (e.g. use
> >> copy-on-write technic for it), I would agree always enable though.
> > Well, also software RAID generally needs this feature (so that parity
> > information / mirror can be properly kept in sync). Not that I'd advocate
> > that this feature must be always enabled, it's just that there are also
> > other users besides blk-integrity.
>
> I see. So many block layer stuff sounds like broken on corner case? If
> so, I more feel this approach should be temporary workaround, and should
> use another less-blocking approach.
Not many but some... The alternative to less blocking approach is to do
copy-out before a page is submitted for IO (or various middle ground
alternatives of doing sometimes copyout, sometimes blocking...). That costs
some performance as well. We talked about it at LSF and the approach
Darrick is implementing was considered the least intrusive. There's really
no way to fix these corner cases and keep performance. But indeed a plain
SATA drive or a USB stick don't need stable pages so they wouldn't need to
pay the cost. So it would be beneficial if the underlying block device
propagated whether it needs stable writes or not and filesystem could turn
on stable pages accordingly.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
On Tue, May 10, 2011 at 10:59:15AM +0900, OGAWA Hirofumi wrote:
> I'd like to know those patches are on what state. Waiting in writeback
> page makes slower, like you mentioned it (I guess it would more
> noticeable if device was slower that like FAT uses). And I think
> currently it doesn't help anything others for blk-integrity stuff
> (without other technic, it doesn't help FS consistency)?
It only makes things slower if we rewrite a region in a file that
is currently undergoing writeback. I'd be interested to know about
real life applications doing that, and if they really are badly affect
we should help them to work around that in userspace, e.g. by adding
a fadvice will rewrite call that might be used to never write back
that regions without an explicit fsync call.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
Jan Kara <[email protected]> writes:
>> I see. So many block layer stuff sounds like broken on corner case? If
>> so, I more feel this approach should be temporary workaround, and should
>> use another less-blocking approach.
> Not many but some... The alternative to less blocking approach is to do
> copy-out before a page is submitted for IO (or various middle ground
> alternatives of doing sometimes copyout, sometimes blocking...). That costs
> some performance as well. We talked about it at LSF and the approach
> Darrick is implementing was considered the least intrusive. There's really
> no way to fix these corner cases and keep performance.
You already considered, to copy only if page was writeback (like
copy-on-write). I.e. if page is on I/O, copy, then switch the page for
writing new data.
Yes, it is complex. But I think blocking and overhead is minimum, and
this can be used as infrastructure for copy-on-write FS.
Thanks.
--
OGAWA Hirofumi <[email protected]>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
Christoph Hellwig <[email protected]> writes:
> On Tue, May 10, 2011 at 10:59:15AM +0900, OGAWA Hirofumi wrote:
>> I'd like to know those patches are on what state. Waiting in writeback
>> page makes slower, like you mentioned it (I guess it would more
>> noticeable if device was slower that like FAT uses). And I think
>> currently it doesn't help anything others for blk-integrity stuff
>> (without other technic, it doesn't help FS consistency)?
>
> It only makes things slower if we rewrite a region in a file that is
> currently undergoing writeback. I'd be interested to know about real
> life applications doing that, and if they really are badly affect we
> should help them to work around that in userspace, e.g. by adding a
> fadvice will rewrite call that might be used to never write back that
> regions without an explicit fsync call.
Isn't it reallocated blocks too, and metadata too?
Thanks.
--
OGAWA Hirofumi <[email protected]>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
OGAWA Hirofumi <[email protected]> writes:
> Jan Kara <[email protected]> writes:
>
>>> I see. So many block layer stuff sounds like broken on corner case? If
>>> so, I more feel this approach should be temporary workaround, and should
>>> use another less-blocking approach.
>> Not many but some... The alternative to less blocking approach is to do
>> copy-out before a page is submitted for IO (or various middle ground
>> alternatives of doing sometimes copyout, sometimes blocking...). That costs
>> some performance as well. We talked about it at LSF and the approach
>> Darrick is implementing was considered the least intrusive. There's really
>> no way to fix these corner cases and keep performance.
>
> You already considered, to copy only if page was writeback (like
> copy-on-write). I.e. if page is on I/O, copy, then switch the page for
> writing new data.
missed question mark in here.
Did you already consider, to copy only if page was writeback (like
copy-on-write)? I.e. if page is on I/O, copy, then switch the page for
writing new data.
> Yes, it is complex. But I think blocking and overhead is minimum, and
> this can be used as infrastructure for copy-on-write FS.
--
OGAWA Hirofumi <[email protected]>
On Tue 10-05-11 22:52:10, OGAWA Hirofumi wrote:
> Christoph Hellwig <[email protected]> writes:
>
> > On Tue, May 10, 2011 at 10:59:15AM +0900, OGAWA Hirofumi wrote:
> >> I'd like to know those patches are on what state. Waiting in writeback
> >> page makes slower, like you mentioned it (I guess it would more
> >> noticeable if device was slower that like FAT uses). And I think
> >> currently it doesn't help anything others for blk-integrity stuff
> >> (without other technic, it doesn't help FS consistency)?
> >
> > It only makes things slower if we rewrite a region in a file that is
> > currently undergoing writeback. I'd be interested to know about real
> > life applications doing that, and if they really are badly affect we
> > should help them to work around that in userspace, e.g. by adding a
> > fadvice will rewrite call that might be used to never write back that
> > regions without an explicit fsync call.
>
> Isn't it reallocated blocks too, and metadata too?
Reallocated blocks - not really. For a block to be freed it cannot be
under writeback and when it's freed no writeback is started. For metadata -
yes. But ext3, ext4, xfs, btrfs have to avoid modifying metadata under
writeback anyway (because of journalling / COW constraints) and thus they
don't care. For ext2 or vfat it's a different story. But as I wrote to
Darrick, I'm not sure about vfat but for ext2 and similar legacy
filesystems, I'd rather let them live with their unstable pages under IO ;)
because I see a limited use for that.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Tue 10-05-11 23:05:41, OGAWA Hirofumi wrote:
> OGAWA Hirofumi <[email protected]> writes:
>
> > Jan Kara <[email protected]> writes:
> >
> >>> I see. So many block layer stuff sounds like broken on corner case? If
> >>> so, I more feel this approach should be temporary workaround, and should
> >>> use another less-blocking approach.
> >> Not many but some... The alternative to less blocking approach is to do
> >> copy-out before a page is submitted for IO (or various middle ground
> >> alternatives of doing sometimes copyout, sometimes blocking...). That costs
> >> some performance as well. We talked about it at LSF and the approach
> >> Darrick is implementing was considered the least intrusive. There's really
> >> no way to fix these corner cases and keep performance.
> >
> > You already considered, to copy only if page was writeback (like
> > copy-on-write). I.e. if page is on I/O, copy, then switch the page for
> > writing new data.
>
> missed question mark in here.
>
> Did you already consider, to copy only if page was writeback (like
> copy-on-write)? I.e. if page is on I/O, copy, then switch the page for
> writing new data.
Yes, that was considered as well. We'd have to essentially migrate the
page that is under writeback and should be written to. You are going to pay
the cost of page allocation, copy, increased memory & cache pressure.
Depending on your backing storage and workload this may or may not be better
than waiting for IO...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
Jan Kara <[email protected]> writes:
>> Isn't it reallocated blocks too, and metadata too?
> Reallocated blocks - not really. For a block to be freed it cannot be
> under writeback and when it's freed no writeback is started.
Sure for data -> data reallocated case. metadata -> data/metadata is
still there.
> For metadata - yes. But ext3, ext4, xfs, btrfs have to avoid modifying
> metadata under writeback anyway (because of journalling / COW
> constraints) and thus they don't care.
Yes. Those would use better way than just blocking.
> For ext2 or vfat it's a different story. But as I wrote to Darrick,
> I'm not sure about vfat but for ext2 and similar legacy filesystems,
> I'd rather let them live with their unstable pages under IO ;) because
> I see a limited use for that.
If this patches was not going to tackle it, I have no argument here ;)
It would be simply FS specific approach/fixes anymore like journal.
Thanks.
--
OGAWA Hirofumi <[email protected]>
Jan Kara <[email protected]> writes:
>> Did you already consider, to copy only if page was writeback (like
>> copy-on-write)? I.e. if page is on I/O, copy, then switch the page for
>> writing new data.
> Yes, that was considered as well. We'd have to essentially migrate the
> page that is under writeback and should be written to. You are going to pay
> the cost of page allocation, copy, increased memory & cache pressure.
> Depending on your backing storage and workload this may or may not be better
> than waiting for IO...
Maybe possible, but you really think on usual case just blocking is
better?
Thanks.
--
OGAWA Hirofumi <[email protected]>
On Wed 11-05-11 00:24:58, OGAWA Hirofumi wrote:
> Jan Kara <[email protected]> writes:
>
> >> Isn't it reallocated blocks too, and metadata too?
> > Reallocated blocks - not really. For a block to be freed it cannot be
> > under writeback and when it's freed no writeback is started.
>
> Sure for data -> data reallocated case. metadata -> data/metadata is
> still there.
Unless you properly use bforget() (which you should I think). But I have
not really checked this in detail for a while.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Wed 11-05-11 01:12:13, OGAWA Hirofumi wrote:
> Jan Kara <[email protected]> writes:
>
> >> Did you already consider, to copy only if page was writeback (like
> >> copy-on-write)? I.e. if page is on I/O, copy, then switch the page for
> >> writing new data.
> > Yes, that was considered as well. We'd have to essentially migrate the
> > page that is under writeback and should be written to. You are going to pay
> > the cost of page allocation, copy, increased memory & cache pressure.
> > Depending on your backing storage and workload this may or may not be better
> > than waiting for IO...
>
> Maybe possible, but you really think on usual case just blocking is
> better?
Define usual case... As Christoph noted, we don't currently have a real
practical case where blocking would matter (since frequent rewrites are
rather rare). So defining what is usual when we don't have a single real
case is kind of tough ;)
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
Excerpts from Jan Kara's message of 2011-05-10 08:51:24 -0400:
> On Mon 09-05-11 16:03:18, Darrick J. Wong wrote:
> > I am still chasing down what exactly is broken in ext3. data=writeback mode
> > passes with no failures. data=ordered, however, does not pass; my current
> > suspicion is that jbd is calling submit_bh on data buffers but doesn't call
> > page_mkclean to kick the userspace programs off the page before writing it.
> Yes, ext3 in data=ordered mode writes pages from
> journal_commit_transaction() via submit_bh() without clearing page dirty
> bits thus page_mkclean() is not called for these pages. Frankly, do you
> really want to bother with adding support for ext2 and ext3? People can use
> ext4 as a fs driver when they want to start using blk-integrity support.
> Especially ext2 patch looks really painful and just from a quick look I can
> see code e.g. in fs/ext2/namei.c which isn't handled by your patch yet.
I think ext23 are going to be pretty big changes, we're best off just
going with ext4.
-chris
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
Jan Kara <[email protected]> writes:
>> Maybe possible, but you really think on usual case just blocking is
>> better?
> Define usual case... As Christoph noted, we don't currently have a real
> practical case where blocking would matter (since frequent rewrites are
> rather rare). So defining what is usual when we don't have a single real
> case is kind of tough ;)
OK. E.g. usual workload on desktop, but FS like ext2/fat.
--
OGAWA Hirofumi <[email protected]>
Jan Kara <[email protected]> writes:
> On Wed 11-05-11 00:24:58, OGAWA Hirofumi wrote:
>> Jan Kara <[email protected]> writes:
>>
>> >> Isn't it reallocated blocks too, and metadata too?
>> > Reallocated blocks - not really. For a block to be freed it cannot be
>> > under writeback and when it's freed no writeback is started.
>>
>> Sure for data -> data reallocated case. metadata -> data/metadata is
>> still there.
> Unless you properly use bforget() (which you should I think). But I have
> not really checked this in detail for a while.
bforget() doesn't wait IO, right?
--
OGAWA Hirofumi <[email protected]>
On Wed 11-05-11 01:29:28, OGAWA Hirofumi wrote:
> Jan Kara <[email protected]> writes:
>
> > On Wed 11-05-11 00:24:58, OGAWA Hirofumi wrote:
> >> Jan Kara <[email protected]> writes:
> >>
> >> >> Isn't it reallocated blocks too, and metadata too?
> >> > Reallocated blocks - not really. For a block to be freed it cannot be
> >> > under writeback and when it's freed no writeback is started.
> >>
> >> Sure for data -> data reallocated case. metadata -> data/metadata is
> >> still there.
> > Unless you properly use bforget() (which you should I think). But I have
> > not really checked this in detail for a while.
>
> bforget() doesn't wait IO, right?
Right, my fault. Sorry. So you were right, reallocated blocks will be hit
as well.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Wed, May 11, 2011 at 12:24:58AM +0900, OGAWA Hirofumi wrote:
> > under writeback and when it's freed no writeback is started.
>
> Sure for data -> data reallocated case. metadata -> data/metadata is
> still there.
That's usually handled differently. For XFS take a look at the
xfs_alloc_busy_* function. For 2.6.40 they've been mostly rewritten
to rarely wait for the reuse but instead avoid busy blocks. But that's
a real data integrity issue even without stable pages for I/O.
For filesystems such as nilfs2 and xfs that use block_page_mkwrite, modify that
function to wait for pending writeback before allowing the page to become
writable. This is needed to stabilize pages during writeback for those two
filesystems.
Slight rework based on Jan Kara's suggestion.
Signed-off-by: Darrick J. Wong <[email protected]>
---
fs/buffer.c | 4 +++-
1 files changed, 3 insertions(+), 1 deletions(-)
diff --git a/fs/buffer.c b/fs/buffer.c
index a08bb8e..0e7fa16 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -2367,8 +2367,10 @@ block_page_mkwrite(struct vm_area_struct *vma, struct vm_fault *vmf,
ret = VM_FAULT_OOM;
else /* -ENOSPC, -EIO, etc */
ret = VM_FAULT_SIGBUS;
- } else
+ } else {
+ wait_on_page_writeback(page);
ret = VM_FAULT_LOCKED;
+ }
out:
return ret;
Christoph Hellwig <[email protected]> writes:
> On Wed, May 11, 2011 at 12:24:58AM +0900, OGAWA Hirofumi wrote:
>> > under writeback and when it's freed no writeback is started.
>>
>> Sure for data -> data reallocated case. metadata -> data/metadata is
>> still there.
>
> That's usually handled differently. For XFS take a look at the
> xfs_alloc_busy_* function. For 2.6.40 they've been mostly rewritten
> to rarely wait for the reuse but instead avoid busy blocks. But that's
> a real data integrity issue even without stable pages for I/O.
Sounds good. So... Are you suggesting this series should use better
approach than just blocking?
Thanks.
--
OGAWA Hirofumi <[email protected]>
On Wed, May 11, 2011 at 05:50:07AM +0900, OGAWA Hirofumi wrote:
> Sounds good. So... Are you suggesting this series should use better
> approach than just blocking?
No, block reuse is a problem independent of stable pages.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
Christoph Hellwig <[email protected]> writes:
> On Wed, May 11, 2011 at 05:50:07AM +0900, OGAWA Hirofumi wrote:
>> Sounds good. So... Are you suggesting this series should use better
>> approach than just blocking?
>
> No, block reuse is a problem independent of stable pages.
OK. So, sounds like we are talking different points. I was generic stuff
(whole of patches). You were only some patches (guess it's only data page).
--
OGAWA Hirofumi <[email protected]>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Tue, May 10, 2011 at 02:51:24PM +0200, Jan Kara wrote:
> On Mon 09-05-11 16:03:18, Darrick J. Wong wrote:
> > I am still chasing down what exactly is broken in ext3. data=writeback mode
> > passes with no failures. data=ordered, however, does not pass; my current
> > suspicion is that jbd is calling submit_bh on data buffers but doesn't call
> > page_mkclean to kick the userspace programs off the page before writing it.
> Yes, ext3 in data=ordered mode writes pages from
> journal_commit_transaction() via submit_bh() without clearing page dirty
> bits thus page_mkclean() is not called for these pages. Frankly, do you
> really want to bother with adding support for ext2 and ext3? People can use
> ext4 as a fs driver when they want to start using blk-integrity support.
> Especially ext2 patch looks really painful and just from a quick look I can
> see code e.g. in fs/ext2/namei.c which isn't handled by your patch yet.
Yeah, I agree that ext2 is ugly and ext3/jbd might be more painful. Are there
any other code that wants stable pages that's already running with ext3? In
this months-long discussion I've heard that encryption and raid also like
stable pages during writes. Have those users been broken this whole time, or
have they been stabilizing pages themselves?
I suppose we can cross the "ext3 fails horribly on DIF" bridge when someone
complains about it. Possibly we could try to steer them to btrfs.
--D
On Wed 11-05-11 11:19:01, Darrick J. Wong wrote:
> On Tue, May 10, 2011 at 02:51:24PM +0200, Jan Kara wrote:
> > On Mon 09-05-11 16:03:18, Darrick J. Wong wrote:
> > > I am still chasing down what exactly is broken in ext3. data=writeback mode
> > > passes with no failures. data=ordered, however, does not pass; my current
> > > suspicion is that jbd is calling submit_bh on data buffers but doesn't call
> > > page_mkclean to kick the userspace programs off the page before writing it.
> > Yes, ext3 in data=ordered mode writes pages from
> > journal_commit_transaction() via submit_bh() without clearing page dirty
> > bits thus page_mkclean() is not called for these pages. Frankly, do you
> > really want to bother with adding support for ext2 and ext3? People can use
> > ext4 as a fs driver when they want to start using blk-integrity support.
> > Especially ext2 patch looks really painful and just from a quick look I can
> > see code e.g. in fs/ext2/namei.c which isn't handled by your patch yet.
>
> Yeah, I agree that ext2 is ugly and ext3/jbd might be more painful. Are there
> any other code that wants stable pages that's already running with ext3? In
> this months-long discussion I've heard that encryption and raid also like
> stable pages during writes. Have those users been broken this whole time, or
> have they been stabilizing pages themselves?
I believe part of them has been broken (e.g. raid) and part of them do
copy-out so they were OK.
> I suppose we can cross the "ext3 fails horribly on DIF" bridge when someone
> complains about it. Possibly we could try to steer them to btrfs.
Well, btrfs might be a bit too advantageous for production servers but
ext4 would be definitely viable for them.
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Wed, May 11, 2011 at 01:28:32AM +0900, OGAWA Hirofumi wrote:
> Jan Kara <[email protected]> writes:
>
> >> Maybe possible, but you really think on usual case just blocking is
> >> better?
> > Define usual case... As Christoph noted, we don't currently have a real
> > practical case where blocking would matter (since frequent rewrites are
> > rather rare). So defining what is usual when we don't have a single real
> > case is kind of tough ;)
>
> OK. E.g. usual workload on desktop, but FS like ext2/fat.
In the frequent rewrite case, here's what you get:
Regular disk: (possibly garbage) write, followed by a second write to make the
disk reflect memory contents.
RAID w/ shadow pages: two writes, both consistent. Higher memory consumption.
T10 DIF disk: disk error any time the CPU modifies a page that the disk
controller is DMA'ing out of memory. I suppose one could simply retry the
operation if the page is dirty, but supposing memory writes are happening fast
enough that the retries also produce disk errors, _nothing_ ever gets written.
With the new stable-page-writes patchset, the garbage write/disk error symptoms
go away since the processes block instead of creating this window where it's
not clear whether the disk's copy of the data is consistent. I could turn the
wait_on_page_writeback calls into some sort of page migration if the
performance turns out to be terrible, though I'm still working on quantifying
the impact. Some people pointed out that sqlite tends to write the same blocks
frequently, though I wonder if sqlite actually tries to write memory pages
while syncing them?
One use case where I could see a serious performance hit happening is the case
where some app writes a bunch of memory pages, calls sync to force the dirty
pages to disk, and /must/ resume writing those memory pages before the sync
completes. The page migration would of course help there, provided a memory
page can be found in less time than an I/O operation.
Someone commented on the LWN article about this topic, claiming that he had a
program that couldn't afford to block on writes to mlock()'d memory. I'm not
sure how to fix that program, because if memory writes never coordinate with
disk writes and the other threads are always writing memory, I wonder how the
copy on disk isn't always indeterminate.
--D
On Thu, May 12, 2011 at 11:42:55AM +0200, Jan Kara wrote:
> On Wed 11-05-11 11:19:01, Darrick J. Wong wrote:
> > On Tue, May 10, 2011 at 02:51:24PM +0200, Jan Kara wrote:
> > > On Mon 09-05-11 16:03:18, Darrick J. Wong wrote:
> > > > I am still chasing down what exactly is broken in ext3. data=writeback mode
> > > > passes with no failures. data=ordered, however, does not pass; my current
> > > > suspicion is that jbd is calling submit_bh on data buffers but doesn't call
> > > > page_mkclean to kick the userspace programs off the page before writing it.
> > > Yes, ext3 in data=ordered mode writes pages from
> > > journal_commit_transaction() via submit_bh() without clearing page dirty
> > > bits thus page_mkclean() is not called for these pages. Frankly, do you
> > > really want to bother with adding support for ext2 and ext3? People can use
> > > ext4 as a fs driver when they want to start using blk-integrity support.
> > > Especially ext2 patch looks really painful and just from a quick look I can
> > > see code e.g. in fs/ext2/namei.c which isn't handled by your patch yet.
> >
> > Yeah, I agree that ext2 is ugly and ext3/jbd might be more painful. Are there
> > any other code that wants stable pages that's already running with ext3? In
> > this months-long discussion I've heard that encryption and raid also like
> > stable pages during writes. Have those users been broken this whole time, or
> > have they been stabilizing pages themselves?
> I believe part of them has been broken (e.g. raid) and part of them do
> copy-out so they were OK.
A future step might be to undo all these homegrown copy-outs?
> > I suppose we can cross the "ext3 fails horribly on DIF" bridge when someone
> > complains about it. Possibly we could try to steer them to btrfs.
> Well, btrfs might be a bit too advantageous for production servers but
> ext4 would be definitely viable for them.
Are there any distros that are going straight from ext3 to btrfs?
--D
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Mon 16-05-11 11:49:27, Darrick J. Wong wrote:
> On Thu, May 12, 2011 at 11:42:55AM +0200, Jan Kara wrote:
> > On Wed 11-05-11 11:19:01, Darrick J. Wong wrote:
> > > On Tue, May 10, 2011 at 02:51:24PM +0200, Jan Kara wrote:
> > > > On Mon 09-05-11 16:03:18, Darrick J. Wong wrote:
> > > > > I am still chasing down what exactly is broken in ext3. data=writeback mode
> > > > > passes with no failures. data=ordered, however, does not pass; my current
> > > > > suspicion is that jbd is calling submit_bh on data buffers but doesn't call
> > > > > page_mkclean to kick the userspace programs off the page before writing it.
> > > > Yes, ext3 in data=ordered mode writes pages from
> > > > journal_commit_transaction() via submit_bh() without clearing page dirty
> > > > bits thus page_mkclean() is not called for these pages. Frankly, do you
> > > > really want to bother with adding support for ext2 and ext3? People can use
> > > > ext4 as a fs driver when they want to start using blk-integrity support.
> > > > Especially ext2 patch looks really painful and just from a quick look I can
> > > > see code e.g. in fs/ext2/namei.c which isn't handled by your patch yet.
> > >
> > > Yeah, I agree that ext2 is ugly and ext3/jbd might be more painful. Are there
> > > any other code that wants stable pages that's already running with ext3? In
> > > this months-long discussion I've heard that encryption and raid also like
> > > stable pages during writes. Have those users been broken this whole time, or
> > > have they been stabilizing pages themselves?
> > I believe part of them has been broken (e.g. raid) and part of them do
> > copy-out so they were OK.
>
> A future step might be to undo all these homegrown copy-outs?
Sure but I'm not the right one to tell you where these are ;).
> > > I suppose we can cross the "ext3 fails horribly on DIF" bridge when someone
> > > complains about it. Possibly we could try to steer them to btrfs.
> > Well, btrfs might be a bit too advantageous for production servers but
> > ext4 would be definitely viable for them.
>
> Are there any distros that are going straight from ext3 to btrfs?
Most distros currently offer users a choice of xfs, ext3, ext4, btrfs
with ext4 being the default. I'm not sure if that's what you are asking
about...
Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Mon, May 09, 2011 at 04:03:18PM -0700, Darrick J. Wong wrote:
> Hi all,
>
> This is v3.1 of the stable-page-writes patchset for ext4/3/2, xfs, and fat.
> The purpose of this patchset is to prohibit processes from writing on memory
> pages that are currently being written to disk because certain storage setups
> (e.g. SCSI disks with DIF integrity checksums) will fail a write if the page
> contents don't match the checksum. btrfs already guarantees page stability, so
> it does not use these changes.
>
> The technique used is fairly simple -- whenever a page is about to become
> writable (either because of a write fault to a mapped page, or a buffered write
> is in progress), wait for the page writeback flag to be clear, indicating that
> the page is not being written to disk. This means that it is necessary (1) to
> add wait for writeback code to grab_cache_page_write_begin to take care of
> buffered writes, and (2) all filesystems must have a page_mkwrite that locks a
> page, waits for writeback, and returns the locked page. For filesystems that
> piggyback on the generic block_page_mkwrite, the patchset adds the writeback
> wait to that function; for filesystems that do not use the page_mkwrite hook at
> all, the patchset provides a stub page_mkwrite.
>
> I ran my write-after-checksum ("wac") reproducer program to try to create the
> DIF checksum errors by madly rewriting the same memory pages. In fact, I tried
> the following combinations against ext2/3/4, xfs, btrfs, and vfat:
>
> a. 64 write() threads + sync_file_range
> b. 64 mmap write threads + msync
> c. 32 write() threads + sync_file_range + 32 mmap write threads + msync
> d. Same as C, but with all threads in directio mode
> e. Same as A, but with all threads in directio mode
> f. Same as B, but with all threads in directio mode
>
> After running profiles A-F for 30 minutes each on 6 different machines, ext2,
> ext4, xfs, and vfat reported no errors. ext3 still has a lingering failure
> case (which I will touch on briefly later) and btrfs eventually reports -ENOSPC
> and fails the test, though it does that even without any of the patches applied.
>
> To assess the performance impact of stable page writes, I moved to a disk that
> doesn't have DIF support so that I could measure just the impact of waiting for
> writeback. I first ran wac with 64 threads madly scribbling on a 64k file and
> saw about a 12 percent performance decrease. I then reran the wac program with
> 64 threads and a 64MB file and saw about the same performance numbers. As I
> suspected, the patchset only seems to impact workloads that rewrite the same
> memory page frequently.
>
> I am still chasing down what exactly is broken in ext3. data=writeback mode
> passes with no failures. data=ordered, however, does not pass; my current
> suspicion is that jbd is calling submit_bh on data buffers but doesn't call
> page_mkclean to kick the userspace programs off the page before writing it.
>
> Per various comments regarding v3 of this patchset, I've integrated his
> suggestions, reworked the patch descriptions to make it clearer which ones
> touch all the filesystems and which ones are to fix remaining holes in specific
> filesystems, and expanded the scope of filesystems that got fixed.
>
> As always, questions and comments are welcome; and thank you to all the
> previous reviewers of this patchset. I am also soliciting people's opinions on
> whether or not these patches could go upstream for .40.
[adding Andrew Morton to cc]
Ted, Mingming, and I were discussing how we might get this patchset pushed
upstream on today's ext4 community call. The ext2 patch can be dropped since
it really only was there as a proof that the generic mm/fs fixes actually
worked. I'm unsure of the vfat maintainer's feelings on the patchset, though
he seems concerned about the performance of apps that rewrite pages frequently.
Ted seemed agreeable with the ext4 changes, though I don't know if he's
reviewed them thoroughly yet.
Ted asked for clarification as to which patches are needed to fix ext4.
Patches 1 and 5 are the two that are required for ext4, and patch 4 cleans up
some ext4 code.
Patches 1 and 2 are needed to fix xfs and nilfs.
Patches 1 and 3 (and fs-specific patches such as 6 & 7) are needed to fix the
rest.
Unfortunately, Ted and I weren't sure who (if anyone) is in charge of pushing
mm and generic fs patches upstream. Ted suggested Andrew Morton for the mm
patches (1 & 3) and we weren't sure if Al Viro or Christoph (or someone else)
is in charge of generic vfs patches (patch #2).
So, who do I actually ask to take the mm and vfs patches? Andrew, can I send
you patches?
--D
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Mon, May 16, 2011 at 08:59:47PM +0200, Jan Kara wrote:
> On Mon 16-05-11 11:49:27, Darrick J. Wong wrote:
> > On Thu, May 12, 2011 at 11:42:55AM +0200, Jan Kara wrote:
> > > On Wed 11-05-11 11:19:01, Darrick J. Wong wrote:
> > > > On Tue, May 10, 2011 at 02:51:24PM +0200, Jan Kara wrote:
> > > > > On Mon 09-05-11 16:03:18, Darrick J. Wong wrote:
> > > > > > I am still chasing down what exactly is broken in ext3. data=writeback mode
> > > > > > passes with no failures. data=ordered, however, does not pass; my current
> > > > > > suspicion is that jbd is calling submit_bh on data buffers but doesn't call
> > > > > > page_mkclean to kick the userspace programs off the page before writing it.
> > > > > Yes, ext3 in data=ordered mode writes pages from
> > > > > journal_commit_transaction() via submit_bh() without clearing page dirty
> > > > > bits thus page_mkclean() is not called for these pages. Frankly, do you
> > > > > really want to bother with adding support for ext2 and ext3? People can use
> > > > > ext4 as a fs driver when they want to start using blk-integrity support.
> > > > > Especially ext2 patch looks really painful and just from a quick look I can
> > > > > see code e.g. in fs/ext2/namei.c which isn't handled by your patch yet.
> > > >
> > > > Yeah, I agree that ext2 is ugly and ext3/jbd might be more painful. Are there
> > > > any other code that wants stable pages that's already running with ext3? In
> > > > this months-long discussion I've heard that encryption and raid also like
> > > > stable pages during writes. Have those users been broken this whole time, or
> > > > have they been stabilizing pages themselves?
> > > I believe part of them has been broken (e.g. raid) and part of them do
> > > copy-out so they were OK.
> >
> > A future step might be to undo all these homegrown copy-outs?
> Sure but I'm not the right one to tell you where these are ;).
Yes, I've found a couple just by digging through the source tree. But maybe
I'll get this small set upstream before writing more patches.
> > > > I suppose we can cross the "ext3 fails horribly on DIF" bridge when someone
> > > > complains about it. Possibly we could try to steer them to btrfs.
> > > Well, btrfs might be a bit too advantageous for production servers but
> > > ext4 would be definitely viable for them.
> >
> > Are there any distros that are going straight from ext3 to btrfs?
> Most distros currently offer users a choice of xfs, ext3, ext4, btrfs
> with ext4 being the default. I'm not sure if that's what you are asking
> about...
Yep. I was primarily concerned that there might be some customer that would be
ok with deploying DIF hardware and rolling forward to ext4, but not to btrfs,
only to find that some distro refused to ship ext4. Looks like SLES/RHEL both
do, however. :)
--D
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
"Darrick J. Wong" <[email protected]> writes:
>> OK. E.g. usual workload on desktop, but FS like ext2/fat.
>
> In the frequent rewrite case, here's what you get:
>
> Regular disk: (possibly garbage) write, followed by a second write to make the
> disk reflect memory contents.
>
> RAID w/ shadow pages: two writes, both consistent. Higher memory consumption.
>
> T10 DIF disk: disk error any time the CPU modifies a page that the disk
> controller is DMA'ing out of memory. I suppose one could simply retry the
> operation if the page is dirty, but supposing memory writes are happening fast
> enough that the retries also produce disk errors, _nothing_ ever gets written.
>
> With the new stable-page-writes patchset, the garbage write/disk error symptoms
> go away since the processes block instead of creating this window where it's
> not clear whether the disk's copy of the data is consistent. I could turn the
> wait_on_page_writeback calls into some sort of page migration if the
> performance turns out to be terrible, though I'm still working on quantifying
> the impact. Some people pointed out that sqlite tends to write the same blocks
> frequently, though I wonder if sqlite actually tries to write memory pages
> while syncing them?
>
> One use case where I could see a serious performance hit happening is the case
> where some app writes a bunch of memory pages, calls sync to force the dirty
> pages to disk, and /must/ resume writing those memory pages before the sync
> completes. The page migration would of course help there, provided a memory
> page can be found in less time than an I/O operation.
>
> Someone commented on the LWN article about this topic, claiming that he had a
> program that couldn't afford to block on writes to mlock()'d memory. I'm not
> sure how to fix that program, because if memory writes never coordinate with
> disk writes and the other threads are always writing memory, I wonder how the
> copy on disk isn't always indeterminate.
I'm not thinking data page is special operation for doing this (at least
logically). In other word, if you are talking about only data page, you
shouldn't send patches for metadata with it.
Thanks.
--
OGAWA Hirofumi <[email protected]>
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
Whay about just sending the VFS patches to Al instead of talking about
it on a totally irrelevant call that doesn't include the important
stakeholders? FS-specific patches can go through the fs maintainers
independently.
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Mon, May 16, 2011 at 04:27:10PM -0400, Christoph Hellwig wrote:
> Whay about just sending the VFS patches to Al
Al was in the To: list of all 7 patches.
> instead of talking about it on a totally irrelevant call that doesn't include
> the important stakeholders? FS-specific patches can go through the fs
> maintainers independently.
The maintainers (ext4/ext2/vfat) were also in the To: list.
Trouble is, MAINTAINERS says this:
MEMORY MANAGEMENT
L: [email protected]
W: http://www.linux-mm.org
S: Maintained
F: include/linux/mm.h
F: mm/
There's a list, but no specific contact person. That's why I had to start
asking around about who actually pushes mm changes to Linus.
As for Al Viro, he's still listed as the VFS maintainer; isn't he resting?
I guess he did nominate you for the holding off of morons (like me). :)
--D
On Tue, May 17, 2011 at 04:31:37AM +0900, OGAWA Hirofumi wrote:
> "Darrick J. Wong" <[email protected]> writes:
>
> >> OK. E.g. usual workload on desktop, but FS like ext2/fat.
> >
> > In the frequent rewrite case, here's what you get:
> >
> > Regular disk: (possibly garbage) write, followed by a second write to make the
> > disk reflect memory contents.
> >
> > RAID w/ shadow pages: two writes, both consistent. Higher memory consumption.
> >
> > T10 DIF disk: disk error any time the CPU modifies a page that the disk
> > controller is DMA'ing out of memory. I suppose one could simply retry the
> > operation if the page is dirty, but supposing memory writes are happening fast
> > enough that the retries also produce disk errors, _nothing_ ever gets written.
> >
> > With the new stable-page-writes patchset, the garbage write/disk error symptoms
> > go away since the processes block instead of creating this window where it's
> > not clear whether the disk's copy of the data is consistent. I could turn the
> > wait_on_page_writeback calls into some sort of page migration if the
> > performance turns out to be terrible, though I'm still working on quantifying
> > the impact. Some people pointed out that sqlite tends to write the same blocks
> > frequently, though I wonder if sqlite actually tries to write memory pages
> > while syncing them?
> >
> > One use case where I could see a serious performance hit happening is the case
> > where some app writes a bunch of memory pages, calls sync to force the dirty
> > pages to disk, and /must/ resume writing those memory pages before the sync
> > completes. The page migration would of course help there, provided a memory
> > page can be found in less time than an I/O operation.
> >
> > Someone commented on the LWN article about this topic, claiming that he had a
> > program that couldn't afford to block on writes to mlock()'d memory. I'm not
> > sure how to fix that program, because if memory writes never coordinate with
> > disk writes and the other threads are always writing memory, I wonder how the
> > copy on disk isn't always indeterminate.
>
> I'm not thinking data page is special operation for doing this (at least
> logically). In other word, if you are talking about only data page, you
> shouldn't send patches for metadata with it.
Patch 7, which is the only patch that touches code under fs/fat/, only fixes
the case where the filesystem tries to modify its own metadata while writing
out the same metadata.
Patches 2 and 3, which affect only files mm/ and fs/, prevent data pages from
being modified while the same data pages are being written out. It is not
necessary to modify any fs/fat/ code to fix the data page case, fortunately.
That said, the intent of the patch set is to prevent writes to any memory page,
regardless of type (data or metadata), while the same page is being written out.
--D
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
"Darrick J. Wong" <[email protected]> writes:
>> > With the new stable-page-writes patchset, the garbage write/disk
>> > error symptoms
>> > go away since the processes block instead of creating this window where it's
>> > not clear whether the disk's copy of the data is consistent. I
>> > could turn the
>> > wait_on_page_writeback calls into some sort of page migration if the
>> > performance turns out to be terrible, though I'm still working on
>> > quantifying
>> > the impact. Some people pointed out that sqlite tends to write
>> > the same blocks
>> > frequently, though I wonder if sqlite actually tries to write memory pages
>> > while syncing them?
>> >
>> > One use case where I could see a serious performance hit happening
>> > is the case
>> > where some app writes a bunch of memory pages, calls sync to force the dirty
>> > pages to disk, and /must/ resume writing those memory pages before the sync
>> > completes. The page migration would of course help there, provided a memory
>> > page can be found in less time than an I/O operation.
[...]
>> I'm not thinking data page is special operation for doing this (at least
>> logically). In other word, if you are talking about only data page, you
>> shouldn't send patches for metadata with it.
>
> Patch 7, which is the only patch that touches code under fs/fat/, only fixes
> the case where the filesystem tries to modify its own metadata while writing
> out the same metadata.
>
> Patches 2 and 3, which affect only files mm/ and fs/, prevent data pages from
> being modified while the same data pages are being written out. It is not
> necessary to modify any fs/fat/ code to fix the data page case, fortunately.
>
> That said, the intent of the patch set is to prevent writes to any memory page,
> regardless of type (data or metadata), while the same page is being written out.
Yes. I understand it though, performance analysis (and looks like
approach) can be quite difference with data page. The metadata depends
on FS, and it can be much more hit depends on FS state, not simple like
data page.
And if you changed only overwrite case, I already mentioned though,
which one makes sure to prevent reallocated metadata case for FAT?
What about metadata operations performance impact? Even if FS was low
free blocks state, performance impact is small? And read can be the
cause of atime update, don't matter (now relatime is default though)?
Although FAT specific, what about fragmented case (i.e. modify multiple
FAT table blocks even if sequential write)?
Thanks.
--
OGAWA Hirofumi <[email protected]>
On Mon, May 16, 2011 at 01:55:35PM -0700, Darrick J. Wong wrote:
> As for Al Viro, he's still listed as the VFS maintainer; isn't he resting?
> I guess he did nominate you for the holding off of morons (like me). :)
Al is back in action. Anyway, the point stands, this is VFS material
and you should formally submit the bits to the maintainer. Note that
there is very little fs specific material left, with Hirofumi beeing
at least not overly exited by fat patches for now, and the only real
ext4 patch going away with Jan's series to make it use the generic
page_mkwrite code.
On Mon, May 09, 2011 at 01:03:49PM -0000, Darrick J. Wong wrote:
> wait_on_page_writeback already checks the writeback bit, so callers of it
> needn't do that test.
>
> Signed-off-by: Darrick J. Wong <[email protected]>
Added to the ext4 tree, thanks.
- Ted
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On Mon, May 09, 2011 at 01:03:56PM -0000, Darrick J. Wong wrote:
> In order to stabilize pages during disk writes, ext4_page_mkwrite must wait for
> writeback operations to complete before making a page writable. Furthermore,
> the function must return locked pages, and recheck the writeback status if the
> page lock is ever dropped. The "someone could wander in" part of this patch
> was suggested by Chris Mason.
>
> Signed-off-by: Darrick J. Wong <[email protected]>
>
> ---
> fs/ext4/inode.c | 24 +++++++++++++++++++-----
> 1 files changed, 19 insertions(+), 5 deletions(-)
Added to the ext4 tree, thanks.
- Ted
--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Fight unfair telecom internet charges in Canada: sign http://stopthemeter.ca/
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
On 05/10/2011 09:22 AM, Jan Kara wrote:
> On Wed 11-05-11 01:12:13, OGAWA Hirofumi wrote:
>> Jan Kara<[email protected]> writes:
>>
>>>> Did you already consider, to copy only if page was writeback (like
>>>> copy-on-write)? I.e. if page is on I/O, copy, then switch the page for
>>>> writing new data.
>>> Yes, that was considered as well. We'd have to essentially migrate the
>>> page that is under writeback and should be written to. You are going to pay
>>> the cost of page allocation, copy, increased memory& cache pressure.
>>> Depending on your backing storage and workload this may or may not be better
>>> than waiting for IO...
>>
>> Maybe possible, but you really think on usual case just blocking is
>> better?
> Define usual case... As Christoph noted, we don't currently have a real
> practical case where blocking would matter (since frequent rewrites are
> rather rare). So defining what is usual when we don't have a single real
> case is kind of tough ;)
>
I'm a bit late to the party, but I have such a use case. I have a
real-time program that generates logs. There's a thread that makes sure
that there are always mlocked, MAP_SHARED, writable pages for the logs,
and under normal (or even very heavy) load, the mlocked pages always
stay far ahead of the logs. On 2.6.39, it works great [1]. On 3.0,
it's unusable -- latencies of 30-100 ms are very common.
In this case, neither throughput nor available memory matter at all --
I'm not stressing either. So copying the pages (especially if they're
mlocked) would be more than a small percentage win -- it would be the
difference between great performance and unusability.
I wonder if we want a stronger version of mlock that says "this page
must not be swapped out and, in addition, ptes must always be mapped
with all appropriate permission bits set". (This is only possible with
hardware dirty and accessed bits, but we could come close even without
them.)
[1] file_update_time is a problem. patches coming.
--Andy