Against 2.6.20-rc2, and now the bug fix.
--
Guillaume
> I/O errors could go unnoticed when syncing, for example the following code could
> write a file bigger than 10Mib on a 10Mib filesystem. With this patch, msync()
> will report the error originally encountered by sync(). Tuning the number of
> sync may be needed to reproduce the bug.
> make_file.c:
>
> #include <unistd.h>
> #include <sys/fcntl.h>
> #include <sys/mman.h>
> #include <string.h>
> #include <stdio.h>
>
> #define NR_SYNC 3 /* Adjust me if needed */
> #define SIZE ((10 << 20) + (100 << 10))
>
> int main(void)
> {
> int i, fd;
> char *mapping;
> fd = open("mnt/file", O_RDWR | O_CREAT, 0600);
> if (fd < 0) {
> perror("open");
> return 1;
> }
>
> if (ftruncate(fd, SIZE) < 0) {
> perror("ftruncate");
> return 1;
> }
>
> mapping = mmap(NULL, SIZE, PROT_WRITE, MAP_SHARED, fd, 0);
> if (mapping == MAP_FAILED) {
> perror("mmap");
> return 1;
> }
>
> memset(mapping, 0xFF, SIZE);
>
> for (i = 0; i < NR_SYNC; i++)
> sync();
>
> if (msync(mapping, SIZE, MS_SYNC) < 0) {
> perror("msync");
> return 1;
> }
>
> if (close(fd) < 0) {
> perror("close");
> return 1;
> }
>
> puts("File written successfully => bad!\n");
> return 0;
> }
>
> #!/bin/sh
>
> dd if=/dev/zero of=fs.10M bs=10M count=0 seek=1
> mkfs.ext2 -qF fs.10M
> mkdir mnt
> mount fs.10M mnt -o loop
> ./make_file
>
> Signed-off-by: Guillaume Chazarain <[email protected]>
> ---
>
> fs-writeback.c | 4 +++-
> 1 file changed, 3 insertions(+), 1 deletion(-)
>
> diff -r 3859b1144d3a fs/fs-writeback.c
> --- a/fs/fs-writeback.c Sun Dec 24 05:00:03 2006 +0000
> +++ b/fs/fs-writeback.c Fri Dec 29 22:12:42 2006 +0100
> @@ -316,6 +316,7 @@ sync_sb_inodes(struct super_block *sb, s
> struct address_space *mapping = inode->i_mapping;
> struct backing_dev_info *bdi = mapping->backing_dev_info;
> long pages_skipped;
> + int ret;
>
> if (!bdi_cap_writeback_dirty(bdi)) {
> list_move(&inode->i_list, &sb->s_dirty);
> @@ -365,7 +366,8 @@ sync_sb_inodes(struct super_block *sb, s
> BUG_ON(inode->i_state & I_FREEING);
> __iget(inode);
> pages_skipped = wbc->pages_skipped;
> - __writeback_single_inode(inode, wbc);
> + ret = __writeback_single_inode(inode, wbc);
> + mapping_set_error(mapping, ret);
> if (wbc->sync_mode == WB_SYNC_HOLD) {
> inode->dirtied_when = jiffies;
> list_move(&inode->i_list, &sb->s_dirty);
This change is somewhat contrary to the way in which we've been handling
these issues thus far.
What the kernel does is to set the address_space error bits at the
lowest-level where the error is detected for the sync operation to later
detect. Whereas your change adopts the more conventional
propagate-it-back-then-handle-it model.
The implication from your change is that there's some piece of code
somewhere which is forgetting to propagate an error into the address_space
at the appropriate time. And that looks to be the code at "recover:" in
__block_write_full_page().
So perhaps a more consistent fix here is to teach __block_write_full_page()
to propagate that error, I think? Something like:
--- a/fs/buffer.c~a
+++ a/fs/buffer.c
@@ -1739,6 +1739,7 @@ recover:
}
} while ((bh = bh->b_this_page) != head);
SetPageError(page);
+ mapping_set_error(page->mapping, err);
BUG_ON(PageWriteback(page));
set_page_writeback(page);
unlock_page(page);
_
Andrew Morton a ?crit :
>> @@ -365,7 +366,8 @@ sync_sb_inodes(struct super_block *sb, s
>> BUG_ON(inode->i_state & I_FREEING);
>> __iget(inode);
>> pages_skipped = wbc->pages_skipped;
>> - __writeback_single_inode(inode, wbc);
>> + ret = __writeback_single_inode(inode, wbc);
>> + mapping_set_error(mapping, ret);
>> if (wbc->sync_mode == WB_SYNC_HOLD) {
>> inode->dirtied_when = jiffies;
>> list_move(&inode->i_list, &sb->s_dirty);
>>
> --- a/fs/buffer.c~a
> +++ a/fs/buffer.c
> @@ -1739,6 +1739,7 @@ recover:
> }
> } while ((bh = bh->b_this_page) != head);
> SetPageError(page);
> + mapping_set_error(page->mapping, err);
> BUG_ON(PageWriteback(page));
> set_page_writeback(page);
> unlock_page(page);
>
Unfortunately, with your patch and not mine, the problem is still
present: msync()
does not return the error. Both pieces of code (yours and mine) are
called for the
same mapping though, albeit yours more frequently.
My interpretation (based more on imagination than experience) is that
__writeback_single_inode() ends up calling __block_write_full_page() which
sets the page flags (your patch), then calls wait_on_page_writeback_range()
which clears the flags but returns the error as its return value. And this
error code is dropped by sync_sb_inodes() (my patch not applied).
With my patch, wait_on_page_writeback_range() would get the error code
by some other mean, and sync_sb_inodes() would re-put it in the flags for
msync() later.
Sorry, for the speculation, but I would need some hint to debug this ;-)
Thanks.
--
Guillaume
On Wed, 03 Jan 2007 23:30:05 +0100
Guillaume Chazarain <[email protected]> wrote:
> Unfortunately, with your patch and not mine, the problem is still
> present: msync()
> does not return the error. Both pieces of code (yours and mine) are
> called for the
> same mapping though, albeit yours more frequently.
>
> My interpretation (based more on imagination than experience) is that
> __writeback_single_inode() ends up calling __block_write_full_page() which
> sets the page flags (your patch), then calls wait_on_page_writeback_range()
> which clears the flags but returns the error as its return value. And this
> error code is dropped by sync_sb_inodes() (my patch not applied).
>
> With my patch, wait_on_page_writeback_range() would get the error code
> by some other mean, and sync_sb_inodes() would re-put it in the flags for
> msync() later.
hm, something like that.
There's a lot of sloppiness in there. Do these two patches fix things up?
From: Andrew Morton <[email protected]>
Guillame points out that sync_sb_inodes() is failing to propagate error codes
back. Fix that, and make several other void-returning functions not drop
reportable error codes.
Cc: Guillaume Chazarain <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/fs-writeback.c | 55 ++++++++++++++++++++++++++----------
include/linux/writeback.h | 6 +--
2 files changed, 44 insertions(+), 17 deletions(-)
diff -puN fs/fs-writeback.c~sync_sb_inodes-propagate-errors fs/fs-writeback.c
--- a/fs/fs-writeback.c~sync_sb_inodes-propagate-errors
+++ a/fs/fs-writeback.c
@@ -302,15 +302,18 @@ __writeback_single_inode(struct inode *i
* on the writer throttling path, and we get decent balancing between many
* throttled threads: we don't want them all piling up on __wait_on_inode.
*/
-static void
+static int
sync_sb_inodes(struct super_block *sb, struct writeback_control *wbc)
{
const unsigned long start = jiffies; /* livelock avoidance */
+ int ret = 0;
if (!wbc->for_kupdate || list_empty(&sb->s_io))
list_splice_init(&sb->s_dirty, &sb->s_io);
while (!list_empty(&sb->s_io)) {
+ int err;
+
struct inode *inode = list_entry(sb->s_io.prev,
struct inode, i_list);
struct address_space *mapping = inode->i_mapping;
@@ -365,7 +368,9 @@ sync_sb_inodes(struct super_block *sb, s
BUG_ON(inode->i_state & I_FREEING);
__iget(inode);
pages_skipped = wbc->pages_skipped;
- __writeback_single_inode(inode, wbc);
+ err = __writeback_single_inode(inode, wbc);
+ if (!ret)
+ ret = err;
if (wbc->sync_mode == WB_SYNC_HOLD) {
inode->dirtied_when = jiffies;
list_move(&inode->i_list, &sb->s_dirty);
@@ -386,7 +391,7 @@ sync_sb_inodes(struct super_block *sb, s
if (wbc->nr_to_write <= 0)
break;
}
- return; /* Leave any unwritten inodes on s_io */
+ return ret; /* Leave any unwritten inodes on s_io */
}
/*
@@ -408,10 +413,10 @@ sync_sb_inodes(struct super_block *sb, s
* sync_sb_inodes will seekout the blockdev which matches `bdi'. Maybe not
* super-efficient but we're about to do a ton of I/O...
*/
-void
-writeback_inodes(struct writeback_control *wbc)
+int writeback_inodes(struct writeback_control *wbc)
{
struct super_block *sb;
+ int ret = 0;
might_sleep();
spin_lock(&sb_lock);
@@ -429,9 +434,13 @@ restart:
*/
if (down_read_trylock(&sb->s_umount)) {
if (sb->s_root) {
+ int err;
+
spin_lock(&inode_lock);
- sync_sb_inodes(sb, wbc);
+ err = sync_sb_inodes(sb, wbc);
spin_unlock(&inode_lock);
+ if (!ret)
+ ret = err;
}
up_read(&sb->s_umount);
}
@@ -443,6 +452,7 @@ restart:
break;
}
spin_unlock(&sb_lock);
+ return ret;
}
/*
@@ -456,7 +466,7 @@ restart:
* We add in the number of potentially dirty inodes, because each inode write
* can dirty pagecache in the underlying blockdev.
*/
-void sync_inodes_sb(struct super_block *sb, int wait)
+int sync_inodes_sb(struct super_block *sb, int wait)
{
struct writeback_control wbc = {
.sync_mode = wait ? WB_SYNC_ALL : WB_SYNC_HOLD,
@@ -465,14 +475,16 @@ void sync_inodes_sb(struct super_block *
};
unsigned long nr_dirty = global_page_state(NR_FILE_DIRTY);
unsigned long nr_unstable = global_page_state(NR_UNSTABLE_NFS);
+ int ret;
wbc.nr_to_write = nr_dirty + nr_unstable +
(inodes_stat.nr_inodes - inodes_stat.nr_unused) +
nr_dirty + nr_unstable;
wbc.nr_to_write += wbc.nr_to_write / 2; /* Bit more for luck */
spin_lock(&inode_lock);
- sync_sb_inodes(sb, &wbc);
+ ret = sync_sb_inodes(sb, &wbc);
spin_unlock(&inode_lock);
+ return ret;
}
/*
@@ -508,13 +520,16 @@ static void set_sb_syncing(int val)
* outstanding dirty inodes, the writeback goes block-at-a-time within the
* filesystem's write_inode(). This is extremely slow.
*/
-static void __sync_inodes(int wait)
+static int __sync_inodes(int wait)
{
struct super_block *sb;
+ int ret = 0;
spin_lock(&sb_lock);
restart:
list_for_each_entry(sb, &super_blocks, s_list) {
+ int err;
+
if (sb->s_syncing)
continue;
sb->s_syncing = 1;
@@ -522,8 +537,12 @@ restart:
spin_unlock(&sb_lock);
down_read(&sb->s_umount);
if (sb->s_root) {
- sync_inodes_sb(sb, wait);
- sync_blockdev(sb->s_bdev);
+ err = sync_inodes_sb(sb, wait);
+ if (!ret)
+ ret = err;
+ err = sync_blockdev(sb->s_bdev);
+ if (!ret)
+ ret = err;
}
up_read(&sb->s_umount);
spin_lock(&sb_lock);
@@ -531,17 +550,25 @@ restart:
goto restart;
}
spin_unlock(&sb_lock);
+ return ret;
}
-void sync_inodes(int wait)
+int sync_inodes(int wait)
{
+ int ret;
+
set_sb_syncing(0);
- __sync_inodes(0);
+ ret = __sync_inodes(0);
if (wait) {
+ int err;
+
set_sb_syncing(0);
- __sync_inodes(1);
+ err = __sync_inodes(1);
+ if (!ret)
+ ret = err;
}
+ return ret;
}
/**
diff -puN include/linux/writeback.h~sync_sb_inodes-propagate-errors include/linux/writeback.h
--- a/include/linux/writeback.h~sync_sb_inodes-propagate-errors
+++ a/include/linux/writeback.h
@@ -64,11 +64,11 @@ struct writeback_control {
/*
* fs/fs-writeback.c
*/
-void writeback_inodes(struct writeback_control *wbc);
+int writeback_inodes(struct writeback_control *wbc);
void wake_up_inode(struct inode *inode);
int inode_wait(void *);
-void sync_inodes_sb(struct super_block *, int wait);
-void sync_inodes(int wait);
+int sync_inodes_sb(struct super_block *, int wait);
+int sync_inodes(int wait);
/* writeback.h requires fs.h; it, too, is not included from here. */
static inline void wait_on_inode(struct inode *inode)
_
From: Andrew Morton <[email protected]>
block_write_full_page() forgot to propagate ENPSOC into the address_space.
Cc: Guillaume Chazarain <[email protected]>
Signed-off-by: Andrew Morton <[email protected]>
---
fs/buffer.c | 1 +
1 files changed, 1 insertion(+)
diff -puN fs/buffer.c~block_write_full_page-handle-enospc fs/buffer.c
--- a/fs/buffer.c~block_write_full_page-handle-enospc
+++ a/fs/buffer.c
@@ -1740,6 +1740,7 @@ recover:
} while ((bh = bh->b_this_page) != head);
SetPageError(page);
BUG_ON(PageWriteback(page));
+ mapping_set_error(page->mapping, err);
set_page_writeback(page);
unlock_page(page);
do {
_
Andrew Morton a ?crit :
> There's a lot of sloppiness in there. Do these two patches fix things up?
>
Still no luck :-(
You managed to propagate the return value up to the
highest level (sync()), but as it's "void sync()" the return
value is dropped.
When msync() comes after it finds nothing in the mapping
flags.
Cheers.
--
Guillaume
Andrew Morton a ?crit :
> There's a lot of sloppiness in there. Do these two patches fix things up?
>
Still no luck :-(
It seems you made it possible for sync() to be aware of the returned
value, but there's nothing it can do with it since it's 'void sync()'.
Then msync() comes after and sees nothing in the mapping flags.
Cheers.
--
Guillaume
Guillaume Chazarain a ?crit :
> Still no luck :-(
First, sorry for the double posting.
The mapping flag is cleared by this call trace:
do_sync(1)
sync_inodes(1)
__sync_inodes(1)
sync_inodes_sb(sb, 1)
sync_sb_inodes(sb, WB_SYNC_ALL)
__writeback_single_inode(inode, WB_SYNC_ALL)
__sync_single_inode(inode, WB_SYNC_ALL)
filemap_fdatawait(mapping)
wait_on_page_writeback_range(mapping)
test_and_clear(...)
So, I can't see any solution other than:
- asking wait_on_page_writeback_range(mapping) to
leave the flag as is
or
- adding a mapping_set_error() after one of the call
in the trace.
Regards.
--
Guillaume