In the bio completion routine, we should not be setting
PageUptodate at all -- it's set at sys_write() time, and is
unaffected by success/failure of the write to disk.
This can cause a page corruption bug when
block size < page size
if we have only written a single block -- we might end up
setting the entire PageUptodate, which will cause subsequent
reads to get bad data.
This commit also takes the opportunity to clean up error
handling in ext4_end_bio(), and remove some extraneous code:
- fixes ext4_end_bio() to set AS_EIO in the
page->mapping->flags on error, which was left out by
mistake.
- remove the clear_buffer_dirty() call on unmapped
buffers for each page.
- consolidate page/buffer error handling in a single
section.
Signed-off-by: Curt Wohlgemuth <[email protected]>
Reported-by: Jim Meyering <[email protected]>
Reported-by: Hugh Dickins <[email protected]>
Cc: Mingming Cao <[email protected]>
---
Changlog since v2:
- Removed clear_buffer_dirty() call
- Consolidated error handling for pages and buffer heads
- Loop over BHs in a page even for page size == block size, so
we emit the correct error for such a case.
Changlog since v1:
- Added commit message text about setting AS_EIO for the
page on error.
- Continue to loop over all BHs in a page and emit unique
errors for each of them.
---
fs/ext4/page-io.c | 39 +++++++++++----------------------------
1 files changed, 11 insertions(+), 28 deletions(-)
diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
index b6dbd05..7bb8f76 100644
--- a/fs/ext4/page-io.c
+++ b/fs/ext4/page-io.c
@@ -203,46 +203,29 @@ static void ext4_end_bio(struct bio *bio, int error)
for (i = 0; i < io_end->num_io_pages; i++) {
struct page *page = io_end->pages[i]->p_page;
struct buffer_head *bh, *head;
- int partial_write = 0;
+ loff_t offset;
+ loff_t io_end_offset;
- head = page_buffers(page);
- if (error)
+ if (error) {
SetPageError(page);
- BUG_ON(!head);
- if (head->b_size != PAGE_CACHE_SIZE) {
- loff_t offset;
- loff_t io_end_offset = io_end->offset + io_end->size;
+ set_bit(AS_EIO, &page->mapping->flags);
+ head = page_buffers(page);
+ BUG_ON(!head);
+
+ io_end_offset = io_end->offset + io_end->size;
offset = (sector_t) page->index << PAGE_CACHE_SHIFT;
bh = head;
do {
if ((offset >= io_end->offset) &&
- (offset+bh->b_size <= io_end_offset)) {
- if (error)
- buffer_io_error(bh);
-
- }
- if (buffer_delay(bh))
- partial_write = 1;
- else if (!buffer_mapped(bh))
- clear_buffer_dirty(bh);
- else if (buffer_dirty(bh))
- partial_write = 1;
+ (offset+bh->b_size <= io_end_offset))
+ buffer_io_error(bh);
+
offset += bh->b_size;
bh = bh->b_this_page;
} while (bh != head);
}
- /*
- * If this is a partial write which happened to make
- * all buffers uptodate then we can optimize away a
- * bogus readpage() for the next read(). Here we
- * 'discover' whether the page went uptodate as a
- * result of this (potentially partial) write.
- */
- if (!partial_write)
- SetPageUptodate(page);
On 2011-04-25, at 2:23 PM, Curt Wohlgemuth wrote:
> In the bio completion routine, we should not be setting
> PageUptodate at all -- it's set at sys_write() time, and is
> unaffected by success/failure of the write to disk.
>
> This can cause a page corruption bug when
>
> block size < page size
>
> @@ -203,46 +203,29 @@ static void ext4_end_bio(struct bio *bio, int error)
> - /*
> - * If this is a partial write which happened to make
> - * all buffers uptodate then we can optimize away a
> - * bogus readpage() for the next read(). Here we
> - * 'discover' whether the page went uptodate as a
> - * result of this (potentially partial) write.
> - */
> - if (!partial_write)
> - SetPageUptodate(page);
> -
I think this is the important part of the code - if there is a read-after-write for a file that was written in "blocksize" units (blocksize < pagesize), does the page get set uptodate when all of the blocks have been written and/or the writing is at EOF? Otherwise, a read-after-write will always cause data to be fetched from disk needlessly, even though the uptodate information is already in cache.
Cheers, Andreas
Hi Andreas:
On Mon, Apr 25, 2011 at 3:40 PM, Andreas Dilger <[email protected]> wrote:
> On 2011-04-25, at 2:23 PM, Curt Wohlgemuth wrote:
>> In the bio completion routine, we should not be setting
>> PageUptodate at all -- it's set at sys_write() time, and is
>> unaffected by success/failure of the write to disk.
>>
>> This can cause a page corruption bug when
>>
>> ? ?block size < page size
>>
>> @@ -203,46 +203,29 @@ static void ext4_end_bio(struct bio *bio, int error)
>> - ? ? ? ? ? ? /*
>> - ? ? ? ? ? ? ?* If this is a partial write which happened to make
>> - ? ? ? ? ? ? ?* all buffers uptodate then we can optimize away a
>> - ? ? ? ? ? ? ?* bogus readpage() for the next read(). Here we
>> - ? ? ? ? ? ? ?* 'discover' whether the page went uptodate as a
>> - ? ? ? ? ? ? ?* result of this (potentially partial) write.
>> - ? ? ? ? ? ? ?*/
>> - ? ? ? ? ? ? if (!partial_write)
>> - ? ? ? ? ? ? ? ? ? ? SetPageUptodate(page);
>> -
>
> I think this is the important part of the code - if there is a read-after-write for a file that was written in "blocksize" units (blocksize < pagesize), does the page get set uptodate when all of the blocks have been written and/or the writing is at EOF? ?Otherwise, a read-after-write will always cause data to be fetched from disk needlessly, even though the uptodate information is already in cache.
Hmm, that's a good question. I would kind of doubt that the page
would be marked uptodate when the final block was written, and this
might be what the code above was trying to do. It wasn't doing it
correctly :-), but it might have possibly avoided the extra read when
it there was no error.
I'll look at this some more, and see if I can't test for your scenario
above. Perhaps at least checking that all BHs in the page are mapped
+ uptodate => SetPageUptodate would not be out of line.
Thanks,
Curt
>
> Cheers, Andreas
>
>
>
>
>
>
Hi again Andreas:
On Mon, Apr 25, 2011 at 3:45 PM, Curt Wohlgemuth <[email protected]> wrote:
> Hi Andreas:
>
> On Mon, Apr 25, 2011 at 3:40 PM, Andreas Dilger <[email protected]> wrote:
>> On 2011-04-25, at 2:23 PM, Curt Wohlgemuth wrote:
>>> In the bio completion routine, we should not be setting
>>> PageUptodate at all -- it's set at sys_write() time, and is
>>> unaffected by success/failure of the write to disk.
>>>
>>> This can cause a page corruption bug when
>>>
>>> ? ?block size < page size
>>>
>>> @@ -203,46 +203,29 @@ static void ext4_end_bio(struct bio *bio, int error)
>>> - ? ? ? ? ? ? /*
>>> - ? ? ? ? ? ? ?* If this is a partial write which happened to make
>>> - ? ? ? ? ? ? ?* all buffers uptodate then we can optimize away a
>>> - ? ? ? ? ? ? ?* bogus readpage() for the next read(). Here we
>>> - ? ? ? ? ? ? ?* 'discover' whether the page went uptodate as a
>>> - ? ? ? ? ? ? ?* result of this (potentially partial) write.
>>> - ? ? ? ? ? ? ?*/
>>> - ? ? ? ? ? ? if (!partial_write)
>>> - ? ? ? ? ? ? ? ? ? ? SetPageUptodate(page);
>>> -
>>
>> I think this is the important part of the code - if there is a read-after-write for a file that was written in "blocksize" units (blocksize < pagesize), does the page get set uptodate when all of the blocks have been written and/or the writing is at EOF? ?Otherwise, a read-after-write will always cause data to be fetched from disk needlessly, even though the uptodate information is already in cache.
>
> Hmm, that's a good question. ?I would kind of doubt that the page
> would be marked uptodate when the final block was written, and this
> might be what the code above was trying to do. ?It wasn't doing it
> correctly :-), but it might have possibly avoided the extra read when
> it there was no error.
>
> I'll look at this some more, and see if I can't test for your scenario
> above. ?Perhaps at least checking that all BHs in the page are mapped
> + uptodate => SetPageUptodate would not be out of line.
My testing is now showing the read coming through after writing to the
4 blocks of a 4K file, using 1K blocksize. And it seems to me that
this is taken care of in __block_commit_write(), which is called from
all the .write_end callbacks for ext4, at least.
Thanks,
Curt
>
> Thanks,
> Curt
>
>
>
>>
>> Cheers, Andreas
>>
>>
>>
>>
>>
>>
>
On 2011-04-25, at 5:20 PM, Curt Wohlgemuth wrote:
> On Mon, Apr 25, 2011 at 3:45 PM, Curt Wohlgemuth <[email protected]> wrote:
>> On Mon, Apr 25, 2011 at 3:40 PM, Andreas Dilger <[email protected]> wrote:
>>> On 2011-04-25, at 2:23 PM, Curt Wohlgemuth wrote:
>>>> In the bio completion routine, we should not be setting
>>>> PageUptodate at all -- it's set at sys_write() time, and is
>>>> unaffected by success/failure of the write to disk.
>>>>
>>>> This can cause a page corruption bug when
>>>>
>>>> block size < page size
>>>>
>>>> @@ -203,46 +203,29 @@ static void ext4_end_bio(struct bio *bio, int error)
>>>> - /*
>>>> - * If this is a partial write which happened to make
>>>> - * all buffers uptodate then we can optimize away a
>>>> - * bogus readpage() for the next read(). Here we
>>>> - * 'discover' whether the page went uptodate as a
>>>> - * result of this (potentially partial) write.
>>>> - */
>>>> - if (!partial_write)
>>>> - SetPageUptodate(page);
>>>> -
>>>
>>> I think this is the important part of the code - if there is a read-after-write for a file that was written in "blocksize" units (blocksize < pagesize), does the page get set uptodate when all of the blocks have been written and/or the writing is at EOF? Otherwise, a read-after-write will always cause data to be fetched from disk needlessly, even though the uptodate information is already in cache.
>>
>> Hmm, that's a good question. I would kind of doubt that the page
>> would be marked uptodate when the final block was written, and this
>> might be what the code above was trying to do. It wasn't doing it
>> correctly :-), but it might have possibly avoided the extra read when
>> it there was no error.
>>
>> I'll look at this some more, and see if I can't test for your scenario
>> above. Perhaps at least checking that all BHs in the page are mapped
>> + uptodate => SetPageUptodate would not be out of line.
>
> My testing is now showing the read coming through after writing to the
> 4 blocks of a 4K file, using 1K blocksize.
Sorry, but could you please clarify? Does "read coming through" mean that there is an IO sent to the disk, or that the page is uptodate and the read is handled from the page cache?
> And it seems to me that
> this is taken care of in __block_commit_write(), which is called from
> all the .write_end callbacks for ext4, at least.
It does indeed look like that should be handling this case. It would be good to verify that this is still true with your patch, just in case theory and reality don't align.
Cheers, Andreas
On Mon, Apr 25, 2011 at 5:58 PM, Andreas Dilger <[email protected]> wrote:
> On 2011-04-25, at 5:20 PM, Curt Wohlgemuth wrote:
>> On Mon, Apr 25, 2011 at 3:45 PM, Curt Wohlgemuth <[email protected]> wrote:
>>> On Mon, Apr 25, 2011 at 3:40 PM, Andreas Dilger <[email protected]> wrote:
>>>> On 2011-04-25, at 2:23 PM, Curt Wohlgemuth wrote:
>>>>> In the bio completion routine, we should not be setting
>>>>> PageUptodate at all -- it's set at sys_write() time, and is
>>>>> unaffected by success/failure of the write to disk.
>>>>>
>>>>> This can cause a page corruption bug when
>>>>>
>>>>> ? ?block size < page size
>>>>>
>>>>> @@ -203,46 +203,29 @@ static void ext4_end_bio(struct bio *bio, int error)
>>>>> - ? ? ? ? ? ? /*
>>>>> - ? ? ? ? ? ? ?* If this is a partial write which happened to make
>>>>> - ? ? ? ? ? ? ?* all buffers uptodate then we can optimize away a
>>>>> - ? ? ? ? ? ? ?* bogus readpage() for the next read(). Here we
>>>>> - ? ? ? ? ? ? ?* 'discover' whether the page went uptodate as a
>>>>> - ? ? ? ? ? ? ?* result of this (potentially partial) write.
>>>>> - ? ? ? ? ? ? ?*/
>>>>> - ? ? ? ? ? ? if (!partial_write)
>>>>> - ? ? ? ? ? ? ? ? ? ? SetPageUptodate(page);
>>>>> -
>>>>
>>>> I think this is the important part of the code - if there is a read-after-write for a file that was written in "blocksize" units (blocksize < pagesize), does the page get set uptodate when all of the blocks have been written and/or the writing is at EOF? ?Otherwise, a read-after-write will always cause data to be fetched from disk needlessly, even though the uptodate information is already in cache.
>>>
>>> Hmm, that's a good question. ?I would kind of doubt that the page
>>> would be marked uptodate when the final block was written, and this
>>> might be what the code above was trying to do. ?It wasn't doing it
>>> correctly :-), but it might have possibly avoided the extra read when
>>> it there was no error.
>>>
>>> I'll look at this some more, and see if I can't test for your scenario
>>> above. ?Perhaps at least checking that all BHs in the page are mapped
>>> + uptodate => SetPageUptodate would not be out of line.
>>
>> My testing is now showing the read coming through after writing to the
>> 4 blocks of a 4K file, using 1K blocksize.
>
> Sorry, but could you please clarify? ?Does "read coming through" mean that there is an IO sent to the disk, or that the page is uptodate and the read is handled from the page cache?
Wow, even I don't understand what I wrote above...
What I meant to write was: In my experiments, I do *not* see a read
going down to disk after the writes for the individual blocks. I
basically did this:
---------------------------
dd if=/dev/zero of=bar bs=1k count=1 seek=3 conv=notrunc
dd if=/dev/zero of=bar bs=1k count=1 seek=0 conv=notrunc
dd if=/dev/zero of=bar bs=1k count=1 seek=1 conv=notrunc
dd if=/dev/zero of=bar bs=1k count=1 seek=2 conv=notrunc
sync
dd if=bar of=foo bs=1k count=4
---------------------------
>
>> ?And it seems to me that
>> this is taken care of in __block_commit_write(), which is called from
>> all the .write_end callbacks for ext4, at least.
>
> It does indeed look like that should be handling this case. ?It would be good to verify that this is still true with your patch, just in case theory and reality don't align.
I see the same *non-read from disk* with 1k blocks both with my patch,
and with a vanilla kernel.
Thanks,
Curt
>
> Cheers, Andreas
>
>
>
>
>
>
On Tue, Apr 26, 2011 at 12:32 PM, Curt Wohlgemuth <[email protected]> wrote:
> On Mon, Apr 25, 2011 at 5:58 PM, Andreas Dilger <[email protected]> wrote:
>> On 2011-04-25, at 5:20 PM, Curt Wohlgemuth wrote:
>>> On Mon, Apr 25, 2011 at 3:45 PM, Curt Wohlgemuth <[email protected]> wrote:
>>>> On Mon, Apr 25, 2011 at 3:40 PM, Andreas Dilger <[email protected]> wrote:
>>>>> On 2011-04-25, at 2:23 PM, Curt Wohlgemuth wrote:
>>>>>> In the bio completion routine, we should not be setting
>>>>>> PageUptodate at all -- it's set at sys_write() time, and is
>>>>>> unaffected by success/failure of the write to disk.
>>>>>>
>>>>>> This can cause a page corruption bug when
>>>>>>
>>>>>> ? ?block size < page size
>>>>>>
>>>>>> @@ -203,46 +203,29 @@ static void ext4_end_bio(struct bio *bio, int error)
>>>>>> - ? ? ? ? ? ? /*
>>>>>> - ? ? ? ? ? ? ?* If this is a partial write which happened to make
>>>>>> - ? ? ? ? ? ? ?* all buffers uptodate then we can optimize away a
>>>>>> - ? ? ? ? ? ? ?* bogus readpage() for the next read(). Here we
>>>>>> - ? ? ? ? ? ? ?* 'discover' whether the page went uptodate as a
>>>>>> - ? ? ? ? ? ? ?* result of this (potentially partial) write.
>>>>>> - ? ? ? ? ? ? ?*/
>>>>>> - ? ? ? ? ? ? if (!partial_write)
>>>>>> - ? ? ? ? ? ? ? ? ? ? SetPageUptodate(page);
>>>>>> -
>>>>>
>>>>> I think this is the important part of the code - if there is a read-after-write for a file that was written in "blocksize" units (blocksize < pagesize), does the page get set uptodate when all of the blocks have been written and/or the writing is at EOF? ?Otherwise, a read-after-write will always cause data to be fetched from disk needlessly, even though the uptodate information is already in cache.
>>>>
>>>> Hmm, that's a good question. ?I would kind of doubt that the page
>>>> would be marked uptodate when the final block was written, and this
>>>> might be what the code above was trying to do. ?It wasn't doing it
>>>> correctly :-), but it might have possibly avoided the extra read when
>>>> it there was no error.
>>>>
>>>> I'll look at this some more, and see if I can't test for your scenario
>>>> above. ?Perhaps at least checking that all BHs in the page are mapped
>>>> + uptodate => SetPageUptodate would not be out of line.
>>>
>>> My testing is now showing the read coming through after writing to the
>>> 4 blocks of a 4K file, using 1K blocksize.
>>
>> Sorry, but could you please clarify? ?Does "read coming through" mean that there is an IO sent to the disk, or that the page is uptodate and the read is handled from the page cache?
>
> Wow, even I don't understand what I wrote above...
>
> What I meant to write was: ?In my experiments, I do *not* see a read
> going down to disk after the writes for the individual blocks. ?I
> basically did this:
>
> ---------------------------
> dd if=/dev/zero of=bar bs=1k count=1 seek=3 conv=notrunc
> dd if=/dev/zero of=bar bs=1k count=1 seek=0 conv=notrunc
> dd if=/dev/zero of=bar bs=1k count=1 seek=1 conv=notrunc
> dd if=/dev/zero of=bar bs=1k count=1 seek=2 conv=notrunc
>
> sync
try to evict all cache pages
echo 1 > proc/sys/vm/drop_caches
>
> dd if=bar of=foo bs=1k count=4
> ---------------------------
>
>>
>>> ?And it seems to me that
>>> this is taken care of in __block_commit_write(), which is called from
>>> all the .write_end callbacks for ext4, at least.
>>
>> It does indeed look like that should be handling this case. ?It would be good to verify that this is still true with your patch, just in case theory and reality don't align.
>
> I see the same *non-read from disk* with 1k blocks both with my patch,
> and with a vanilla kernel.
>
> Thanks,
> Curt
>
>>
>> Cheers, Andreas
>>
>>
>>
>>
>>
>>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>
--
Best Wishes
Yongqiang Yang
This function is only called from write path which flushes pages to
disk, actually, pages' state have been set right at time when
write_end() is called. Why did we handle pages' state in this
function?
On Tue, Apr 26, 2011 at 6:40 AM, Andreas Dilger <[email protected]> wrote:
> On 2011-04-25, at 2:23 PM, Curt Wohlgemuth wrote:
>> In the bio completion routine, we should not be setting
>> PageUptodate at all -- it's set at sys_write() time, and is
>> unaffected by success/failure of the write to disk.
>>
>> This can cause a page corruption bug when
>>
>> ? ?block size < page size
>>
>> @@ -203,46 +203,29 @@ static void ext4_end_bio(struct bio *bio, int error)
>> - ? ? ? ? ? ? /*
>> - ? ? ? ? ? ? ?* If this is a partial write which happened to make
>> - ? ? ? ? ? ? ?* all buffers uptodate then we can optimize away a
>> - ? ? ? ? ? ? ?* bogus readpage() for the next read(). Here we
>> - ? ? ? ? ? ? ?* 'discover' whether the page went uptodate as a
>> - ? ? ? ? ? ? ?* result of this (potentially partial) write.
>> - ? ? ? ? ? ? ?*/
>> - ? ? ? ? ? ? if (!partial_write)
>> - ? ? ? ? ? ? ? ? ? ? SetPageUptodate(page);
>> -
>
> I think this is the important part of the code - if there is a read-after-write for a file that was written in "blocksize" units (blocksize < pagesize), does the page get set uptodate when all of the blocks have been written and/or the writing is at EOF? ?Otherwise, a read-after-write will always cause data to be fetched from disk needlessly, even though the uptodate information is already in cache.
>
> Cheers, Andreas
>
>
>
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>
--
Best Wishes
Yongqiang Yang
On Tue, Apr 26, 2011 at 03:41:47PM +0800, Yongqiang Yang wrote:
> This function is only called from write path which flushes pages to
> disk, actually, pages' state have been set right at time when
> write_end() is called. Why did we handle pages' state in this
> function?
This was a bug on my part. A lot of page_io.c was copied from
fs/buffer.c, and then optimized for bulk page writes, instead of
sending down a separate 4k bio write request at a time. I
(incorrectly) copied logic from block_commit_write() into the write
completion callback(). So I agree with Curt that the messing with the
page state should be removed from ext4_end_bio().
As far as messing with the buffer_dirty() logic, it's just wrong to be
clearing the buffer dirty bit at all in the write callback. In the
original codepath, the buffer dirty bit was set by
block_commit_write(), only to be cleared immediately by
__block_write_full_page(). I'm pretty sure the setting and clearing
the buffer dirty bit can be optimized out of fs/ext4/page_io.c
What does bother me a little bit is that we have code that tests
buffer_dirty() in the delalloc logic in fs/ex4/inode.c --- note
ext4_bh_delay_on_unwritten(), and write_cache_pages_da(), where we are
testing the buffer_dirty bit for the pages that hang off of struct
page. The only thing that might bear on this is the
ext4_block_truncate_page(), which does call mark_buffer_dirty() on the
buffer containing the truncation point (which by the way has made me
very worried about the punch code getting very well tested on 1k block
size file systems, since the edge cases when the region to be
truncated are not page-aligned are obviously going to be very hard to
get right).
I *think* we should be nuking the buffer_dirty manipulations in the
delalloc path, but we need to look very closely at that to make sure
there isn't anything going on here...
- Ted
On Mon, Apr 25, 2011 at 11:59 PM, Yongqiang Yang <[email protected]> wrote:
> On Tue, Apr 26, 2011 at 12:32 PM, Curt Wohlgemuth <[email protected]> wrote:
>> On Mon, Apr 25, 2011 at 5:58 PM, Andreas Dilger <[email protected]> wrote:
>>> On 2011-04-25, at 5:20 PM, Curt Wohlgemuth wrote:
>>>> On Mon, Apr 25, 2011 at 3:45 PM, Curt Wohlgemuth <[email protected]> wrote:
>>>>> On Mon, Apr 25, 2011 at 3:40 PM, Andreas Dilger <[email protected]> wrote:
>>>>>> On 2011-04-25, at 2:23 PM, Curt Wohlgemuth wrote:
>>>>>>> In the bio completion routine, we should not be setting
>>>>>>> PageUptodate at all -- it's set at sys_write() time, and is
>>>>>>> unaffected by success/failure of the write to disk.
>>>>>>>
>>>>>>> This can cause a page corruption bug when
>>>>>>>
>>>>>>> ? ?block size < page size
>>>>>>>
>>>>>>> @@ -203,46 +203,29 @@ static void ext4_end_bio(struct bio *bio, int error)
>>>>>>> - ? ? ? ? ? ? /*
>>>>>>> - ? ? ? ? ? ? ?* If this is a partial write which happened to make
>>>>>>> - ? ? ? ? ? ? ?* all buffers uptodate then we can optimize away a
>>>>>>> - ? ? ? ? ? ? ?* bogus readpage() for the next read(). Here we
>>>>>>> - ? ? ? ? ? ? ?* 'discover' whether the page went uptodate as a
>>>>>>> - ? ? ? ? ? ? ?* result of this (potentially partial) write.
>>>>>>> - ? ? ? ? ? ? ?*/
>>>>>>> - ? ? ? ? ? ? if (!partial_write)
>>>>>>> - ? ? ? ? ? ? ? ? ? ? SetPageUptodate(page);
>>>>>>> -
>>>>>>
>>>>>> I think this is the important part of the code - if there is a read-after-write for a file that was written in "blocksize" units (blocksize < pagesize), does the page get set uptodate when all of the blocks have been written and/or the writing is at EOF? ?Otherwise, a read-after-write will always cause data to be fetched from disk needlessly, even though the uptodate information is already in cache.
>>>>>
>>>>> Hmm, that's a good question. ?I would kind of doubt that the page
>>>>> would be marked uptodate when the final block was written, and this
>>>>> might be what the code above was trying to do. ?It wasn't doing it
>>>>> correctly :-), but it might have possibly avoided the extra read when
>>>>> it there was no error.
>>>>>
>>>>> I'll look at this some more, and see if I can't test for your scenario
>>>>> above. ?Perhaps at least checking that all BHs in the page are mapped
>>>>> + uptodate => SetPageUptodate would not be out of line.
>>>>
>>>> My testing is now showing the read coming through after writing to the
>>>> 4 blocks of a 4K file, using 1K blocksize.
>>>
>>> Sorry, but could you please clarify? ?Does "read coming through" mean that there is an IO sent to the disk, or that the page is uptodate and the read is handled from the page cache?
>>
>> Wow, even I don't understand what I wrote above...
>>
>> What I meant to write was: ?In my experiments, I do *not* see a read
>> going down to disk after the writes for the individual blocks. ?I
>> basically did this:
>>
>> ---------------------------
>> dd if=/dev/zero of=bar bs=1k count=1 seek=3 conv=notrunc
>> dd if=/dev/zero of=bar bs=1k count=1 seek=0 conv=notrunc
>> dd if=/dev/zero of=bar bs=1k count=1 seek=1 conv=notrunc
>> dd if=/dev/zero of=bar bs=1k count=1 seek=2 conv=notrunc
>>
>> sync
> try to evict all cache pages
> echo 1 > proc/sys/vm/drop_caches
But this won't prove anything. Andreas was worried about an
extraneous read from disk when the page in the page cache was in fact
up to date. If we evict the page cache, then we *better* read from
disk...
Curt
>
>>
>> dd if=bar of=foo bs=1k count=4
>> ---------------------------
>>
>>>
>>>> ?And it seems to me that
>>>> this is taken care of in __block_commit_write(), which is called from
>>>> all the .write_end callbacks for ext4, at least.
>>>
>>> It does indeed look like that should be handling this case. ?It would be good to verify that this is still true with your patch, just in case theory and reality don't align.
>>
>> I see the same *non-read from disk* with 1k blocks both with my patch,
>> and with a vanilla kernel.
>>
>> Thanks,
>> Curt
>>
>>>
>>> Cheers, Andreas
>>>
>>>
>>>
>>>
>>>
>>>
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>>
>
>
>
> --
> Best Wishes
> Yongqiang Yang
> --
> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
> the body of a message to [email protected]
> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>
On Tue, Apr 26, 2011 at 11:37 PM, Curt Wohlgemuth <[email protected]> wrote:
> On Mon, Apr 25, 2011 at 11:59 PM, Yongqiang Yang <[email protected]> wrote:
>> On Tue, Apr 26, 2011 at 12:32 PM, Curt Wohlgemuth <[email protected]> wrote:
>>> On Mon, Apr 25, 2011 at 5:58 PM, Andreas Dilger <[email protected]> wrote:
>>>> On 2011-04-25, at 5:20 PM, Curt Wohlgemuth wrote:
>>>>> On Mon, Apr 25, 2011 at 3:45 PM, Curt Wohlgemuth <[email protected]> wrote:
>>>>>> On Mon, Apr 25, 2011 at 3:40 PM, Andreas Dilger <[email protected]> wrote:
>>>>>>> On 2011-04-25, at 2:23 PM, Curt Wohlgemuth wrote:
>>>>>>>> In the bio completion routine, we should not be setting
>>>>>>>> PageUptodate at all -- it's set at sys_write() time, and is
>>>>>>>> unaffected by success/failure of the write to disk.
>>>>>>>>
>>>>>>>> This can cause a page corruption bug when
>>>>>>>>
>>>>>>>> ? ?block size < page size
>>>>>>>>
>>>>>>>> @@ -203,46 +203,29 @@ static void ext4_end_bio(struct bio *bio, int error)
>>>>>>>> - ? ? ? ? ? ? /*
>>>>>>>> - ? ? ? ? ? ? ?* If this is a partial write which happened to make
>>>>>>>> - ? ? ? ? ? ? ?* all buffers uptodate then we can optimize away a
>>>>>>>> - ? ? ? ? ? ? ?* bogus readpage() for the next read(). Here we
>>>>>>>> - ? ? ? ? ? ? ?* 'discover' whether the page went uptodate as a
>>>>>>>> - ? ? ? ? ? ? ?* result of this (potentially partial) write.
>>>>>>>> - ? ? ? ? ? ? ?*/
>>>>>>>> - ? ? ? ? ? ? if (!partial_write)
>>>>>>>> - ? ? ? ? ? ? ? ? ? ? SetPageUptodate(page);
>>>>>>>> -
>>>>>>>
>>>>>>> I think this is the important part of the code - if there is a read-after-write for a file that was written in "blocksize" units (blocksize < pagesize), does the page get set uptodate when all of the blocks have been written and/or the writing is at EOF? ?Otherwise, a read-after-write will always cause data to be fetched from disk needlessly, even though the uptodate information is already in cache.
>>>>>>
>>>>>> Hmm, that's a good question. ?I would kind of doubt that the page
>>>>>> would be marked uptodate when the final block was written, and this
>>>>>> might be what the code above was trying to do. ?It wasn't doing it
>>>>>> correctly :-), but it might have possibly avoided the extra read when
>>>>>> it there was no error.
>>>>>>
>>>>>> I'll look at this some more, and see if I can't test for your scenario
>>>>>> above. ?Perhaps at least checking that all BHs in the page are mapped
>>>>>> + uptodate => SetPageUptodate would not be out of line.
>>>>>
>>>>> My testing is now showing the read coming through after writing to the
>>>>> 4 blocks of a 4K file, using 1K blocksize.
>>>>
>>>> Sorry, but could you please clarify? ?Does "read coming through" mean that there is an IO sent to the disk, or that the page is uptodate and the read is handled from the page cache?
>>>
>>> Wow, even I don't understand what I wrote above...
>>>
>>> What I meant to write was: ?In my experiments, I do *not* see a read
>>> going down to disk after the writes for the individual blocks. ?I
>>> basically did this:
>>>
>>> ---------------------------
>>> dd if=/dev/zero of=bar bs=1k count=1 seek=3 conv=notrunc
>>> dd if=/dev/zero of=bar bs=1k count=1 seek=0 conv=notrunc
>>> dd if=/dev/zero of=bar bs=1k count=1 seek=1 conv=notrunc
>>> dd if=/dev/zero of=bar bs=1k count=1 seek=2 conv=notrunc
>>>
>>> sync
>> try to evict all cache pages
>> echo 1 > proc/sys/vm/drop_caches
>
> But this won't prove anything. ?Andreas was worried about an
> extraneous read from disk when the page in the page cache was in fact
> up to date. ?If we evict the page cache, then we *better* read from
> disk...
If the buffer is marked uptodate, 'read from disk' does not happen
even page is not uptodate,
mapping->is_partially_uptodate handles the case.
>
> Curt
>
>>
>>>
>>> dd if=bar of=foo bs=1k count=4
>>> ---------------------------
>>>
>>>>
>>>>> ?And it seems to me that
>>>>> this is taken care of in __block_commit_write(), which is called from
>>>>> all the .write_end callbacks for ext4, at least.
>>>>
>>>> It does indeed look like that should be handling this case. ?It would be good to verify that this is still true with your patch, just in case theory and reality don't align.
>>>
>>> I see the same *non-read from disk* with 1k blocks both with my patch,
>>> and with a vanilla kernel.
>>>
>>> Thanks,
>>> Curt
>>>
>>>>
>>>> Cheers, Andreas
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>> --
>>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>>> the body of a message to [email protected]
>>> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>>>
>>
>>
>>
>> --
>> Best Wishes
>> Yongqiang Yang
>> --
>> To unsubscribe from this list: send the line "unsubscribe linux-ext4" in
>> the body of a message to [email protected]
>> More majordomo info at ?http://vger.kernel.org/majordomo-info.html
>>
>
--
Best Wishes
Yongqiang Yang
On Mon, 25 Apr 2011, Curt Wohlgemuth wrote:
> In the bio completion routine, we should not be setting
> PageUptodate at all -- it's set at sys_write() time, and is
> unaffected by success/failure of the write to disk.
>
> This can cause a page corruption bug when
>
> block size < page size
>
> if we have only written a single block -- we might end up
> setting the entire PageUptodate, which will cause subsequent
> reads to get bad data.
>
> This commit also takes the opportunity to clean up error
> handling in ext4_end_bio(), and remove some extraneous code:
>
> - fixes ext4_end_bio() to set AS_EIO in the
> page->mapping->flags on error, which was left out by
> mistake.
> - remove the clear_buffer_dirty() call on unmapped
> buffers for each page.
> - consolidate page/buffer error handling in a single
> section.
>
> Signed-off-by: Curt Wohlgemuth <[email protected]>
> Reported-by: Jim Meyering <[email protected]>
> Reported-by: Hugh Dickins <[email protected]>
> Cc: Mingming Cao <[email protected]>
> ---
> Changlog since v2:
> - Removed clear_buffer_dirty() call
> - Consolidated error handling for pages and buffer heads
> - Loop over BHs in a page even for page size == block size, so
> we emit the correct error for such a case.
>
> Changlog since v1:
> - Added commit message text about setting AS_EIO for the
> page on error.
> - Continue to loop over all BHs in a page and emit unique
> errors for each of them.
> ---
> fs/ext4/page-io.c | 39 +++++++++++----------------------------
> 1 files changed, 11 insertions(+), 28 deletions(-)
>
> diff --git a/fs/ext4/page-io.c b/fs/ext4/page-io.c
> index b6dbd05..7bb8f76 100644
> --- a/fs/ext4/page-io.c
> +++ b/fs/ext4/page-io.c
> @@ -203,46 +203,29 @@ static void ext4_end_bio(struct bio *bio, int error)
> for (i = 0; i < io_end->num_io_pages; i++) {
> struct page *page = io_end->pages[i]->p_page;
> struct buffer_head *bh, *head;
> - int partial_write = 0;
> + loff_t offset;
> + loff_t io_end_offset;
>
> - head = page_buffers(page);
> - if (error)
> + if (error) {
> SetPageError(page);
> - BUG_ON(!head);
> - if (head->b_size != PAGE_CACHE_SIZE) {
> - loff_t offset;
> - loff_t io_end_offset = io_end->offset + io_end->size;
> + set_bit(AS_EIO, &page->mapping->flags);
> + head = page_buffers(page);
> + BUG_ON(!head);
> +
> + io_end_offset = io_end->offset + io_end->size;
>
> offset = (sector_t) page->index << PAGE_CACHE_SHIFT;
> bh = head;
> do {
> if ((offset >= io_end->offset) &&
> - (offset+bh->b_size <= io_end_offset)) {
> - if (error)
> - buffer_io_error(bh);
> -
> - }
> - if (buffer_delay(bh))
> - partial_write = 1;
> - else if (!buffer_mapped(bh))
> - clear_buffer_dirty(bh);
> - else if (buffer_dirty(bh))
> - partial_write = 1;
> + (offset+bh->b_size <= io_end_offset))
> + buffer_io_error(bh);
> +
> offset += bh->b_size;
> bh = bh->b_this_page;
> } while (bh != head);
> }
>
> - /*
> - * If this is a partial write which happened to make
> - * all buffers uptodate then we can optimize away a
> - * bogus readpage() for the next read(). Here we
> - * 'discover' whether the page went uptodate as a
> - * result of this (potentially partial) write.
> - */
> - if (!partial_write)
> - SetPageUptodate(page);
> -
> put_io_page(io_end->pages[i]);
> }
> io_end->num_io_pages = 0;
> --
> 1.7.3.1
I'm concerned that we've reached -rc7, with Linus planning on 2.6.39
release next week, but Curt's fix above to the mblk_io corruption bug
seems to have fallen through the cracks.
I've been including it in all my testing over the last two weeks: it
works fine - and because of my own tmpfs bug, I even got to see its
error messages :) Adding in the patch is easy enough for me,
but surely we don't want others to stumble into this bug.
Thanks,
Hugh
On Tue, May 10, 2011 at 10:41:56AM -0700, Hugh Dickins wrote:
>
> I'm concerned that we've reached -rc7, with Linus planning on 2.6.39
> release next week, but Curt's fix above to the mblk_io corruption bug
> seems to have fallen through the cracks.
It's in the ext4 master branch so it's queued to be pushed during the
next merge window, and then it would go into the 2.6.39.x stable
series. So it didn't fall through the cracks; it was just a question
of whether to push late in the rc series, or waiting until the merge
window and then backporting to 2.6.39 stable. Basically I got the
patch in -rc5, and at that point, given that the vast majority of the
file systems are 4k blocksize on x86, and most Power and Itanic users
are using distribution kernels (and that's a very small number
anyway), I decided not to push it to Linus at that point.
It's a judgement call; and I could have gone the other way; it was
very much a 49/51 sort of decision.
- Ted
On Tue, May 10, 2011 at 12:17 PM, Ted Ts'o <[email protected]> wrote:
> On Tue, May 10, 2011 at 10:41:56AM -0700, Hugh Dickins wrote:
>>
>> I'm concerned that we've reached -rc7, with Linus planning on 2.6.39
>> release next week, but Curt's fix above to the mblk_io corruption bug
>> seems to have fallen through the cracks.
>
> It's in the ext4 master branch so it's queued to be pushed during the
> next merge window, and then it would go into the 2.6.39.x stable
> series. So it didn't fall through the cracks; it was just a question
> of whether to push late in the rc series, or waiting until the merge
> window and then backporting to 2.6.39 stable. Basically I got the
> patch in -rc5, and at that point, given that the vast majority of the
> file systems are 4k blocksize on x86, and most Power and Itanic users
> are using distribution kernels (and that's a very small number
> anyway), I decided not to push it to Linus at that point.
>
> It's a judgement call; and I could have gone the other way; it was
> very much a 49/51 sort of decision.
Hmm. Well, thanks for the reply. I do disagree with your decision,
and am surprised that you're still seeing it in the past tense. This
is a corruption and a regression: prime material for a fix even at
this stage.
Admittedly, in my case the data was written correctly and only
temporarily presented incorrectly; but in other tests that could then
have got written back incorrectly. And admittedly, not many people
outside of the large-page-size world are using blocksize less than
pagesize. But even so....
Moral: cc Rafael's regression list next time?
Hugh