2009-03-24 07:44:36

by Nick Piggin

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Friday 20 March 2009 03:46:39 Jan Kara wrote:
> On Fri 20-03-09 02:48:21, Nick Piggin wrote:

> > Holding mapping->private_lock over the __set_page_dirty should
> > fix it, although I guess you'd want to release it before calling
> > __mark_inode_dirty so as not to put inode_lock under there. I
> > have a patch for this if it sounds reasonable.
>
> Yes, that seems to be a bug - the function actually looked suspitious to
> me yesterday but I somehow convinced myself that it's fine. Probably
> because fsx-linux is single-threaded.


After a whole lot of chasing my own tail in the VM and buffer layers,
I think it is a problem in ext2 (and I haven't been able to reproduce
with ext3 yet, which might lend weight to that, although as we have
seen, it is very timing dependent).

That would be slightly unfortunate because we still have Jan's ext3
problem, and also another reported problem of corruption on ext3 (on
brd driver).

Anyway, when I have reproduced the problem with the test case, the
"lost" writes are all reported to be holes. Unfortunately, that doesn't
point straight to the filesystem, because ext2 allocates blocks in this
case at writeout time, so if dirty bits are getting lost, then it would
be normal to see holes.

I then put in a whole lot of extra infrastructure to track metadata about
each struct page (when it was last written out, when it last had the number
of writable ptes reach 0, when the dirty bits were last cleared etc). And
none of the normal asertions were triggering: eg. when any page is removed
from pagecache (except truncates), it has always had all its buffers
written out *after* all ptes were made readonly or unmapped. Lots of other
tests and crap like that.

So I tried what I should have done to start with and did an e2fsck after
seeing corruption. Yes, it comes up with errors. Now that is unusual
because that should be largely insulated from the vm: if a dirty bit gets
lost, then the filesystem image should be quite happy and error-free with
a hole or unwritten data there.

I don't know ext? locking very well, except that it looks pretty overly
complex and crufty.

Usually, blocks are instantiated by write(2), under i_mutex, serialising
the allocator somewhat. mmap-write blocks are instantiated at writeout
time, unserialised. I moved truncate_mutex to cover the entire get_blocks
function, and can no longer trigger the problem. Might be a timing issue
though -- Ying, can you try this and see if you can still reproduce?

I close my eyes and pick something out of a hat. a686cd89. Search for XXX.
Nice. Whether or not this cased the problem, can someone please tell me
why it got merged in that state?

I'm leaving ext3 running for now. It looks like a nasty task to bisect
ext2 down to that commit :( and I would be more interested in trying to
reproduce Jan's ext3 problem, however, because I'm not too interested in
diving into ext2 locking to work out exactly what is racing and how to
fix it properly. I suspect it would be most productive to wire up some
ioctls right into the block allocator/lookup and code up a userspace
tester for it that could probably stress it a lot harder than kernel
writeout can.



2009-03-24 10:27:47

by Nick Piggin

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Tuesday 24 March 2009 18:44:21 Nick Piggin wrote:

> I close my eyes and pick something out of a hat. a686cd89. Search for XXX.
> Nice. Whether or not this cased the problem, can someone please tell me
> why it got merged in that state?

Actually I must be wrong about this if the problem was reproduced in
2.6.18. Question still stands, though.


2009-03-24 10:32:04

by Andrew Morton

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Tue, 24 Mar 2009 18:44:21 +1100 Nick Piggin <[email protected]> wrote:

> On Friday 20 March 2009 03:46:39 Jan Kara wrote:
> > On Fri 20-03-09 02:48:21, Nick Piggin wrote:
>
> > > Holding mapping->private_lock over the __set_page_dirty should
> > > fix it, although I guess you'd want to release it before calling
> > > __mark_inode_dirty so as not to put inode_lock under there. I
> > > have a patch for this if it sounds reasonable.
> >
> > Yes, that seems to be a bug - the function actually looked suspitious to
> > me yesterday but I somehow convinced myself that it's fine. Probably
> > because fsx-linux is single-threaded.
>
>
> After a whole lot of chasing my own tail in the VM and buffer layers,
> I think it is a problem in ext2 (and I haven't been able to reproduce
> with ext3 yet, which might lend weight to that, although as we have
> seen, it is very timing dependent).
>
> That would be slightly unfortunate because we still have Jan's ext3
> problem, and also another reported problem of corruption on ext3 (on
> brd driver).
>
> Anyway, when I have reproduced the problem with the test case, the
> "lost" writes are all reported to be holes. Unfortunately, that doesn't
> point straight to the filesystem, because ext2 allocates blocks in this
> case at writeout time, so if dirty bits are getting lost, then it would
> be normal to see holes.
>
> I then put in a whole lot of extra infrastructure to track metadata about
> each struct page (when it was last written out, when it last had the number
> of writable ptes reach 0, when the dirty bits were last cleared etc). And
> none of the normal asertions were triggering: eg. when any page is removed
> from pagecache (except truncates), it has always had all its buffers
> written out *after* all ptes were made readonly or unmapped. Lots of other
> tests and crap like that.
>
> So I tried what I should have done to start with and did an e2fsck after
> seeing corruption. Yes, it comes up with errors.

Do you recall what the errors were?

> Now that is unusual
> because that should be largely insulated from the vm: if a dirty bit gets
> lost, then the filesystem image should be quite happy and error-free with
> a hole or unwritten data there.
>
> I don't know ext? locking very well, except that it looks pretty overly
> complex and crufty.
>
> Usually, blocks are instantiated by write(2), under i_mutex, serialising
> the allocator somewhat. mmap-write blocks are instantiated at writeout
> time, unserialised. I moved truncate_mutex to cover the entire get_blocks
> function, and can no longer trigger the problem. Might be a timing issue
> though -- Ying, can you try this and see if you can still reproduce?
>
> I close my eyes and pick something out of a hat. a686cd89. Search for XXX.
> Nice. Whether or not this cased the problem, can someone please tell me
> why it got merged in that state?
>
> I'm leaving ext3 running for now. It looks like a nasty task to bisect
> ext2 down to that commit :( and I would be more interested in trying to
> reproduce Jan's ext3 problem, however, because I'm not too interested in
> diving into ext2 locking to work out exactly what is racing and how to
> fix it properly. I suspect it would be most productive to wire up some
> ioctls right into the block allocator/lookup and code up a userspace
> tester for it that could probably stress it a lot harder than kernel
> writeout can.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-03-24 12:39:41

by Jan Kara

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

Hi,

On Tue 24-03-09 18:44:21, Nick Piggin wrote:
> On Friday 20 March 2009 03:46:39 Jan Kara wrote:
> > On Fri 20-03-09 02:48:21, Nick Piggin wrote:
>
> > > Holding mapping->private_lock over the __set_page_dirty should
> > > fix it, although I guess you'd want to release it before calling
> > > __mark_inode_dirty so as not to put inode_lock under there. I
> > > have a patch for this if it sounds reasonable.
> >
> > Yes, that seems to be a bug - the function actually looked suspitious to
> > me yesterday but I somehow convinced myself that it's fine. Probably
> > because fsx-linux is single-threaded.
>
>
> After a whole lot of chasing my own tail in the VM and buffer layers,
> I think it is a problem in ext2 (and I haven't been able to reproduce
> with ext3 yet, which might lend weight to that, although as we have
> seen, it is very timing dependent).
>
> That would be slightly unfortunate because we still have Jan's ext3
> problem, and also another reported problem of corruption on ext3 (on
> brd driver).
>
> Anyway, when I have reproduced the problem with the test case, the
> "lost" writes are all reported to be holes. Unfortunately, that doesn't
> point straight to the filesystem, because ext2 allocates blocks in this
> case at writeout time, so if dirty bits are getting lost, then it would
> be normal to see holes.
>
> I then put in a whole lot of extra infrastructure to track metadata about
> each struct page (when it was last written out, when it last had the number
> of writable ptes reach 0, when the dirty bits were last cleared etc). And
> none of the normal asertions were triggering: eg. when any page is removed
> from pagecache (except truncates), it has always had all its buffers
> written out *after* all ptes were made readonly or unmapped. Lots of other
> tests and crap like that.
I see we're going the same way ;)

> So I tried what I should have done to start with and did an e2fsck after
> seeing corruption. Yes, it comes up with errors. Now that is unusual
> because that should be largely insulated from the vm: if a dirty bit gets
> lost, then the filesystem image should be quite happy and error-free with
> a hole or unwritten data there.
This is different for me. I see no corruption on the filesystem with
ext3. Anyway, errors from fsck would be useful to see what kind of
corruption you saw.

> I don't know ext? locking very well, except that it looks pretty overly
> complex and crufty.
>
> Usually, blocks are instantiated by write(2), under i_mutex, serialising
> the allocator somewhat. mmap-write blocks are instantiated at writeout
> time, unserialised. I moved truncate_mutex to cover the entire get_blocks
> function, and can no longer trigger the problem. Might be a timing issue
> though -- Ying, can you try this and see if you can still reproduce?
>
> I close my eyes and pick something out of a hat. a686cd89. Search for XXX.
> Nice. Whether or not this cased the problem, can someone please tell me
> why it got merged in that state?
Maybe, I see it did some changes to ext2_get_blocks() which could be
where the problem was introduced...

> I'm leaving ext3 running for now. It looks like a nasty task to bisect
> ext2 down to that commit :( and I would be more interested in trying to
> reproduce Jan's ext3 problem, however, because I'm not too interested in
> diving into ext2 locking to work out exactly what is racing and how to
> fix it properly. I suspect it would be most productive to wire up some
> ioctls right into the block allocator/lookup and code up a userspace
> tester for it that could probably stress it a lot harder than kernel
> writeout can.
Yes, what I observed with ext3 so far is that data is properly copied and
page marked dirty when the data is copied in. But then at some point dirty
bit is cleared via block_write_full_page() but we don't get to submitting
at least one buffer in that page. I'm now debugging which path we take so
that this happens...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-03-24 12:55:23

by Jan Kara

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Tue 24-03-09 13:39:36, Jan Kara wrote:
> Hi,
>
> On Tue 24-03-09 18:44:21, Nick Piggin wrote:
> > On Friday 20 March 2009 03:46:39 Jan Kara wrote:
> > > On Fri 20-03-09 02:48:21, Nick Piggin wrote:
> >
> > > > Holding mapping->private_lock over the __set_page_dirty should
> > > > fix it, although I guess you'd want to release it before calling
> > > > __mark_inode_dirty so as not to put inode_lock under there. I
> > > > have a patch for this if it sounds reasonable.
> > >
> > > Yes, that seems to be a bug - the function actually looked suspitious to
> > > me yesterday but I somehow convinced myself that it's fine. Probably
> > > because fsx-linux is single-threaded.
> >
> >
> > After a whole lot of chasing my own tail in the VM and buffer layers,
> > I think it is a problem in ext2 (and I haven't been able to reproduce
> > with ext3 yet, which might lend weight to that, although as we have
> > seen, it is very timing dependent).
> >
> > That would be slightly unfortunate because we still have Jan's ext3
> > problem, and also another reported problem of corruption on ext3 (on
> > brd driver).
> >
> > Anyway, when I have reproduced the problem with the test case, the
> > "lost" writes are all reported to be holes. Unfortunately, that doesn't
> > point straight to the filesystem, because ext2 allocates blocks in this
> > case at writeout time, so if dirty bits are getting lost, then it would
> > be normal to see holes.
> >
> > I then put in a whole lot of extra infrastructure to track metadata about
> > each struct page (when it was last written out, when it last had the number
> > of writable ptes reach 0, when the dirty bits were last cleared etc). And
> > none of the normal asertions were triggering: eg. when any page is removed
> > from pagecache (except truncates), it has always had all its buffers
> > written out *after* all ptes were made readonly or unmapped. Lots of other
> > tests and crap like that.
> I see we're going the same way ;)
>
> > So I tried what I should have done to start with and did an e2fsck after
> > seeing corruption. Yes, it comes up with errors. Now that is unusual
> > because that should be largely insulated from the vm: if a dirty bit gets
> > lost, then the filesystem image should be quite happy and error-free with
> > a hole or unwritten data there.
> This is different for me. I see no corruption on the filesystem with
> ext3. Anyway, errors from fsck would be useful to see what kind of
> corruption you saw.
>
> > I don't know ext? locking very well, except that it looks pretty overly
> > complex and crufty.
> >
> > Usually, blocks are instantiated by write(2), under i_mutex, serialising
> > the allocator somewhat. mmap-write blocks are instantiated at writeout
> > time, unserialised. I moved truncate_mutex to cover the entire get_blocks
> > function, and can no longer trigger the problem. Might be a timing issue
> > though -- Ying, can you try this and see if you can still reproduce?
> >
> > I close my eyes and pick something out of a hat. a686cd89. Search for XXX.
> > Nice. Whether or not this cased the problem, can someone please tell me
> > why it got merged in that state?
> Maybe, I see it did some changes to ext2_get_blocks() which could be
> where the problem was introduced...
>
> > I'm leaving ext3 running for now. It looks like a nasty task to bisect
> > ext2 down to that commit :( and I would be more interested in trying to
> > reproduce Jan's ext3 problem, however, because I'm not too interested in
> > diving into ext2 locking to work out exactly what is racing and how to
> > fix it properly. I suspect it would be most productive to wire up some
> > ioctls right into the block allocator/lookup and code up a userspace
> > tester for it that could probably stress it a lot harder than kernel
> > writeout can.
> Yes, what I observed with ext3 so far is that data is properly copied and
> page marked dirty when the data is copied in. But then at some point dirty
> bit is cleared via block_write_full_page() but we don't get to submitting
> at least one buffer in that page. I'm now debugging which path we take so
> that this happens...
And one more interesting thing I don't yet fully understand - I see pages
having PageError() set when they are removed from page cache (and they have
been faulted in before). It's probably some interaction with pagecache
readahead...

Honza

2009-03-24 13:26:42

by Jan Kara

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Tue 24-03-09 13:55:10, Jan Kara wrote:
> On Tue 24-03-09 13:39:36, Jan Kara wrote:
> > Hi,
> >
> > On Tue 24-03-09 18:44:21, Nick Piggin wrote:
> > > On Friday 20 March 2009 03:46:39 Jan Kara wrote:
> > > > On Fri 20-03-09 02:48:21, Nick Piggin wrote:
> > >
> > > > > Holding mapping->private_lock over the __set_page_dirty should
> > > > > fix it, although I guess you'd want to release it before calling
> > > > > __mark_inode_dirty so as not to put inode_lock under there. I
> > > > > have a patch for this if it sounds reasonable.
> > > >
> > > > Yes, that seems to be a bug - the function actually looked suspitious to
> > > > me yesterday but I somehow convinced myself that it's fine. Probably
> > > > because fsx-linux is single-threaded.
> > >
> > >
> > > After a whole lot of chasing my own tail in the VM and buffer layers,
> > > I think it is a problem in ext2 (and I haven't been able to reproduce
> > > with ext3 yet, which might lend weight to that, although as we have
> > > seen, it is very timing dependent).
> > >
> > > That would be slightly unfortunate because we still have Jan's ext3
> > > problem, and also another reported problem of corruption on ext3 (on
> > > brd driver).
> > >
> > > Anyway, when I have reproduced the problem with the test case, the
> > > "lost" writes are all reported to be holes. Unfortunately, that doesn't
> > > point straight to the filesystem, because ext2 allocates blocks in this
> > > case at writeout time, so if dirty bits are getting lost, then it would
> > > be normal to see holes.
> > >
> > > I then put in a whole lot of extra infrastructure to track metadata about
> > > each struct page (when it was last written out, when it last had the number
> > > of writable ptes reach 0, when the dirty bits were last cleared etc). And
> > > none of the normal asertions were triggering: eg. when any page is removed
> > > from pagecache (except truncates), it has always had all its buffers
> > > written out *after* all ptes were made readonly or unmapped. Lots of other
> > > tests and crap like that.
> > I see we're going the same way ;)
> >
> > > So I tried what I should have done to start with and did an e2fsck after
> > > seeing corruption. Yes, it comes up with errors. Now that is unusual
> > > because that should be largely insulated from the vm: if a dirty bit gets
> > > lost, then the filesystem image should be quite happy and error-free with
> > > a hole or unwritten data there.
> > This is different for me. I see no corruption on the filesystem with
> > ext3. Anyway, errors from fsck would be useful to see what kind of
> > corruption you saw.
> >
> > > I don't know ext? locking very well, except that it looks pretty overly
> > > complex and crufty.
> > >
> > > Usually, blocks are instantiated by write(2), under i_mutex, serialising
> > > the allocator somewhat. mmap-write blocks are instantiated at writeout
> > > time, unserialised. I moved truncate_mutex to cover the entire get_blocks
> > > function, and can no longer trigger the problem. Might be a timing issue
> > > though -- Ying, can you try this and see if you can still reproduce?
> > >
> > > I close my eyes and pick something out of a hat. a686cd89. Search for XXX.
> > > Nice. Whether or not this cased the problem, can someone please tell me
> > > why it got merged in that state?
> > Maybe, I see it did some changes to ext2_get_blocks() which could be
> > where the problem was introduced...
> >
> > > I'm leaving ext3 running for now. It looks like a nasty task to bisect
> > > ext2 down to that commit :( and I would be more interested in trying to
> > > reproduce Jan's ext3 problem, however, because I'm not too interested in
> > > diving into ext2 locking to work out exactly what is racing and how to
> > > fix it properly. I suspect it would be most productive to wire up some
> > > ioctls right into the block allocator/lookup and code up a userspace
> > > tester for it that could probably stress it a lot harder than kernel
> > > writeout can.
> > Yes, what I observed with ext3 so far is that data is properly copied and
> > page marked dirty when the data is copied in. But then at some point dirty
> > bit is cleared via block_write_full_page() but we don't get to submitting
> > at least one buffer in that page. I'm now debugging which path we take so
> > that this happens...
> And one more interesting thing I don't yet fully understand - I see pages
> having PageError() set when they are removed from page cache (and they have
> been faulted in before). It's probably some interaction with pagecache
> readahead...
Argh... So the problem seems to be that get_block() occasionally returns
ENOSPC and we then discard the dirty data (hmm, we could give at least a
warning for that). I'm not yet sure why getblock behaves like this because
the filesystem seems to have enough space but anyway this seems to be some
strange fs trouble as well.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-03-24 14:01:45

by Chris Mason

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Tue, 2009-03-24 at 14:26 +0100, Jan Kara wrote:
> On Tue 24-03-09 13:55:10, Jan Kara wrote:

> > And one more interesting thing I don't yet fully understand - I see pages
> > having PageError() set when they are removed from page cache (and they have
> > been faulted in before). It's probably some interaction with pagecache
> > readahead...
> Argh... So the problem seems to be that get_block() occasionally returns
> ENOSPC and we then discard the dirty data (hmm, we could give at least a
> warning for that). I'm not yet sure why getblock behaves like this because
> the filesystem seems to have enough space but anyway this seems to be some
> strange fs trouble as well.
>

Ouch. Perhaps the free space is waiting on a journal commit?

-chris


--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-03-24 14:07:21

by Jan Kara

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Tue 24-03-09 10:01:45, Chris Mason wrote:
> On Tue, 2009-03-24 at 14:26 +0100, Jan Kara wrote:
> > On Tue 24-03-09 13:55:10, Jan Kara wrote:
>
> > > And one more interesting thing I don't yet fully understand - I see pages
> > > having PageError() set when they are removed from page cache (and they have
> > > been faulted in before). It's probably some interaction with pagecache
> > > readahead...
> > Argh... So the problem seems to be that get_block() occasionally returns
> > ENOSPC and we then discard the dirty data (hmm, we could give at least a
> > warning for that). I'm not yet sure why getblock behaves like this because
> > the filesystem seems to have enough space but anyway this seems to be some
> > strange fs trouble as well.
> >
>
> Ouch. Perhaps the free space is waiting on a journal commit?
Yes, exactly. I've already found there's lot of space hold by the
committing transaction (it can easily hold a few hundred megs or a few gigs
with larger journal and my UML images aren't that big...). And writepage()
implementation in ext3 does not have a logic to retry. Also
block_write_full_page() clears buffers dirty bits so it's not easy to retry
even if we did it. I'm now looking into how to fix this...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-03-24 14:30:00

by Nick Piggin

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Wednesday 25 March 2009 00:26:37 Jan Kara wrote:
> On Tue 24-03-09 13:55:10, Jan Kara wrote:
> > On Tue 24-03-09 13:39:36, Jan Kara wrote:
> > > Hi,
> > >
> > > On Tue 24-03-09 18:44:21, Nick Piggin wrote:
> > > > On Friday 20 March 2009 03:46:39 Jan Kara wrote:
> > > > > On Fri 20-03-09 02:48:21, Nick Piggin wrote:
> > > > > > Holding mapping->private_lock over the __set_page_dirty should
> > > > > > fix it, although I guess you'd want to release it before calling
> > > > > > __mark_inode_dirty so as not to put inode_lock under there. I
> > > > > > have a patch for this if it sounds reasonable.
> > > > >
> > > > > Yes, that seems to be a bug - the function actually looked
> > > > > suspitious to me yesterday but I somehow convinced myself that it's
> > > > > fine. Probably because fsx-linux is single-threaded.
> > > >
> > > > After a whole lot of chasing my own tail in the VM and buffer layers,
> > > > I think it is a problem in ext2 (and I haven't been able to reproduce
> > > > with ext3 yet, which might lend weight to that, although as we have
> > > > seen, it is very timing dependent).
> > > >
> > > > That would be slightly unfortunate because we still have Jan's ext3
> > > > problem, and also another reported problem of corruption on ext3 (on
> > > > brd driver).
> > > >
> > > > Anyway, when I have reproduced the problem with the test case, the
> > > > "lost" writes are all reported to be holes. Unfortunately, that
> > > > doesn't point straight to the filesystem, because ext2 allocates
> > > > blocks in this case at writeout time, so if dirty bits are getting
> > > > lost, then it would be normal to see holes.
> > > >
> > > > I then put in a whole lot of extra infrastructure to track metadata
> > > > about each struct page (when it was last written out, when it last
> > > > had the number of writable ptes reach 0, when the dirty bits were
> > > > last cleared etc). And none of the normal asertions were triggering:
> > > > eg. when any page is removed from pagecache (except truncates), it
> > > > has always had all its buffers written out *after* all ptes were made
> > > > readonly or unmapped. Lots of other tests and crap like that.
> > >
> > > I see we're going the same way ;)
> > >
> > > > So I tried what I should have done to start with and did an e2fsck
> > > > after seeing corruption. Yes, it comes up with errors. Now that is
> > > > unusual because that should be largely insulated from the vm: if a
> > > > dirty bit gets lost, then the filesystem image should be quite happy
> > > > and error-free with a hole or unwritten data there.
> > >
> > > This is different for me. I see no corruption on the filesystem with
> > > ext3. Anyway, errors from fsck would be useful to see what kind of
> > > corruption you saw.
> > >
> > > > I don't know ext? locking very well, except that it looks pretty
> > > > overly complex and crufty.
> > > >
> > > > Usually, blocks are instantiated by write(2), under i_mutex,
> > > > serialising the allocator somewhat. mmap-write blocks are
> > > > instantiated at writeout time, unserialised. I moved truncate_mutex
> > > > to cover the entire get_blocks function, and can no longer trigger
> > > > the problem. Might be a timing issue though -- Ying, can you try this
> > > > and see if you can still reproduce?
> > > >
> > > > I close my eyes and pick something out of a hat. a686cd89. Search for
> > > > XXX. Nice. Whether or not this cased the problem, can someone please
> > > > tell me why it got merged in that state?
> > >
> > > Maybe, I see it did some changes to ext2_get_blocks() which could be
> > > where the problem was introduced...
> > >
> > > > I'm leaving ext3 running for now. It looks like a nasty task to
> > > > bisect ext2 down to that commit :( and I would be more interested in
> > > > trying to reproduce Jan's ext3 problem, however, because I'm not too
> > > > interested in diving into ext2 locking to work out exactly what is
> > > > racing and how to fix it properly. I suspect it would be most
> > > > productive to wire up some ioctls right into the block
> > > > allocator/lookup and code up a userspace tester for it that could
> > > > probably stress it a lot harder than kernel writeout can.
> > >
> > > Yes, what I observed with ext3 so far is that data is properly copied
> > > and page marked dirty when the data is copied in. But then at some
> > > point dirty bit is cleared via block_write_full_page() but we don't get
> > > to submitting at least one buffer in that page. I'm now debugging which
> > > path we take so that this happens...
> >
> > And one more interesting thing I don't yet fully understand - I see
> > pages having PageError() set when they are removed from page cache (and
> > they have been faulted in before). It's probably some interaction with
> > pagecache readahead...
>
> Argh... So the problem seems to be that get_block() occasionally returns
> ENOSPC and we then discard the dirty data (hmm, we could give at least a
> warning for that). I'm not yet sure why getblock behaves like this because
> the filesystem seems to have enough space but anyway this seems to be some
> strange fs trouble as well.

Ah good find.

I don't think it is a very good idea for block_write_full_page recovery
to do clear_buffer_dirty for !mapped buffers. I think that should rather
be a redirty_page_for_writepage in the case that the buffer is dirty.

Perhaps not the cleanest way to solve the problem if it is just due to
transient shortage of space in ext3, but generic code shouldn't be
allowed to throw away dirty data even if it can't be written back due
to some software or hardware error.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-03-24 14:47:09

by Jan Kara

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Wed 25-03-09 01:30:00, Nick Piggin wrote:
> On Wednesday 25 March 2009 00:26:37 Jan Kara wrote:
> > On Tue 24-03-09 13:55:10, Jan Kara wrote:
> > > On Tue 24-03-09 13:39:36, Jan Kara wrote:
> > > > Hi,
> > > >
> > > > On Tue 24-03-09 18:44:21, Nick Piggin wrote:
> > > > > On Friday 20 March 2009 03:46:39 Jan Kara wrote:
> > > > > > On Fri 20-03-09 02:48:21, Nick Piggin wrote:
> > > > > > > Holding mapping->private_lock over the __set_page_dirty should
> > > > > > > fix it, although I guess you'd want to release it before calling
> > > > > > > __mark_inode_dirty so as not to put inode_lock under there. I
> > > > > > > have a patch for this if it sounds reasonable.
> > > > > >
> > > > > > Yes, that seems to be a bug - the function actually looked
> > > > > > suspitious to me yesterday but I somehow convinced myself that it's
> > > > > > fine. Probably because fsx-linux is single-threaded.
> > > > >
> > > > > After a whole lot of chasing my own tail in the VM and buffer layers,
> > > > > I think it is a problem in ext2 (and I haven't been able to reproduce
> > > > > with ext3 yet, which might lend weight to that, although as we have
> > > > > seen, it is very timing dependent).
> > > > >
> > > > > That would be slightly unfortunate because we still have Jan's ext3
> > > > > problem, and also another reported problem of corruption on ext3 (on
> > > > > brd driver).
> > > > >
> > > > > Anyway, when I have reproduced the problem with the test case, the
> > > > > "lost" writes are all reported to be holes. Unfortunately, that
> > > > > doesn't point straight to the filesystem, because ext2 allocates
> > > > > blocks in this case at writeout time, so if dirty bits are getting
> > > > > lost, then it would be normal to see holes.
> > > > >
> > > > > I then put in a whole lot of extra infrastructure to track metadata
> > > > > about each struct page (when it was last written out, when it last
> > > > > had the number of writable ptes reach 0, when the dirty bits were
> > > > > last cleared etc). And none of the normal asertions were triggering:
> > > > > eg. when any page is removed from pagecache (except truncates), it
> > > > > has always had all its buffers written out *after* all ptes were made
> > > > > readonly or unmapped. Lots of other tests and crap like that.
> > > >
> > > > I see we're going the same way ;)
> > > >
> > > > > So I tried what I should have done to start with and did an e2fsck
> > > > > after seeing corruption. Yes, it comes up with errors. Now that is
> > > > > unusual because that should be largely insulated from the vm: if a
> > > > > dirty bit gets lost, then the filesystem image should be quite happy
> > > > > and error-free with a hole or unwritten data there.
> > > >
> > > > This is different for me. I see no corruption on the filesystem with
> > > > ext3. Anyway, errors from fsck would be useful to see what kind of
> > > > corruption you saw.
> > > >
> > > > > I don't know ext? locking very well, except that it looks pretty
> > > > > overly complex and crufty.
> > > > >
> > > > > Usually, blocks are instantiated by write(2), under i_mutex,
> > > > > serialising the allocator somewhat. mmap-write blocks are
> > > > > instantiated at writeout time, unserialised. I moved truncate_mutex
> > > > > to cover the entire get_blocks function, and can no longer trigger
> > > > > the problem. Might be a timing issue though -- Ying, can you try this
> > > > > and see if you can still reproduce?
> > > > >
> > > > > I close my eyes and pick something out of a hat. a686cd89. Search for
> > > > > XXX. Nice. Whether or not this cased the problem, can someone please
> > > > > tell me why it got merged in that state?
> > > >
> > > > Maybe, I see it did some changes to ext2_get_blocks() which could be
> > > > where the problem was introduced...
> > > >
> > > > > I'm leaving ext3 running for now. It looks like a nasty task to
> > > > > bisect ext2 down to that commit :( and I would be more interested in
> > > > > trying to reproduce Jan's ext3 problem, however, because I'm not too
> > > > > interested in diving into ext2 locking to work out exactly what is
> > > > > racing and how to fix it properly. I suspect it would be most
> > > > > productive to wire up some ioctls right into the block
> > > > > allocator/lookup and code up a userspace tester for it that could
> > > > > probably stress it a lot harder than kernel writeout can.
> > > >
> > > > Yes, what I observed with ext3 so far is that data is properly copied
> > > > and page marked dirty when the data is copied in. But then at some
> > > > point dirty bit is cleared via block_write_full_page() but we don't get
> > > > to submitting at least one buffer in that page. I'm now debugging which
> > > > path we take so that this happens...
> > >
> > > And one more interesting thing I don't yet fully understand - I see
> > > pages having PageError() set when they are removed from page cache (and
> > > they have been faulted in before). It's probably some interaction with
> > > pagecache readahead...
> >
> > Argh... So the problem seems to be that get_block() occasionally returns
> > ENOSPC and we then discard the dirty data (hmm, we could give at least a
> > warning for that). I'm not yet sure why getblock behaves like this because
> > the filesystem seems to have enough space but anyway this seems to be some
> > strange fs trouble as well.
>
> Ah good find.
>
> I don't think it is a very good idea for block_write_full_page recovery
> to do clear_buffer_dirty for !mapped buffers. I think that should rather
> be a redirty_page_for_writepage in the case that the buffer is dirty.
>
> Perhaps not the cleanest way to solve the problem if it is just due to
> transient shortage of space in ext3, but generic code shouldn't be
> allowed to throw away dirty data even if it can't be written back due
> to some software or hardware error.
Well, that would be one possibility. But then we'd be left with dirty
pages we cannot ever release since they are constantly dirty (when the
filesystem really becomes out of space). So what I
rather want to do is something like below:

diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
index d351eab..77c526f 100644
--- a/fs/ext3/inode.c
+++ b/fs/ext3/inode.c
@@ -1440,6 +1440,40 @@ static int journal_dirty_data_fn(handle_t *handle, struct buffer_head *bh)
}

/*
+ * Decides whether it's worthwhile to wait for transaction commit and
+ * retry allocation. If it is, function waits 1 is returns, otherwise
+ * 0 is returned. In both cases we redirty page and it's buffers so that
+ * data is not lost. In case we've retried too many times, we also return
+ * 0 and don't redirty the page. Data gets discarded but we cannot hang
+ * writepage forever...
+ */
+static int ext3_writepage_retry_alloc(struct page *page, int *retries,
+ struct writeback_control *wbc)
+{
+ struct super_block *sb = ((struct inode *)page->mapping->host)->i_sb;
+ int ret = 0;
+
+ /*
+ * We don't want to slow down background writeback too much. On the
+ * other hand if most of the dirty data needs allocation, we better
+ * wait to make some progress
+ */
+ if (wbc->sync_mode == WB_SYNC_NONE && !wbc->for_reclaim &&
+ wbc->pages_skipped < wbc->nr_to_write / 2)
+ goto redirty;
+ /*
+ * Now wait if commit can free some space and we haven't retried
+ * too much
+ */
+ if (!ext3_should_retry_alloc(sb, retries))
+ return 0;
+ ret = 1;
+redirty:
+ set_page_dirty(page);
+ return ret;
+}
+
+/*
* Note that we always start a transaction even if we're not journalling
* data. This is to preserve ordering: any hole instantiation within
* __block_write_full_page -> ext3_get_block() should be journalled
@@ -1564,10 +1598,12 @@ static int ext3_writeback_writepage(struct page *page,
handle_t *handle = NULL;
int ret = 0;
int err;
+ int retries;

if (ext3_journal_current_handle())
goto out_fail;

+restart:
handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
if (IS_ERR(handle)) {
ret = PTR_ERR(handle);
@@ -1580,8 +1616,13 @@ static int ext3_writeback_writepage(struct page *page,
ret = block_write_full_page(page, ext3_get_block, wbc);

err = ext3_journal_stop(handle);
- if (!ret)
+ if (!ret) {
ret = err;
+ } else {
+ if (ret == -ENOSPC &&
+ ext3_writepage_retry_alloc(page, &retries, wbc))
+ goto restart;
+ }
return ret;

out_fail:

And similarly for the other two writepage implementations in ext3...
But it currently gives me:
WARNING: at fs/buffer.c:781 __set_page_dirty+0x8d/0x145()
probably because of that set_page_dirty() in ext3_writepage_retry_alloc().

Or we could implement ext3_mkwrite() to allocate buffers already when we
make page writeable. But it costs some performace (we have to write page
full of zeros when allocating those buffers, where previously we didn't
have to do anything) and it's not trivial to make it work if pagesize >
blocksize (we should not allocate buffers outside of i_size so if i_size
= 1024, we create just one block in ext3_mkwrite() but then we need to
allocate more when we extend the file).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-03-24 14:56:03

by Peter Zijlstra

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Tue, 2009-03-24 at 15:47 +0100, Jan Kara wrote:
>
> Or we could implement ext3_mkwrite() to allocate buffers already when we
> make page writeable. But it costs some performace (we have to write page
> full of zeros when allocating those buffers, where previously we didn't
> have to do anything) and it's not trivial to make it work if pagesize >
> blocksize (we should not allocate buffers outside of i_size so if i_size
> = 1024, we create just one block in ext3_mkwrite() but then we need to
> allocate more when we extend the file).

I think this is the best option, failing with SIGBUS when we fail to
allocate blocks seems consistent with other filesystems as well.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-03-24 15:03:54

by Nick Piggin

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Wednesday 25 March 2009 01:47:09 Jan Kara wrote:
> On Wed 25-03-09 01:30:00, Nick Piggin wrote:

> > I don't think it is a very good idea for block_write_full_page recovery
> > to do clear_buffer_dirty for !mapped buffers. I think that should rather
> > be a redirty_page_for_writepage in the case that the buffer is dirty.
> >
> > Perhaps not the cleanest way to solve the problem if it is just due to
> > transient shortage of space in ext3, but generic code shouldn't be
> > allowed to throw away dirty data even if it can't be written back due
> > to some software or hardware error.
>
> Well, that would be one possibility. But then we'd be left with dirty
> pages we cannot ever release since they are constantly dirty (when the
> filesystem really becomes out of space). So what I

If the filesystem becomes out of space and we have over-committed these
dirty mmapped blocks, then we most definitely want to keep them around.
An error of the system losing a few pages (or if it happens an insanely
large number of times, then slowly dying due to memory leak) is better
than an app suddenly seeing the contents of the page change to nulls
under it when the kernel decides to do some page reclaim.


> rather want to do is something like below:
>
> diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> index d351eab..77c526f 100644
> --- a/fs/ext3/inode.c
> +++ b/fs/ext3/inode.c
> @@ -1440,6 +1440,40 @@ static int journal_dirty_data_fn(handle_t *handle,
> struct buffer_head *bh) }
>
> /*
> + * Decides whether it's worthwhile to wait for transaction commit and
> + * retry allocation. If it is, function waits 1 is returns, otherwise
> + * 0 is returned. In both cases we redirty page and it's buffers so that
> + * data is not lost. In case we've retried too many times, we also return
> + * 0 and don't redirty the page. Data gets discarded but we cannot hang
> + * writepage forever...
> + */
> +static int ext3_writepage_retry_alloc(struct page *page, int *retries,
> + struct writeback_control *wbc)
> +{
> + struct super_block *sb = ((struct inode *)page->mapping->host)->i_sb;
> + int ret = 0;
> +
> + /*
> + * We don't want to slow down background writeback too much. On the
> + * other hand if most of the dirty data needs allocation, we better
> + * wait to make some progress
> + */
> + if (wbc->sync_mode == WB_SYNC_NONE && !wbc->for_reclaim &&
> + wbc->pages_skipped < wbc->nr_to_write / 2)
> + goto redirty;
> + /*
> + * Now wait if commit can free some space and we haven't retried
> + * too much
> + */
> + if (!ext3_should_retry_alloc(sb, retries))
> + return 0;
> + ret = 1;
> +redirty:
> + set_page_dirty(page);
> + return ret;
> +}
> +
> +/*
> * Note that we always start a transaction even if we're not journalling
> * data. This is to preserve ordering: any hole instantiation within
> * __block_write_full_page -> ext3_get_block() should be journalled
> @@ -1564,10 +1598,12 @@ static int ext3_writeback_writepage(struct page
> *page, handle_t *handle = NULL;
> int ret = 0;
> int err;
> + int retries;
>
> if (ext3_journal_current_handle())
> goto out_fail;
>
> +restart:
> handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
> if (IS_ERR(handle)) {
> ret = PTR_ERR(handle);
> @@ -1580,8 +1616,13 @@ static int ext3_writeback_writepage(struct page
> *page, ret = block_write_full_page(page, ext3_get_block, wbc);
>
> err = ext3_journal_stop(handle);
> - if (!ret)
> + if (!ret) {
> ret = err;
> + } else {
> + if (ret == -ENOSPC &&
> + ext3_writepage_retry_alloc(page, &retries, wbc))
> + goto restart;
> + }
> return ret;
>
> out_fail:
>
> And similarly for the other two writepage implementations in ext3...
> But it currently gives me:
> WARNING: at fs/buffer.c:781 __set_page_dirty+0x8d/0x145()
> probably because of that set_page_dirty() in ext3_writepage_retry_alloc().

And this is a valid warning because we don't know that all buffers are
uptodate or which ones to set as dirty, I think. Unless it is impossible
to have dirty && !uptodate pages come thought this path.

But you shouldn't need to redirty the page at all if we change
block_write_full_page in the way I suggested. Because then it won't
have cleaned the buffer, and it will have done redirty_page_for_writepage.


> Or we could implement ext3_mkwrite() to allocate buffers already when we
> make page writeable. But it costs some performace (we have to write page
> full of zeros when allocating those buffers, where previously we didn't
> have to do anything) and it's not trivial to make it work if pagesize >
> blocksize (we should not allocate buffers outside of i_size so if i_size
> = 1024, we create just one block in ext3_mkwrite() but then we need to
> allocate more when we extend the file).

Well the core page_mkwrite function doesn't care about that case
properly either (it just doesn't allocate buffers on extend). I agree it
should be fixed, but it is a little hard (I need to fix it in fsblock
and I think what is required is a setattr helper for that).

2009-03-24 15:29:59

by Jan Kara

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Tue 24-03-09 15:56:03, Peter Zijlstra wrote:
> On Tue, 2009-03-24 at 15:47 +0100, Jan Kara wrote:
> >
> > Or we could implement ext3_mkwrite() to allocate buffers already when we
> > make page writeable. But it costs some performace (we have to write page
> > full of zeros when allocating those buffers, where previously we didn't
> > have to do anything) and it's not trivial to make it work if pagesize >
> > blocksize (we should not allocate buffers outside of i_size so if i_size
> > = 1024, we create just one block in ext3_mkwrite() but then we need to
> > allocate more when we extend the file).
>
> I think this is the best option, failing with SIGBUS when we fail to
> allocate blocks seems consistent with other filesystems as well.
I agree this looks attractive at the first sight. But there are drawbacks
as I wrote - the problem with blocksize < pagesize, slight performance
decrease due to additional write, page faults doing allocation can take a
*long* time and overall fragmentation is going to be higher (previously
writepage wrote pages for us in the right order, now we are going to
allocate in the first-accessed order). So I'm not sure we really want to
go this way.
Hmm, maybe we could play a trick ala delayed allocation - i.e., reserve
some space in mkwrite() but don't actually allocate it. That would be done
in writepage(). This would solve all the problems I describe above. We could
use PG_Checked flag to track that the page has a reservation and behave
accordingly in writepage() / invalidatepage(). ext3 in data=journal mode
already uses the flag but the use seems to be compatible with what I want
to do now... So it may actually work.
BTW: Note that there's a plenty of filesystems that don't implement
mkwrite() (e.g. ext2, UDF, VFAT...) and thus have the same problem with
ENOSPC. So I'd not speak too much about consistency ;).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-03-24 15:35:01

by Nick Piggin

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Tuesday 24 March 2009 21:32:04 Andrew Morton wrote:
> On Tue, 24 Mar 2009 18:44:21 +1100 Nick Piggin <[email protected]>
wrote:
> > On Friday 20 March 2009 03:46:39 Jan Kara wrote:
> > > On Fri 20-03-09 02:48:21, Nick Piggin wrote:
> > > > Holding mapping->private_lock over the __set_page_dirty should
> > > > fix it, although I guess you'd want to release it before calling
> > > > __mark_inode_dirty so as not to put inode_lock under there. I
> > > > have a patch for this if it sounds reasonable.
> > >
> > > Yes, that seems to be a bug - the function actually looked suspitious
> > > to me yesterday but I somehow convinced myself that it's fine. Probably
> > > because fsx-linux is single-threaded.
> >
> > After a whole lot of chasing my own tail in the VM and buffer layers,
> > I think it is a problem in ext2 (and I haven't been able to reproduce
> > with ext3 yet, which might lend weight to that, although as we have
> > seen, it is very timing dependent).
> >
> > That would be slightly unfortunate because we still have Jan's ext3
> > problem, and also another reported problem of corruption on ext3 (on
> > brd driver).
> >
> > Anyway, when I have reproduced the problem with the test case, the
> > "lost" writes are all reported to be holes. Unfortunately, that doesn't
> > point straight to the filesystem, because ext2 allocates blocks in this
> > case at writeout time, so if dirty bits are getting lost, then it would
> > be normal to see holes.
> >
> > I then put in a whole lot of extra infrastructure to track metadata about
> > each struct page (when it was last written out, when it last had the
> > number of writable ptes reach 0, when the dirty bits were last cleared
> > etc). And none of the normal asertions were triggering: eg. when any page
> > is removed from pagecache (except truncates), it has always had all its
> > buffers written out *after* all ptes were made readonly or unmapped. Lots
> > of other tests and crap like that.
> >
> > So I tried what I should have done to start with and did an e2fsck after
> > seeing corruption. Yes, it comes up with errors.
>
> Do you recall what the errors were?

OK, after running several tests in parallel and having 3 of them
blow up, I unmounted the fs (so error-case files are still intact).

# e2fsck -fn /dev/ram0
e2fsck 1.41.3 (12-Oct-2008)
Pass 1: Checking inodes, blocks, and sizes
Inode 16, i_blocks is 131594, should be 131566. Fix? no

Inode 18, i_blocks is 131588, should be 131576. Fix? no

Inode 21, i_blocks is 131594, should be 131552. Fix? no

Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences: -(628209--628220) -628231 -628233 -(638751--638755)
-638765 -(646271--646295) -(646301--646304) -647609 -(651501--651505) -651509
-(651719--651726) -(651732--651733) -(665666--665670)
Fix? no


/dev/ram0: ********** WARNING: Filesystem still has errors **********

/dev/ram0: 21/229376 files (4.8% non-contiguous), 407105/3670016 blocks

ino 16, 18, 21 of course are the files with errors.


inode 18 is the simplest case with just one hole, so let's look at that:

#hexdump file9
0000000 ffff ffff ffff ffff ffff ffff ffff ffff
*
3c8c000 0000 0000 0000 0000 0000 0000 0000 0000
*
3c8d400 ffff ffff ffff ffff ffff ffff ffff ffff
*
4000000


Let's take a look at our hole then:

#./bmap file9 // bmap is modified to print hex offsets
[... lots of stuff ...]
3c82000-3c82c00: 26fd0400-26fd1000 (1000)
3c83000-3c83c00: 26fd3400-26fd4000 (1000)
3c84000-3c84c00: 26fc9c00-26fca800 (1000)
3c85000-3c85c00: 26fcc400-26fcd000 (1000)
3c86000-3c86c00: 26fcf400-26fd0000 (1000)
3c87000-3c87c00: 26fd2400-26fd3000 (1000)
3c88000-3c88c00: 26fd5400-26fd6000 (1000)
3c89000-3c8bc00: 26fd7400-26fda000 (3000)
3c8c000-3c8c000: 0-0 (400)
3c8c400-3c8c400: 0-0 (400)
3c8c800-3c8c800: 0-0 (400)
3c8cc00-3c8cc00: 0-0 (400)
3c8d000-3c8d000: 0-0 (400)
3c8d400-3c8dc00: 26fcb800-26fcc000 (c00)
3c8e000-3c8ec00: 26fce400-26fcf000 (1000)
3c8f000-3c8fc00: 26fd1400-26fd2000 (1000)
3c90000-3c99c00: 27924400-2792e000 (a000)
3c9a000-3c9ac00: 2792f000-2792fc00 (1000)
3c9b000-3c9bc00: 27931000-27931c00 (1000)
3c9c000-3c9cc00: 27933000-27933c00 (1000)
3c9d000-3c9dc00: 27935000-27935c00 (1000)
3c9e000-3c9ec00: 27938000-27938c00 (1000)
3c9f000-3c9fc00: 2793a000-2793ac00 (1000)
[... lots more stuff ...]

3.5G filesystem image bzip2s down to 500K if anybody wants it I
can send it privately.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-03-24 15:48:18

by Jan Kara

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Wed 25-03-09 02:03:54, Nick Piggin wrote:
> On Wednesday 25 March 2009 01:47:09 Jan Kara wrote:
> > On Wed 25-03-09 01:30:00, Nick Piggin wrote:
>
> > > I don't think it is a very good idea for block_write_full_page recovery
> > > to do clear_buffer_dirty for !mapped buffers. I think that should rather
> > > be a redirty_page_for_writepage in the case that the buffer is dirty.
> > >
> > > Perhaps not the cleanest way to solve the problem if it is just due to
> > > transient shortage of space in ext3, but generic code shouldn't be
> > > allowed to throw away dirty data even if it can't be written back due
> > > to some software or hardware error.
> >
> > Well, that would be one possibility. But then we'd be left with dirty
> > pages we cannot ever release since they are constantly dirty (when the
> > filesystem really becomes out of space). So what I
>
> If the filesystem becomes out of space and we have over-committed these
> dirty mmapped blocks, then we most definitely want to keep them around.
> An error of the system losing a few pages (or if it happens an insanely
> large number of times, then slowly dying due to memory leak) is better
> than an app suddenly seeing the contents of the page change to nulls
> under it when the kernel decides to do some page reclaim.
Hmm, probably you're right. Definitely it would be much easier to track
the problem down than it is now... Thinking a bit more... But couldn't a
malicious user bring the machine easily to OOM this way? That would be
unfortunate.

> > rather want to do is something like below:
> >
> > diff --git a/fs/ext3/inode.c b/fs/ext3/inode.c
> > index d351eab..77c526f 100644
> > --- a/fs/ext3/inode.c
> > +++ b/fs/ext3/inode.c
> > @@ -1440,6 +1440,40 @@ static int journal_dirty_data_fn(handle_t *handle,
> > struct buffer_head *bh) }
> >
> > /*
> > + * Decides whether it's worthwhile to wait for transaction commit and
> > + * retry allocation. If it is, function waits 1 is returns, otherwise
> > + * 0 is returned. In both cases we redirty page and it's buffers so that
> > + * data is not lost. In case we've retried too many times, we also return
> > + * 0 and don't redirty the page. Data gets discarded but we cannot hang
> > + * writepage forever...
> > + */
> > +static int ext3_writepage_retry_alloc(struct page *page, int *retries,
> > + struct writeback_control *wbc)
> > +{
> > + struct super_block *sb = ((struct inode *)page->mapping->host)->i_sb;
> > + int ret = 0;
> > +
> > + /*
> > + * We don't want to slow down background writeback too much. On the
> > + * other hand if most of the dirty data needs allocation, we better
> > + * wait to make some progress
> > + */
> > + if (wbc->sync_mode == WB_SYNC_NONE && !wbc->for_reclaim &&
> > + wbc->pages_skipped < wbc->nr_to_write / 2)
> > + goto redirty;
> > + /*
> > + * Now wait if commit can free some space and we haven't retried
> > + * too much
> > + */
> > + if (!ext3_should_retry_alloc(sb, retries))
> > + return 0;
> > + ret = 1;
> > +redirty:
> > + set_page_dirty(page);
> > + return ret;
> > +}
> > +
> > +/*
> > * Note that we always start a transaction even if we're not journalling
> > * data. This is to preserve ordering: any hole instantiation within
> > * __block_write_full_page -> ext3_get_block() should be journalled
> > @@ -1564,10 +1598,12 @@ static int ext3_writeback_writepage(struct page
> > *page, handle_t *handle = NULL;
> > int ret = 0;
> > int err;
> > + int retries;
> >
> > if (ext3_journal_current_handle())
> > goto out_fail;
> >
> > +restart:
> > handle = ext3_journal_start(inode, ext3_writepage_trans_blocks(inode));
> > if (IS_ERR(handle)) {
> > ret = PTR_ERR(handle);
> > @@ -1580,8 +1616,13 @@ static int ext3_writeback_writepage(struct page
> > *page, ret = block_write_full_page(page, ext3_get_block, wbc);
> >
> > err = ext3_journal_stop(handle);
> > - if (!ret)
> > + if (!ret) {
> > ret = err;
> > + } else {
> > + if (ret == -ENOSPC &&
> > + ext3_writepage_retry_alloc(page, &retries, wbc))
> > + goto restart;
> > + }
> > return ret;
> >
> > out_fail:
> >
> > And similarly for the other two writepage implementations in ext3...
> > But it currently gives me:
> > WARNING: at fs/buffer.c:781 __set_page_dirty+0x8d/0x145()
> > probably because of that set_page_dirty() in ext3_writepage_retry_alloc().
>
> And this is a valid warning because we don't know that all buffers are
> uptodate or which ones to set as dirty, I think. Unless it is impossible
> to have dirty && !uptodate pages come thought this path.
Ah, OK, thanks for explanation.

> But you shouldn't need to redirty the page at all if we change
> block_write_full_page in the way I suggested. Because then it won't
> have cleaned the buffer, and it will have done redirty_page_for_writepage.
>
> > Or we could implement ext3_mkwrite() to allocate buffers already when we
> > make page writeable. But it costs some performace (we have to write page
> > full of zeros when allocating those buffers, where previously we didn't
> > have to do anything) and it's not trivial to make it work if pagesize >
> > blocksize (we should not allocate buffers outside of i_size so if i_size
> > = 1024, we create just one block in ext3_mkwrite() but then we need to
> > allocate more when we extend the file).
>
> Well the core page_mkwrite function doesn't care about that case
> properly either (it just doesn't allocate buffers on extend). I agree it
> should be fixed, but it is a little hard (I need to fix it in fsblock
> and I think what is required is a setattr helper for that).
Won't it be enough, if extending truncate called page_mkwrite() on the
page which used to be the last one in the file? That would be enough for
my use although it seems a bit hacky I agree...


Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2009-03-24 17:35:11

by Jan Kara

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Tue 24-03-09 16:48:14, Jan Kara wrote:
> On Wed 25-03-09 02:03:54, Nick Piggin wrote:
> > On Wednesday 25 March 2009 01:47:09 Jan Kara wrote:
> > > On Wed 25-03-09 01:30:00, Nick Piggin wrote:
> >
> > > > I don't think it is a very good idea for block_write_full_page recovery
> > > > to do clear_buffer_dirty for !mapped buffers. I think that should rather
> > > > be a redirty_page_for_writepage in the case that the buffer is dirty.
> > > >
> > > > Perhaps not the cleanest way to solve the problem if it is just due to
> > > > transient shortage of space in ext3, but generic code shouldn't be
> > > > allowed to throw away dirty data even if it can't be written back due
> > > > to some software or hardware error.
> > >
> > > Well, that would be one possibility. But then we'd be left with dirty
> > > pages we cannot ever release since they are constantly dirty (when the
> > > filesystem really becomes out of space). So what I
> >
> > If the filesystem becomes out of space and we have over-committed these
> > dirty mmapped blocks, then we most definitely want to keep them around.
> > An error of the system losing a few pages (or if it happens an insanely
> > large number of times, then slowly dying due to memory leak) is better
> > than an app suddenly seeing the contents of the page change to nulls
> > under it when the kernel decides to do some page reclaim.
> Hmm, probably you're right. Definitely it would be much easier to track
> the problem down than it is now... Thinking a bit more... But couldn't a
> malicious user bring the machine easily to OOM this way? That would be
> unfortunate.
OK, below is the patch which makes things work for me (i.e. no data
lost). What do you think?

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

>From f423c2964dd5afbcc40c47731724d48675dd2822 Mon Sep 17 00:00:00 2001
From: Jan Kara <[email protected]>
Date: Tue, 24 Mar 2009 16:38:22 +0100
Subject: [PATCH] fs: Don't clear dirty bits in block_write_full_page()

If getblock() fails in block_write_full_page(), we don't want to clear
dirty bits on buffers. Actually, we even want to redirty the page. This
way we just won't silently discard users data (written e.g. through mmap)
in case of ENOSPC, EDQUOT, EIO or other write error. The downside of this
approach is that if the error is persistent we have this page pinned in
memory forever and if there are lots of such pages, we can bring the
machine OOM.

Signed-off-by: Jan Kara <[email protected]>
---
fs/buffer.c | 10 +++-------
1 files changed, 3 insertions(+), 7 deletions(-)

diff --git a/fs/buffer.c b/fs/buffer.c
index 891e1c7..ae779a0 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -1833,9 +1833,11 @@ recover:
/*
* ENOSPC, or some other error. We may already have added some
* blocks to the file, so we need to write these out to avoid
- * exposing stale data.
+ * exposing stale data. We redirty the page so that we don't
+ * loose data we are unable to write.
* The page is currently locked and not marked for writeback
*/
+ redirty_page_for_writepage(wbc, page);
bh = head;
/* Recovery: lock and submit the mapped buffers */
do {
@@ -1843,12 +1845,6 @@ recover:
!buffer_delay(bh)) {
lock_buffer(bh);
mark_buffer_async_write(bh);
- } else {
- /*
- * The buffer may have been set dirty during
- * attachment to a dirty page.
- */
- clear_buffer_dirty(bh);
}
} while ((bh = bh->b_this_page) != head);
SetPageError(page);
--
1.6.0.2

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-03-24 20:14:34

by OGAWA Hirofumi

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

Jan Kara <[email protected]> writes:

> BTW: Note that there's a plenty of filesystems that don't implement
> mkwrite() (e.g. ext2, UDF, VFAT...) and thus have the same problem with
> ENOSPC. So I'd not speak too much about consistency ;).

FWIW, fatfs doesn't allow sparse file (mmap the non-allocated region),
so I guess there is no problem.
--
OGAWA Hirofumi <[email protected]>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-03-26 00:03:58

by Ying Han

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Tue, Mar 24, 2009 at 3:32 AM, Andrew Morton
<[email protected]> wrote:
> On Tue, 24 Mar 2009 18:44:21 +1100 Nick Piggin <[email protected]> wrote:
>
>> On Friday 20 March 2009 03:46:39 Jan Kara wrote:
>> > On Fri 20-03-09 02:48:21, Nick Piggin wrote:
>>
>> > > Holding mapping->private_lock over the __set_page_dirty should
>> > > fix it, although I guess you'd want to release it before calling
>> > > __mark_inode_dirty so as not to put inode_lock under there. I
>> > > have a patch for this if it sounds reasonable.
>> >
>> > Yes, that seems to be a bug - the function actually looked suspitious to
>> > me yesterday but I somehow convinced myself that it's fine. Probably
>> > because fsx-linux is single-threaded.
>>
>>
>> After a whole lot of chasing my own tail in the VM and buffer layers,
>> I think it is a problem in ext2 (and I haven't been able to reproduce
>> with ext3 yet, which might lend weight to that, although as we have
>> seen, it is very timing dependent).
>>
>> That would be slightly unfortunate because we still have Jan's ext3
>> problem, and also another reported problem of corruption on ext3 (on
>> brd driver).
>>
>> Anyway, when I have reproduced the problem with the test case, the
>> "lost" writes are all reported to be holes. Unfortunately, that doesn't
>> point straight to the filesystem, because ext2 allocates blocks in this
>> case at writeout time, so if dirty bits are getting lost, then it would
>> be normal to see holes.
>>
>> I then put in a whole lot of extra infrastructure to track metadata about
>> each struct page (when it was last written out, when it last had the number
>> of writable ptes reach 0, when the dirty bits were last cleared etc). And
>> none of the normal asertions were triggering: eg. when any page is removed
>> from pagecache (except truncates), it has always had all its buffers
>> written out *after* all ptes were made readonly or unmapped. Lots of other
>> tests and crap like that.
>>
>> So I tried what I should have done to start with and did an e2fsck after
>> seeing corruption. Yes, it comes up with errors.
>
> Do you recall what the errors were?

I run e2fsck on the partition after the failure happened and here is
what i saw, not sure if that is the same message Jan looked at:

e2fsck 1.41.3 (12-Oct-2008)
Warning! /dev/hda1 is mounted.
/dev/hda1 contains a file system with errors, check forced.
Pass 1: Checking inodes, blocks, and sizes
Pass 2: Checking directory structure
Pass 3: Checking directory connectivity
Pass 4: Checking reference counts
Pass 5: Checking group summary information
Block bitmap differences: +74915 -195111 -224680
Fix? no

Free blocks count wrong for group #6 (170, counted=169).
Fix? no

Free blocks count wrong (10120, counted=523).
Fix? no

Free inodes count wrong (95678, counted=95672).
Fix? no


/dev/hda1: ********** WARNING: Filesystem still has errors **********

/dev/hda1: 35938/131616 files (1.5% non-contiguous), 252936/263056 blocks

--Ying


>
>> Now that is unusual
>> because that should be largely insulated from the vm: if a dirty bit gets
>> lost, then the filesystem image should be quite happy and error-free with
>> a hole or unwritten data there.
>>
>> I don't know ext? locking very well, except that it looks pretty overly
>> complex and crufty.
>>
>> Usually, blocks are instantiated by write(2), under i_mutex, serialising
>> the allocator somewhat. mmap-write blocks are instantiated at writeout
>> time, unserialised. I moved truncate_mutex to cover the entire get_blocks
>> function, and can no longer trigger the problem. Might be a timing issue
>> though -- Ying, can you try this and see if you can still reproduce?
>>
>> I close my eyes and pick something out of a hat. a686cd89. Search for XXX.
>> Nice. Whether or not this cased the problem, can someone please tell me
>> why it got merged in that state?
>>
>> I'm leaving ext3 running for now. It looks like a nasty task to bisect
>> ext2 down to that commit :( and I would be more interested in trying to
>> reproduce Jan's ext3 problem, however, because I'm not too interested in
>> diving into ext2 locking to work out exactly what is racing and how to
>> fix it properly. I suspect it would be most productive to wire up some
>> ioctls right into the block allocator/lookup and code up a userspace
>> tester for it that could probably stress it a lot harder than kernel
>> writeout can.
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-03-26 08:18:43

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Tue, Mar 24, 2009 at 03:07:21PM +0100, Jan Kara wrote:
> On Tue 24-03-09 10:01:45, Chris Mason wrote:
> > On Tue, 2009-03-24 at 14:26 +0100, Jan Kara wrote:
> > > On Tue 24-03-09 13:55:10, Jan Kara wrote:
> >
> > > > And one more interesting thing I don't yet fully understand - I see pages
> > > > having PageError() set when they are removed from page cache (and they have
> > > > been faulted in before). It's probably some interaction with pagecache
> > > > readahead...
> > > Argh... So the problem seems to be that get_block() occasionally returns
> > > ENOSPC and we then discard the dirty data (hmm, we could give at least a
> > > warning for that). I'm not yet sure why getblock behaves like this because
> > > the filesystem seems to have enough space but anyway this seems to be some
> > > strange fs trouble as well.
> > >
> >
> > Ouch. Perhaps the free space is waiting on a journal commit?
> Yes, exactly. I've already found there's lot of space hold by the
> committing transaction (it can easily hold a few hundred megs or a few gigs
> with larger journal and my UML images aren't that big...). And writepage()
> implementation in ext3 does not have a logic to retry. Also
> block_write_full_page() clears buffers dirty bits so it's not easy to retry
> even if we did it. I'm now looking into how to fix this...

We retry block allocation in ext3_write_begin. And for mmap we should be
doing something similar to ext4_page_mkwrite so that we can be sure
that during writepage we don't need to do block allocation.

-aneesh

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-03-26 08:47:45

by Aneesh Kumar K.V

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Tue, Mar 24, 2009 at 04:29:59PM +0100, Jan Kara wrote:
> On Tue 24-03-09 15:56:03, Peter Zijlstra wrote:
> > On Tue, 2009-03-24 at 15:47 +0100, Jan Kara wrote:
> > >
> > > Or we could implement ext3_mkwrite() to allocate buffers already when we
> > > make page writeable. But it costs some performace (we have to write page
> > > full of zeros when allocating those buffers, where previously we didn't
> > > have to do anything) and it's not trivial to make it work if pagesize >
> > > blocksize (we should not allocate buffers outside of i_size so if i_size
> > > = 1024, we create just one block in ext3_mkwrite() but then we need to
> > > allocate more when we extend the file).
> >
> > I think this is the best option, failing with SIGBUS when we fail to
> > allocate blocks seems consistent with other filesystems as well.
> I agree this looks attractive at the first sight. But there are drawbacks
> as I wrote - the problem with blocksize < pagesize, slight performance
> decrease due to additional write,

It should not cause an additional write. Can you let me why it would
result in additional write ?


>page faults doing allocation can take a
> *long* time

That is true

>and overall fragmentation is going to be higher (previously
> writepage wrote pages for us in the right order, now we are going to
> allocate in the first-accessed order). So I'm not sure we really want to
> go this way.


block allocator should be improved to fix that. For example ext4
mballoc also look at the logical file block number when doing block
allocation. So if we does enough reservation it should handle the
the first-accessed order and sequential order allocation properly.

Another reason why I think we would need ext3_page_mkwrite is, if we
really are out of space how do we handle it ? Currently the patch you
posted does redirty_page_for_writepage, which would imply we can't
reclaim the page and since get_block get ENOSPC we can't allocate
blocks.

> Hmm, maybe we could play a trick ala delayed allocation - i.e., reserve
> some space in mkwrite() but don't actually allocate it. That would be done
> in writepage(). This would solve all the problems I describe above. We could
> use PG_Checked flag to track that the page has a reservation and behave
> accordingly in writepage() / invalidatepage(). ext3 in data=journal mode
> already uses the flag but the use seems to be compatible with what I want
> to do now... So it may actually work.
> BTW: Note that there's a plenty of filesystems that don't implement
> mkwrite() (e.g. ext2, UDF, VFAT...) and thus have the same problem with
> ENOSPC. So I'd not speak too much about consistency ;).
>

-aneesh

2009-03-26 11:37:34

by Jan Kara

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Thu 26-03-09 14:17:23, Aneesh Kumar K.V wrote:
> On Tue, Mar 24, 2009 at 04:29:59PM +0100, Jan Kara wrote:
> > On Tue 24-03-09 15:56:03, Peter Zijlstra wrote:
> > > On Tue, 2009-03-24 at 15:47 +0100, Jan Kara wrote:
> > > >
> > > > Or we could implement ext3_mkwrite() to allocate buffers already when we
> > > > make page writeable. But it costs some performace (we have to write page
> > > > full of zeros when allocating those buffers, where previously we didn't
> > > > have to do anything) and it's not trivial to make it work if pagesize >
> > > > blocksize (we should not allocate buffers outside of i_size so if i_size
> > > > = 1024, we create just one block in ext3_mkwrite() but then we need to
> > > > allocate more when we extend the file).
> > >
> > > I think this is the best option, failing with SIGBUS when we fail to
> > > allocate blocks seems consistent with other filesystems as well.
> > I agree this looks attractive at the first sight. But there are drawbacks
> > as I wrote - the problem with blocksize < pagesize, slight performance
> > decrease due to additional write,
>
> It should not cause an additional write. Can you let me why it would
> result in additional write ?
Because if you have a new page, at the time mkwrite() or set_page_dirty()
is called, it is just full of zeros. So we attach buffers full of zeros to
the running transaction to stand to data=ordered mode requirements. Then
these get written out on transaction commit (or they can already contain some
data user has written via mmap) but we're going to write them again when
writepage() is called on the page.
Umm, but yes, thinking more about the details, we clear buffer dirty bits
at commit time so if by that time user has copied in all the data,
subsequent writepage will find out all the buffers are clean and will not
send them to disk. So in this case overhead is going to be just
journal_start() + journal_stop(). OTOH mm usually decides to write the page
only after some time so if user writes to the page often then we really do
one more write. But in this case one additional write is going to be
probably lost in the number of total writes of the page. So yes, this is
not such a big issue as I though originally.

> >page faults doing allocation can take a
> > *long* time
>
> That is true
>
> >and overall fragmentation is going to be higher (previously
> > writepage wrote pages for us in the right order, now we are going to
> > allocate in the first-accessed order). So I'm not sure we really want to
> > go this way.
>
>
> block allocator should be improved to fix that. For example ext4
> mballoc also look at the logical file block number when doing block
> allocation. So if we does enough reservation it should handle the
> the first-accessed order and sequential order allocation properly.
Well, we could definitely improve ext3 allocator. But do we really want
to backport mballoc to ext3? IMO It is easier to essentialy perform delayed
allocation at the time of mkwrite() and the do the real allocation at the
time of writepage(). So I'd rather vote for a mechanism I write about
below.

> Another reason why I think we would need ext3_page_mkwrite is, if we
> really are out of space how do we handle it ? Currently the patch you
> posted does redirty_page_for_writepage, which would imply we can't
> reclaim the page and since get_block get ENOSPC we can't allocate
> blocks.
I definitely agree we should somehow solve this problem but the mechanism
below seems to be an easier way to me.

> > Hmm, maybe we could play a trick ala delayed allocation - i.e., reserve
> > some space in mkwrite() but don't actually allocate it. That would be done
> > in writepage(). This would solve all the problems I describe above. We could
> > use PG_Checked flag to track that the page has a reservation and behave
> > accordingly in writepage() / invalidatepage(). ext3 in data=journal mode
> > already uses the flag but the use seems to be compatible with what I want
> > to do now... So it may actually work.
> > BTW: Note that there's a plenty of filesystems that don't implement
> > mkwrite() (e.g. ext2, UDF, VFAT...) and thus have the same problem with
> > ENOSPC. So I'd not speak too much about consistency ;).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-03-26 18:29:47

by Jan Kara

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Wed 25-03-09 02:35:01, Nick Piggin wrote:
> On Tuesday 24 March 2009 21:32:04 Andrew Morton wrote:
> > On Tue, 24 Mar 2009 18:44:21 +1100 Nick Piggin <[email protected]>
> wrote:
> > > On Friday 20 March 2009 03:46:39 Jan Kara wrote:
> > > > On Fri 20-03-09 02:48:21, Nick Piggin wrote:
> > > > > Holding mapping->private_lock over the __set_page_dirty should
> > > > > fix it, although I guess you'd want to release it before calling
> > > > > __mark_inode_dirty so as not to put inode_lock under there. I
> > > > > have a patch for this if it sounds reasonable.
> > > >
> > > > Yes, that seems to be a bug - the function actually looked suspitious
> > > > to me yesterday but I somehow convinced myself that it's fine. Probably
> > > > because fsx-linux is single-threaded.
> > >
> > > After a whole lot of chasing my own tail in the VM and buffer layers,
> > > I think it is a problem in ext2 (and I haven't been able to reproduce
> > > with ext3 yet, which might lend weight to that, although as we have
> > > seen, it is very timing dependent).
> > >
> > > That would be slightly unfortunate because we still have Jan's ext3
> > > problem, and also another reported problem of corruption on ext3 (on
> > > brd driver).
> > >
> > > Anyway, when I have reproduced the problem with the test case, the
> > > "lost" writes are all reported to be holes. Unfortunately, that doesn't
> > > point straight to the filesystem, because ext2 allocates blocks in this
> > > case at writeout time, so if dirty bits are getting lost, then it would
> > > be normal to see holes.
> > >
> > > I then put in a whole lot of extra infrastructure to track metadata about
> > > each struct page (when it was last written out, when it last had the
> > > number of writable ptes reach 0, when the dirty bits were last cleared
> > > etc). And none of the normal asertions were triggering: eg. when any page
> > > is removed from pagecache (except truncates), it has always had all its
> > > buffers written out *after* all ptes were made readonly or unmapped. Lots
> > > of other tests and crap like that.
> > >
> > > So I tried what I should have done to start with and did an e2fsck after
> > > seeing corruption. Yes, it comes up with errors.
> >
> > Do you recall what the errors were?
>
> OK, after running several tests in parallel and having 3 of them
> blow up, I unmounted the fs (so error-case files are still intact).
Nick, what tests do you use? Because on the first reading the ext2 code
looks correct so I'll probably have to reproduce the corruption...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-03-26 23:02:03

by Linus Torvalds

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.



On Thu, 26 Mar 2009, Aneesh Kumar K.V wrote:
>
> >page faults doing allocation can take a
> > *long* time
>
> That is true

Btw, this is actually a feature rather than a bug.

We want to slow down the writer, which is why we also do dirty page
balancing when marking a page dirty.

Basically, if block allocation is a performance problem, then it should be
a performance problem that is attributed to the process that _causes_ it,
rather than to some random poor unrelated process that then later ends up
writing the page out because it wants to use some memory.

This is why tracking dirty pages is so important. Yes, it also avoids
various nasty overcommit situations, but the whole "make it hurt for the
person responsible, rather than a random innocent bystander" is the more
important part of it.

Linus

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-03-27 20:35:16

by Ying Han

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Tue, Mar 24, 2009 at 12:44 AM, Nick Piggin <[email protected]> wrote:
> On Friday 20 March 2009 03:46:39 Jan Kara wrote:
>> On Fri 20-03-09 02:48:21, Nick Piggin wrote:
>
>> > Holding mapping->private_lock over the __set_page_dirty should
>> > fix it, although I guess you'd want to release it before calling
>> > __mark_inode_dirty so as not to put inode_lock under there. I
>> > have a patch for this if it sounds reasonable.
>>
>> Yes, that seems to be a bug - the function actually looked suspitious to
>> me yesterday but I somehow convinced myself that it's fine. Probably
>> because fsx-linux is single-threaded.
>
>
> After a whole lot of chasing my own tail in the VM and buffer layers,
> I think it is a problem in ext2 (and I haven't been able to reproduce
> with ext3 yet, which might lend weight to that, although as we have
> seen, it is very timing dependent).
>
> That would be slightly unfortunate because we still have Jan's ext3
> problem, and also another reported problem of corruption on ext3 (on
> brd driver).
I believe i see the same issue on ext2 as well as ext4.
>
> Anyway, when I have reproduced the problem with the test case, the
> "lost" writes are all reported to be holes. Unfortunately, that doesn't
> point straight to the filesystem, because ext2 allocates blocks in this
> case at writeout time, so if dirty bits are getting lost, then it would
> be normal to see holes.
>
> I then put in a whole lot of extra infrastructure to track metadata about
> each struct page (when it was last written out, when it last had the number
> of writable ptes reach 0, when the dirty bits were last cleared etc). And
> none of the normal asertions were triggering: eg. when any page is removed
> from pagecache (except truncates), it has always had all its buffers
> written out *after* all ptes were made readonly or unmapped. Lots of other
> tests and crap like that.

Do you think there might be a race in the page reclaim path? I did a
hack which commeted out
wakeup_pdflush in try_to_free_pages ( based on 2.6.21, just randomly
picked on has the problem)
It runs for couple of hours and the problem not happened yet. I am not
sure if that is the problem or not,
and i will leave it running.
The reason i tried the hack since i reproduce the "bad pages" easily
everytime i put more memory pressure
on the system.


diff --git a/mm/vmscan.c b/mm/vmscan.c
index db023e2..b4b7e1f 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1067,11 +1067,13 @@ unsigned long try_to_free_pages(struct zone **zones, g
* that's undesirable in laptop mode, where we *want* lumpy
* writeout. So in laptop mode, write out the whole world.
*/
+/*
if (total_scanned > sc.swap_cluster_max +
sc.swap_cluster_max / 2) {
wakeup_pdflush(laptop_mode ? 0 : total_scanned);
sc.may_writepage = 1;
}
+*/

/* Take a nap, wait for some writeback to complete */
if (sc.nr_scanned && priority < DEF_PRIORITY - 2)

>
> So I tried what I should have done to start with and did an e2fsck after
> seeing corruption. Yes, it comes up with errors. Now that is unusual
> because that should be largely insulated from the vm: if a dirty bit gets
> lost, then the filesystem image should be quite happy and error-free with
> a hole or unwritten data there.
>
> I don't know ext? locking very well, except that it looks pretty overly
> complex and crufty.
>
> Usually, blocks are instantiated by write(2), under i_mutex, serialising
> the allocator somewhat. mmap-write blocks are instantiated at writeout
> time, unserialised. I moved truncate_mutex to cover the entire get_blocks
> function, and can no longer trigger the problem. Might be a timing issue
> though -- Ying, can you try this and see if you can still reproduce?
>
> I close my eyes and pick something out of a hat. a686cd89. Search for XXX.
> Nice. Whether or not this cased the problem, can someone please tell me
> why it got merged in that state?
>
> I'm leaving ext3 running for now. It looks like a nasty task to bisect
> ext2 down to that commit :( and I would be more interested in trying to
> reproduce Jan's ext3 problem, however, because I'm not too interested in
> diving into ext2 locking to work out exactly what is racing and how to
> fix it properly. I suspect it would be most productive to wire up some
> ioctls right into the block allocator/lookup and code up a userspace
> tester for it that could probably stress it a lot harder than kernel
> writeout can.
>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-04-01 22:36:13

by Ying Han

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

Hi Jan:
I feel that the problem you saw is kind of differnt than mine. As
you mentioned that you saw the PageError() message, which i don't see
it on my system. I tried you patch(based on 2.6.21) on my system and
it runs ok for 2 days, Still, since i don't see the same error message
as you saw, i am not convineced this is the root cause at least for
our problem. I am still looking into it.
So, are you seeing the PageError() every time the problem happened?

--Ying


On Tue, Mar 24, 2009 at 10:35 AM, Jan Kara <[email protected]> wrote:
> On Tue 24-03-09 16:48:14, Jan Kara wrote:
>> On Wed 25-03-09 02:03:54, Nick Piggin wrote:
>> > On Wednesday 25 March 2009 01:47:09 Jan Kara wrote:
>> > > On Wed 25-03-09 01:30:00, Nick Piggin wrote:
>> >
>> > > > I don't think it is a very good idea for block_write_full_page recovery
>> > > > to do clear_buffer_dirty for !mapped buffers. I think that should rather
>> > > > be a redirty_page_for_writepage in the case that the buffer is dirty.
>> > > >
>> > > > Perhaps not the cleanest way to solve the problem if it is just due to
>> > > > transient shortage of space in ext3, but generic code shouldn't be
>> > > > allowed to throw away dirty data even if it can't be written back due
>> > > > to some software or hardware error.
>> > >
>> > > Well, that would be one possibility. But then we'd be left with dirty
>> > > pages we cannot ever release since they are constantly dirty (when the
>> > > filesystem really becomes out of space). So what I
>> >
>> > If the filesystem becomes out of space and we have over-committed these
>> > dirty mmapped blocks, then we most definitely want to keep them around.
>> > An error of the system losing a few pages (or if it happens an insanely
>> > large number of times, then slowly dying due to memory leak) is better
>> > than an app suddenly seeing the contents of the page change to nulls
>> > under it when the kernel decides to do some page reclaim.
>> Hmm, probably you're right. Definitely it would be much easier to track
>> the problem down than it is now... Thinking a bit more... But couldn't a
>> malicious user bring the machine easily to OOM this way? That would be
>> unfortunate.
> OK, below is the patch which makes things work for me (i.e. no data
> lost). What do you think?
>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
>
> From f423c2964dd5afbcc40c47731724d48675dd2822 Mon Sep 17 00:00:00 2001
> From: Jan Kara <[email protected]>
> Date: Tue, 24 Mar 2009 16:38:22 +0100
> Subject: [PATCH] fs: Don't clear dirty bits in block_write_full_page()
>
> If getblock() fails in block_write_full_page(), we don't want to clear
> dirty bits on buffers. Actually, we even want to redirty the page. This
> way we just won't silently discard users data (written e.g. through mmap)
> in case of ENOSPC, EDQUOT, EIO or other write error. The downside of this
> approach is that if the error is persistent we have this page pinned in
> memory forever and if there are lots of such pages, we can bring the
> machine OOM.
>
> Signed-off-by: Jan Kara <[email protected]>
> ---
> fs/buffer.c | 10 +++-------
> 1 files changed, 3 insertions(+), 7 deletions(-)
>
> diff --git a/fs/buffer.c b/fs/buffer.c
> index 891e1c7..ae779a0 100644
> --- a/fs/buffer.c
> +++ b/fs/buffer.c
> @@ -1833,9 +1833,11 @@ recover:
> /*
> * ENOSPC, or some other error. We may already have added some
> * blocks to the file, so we need to write these out to avoid
> - * exposing stale data.
> + * exposing stale data. We redirty the page so that we don't
> + * loose data we are unable to write.
> * The page is currently locked and not marked for writeback
> */
> + redirty_page_for_writepage(wbc, page);
> bh = head;
> /* Recovery: lock and submit the mapped buffers */
> do {
> @@ -1843,12 +1845,6 @@ recover:
> !buffer_delay(bh)) {
> lock_buffer(bh);
> mark_buffer_async_write(bh);
> - } else {
> - /*
> - * The buffer may have been set dirty during
> - * attachment to a dirty page.
> - */
> - clear_buffer_dirty(bh);
> }
> } while ((bh = bh->b_this_page) != head);
> SetPageError(page);
> --
> 1.6.0.2
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-04-02 10:11:17

by Jan Kara

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

Hi Ying,

On Wed 01-04-09 15:36:13, Ying Han wrote:
> I feel that the problem you saw is kind of differnt than mine. As
> you mentioned that you saw the PageError() message, which i don't see
> it on my system. I tried you patch(based on 2.6.21) on my system and
> it runs ok for 2 days, Still, since i don't see the same error message
> as you saw, i am not convineced this is the root cause at least for
> our problem. I am still looking into it.
> So, are you seeing the PageError() every time the problem happened?
Yes, but I agree that your problem is probably different. BTW: How do you
reproduce the problem?

Honza

> On Tue, Mar 24, 2009 at 10:35 AM, Jan Kara <[email protected]> wrote:
> > On Tue 24-03-09 16:48:14, Jan Kara wrote:
> >> On Wed 25-03-09 02:03:54, Nick Piggin wrote:
> >> > On Wednesday 25 March 2009 01:47:09 Jan Kara wrote:
> >> > > On Wed 25-03-09 01:30:00, Nick Piggin wrote:
> >> >
> >> > > > I don't think it is a very good idea for block_write_full_page recovery
> >> > > > to do clear_buffer_dirty for !mapped buffers. I think that should rather
> >> > > > be a redirty_page_for_writepage in the case that the buffer is dirty.
> >> > > >
> >> > > > Perhaps not the cleanest way to solve the problem if it is just due to
> >> > > > transient shortage of space in ext3, but generic code shouldn't be
> >> > > > allowed to throw away dirty data even if it can't be written back due
> >> > > > to some software or hardware error.
> >> > >
> >> > > Well, that would be one possibility. But then we'd be left with dirty
> >> > > pages we cannot ever release since they are constantly dirty (when the
> >> > > filesystem really becomes out of space). So what I
> >> >
> >> > If the filesystem becomes out of space and we have over-committed these
> >> > dirty mmapped blocks, then we most definitely want to keep them around.
> >> > An error of the system losing a few pages (or if it happens an insanely
> >> > large number of times, then slowly dying due to memory leak) is better
> >> > than an app suddenly seeing the contents of the page change to nulls
> >> > under it when the kernel decides to do some page reclaim.
> >> Hmm, probably you're right. Definitely it would be much easier to track
> >> the problem down than it is now... Thinking a bit more... But couldn't a
> >> malicious user bring the machine easily to OOM this way? That would be
> >> unfortunate.
> > OK, below is the patch which makes things work for me (i.e. no data
> > lost). What do you think?
> >
> > Honza
> > --
> > Jan Kara <[email protected]>
> > SUSE Labs, CR
> >
> > From f423c2964dd5afbcc40c47731724d48675dd2822 Mon Sep 17 00:00:00 2001
> > From: Jan Kara <[email protected]>
> > Date: Tue, 24 Mar 2009 16:38:22 +0100
> > Subject: [PATCH] fs: Don't clear dirty bits in block_write_full_page()
> >
> > If getblock() fails in block_write_full_page(), we don't want to clear
> > dirty bits on buffers. Actually, we even want to redirty the page. This
> > way we just won't silently discard users data (written e.g. through mmap)
> > in case of ENOSPC, EDQUOT, EIO or other write error. The downside of this
> > approach is that if the error is persistent we have this page pinned in
> > memory forever and if there are lots of such pages, we can bring the
> > machine OOM.
> >
> > Signed-off-by: Jan Kara <[email protected]>
> > ---
> > fs/buffer.c | 10 +++-------
> > 1 files changed, 3 insertions(+), 7 deletions(-)
> >
> > diff --git a/fs/buffer.c b/fs/buffer.c
> > index 891e1c7..ae779a0 100644
> > --- a/fs/buffer.c
> > +++ b/fs/buffer.c
> > @@ -1833,9 +1833,11 @@ recover:
> > /*
> > * ENOSPC, or some other error. We may already have added some
> > * blocks to the file, so we need to write these out to avoid
> > - * exposing stale data.
> > + * exposing stale data. We redirty the page so that we don't
> > + * loose data we are unable to write.
> > * The page is currently locked and not marked for writeback
> > */
> > + redirty_page_for_writepage(wbc, page);
> > bh = head;
> > /* Recovery: lock and submit the mapped buffers */
> > do {
> > @@ -1843,12 +1845,6 @@ recover:
> > !buffer_delay(bh)) {
> > lock_buffer(bh);
> > mark_buffer_async_write(bh);
> > - } else {
> > - /*
> > - * The buffer may have been set dirty during
> > - * attachment to a dirty page.
> > - */
> > - clear_buffer_dirty(bh);
> > }
> > } while ((bh = bh->b_this_page) != head);
> > SetPageError(page);
> > --
> > 1.6.0.2
> >
> > --
> > To unsubscribe, send a message with 'unsubscribe linux-mm' in
> > the body to [email protected]. For more info on Linux MM,
> > see: http://www.linux-mm.org/ .
> > Don't email: <a href=mailto:"[email protected]"> [email protected] </a>
> >
--
Jan Kara <[email protected]>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-04-02 11:24:38

by Nick Piggin

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Thursday 02 April 2009 09:36:13 Ying Han wrote:
> Hi Jan:
> I feel that the problem you saw is kind of differnt than mine. As
> you mentioned that you saw the PageError() message, which i don't see
> it on my system. I tried you patch(based on 2.6.21) on my system and
> it runs ok for 2 days, Still, since i don't see the same error message
> as you saw, i am not convineced this is the root cause at least for
> our problem. I am still looking into it.
> So, are you seeing the PageError() every time the problem happened?

So I asked if you could test with my workaround of taking truncate_mutex
at the start of ext2_get_blocks, and report back. I never heard of any
response after that.

To reiterate: I was able to reproduce a problem with ext2 (I was testing
on brd to get IO rates high enough to reproduce it quite frequently).
I think I narrowed the problem down to block allocation or inode block
tree corruption because I was unable to reproduce it with that hack in
place.


2009-04-02 11:34:01

by Jan Kara

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Thu 02-04-09 22:24:29, Nick Piggin wrote:
> On Thursday 02 April 2009 09:36:13 Ying Han wrote:
> > Hi Jan:
> > I feel that the problem you saw is kind of differnt than mine. As
> > you mentioned that you saw the PageError() message, which i don't see
> > it on my system. I tried you patch(based on 2.6.21) on my system and
> > it runs ok for 2 days, Still, since i don't see the same error message
> > as you saw, i am not convineced this is the root cause at least for
> > our problem. I am still looking into it.
> > So, are you seeing the PageError() every time the problem happened?
>
> So I asked if you could test with my workaround of taking truncate_mutex
> at the start of ext2_get_blocks, and report back. I never heard of any
> response after that.
>
> To reiterate: I was able to reproduce a problem with ext2 (I was testing
> on brd to get IO rates high enough to reproduce it quite frequently).
> I think I narrowed the problem down to block allocation or inode block
> tree corruption because I was unable to reproduce it with that hack in
> place.
Nick, what load did you use for reproduction? I'll try to reproduce it
here so that I can debug ext2...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-04-02 15:51:20

by Nick Piggin

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Thursday 02 April 2009 22:34:01 Jan Kara wrote:
> On Thu 02-04-09 22:24:29, Nick Piggin wrote:
> > On Thursday 02 April 2009 09:36:13 Ying Han wrote:
> > > Hi Jan:
> > > I feel that the problem you saw is kind of differnt than mine. As
> > > you mentioned that you saw the PageError() message, which i don't see
> > > it on my system. I tried you patch(based on 2.6.21) on my system and
> > > it runs ok for 2 days, Still, since i don't see the same error message
> > > as you saw, i am not convineced this is the root cause at least for
> > > our problem. I am still looking into it.
> > > So, are you seeing the PageError() every time the problem happened?
> >
> > So I asked if you could test with my workaround of taking truncate_mutex
> > at the start of ext2_get_blocks, and report back. I never heard of any
> > response after that.
> >
> > To reiterate: I was able to reproduce a problem with ext2 (I was testing
> > on brd to get IO rates high enough to reproduce it quite frequently).
> > I think I narrowed the problem down to block allocation or inode block
> > tree corruption because I was unable to reproduce it with that hack in
> > place.
> Nick, what load did you use for reproduction? I'll try to reproduce it
> here so that I can debug ext2...

OK, I set up the filesystem like this:

modprobe rd rd_size=$[3*1024*1024] #almost fill memory so we reclaim buffers
dd if=/dev/zero of=/dev/ram0 bs=4k #prefill brd so we don't get alloc deadlock
mkfs.ext2 -b1024 /dev/ram0 #1K buffers

Test is basically unmodified except I use 64MB files, and start 8 of them
at once to (8 core system, so improve chances of hitting the bug). Although I
do see it with only 1 running it takes longer to trigger.

I also run a loop doing 'sync ; echo 3 > /proc/sys/vm/drop_caches' but I don't
know if that really helps speed up reproducing it. It is quite random to hit,
but I was able to hit it IIRC in under a minute with that setup.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-04-02 17:44:14

by Ying Han

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Thu, Apr 2, 2009 at 8:51 AM, Nick Piggin <[email protected]> wrote:
> On Thursday 02 April 2009 22:34:01 Jan Kara wrote:
>> On Thu 02-04-09 22:24:29, Nick Piggin wrote:
>> > On Thursday 02 April 2009 09:36:13 Ying Han wrote:
>> > > Hi Jan:
>> > > I feel that the problem you saw is kind of differnt than mine. As
>> > > you mentioned that you saw the PageError() message, which i don't see
>> > > it on my system. I tried you patch(based on 2.6.21) on my system and
>> > > it runs ok for 2 days, Still, since i don't see the same error message
>> > > as you saw, i am not convineced this is the root cause at least for
>> > > our problem. I am still looking into it.
>> > > So, are you seeing the PageError() every time the problem happened?
>> >
>> > So I asked if you could test with my workaround of taking truncate_mutex
>> > at the start of ext2_get_blocks, and report back. I never heard of any
>> > response after that.
>> >
>> > To reiterate: I was able to reproduce a problem with ext2 (I was testing
>> > on brd to get IO rates high enough to reproduce it quite frequently).
>> > I think I narrowed the problem down to block allocation or inode block
>> > tree corruption because I was unable to reproduce it with that hack in
>> > place.
>> Nick, what load did you use for reproduction? I'll try to reproduce it
>> here so that I can debug ext2...
>
> OK, I set up the filesystem like this:
>
> modprobe rd rd_size=$[3*1024*1024] #almost fill memory so we reclaim buffers
> dd if=/dev/zero of=/dev/ram0 bs=4k #prefill brd so we don't get alloc deadlock
> mkfs.ext2 -b1024 /dev/ram0 #1K buffers
>
> Test is basically unmodified except I use 64MB files, and start 8 of them
> at once to (8 core system, so improve chances of hitting the bug). Although I
> do see it with only 1 running it takes longer to trigger.
>
> I also run a loop doing 'sync ; echo 3 > /proc/sys/vm/drop_caches' but I don't
> know if that really helps speed up reproducing it. It is quite random to hit,
> but I was able to hit it IIRC in under a minute with that setup.
>

Here is how i reproduce it:
Filesystem is ext2 with blocksize 4096
Fill up the ram with 95% anon memory and mlockall ( put enough memory
pressure which will trigger page reclaim and background writeout)
Run one thread of the test program

and i will see "bad pages" within few minutes.

--Ying

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-04-02 22:52:24

by Ying Han

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Thu, Apr 2, 2009 at 10:44 AM, Ying Han <[email protected]> wrote:
> On Thu, Apr 2, 2009 at 8:51 AM, Nick Piggin <[email protected]> wrote:
>> On Thursday 02 April 2009 22:34:01 Jan Kara wrote:
>>> On Thu 02-04-09 22:24:29, Nick Piggin wrote:
>>> > On Thursday 02 April 2009 09:36:13 Ying Han wrote:
>>> > > Hi Jan:
>>> > > I feel that the problem you saw is kind of differnt than mine. As
>>> > > you mentioned that you saw the PageError() message, which i don't see
>>> > > it on my system. I tried you patch(based on 2.6.21) on my system and
>>> > > it runs ok for 2 days, Still, since i don't see the same error message
>>> > > as you saw, i am not convineced this is the root cause at least for
>>> > > our problem. I am still looking into it.
>>> > > So, are you seeing the PageError() every time the problem happened?
>>> >
>>> > So I asked if you could test with my workaround of taking truncate_mutex
>>> > at the start of ext2_get_blocks, and report back. I never heard of any
>>> > response after that.
>>> >
>>> > To reiterate: I was able to reproduce a problem with ext2 (I was testing
>>> > on brd to get IO rates high enough to reproduce it quite frequently).
>>> > I think I narrowed the problem down to block allocation or inode block
>>> > tree corruption because I was unable to reproduce it with that hack in
>>> > place.
>>> Nick, what load did you use for reproduction? I'll try to reproduce it
>>> here so that I can debug ext2...
>>
>> OK, I set up the filesystem like this:
>>
>> modprobe rd rd_size=$[3*1024*1024] #almost fill memory so we reclaim buffers
>> dd if=/dev/zero of=/dev/ram0 bs=4k #prefill brd so we don't get alloc deadlock
>> mkfs.ext2 -b1024 /dev/ram0 #1K buffers
>>
>> Test is basically unmodified except I use 64MB files, and start 8 of them
>> at once to (8 core system, so improve chances of hitting the bug). Although I
>> do see it with only 1 running it takes longer to trigger.
>>
>> I also run a loop doing 'sync ; echo 3 > /proc/sys/vm/drop_caches' but I don't
>> know if that really helps speed up reproducing it. It is quite random to hit,
>> but I was able to hit it IIRC in under a minute with that setup.
>>
>
> Here is how i reproduce it:
> Filesystem is ext2 with blocksize 4096
> Fill up the ram with 95% anon memory and mlockall ( put enough memory
> pressure which will trigger page reclaim and background writeout)
> Run one thread of the test program
>
> and i will see "bad pages" within few minutes.

And here is the "top" and stdout while it is getting "bad pages"
top

PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
3487 root 20 0 52616 50m 284 R 95 0.3 3:58.85 usemem
3810 root 20 0 129m 99m 99m D 41 0.6 0:01.87 ftruncate_mmap
261 root 15 -5 0 0 0 D 4 0.0 0:31.08 kswapd0
262 root 15 -5 0 0 0 D 3 0.0 0:10.26 kswapd1

stdout:

while true; do
./ftruncate_mmap;
done
Running 852 bad page
Running 315 bad page
Running 999 bad page
Running 482 bad page
Running 24 bad page

--Ying

>
> --Ying
>

2009-04-02 23:39:09

by Jan Kara

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Thu 02-04-09 15:52:19, Ying Han wrote:
> On Thu, Apr 2, 2009 at 10:44 AM, Ying Han <[email protected]> wrote:
> > On Thu, Apr 2, 2009 at 8:51 AM, Nick Piggin <[email protected]> wrote:
> >> On Thursday 02 April 2009 22:34:01 Jan Kara wrote:
> >>> On Thu 02-04-09 22:24:29, Nick Piggin wrote:
> >>> > On Thursday 02 April 2009 09:36:13 Ying Han wrote:
> >>> > > Hi Jan:
> >>> > > I feel that the problem you saw is kind of differnt than mine. As
> >>> > > you mentioned that you saw the PageError() message, which i don't see
> >>> > > it on my system. I tried you patch(based on 2.6.21) on my system and
> >>> > > it runs ok for 2 days, Still, since i don't see the same error message
> >>> > > as you saw, i am not convineced this is the root cause at least for
> >>> > > our problem. I am still looking into it.
> >>> > > So, are you seeing the PageError() every time the problem happened?
> >>> >
> >>> > So I asked if you could test with my workaround of taking truncate_mutex
> >>> > at the start of ext2_get_blocks, and report back. I never heard of any
> >>> > response after that.
> >>> >
> >>> > To reiterate: I was able to reproduce a problem with ext2 (I was testing
> >>> > on brd to get IO rates high enough to reproduce it quite frequently).
> >>> > I think I narrowed the problem down to block allocation or inode block
> >>> > tree corruption because I was unable to reproduce it with that hack in
> >>> > place.
> >>> Nick, what load did you use for reproduction? I'll try to reproduce it
> >>> here so that I can debug ext2...
> >>
> >> OK, I set up the filesystem like this:
> >>
> >> modprobe rd rd_size=$[3*1024*1024] #almost fill memory so we reclaim buffers
> >> dd if=/dev/zero of=/dev/ram0 bs=4k #prefill brd so we don't get alloc deadlock
> >> mkfs.ext2 -b1024 /dev/ram0 #1K buffers
> >>
> >> Test is basically unmodified except I use 64MB files, and start 8 of them
> >> at once to (8 core system, so improve chances of hitting the bug). Although I
> >> do see it with only 1 running it takes longer to trigger.
> >>
> >> I also run a loop doing 'sync ; echo 3 > /proc/sys/vm/drop_caches' but I don't
> >> know if that really helps speed up reproducing it. It is quite random to hit,
> >> but I was able to hit it IIRC in under a minute with that setup.
> >>
> >
> > Here is how i reproduce it:
> > Filesystem is ext2 with blocksize 4096
> > Fill up the ram with 95% anon memory and mlockall ( put enough memory
> > pressure which will trigger page reclaim and background writeout)
> > Run one thread of the test program
> >
> > and i will see "bad pages" within few minutes.
>
> And here is the "top" and stdout while it is getting "bad pages"
> top
>
> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> 3487 root 20 0 52616 50m 284 R 95 0.3 3:58.85 usemem
> 3810 root 20 0 129m 99m 99m D 41 0.6 0:01.87 ftruncate_mmap
> 261 root 15 -5 0 0 0 D 4 0.0 0:31.08 kswapd0
> 262 root 15 -5 0 0 0 D 3 0.0 0:10.26 kswapd1
>
> stdout:
>
> while true; do
> ./ftruncate_mmap;
> done
> Running 852 bad page
> Running 315 bad page
> Running 999 bad page
> Running 482 bad page
> Running 24 bad page
Thanks, for the help. I've debugged the problem to a bug in
ext2_get_block(). I've already sent out a patch which should fix the issue
(at least it fixes the problem for me).
The fix is also attached if you want to try it.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR


Attachments:
(No filename) (3.46 kB)
0001-ext2-Fix-data-corruption-for-racing-writes.patch (4.24 kB)
Download all attachments

2009-04-03 00:13:19

by Ying Han

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Thu, Apr 2, 2009 at 4:24 AM, Nick Piggin <[email protected]> wrote:
> On Thursday 02 April 2009 09:36:13 Ying Han wrote:
>> Hi Jan:
>> I feel that the problem you saw is kind of differnt than mine. As
>> you mentioned that you saw the PageError() message, which i don't see
>> it on my system. I tried you patch(based on 2.6.21) on my system and
>> it runs ok for 2 days, Still, since i don't see the same error message
>> as you saw, i am not convineced this is the root cause at least for
>> our problem. I am still looking into it.
>> So, are you seeing the PageError() every time the problem happened?
>
> So I asked if you could test with my workaround of taking truncate_mutex
> at the start of ext2_get_blocks, and report back. I never heard of any
> response after that.

I applied the change and still get the same issue, unless i didn't do
the right thing, here
is the patch i applied, which put the truncate_mutex at the beginning
of ext2_get_blocks.

diff --git a/fs/ext2/inode.c b/fs/ext2/inode.c
index 384fc0d..94cf773 100644
--- a/fs/ext2/inode.c
+++ b/fs/ext2/inode.c
@@ -586,10 +586,13 @@ static int ext2_get_blocks(struct inode *inode,
int count = 0;
ext2_fsblk_t first_block = 0;

+ mutex_lock(&ei->truncate_mutex);
depth = ext2_block_to_path(inode,iblock,offsets,&blocks_to_boundary);

- if (depth == 0)
+ if (depth == 0) {
+ mutex_unlock(&ei->truncate_mutex);
return (err);
+ }
reread:
partial = ext2_get_branch(inode, depth, offsets, chain, &err);

@@ -625,7 +628,7 @@ reread:
if (!create || err == -EIO)
goto cleanup;

- mutex_lock(&ei->truncate_mutex);

/*
* Okay, we need to do block allocation. Lazily initialize the block
@@ -651,7 +654,7 @@ reread:
offsets + (partial - chain), partial);

if (err) {
- mutex_unlock(&ei->truncate_mutex);
goto cleanup;
}

@@ -662,13 +665,13 @@ reread:
err = ext2_clear_xip_target (inode,
le32_to_cpu(chain[depth-1].key));
if (err) {
- mutex_unlock(&ei->truncate_mutex);
goto cleanup;
}
}

ext2_splice_branch(inode, iblock, partial, indirect_blks, count);
- mutex_unlock(&ei->truncate_mutex);
set_buffer_new(bh_result);
got_it:
map_bh(bh_result, inode->i_sb, le32_to_cpu(chain[depth-1].key));
@@ -678,6 +681,7 @@ got_it:
/* Clean up and exit */
partial = chain + depth - 1; /* the whole chain */
cleanup:
+ mutex_unlock(&ei->truncate_mutex);
while (partial > chain) {
brelse(partial->bh);
partial--;

--Ying

>
> To reiterate: I was able to reproduce a problem with ext2 (I was testing
> on brd to get IO rates high enough to reproduce it quite frequently).
> I think I narrowed the problem down to block allocation or inode block
> tree corruption because I was unable to reproduce it with that hack in
> place.
>
>

2009-04-03 00:25:36

by Ying Han

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Thu, Apr 2, 2009 at 4:39 PM, Jan Kara <[email protected]> wrote:
> On Thu 02-04-09 15:52:19, Ying Han wrote:
>> On Thu, Apr 2, 2009 at 10:44 AM, Ying Han <[email protected]> wrote:
>> > On Thu, Apr 2, 2009 at 8:51 AM, Nick Piggin <[email protected]> wrote:
>> >> On Thursday 02 April 2009 22:34:01 Jan Kara wrote:
>> >>> On Thu 02-04-09 22:24:29, Nick Piggin wrote:
>> >>> > On Thursday 02 April 2009 09:36:13 Ying Han wrote:
>> >>> > > Hi Jan:
>> >>> > > I feel that the problem you saw is kind of differnt than mine. As
>> >>> > > you mentioned that you saw the PageError() message, which i don't see
>> >>> > > it on my system. I tried you patch(based on 2.6.21) on my system and
>> >>> > > it runs ok for 2 days, Still, since i don't see the same error message
>> >>> > > as you saw, i am not convineced this is the root cause at least for
>> >>> > > our problem. I am still looking into it.
>> >>> > > So, are you seeing the PageError() every time the problem happened?
>> >>> >
>> >>> > So I asked if you could test with my workaround of taking truncate_mutex
>> >>> > at the start of ext2_get_blocks, and report back. I never heard of any
>> >>> > response after that.
>> >>> >
>> >>> > To reiterate: I was able to reproduce a problem with ext2 (I was testing
>> >>> > on brd to get IO rates high enough to reproduce it quite frequently).
>> >>> > I think I narrowed the problem down to block allocation or inode block
>> >>> > tree corruption because I was unable to reproduce it with that hack in
>> >>> > place.
>> >>> Nick, what load did you use for reproduction? I'll try to reproduce it
>> >>> here so that I can debug ext2...
>> >>
>> >> OK, I set up the filesystem like this:
>> >>
>> >> modprobe rd rd_size=$[3*1024*1024] #almost fill memory so we reclaim buffers
>> >> dd if=/dev/zero of=/dev/ram0 bs=4k #prefill brd so we don't get alloc deadlock
>> >> mkfs.ext2 -b1024 /dev/ram0 #1K buffers
>> >>
>> >> Test is basically unmodified except I use 64MB files, and start 8 of them
>> >> at once to (8 core system, so improve chances of hitting the bug). Although I
>> >> do see it with only 1 running it takes longer to trigger.
>> >>
>> >> I also run a loop doing 'sync ; echo 3 > /proc/sys/vm/drop_caches' but I don't
>> >> know if that really helps speed up reproducing it. It is quite random to hit,
>> >> but I was able to hit it IIRC in under a minute with that setup.
>> >>
>> >
>> > Here is how i reproduce it:
>> > Filesystem is ext2 with blocksize 4096
>> > Fill up the ram with 95% anon memory and mlockall ( put enough memory
>> > pressure which will trigger page reclaim and background writeout)
>> > Run one thread of the test program
>> >
>> > and i will see "bad pages" within few minutes.
>>
>> And here is the "top" and stdout while it is getting "bad pages"
>> top
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 3487 root 20 0 52616 50m 284 R 95 0.3 3:58.85 usemem
>> 3810 root 20 0 129m 99m 99m D 41 0.6 0:01.87 ftruncate_mmap
>> 261 root 15 -5 0 0 0 D 4 0.0 0:31.08 kswapd0
>> 262 root 15 -5 0 0 0 D 3 0.0 0:10.26 kswapd1
>>
>> stdout:
>>
>> while true; do
>> ./ftruncate_mmap;
>> done
>> Running 852 bad page
>> Running 315 bad page
>> Running 999 bad page
>> Running 482 bad page
>> Running 24 bad page
> Thanks, for the help. I've debugged the problem to a bug in
> ext2_get_block(). I've already sent out a patch which should fix the issue
> (at least it fixes the problem for me).
> The fix is also attached if you want to try it.

I just did a quick run after applied the patch, unforturnately, my
problem is still there...

--Ying
>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
>

2009-04-03 01:29:21

by Ying Han

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Thu, Apr 2, 2009 at 4:39 PM, Jan Kara <[email protected]> wrote:
> On Thu 02-04-09 15:52:19, Ying Han wrote:
>> On Thu, Apr 2, 2009 at 10:44 AM, Ying Han <[email protected]> wrote:
>> > On Thu, Apr 2, 2009 at 8:51 AM, Nick Piggin <[email protected]> wrote:
>> >> On Thursday 02 April 2009 22:34:01 Jan Kara wrote:
>> >>> On Thu 02-04-09 22:24:29, Nick Piggin wrote:
>> >>> > On Thursday 02 April 2009 09:36:13 Ying Han wrote:
>> >>> > > Hi Jan:
>> >>> > > I feel that the problem you saw is kind of differnt than mine. As
>> >>> > > you mentioned that you saw the PageError() message, which i don't see
>> >>> > > it on my system. I tried you patch(based on 2.6.21) on my system and
>> >>> > > it runs ok for 2 days, Still, since i don't see the same error message
>> >>> > > as you saw, i am not convineced this is the root cause at least for
>> >>> > > our problem. I am still looking into it.
>> >>> > > So, are you seeing the PageError() every time the problem happened?
>> >>> >
>> >>> > So I asked if you could test with my workaround of taking truncate_mutex
>> >>> > at the start of ext2_get_blocks, and report back. I never heard of any
>> >>> > response after that.
>> >>> >
>> >>> > To reiterate: I was able to reproduce a problem with ext2 (I was testing
>> >>> > on brd to get IO rates high enough to reproduce it quite frequently).
>> >>> > I think I narrowed the problem down to block allocation or inode block
>> >>> > tree corruption because I was unable to reproduce it with that hack in
>> >>> > place.
>> >>> Nick, what load did you use for reproduction? I'll try to reproduce it
>> >>> here so that I can debug ext2...
>> >>
>> >> OK, I set up the filesystem like this:
>> >>
>> >> modprobe rd rd_size=$[3*1024*1024] #almost fill memory so we reclaim buffers
>> >> dd if=/dev/zero of=/dev/ram0 bs=4k #prefill brd so we don't get alloc deadlock
>> >> mkfs.ext2 -b1024 /dev/ram0 #1K buffers
>> >>
>> >> Test is basically unmodified except I use 64MB files, and start 8 of them
>> >> at once to (8 core system, so improve chances of hitting the bug). Although I
>> >> do see it with only 1 running it takes longer to trigger.
>> >>
>> >> I also run a loop doing 'sync ; echo 3 > /proc/sys/vm/drop_caches' but I don't
>> >> know if that really helps speed up reproducing it. It is quite random to hit,
>> >> but I was able to hit it IIRC in under a minute with that setup.
>> >>
>> >
>> > Here is how i reproduce it:
>> > Filesystem is ext2 with blocksize 4096
>> > Fill up the ram with 95% anon memory and mlockall ( put enough memory
>> > pressure which will trigger page reclaim and background writeout)
>> > Run one thread of the test program
>> >
>> > and i will see "bad pages" within few minutes.
>>
>> And here is the "top" and stdout while it is getting "bad pages"
>> top
>>
>> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> 3487 root 20 0 52616 50m 284 R 95 0.3 3:58.85 usemem
>> 3810 root 20 0 129m 99m 99m D 41 0.6 0:01.87 ftruncate_mmap
>> 261 root 15 -5 0 0 0 D 4 0.0 0:31.08 kswapd0
>> 262 root 15 -5 0 0 0 D 3 0.0 0:10.26 kswapd1
>>
>> stdout:
>>
>> while true; do
>> ./ftruncate_mmap;
>> done
>> Running 852 bad page
>> Running 315 bad page
>> Running 999 bad page
>> Running 482 bad page
>> Running 24 bad page
> Thanks, for the help. I've debugged the problem to a bug in
> ext2_get_block(). I've already sent out a patch which should fix the issue
> (at least it fixes the problem for me).
> The fix is also attached if you want to try it.

hmm, now i do see that get_block() returns ENOSPC by printk the err.
So did you applied the patch which redirty_page_for_writepage as well
as this one together?

I will start the test with kernel applied both patches and leave it for running.

>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
>

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-04-03 09:41:10

by Jan Kara

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Thu 02-04-09 18:29:21, Ying Han wrote:
> On Thu, Apr 2, 2009 at 4:39 PM, Jan Kara <[email protected]> wrote:
> > On Thu 02-04-09 15:52:19, Ying Han wrote:
> >> On Thu, Apr 2, 2009 at 10:44 AM, Ying Han <[email protected]> wrote:
> >> > On Thu, Apr 2, 2009 at 8:51 AM, Nick Piggin <[email protected]> wrote:
> >> >> On Thursday 02 April 2009 22:34:01 Jan Kara wrote:
> >> >>> On Thu 02-04-09 22:24:29, Nick Piggin wrote:
> >> >>> > On Thursday 02 April 2009 09:36:13 Ying Han wrote:
> >> >>> > > Hi Jan:
> >> >>> > > I feel that the problem you saw is kind of differnt than mine. As
> >> >>> > > you mentioned that you saw the PageError() message, which i don't see
> >> >>> > > it on my system. I tried you patch(based on 2.6.21) on my system and
> >> >>> > > it runs ok for 2 days, Still, since i don't see the same error message
> >> >>> > > as you saw, i am not convineced this is the root cause at least for
> >> >>> > > our problem. I am still looking into it.
> >> >>> > > So, are you seeing the PageError() every time the problem happened?
> >> >>> >
> >> >>> > So I asked if you could test with my workaround of taking truncate_mutex
> >> >>> > at the start of ext2_get_blocks, and report back. I never heard of any
> >> >>> > response after that.
> >> >>> >
> >> >>> > To reiterate: I was able to reproduce a problem with ext2 (I was testing
> >> >>> > on brd to get IO rates high enough to reproduce it quite frequently).
> >> >>> > I think I narrowed the problem down to block allocation or inode block
> >> >>> > tree corruption because I was unable to reproduce it with that hack in
> >> >>> > place.
> >> >>> Nick, what load did you use for reproduction? I'll try to reproduce it
> >> >>> here so that I can debug ext2...
> >> >>
> >> >> OK, I set up the filesystem like this:
> >> >>
> >> >> modprobe rd rd_size=$[3*1024*1024] #almost fill memory so we reclaim buffers
> >> >> dd if=/dev/zero of=/dev/ram0 bs=4k #prefill brd so we don't get alloc deadlock
> >> >> mkfs.ext2 -b1024 /dev/ram0 #1K buffers
> >> >>
> >> >> Test is basically unmodified except I use 64MB files, and start 8 of them
> >> >> at once to (8 core system, so improve chances of hitting the bug). Although I
> >> >> do see it with only 1 running it takes longer to trigger.
> >> >>
> >> >> I also run a loop doing 'sync ; echo 3 > /proc/sys/vm/drop_caches' but I don't
> >> >> know if that really helps speed up reproducing it. It is quite random to hit,
> >> >> but I was able to hit it IIRC in under a minute with that setup.
> >> >>
> >> >
> >> > Here is how i reproduce it:
> >> > Filesystem is ext2 with blocksize 4096
> >> > Fill up the ram with 95% anon memory and mlockall ( put enough memory
> >> > pressure which will trigger page reclaim and background writeout)
> >> > Run one thread of the test program
> >> >
> >> > and i will see "bad pages" within few minutes.
> >>
> >> And here is the "top" and stdout while it is getting "bad pages"
> >> top
> >>
> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
> >> 3487 root 20 0 52616 50m 284 R 95 0.3 3:58.85 usemem
> >> 3810 root 20 0 129m 99m 99m D 41 0.6 0:01.87 ftruncate_mmap
> >> 261 root 15 -5 0 0 0 D 4 0.0 0:31.08 kswapd0
> >> 262 root 15 -5 0 0 0 D 3 0.0 0:10.26 kswapd1
> >>
> >> stdout:
> >>
> >> while true; do
> >> ./ftruncate_mmap;
> >> done
> >> Running 852 bad page
> >> Running 315 bad page
> >> Running 999 bad page
> >> Running 482 bad page
> >> Running 24 bad page
> > Thanks, for the help. I've debugged the problem to a bug in
> > ext2_get_block(). I've already sent out a patch which should fix the issue
> > (at least it fixes the problem for me).
> > The fix is also attached if you want to try it.
>
> hmm, now i do see that get_block() returns ENOSPC by printk the err.
> So did you applied the patch which redirty_page_for_writepage as well
> as this one together?
No, my patch contained only a fix in ext2_get_block(). When you see
ENOSPC, that's a completely separate issue. You may apply that patch but
with ext2 it would be enough to make the file fit the ram disk. I.e. first
try with dd how big file fits there and then run your tester with at most
as big file so that you don't hit ENOSPC...

> I will start the test with kernel applied both patches and leave it for running.
OK.

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2009-04-03 21:35:05

by Ying Han

[permalink] [raw]
Subject: Re: ftruncate-mmap: pages are lost after writing to mmaped file.

On Fri, Apr 3, 2009 at 2:41 AM, Jan Kara <[email protected]> wrote:
> On Thu 02-04-09 18:29:21, Ying Han wrote:
>> On Thu, Apr 2, 2009 at 4:39 PM, Jan Kara <[email protected]> wrote:
>> > On Thu 02-04-09 15:52:19, Ying Han wrote:
>> >> On Thu, Apr 2, 2009 at 10:44 AM, Ying Han <[email protected]> wrote:
>> >> > On Thu, Apr 2, 2009 at 8:51 AM, Nick Piggin <[email protected]> wrote:
>> >> >> On Thursday 02 April 2009 22:34:01 Jan Kara wrote:
>> >> >>> On Thu 02-04-09 22:24:29, Nick Piggin wrote:
>> >> >>> > On Thursday 02 April 2009 09:36:13 Ying Han wrote:
>> >> >>> > > Hi Jan:
>> >> >>> > > I feel that the problem you saw is kind of differnt than mine. As
>> >> >>> > > you mentioned that you saw the PageError() message, which i don't see
>> >> >>> > > it on my system. I tried you patch(based on 2.6.21) on my system and
>> >> >>> > > it runs ok for 2 days, Still, since i don't see the same error message
>> >> >>> > > as you saw, i am not convineced this is the root cause at least for
>> >> >>> > > our problem. I am still looking into it.
>> >> >>> > > So, are you seeing the PageError() every time the problem happened?
>> >> >>> >
>> >> >>> > So I asked if you could test with my workaround of taking truncate_mutex
>> >> >>> > at the start of ext2_get_blocks, and report back. I never heard of any
>> >> >>> > response after that.
>> >> >>> >
>> >> >>> > To reiterate: I was able to reproduce a problem with ext2 (I was testing
>> >> >>> > on brd to get IO rates high enough to reproduce it quite frequently).
>> >> >>> > I think I narrowed the problem down to block allocation or inode block
>> >> >>> > tree corruption because I was unable to reproduce it with that hack in
>> >> >>> > place.
>> >> >>> Nick, what load did you use for reproduction? I'll try to reproduce it
>> >> >>> here so that I can debug ext2...
>> >> >>
>> >> >> OK, I set up the filesystem like this:
>> >> >>
>> >> >> modprobe rd rd_size=$[3*1024*1024] #almost fill memory so we reclaim buffers
>> >> >> dd if=/dev/zero of=/dev/ram0 bs=4k #prefill brd so we don't get alloc deadlock
>> >> >> mkfs.ext2 -b1024 /dev/ram0 #1K buffers
>> >> >>
>> >> >> Test is basically unmodified except I use 64MB files, and start 8 of them
>> >> >> at once to (8 core system, so improve chances of hitting the bug). Although I
>> >> >> do see it with only 1 running it takes longer to trigger.
>> >> >>
>> >> >> I also run a loop doing 'sync ; echo 3 > /proc/sys/vm/drop_caches' but I don't
>> >> >> know if that really helps speed up reproducing it. It is quite random to hit,
>> >> >> but I was able to hit it IIRC in under a minute with that setup.
>> >> >>
>> >> >
>> >> > Here is how i reproduce it:
>> >> > Filesystem is ext2 with blocksize 4096
>> >> > Fill up the ram with 95% anon memory and mlockall ( put enough memory
>> >> > pressure which will trigger page reclaim and background writeout)
>> >> > Run one thread of the test program
>> >> >
>> >> > and i will see "bad pages" within few minutes.
>> >>
>> >> And here is the "top" and stdout while it is getting "bad pages"
>> >> top
>> >>
>> >> PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
>> >> 3487 root 20 0 52616 50m 284 R 95 0.3 3:58.85 usemem
>> >> 3810 root 20 0 129m 99m 99m D 41 0.6 0:01.87 ftruncate_mmap
>> >> 261 root 15 -5 0 0 0 D 4 0.0 0:31.08 kswapd0
>> >> 262 root 15 -5 0 0 0 D 3 0.0 0:10.26 kswapd1
>> >>
>> >> stdout:
>> >>
>> >> while true; do
>> >> ./ftruncate_mmap;
>> >> done
>> >> Running 852 bad page
>> >> Running 315 bad page
>> >> Running 999 bad page
>> >> Running 482 bad page
>> >> Running 24 bad page
>> > Thanks, for the help. I've debugged the problem to a bug in
>> > ext2_get_block(). I've already sent out a patch which should fix the issue
>> > (at least it fixes the problem for me).
>> > The fix is also attached if you want to try it.
>>
>> hmm, now i do see that get_block() returns ENOSPC by printk the err.
>> So did you applied the patch which redirty_page_for_writepage as well
>> as this one together?
> No, my patch contained only a fix in ext2_get_block(). When you see
> ENOSPC, that's a completely separate issue. You may apply that patch but
> with ext2 it would be enough to make the file fit the ram disk. I.e. first
> try with dd how big file fits there and then run your tester with at most
> as big file so that you don't hit ENOSPC...
>
>> I will start the test with kernel applied both patches and leave it for running.
> OK.

I applied the patch(based on 2.6.26) in the attachment and the test
itself runs fine so far without reporting "bad pages", however, i
seems get deadlock in the varlog, here is the message and i turned on
lockdep.

kernel: 1 lock held by kswapd1/264:
kernel: #0: (&ei->truncate_mutex){--..}, at: [<ffffffff8031d529>]
ext2_get_block+0x109/0x960
kernel: INFO: task ftruncate_mmap:2950 blocked for more than 120 seconds.
kernel: "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables
this message.
kernel: ftruncate_mma D ffff81047e733a80 0 2950 2858
kernel: ffff8101798516f8 0000000000000092 0000000000000000 0000000000000046
kernel: ffff81047e0a1260 ffff81047f070000 ffff81047e0a15c0 0000000100130c66
kernel: 00000000ffffffff ffffffff8025740d 0000000000000000 0000000000000000
kernel: Call Trace:
kernel: [<ffffffff8025740d>] mark_held_locks+0x3d/0x80
kernel: [<ffffffff804d78bd>] mutex_lock_nested+0x14d/0x280
kernel: [<ffffffff804d7855>] mutex_lock_nested+0xe5/0x280
kernel: [<ffffffff8031d529>] ext2_get_block+0x109/0x960
kernel: [<ffffffff802ca2e3>] create_empty_buffers+0x43/0xb0
kernel: [<ffffffff802ca2e3>] create_empty_buffers+0x43/0xb0
kernel: [<ffffffff802ca217>] alloc_page_buffers+0x97/0x120
kernel: [<ffffffff802cbfb6>] __block_write_full_page+0x206/0x320
kernel: [<ffffffff802cbe70>] __block_write_full_page+0xc0/0x320
kernel: [<ffffffff8031d420>] ext2_get_block+0x0/0x960
kernel: [<ffffffff8027c74e>] shrink_page_list+0x4fe/0x650
kernel: [<ffffffff80257ee8>] __lock_acquire+0x3b8/0x1080
kernel: [<ffffffff8027be18>] isolate_lru_pages+0x88/0x230
kernel: [<ffffffff8027c9ea>] shrink_inactive_list+0x14a/0x3f0
kernel: [<ffffffff8027cd43>] shrink_zone+0xb3/0x130
kernel: [<ffffffff80249e90>] autoremove_wake_function+0x0/0x30
kernel: [<ffffffff8027d1a8>] try_to_free_pages+0x268/0x3e0
kernel: [<ffffffff8027bfc0>] isolate_pages_global+0x0/0x40
kernel: [<ffffffff802774f7>] __alloc_pages_internal+0x1d7/0x4a0
kernel: [<ffffffff80279b94>] __do_page_cache_readahead+0x124/0x270
kernel: [<ffffffff8027314f>] filemap_fault+0x18f/0x400
kernel: [<ffffffff80280925>] __do_fault+0x65/0x450
kernel: [<ffffffff80257ee8>] __lock_acquire+0x3b8/0x1080
kernel: [<ffffffff803475dd>] __down_read_trylock+0x1d/0x60
kernel: [<ffffffff8028389a>] handle_mm_fault+0x18a/0x7a0
kernel: [<ffffffff804dba1c>] do_page_fault+0x29c/0x930
kernel: [<ffffffff804d8b46>] trace_hardirqs_on_thunk+0x35/0x3a
kernel: [<ffffffff804d94dd>] error_exit+0x0/0xa9
kernel:
kernel: 2 locks held by ftruncate_mmap/2950:
kernel: #0: (&mm->mmap_sem){----}, at: [<ffffffff804db9af>]
do_page_fault+0x22f/0x930
kernel: #1: (&ei->truncate_mutex){--..}, at: [<ffffffff8031d529>]
ext2_get_block+0x109/0x960

--Ying

>
> Honza
> --
> Jan Kara <[email protected]>
> SUSE Labs, CR
>