LinuxLists.cc - Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback

2006-12-29 22:49:07

Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback

On Fri, Dec 29, 2006 at 06:59:02PM +0000, Linux Kernel Mailing List wrote:
> Gitweb: http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7658cc289288b8ae7dd2c2224549a048431222b3
> Commit: 7658cc289288b8ae7dd2c2224549a048431222b3
> Parent: 3bf8ba38f38d3647368e4edcf7d019f9f8d9184a
> Author: Linus Torvalds <[email protected]>
> AuthorDate: Fri Dec 29 10:00:58 2006 -0800
> Committer: Linus Torvalds <[email protected]>
> CommitDate: Fri Dec 29 10:00:58 2006 -0800
>
> VM: Fix nasty and subtle race in shared mmap'ed page writeback

With 2.6.20-rc2-git1, which contain this patch, I have no more Berkeley
DB corruption with Klibido.?
I'm afraid a lot of software project switched to Sqlite,? from BDB,?
because the bug this patch fix (ie. http://bogofilter.sourceforge.net/).
I've also thought, since years, it was an userland problem.

Ciao,
gelma

-------------------
? http://klibido.sourceforge.net/
? http://www.sqlite.org/
? http://www.oracle.com/database/berkeley-db/index.html

2006-12-31 03:56:33

by Nick Piggin

[permalink] [raw]

Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback

Andrea Gelmini wrote:
> On Fri, Dec 29, 2006 at 06:59:02PM +0000, Linux Kernel Mailing List wrote:
>
>>Gitweb: http://git.kernel.org/git/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=7658cc289288b8ae7dd2c2224549a048431222b3
>>Commit: 7658cc289288b8ae7dd2c2224549a048431222b3
>>Parent: 3bf8ba38f38d3647368e4edcf7d019f9f8d9184a
>>Author: Linus Torvalds <[email protected]>
>>AuthorDate: Fri Dec 29 10:00:58 2006 -0800
>>Committer: Linus Torvalds <[email protected]>
>>CommitDate: Fri Dec 29 10:00:58 2006 -0800
>>
>> VM: Fix nasty and subtle race in shared mmap'ed page writeback
>
>
> With 2.6.20-rc2-git1, which contain this patch, I have no more Berkeley
> DB corruption with Klibido.
> I'm afraid a lot of software project switched to Sqlite, from BDB,
> because the bug this patch fix (ie. http://bogofilter.sourceforge.net/).
> I've also thought, since years, it was an userland problem.

This bug was only introduced in 2.6.19, due to a change that caused pte
dirty bits to be discarded without a subsequent set_page_dirty() (nowhere
else in the kernel should have done this).

So if your corruption is years old, then it must be something else.
Maybe it is hidden by a timing change, or BDB isn't using msync properly.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-12-31 13:50:33

by Andrea Gelmini

[permalink] [raw]

Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback

On Sun, Dec 31, 2006 at 02:55:58PM +1100, Nick Piggin wrote:
> This bug was only introduced in 2.6.19, due to a change that caused pte
no, Linus said that with 2.6.19 it's easier to trigger this bug...

> So if your corruption is years old, then it must be something else.
> Maybe it is hidden by a timing change, or BDB isn't using msync properly.
I can give you a complete image where just changing kernel (everything
is same, of course) corruptions goes away.
we spent a lot, I mean a *lot*, of time looking for our code mistake,
and so on.
I don't want to seem rude, but I am sure that Berkeley DB corruption we
have seen (not just Klibido, but I also think about postgrey, and so on)
depends on this bug.
I repeat, if you have time/interest I can give you a complete machine
to see the problem.

thanks a lot for your time,
gelma

2007-01-04 03:58:02

by Nick Piggin

[permalink] [raw]

Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback

Andrea Gelmini wrote:
> On Sun, Dec 31, 2006 at 02:55:58PM +1100, Nick Piggin wrote:
>
>>This bug was only introduced in 2.6.19, due to a change that caused pte
>
> no, Linus said that with 2.6.19 it's easier to trigger this bug...

Yhat's when the bug was introduced -- 2.6.19. 2.6.18 does not have
this bug, so it cannot be years old.

>>So if your corruption is years old, then it must be something else.
>>Maybe it is hidden by a timing change, or BDB isn't using msync properly.
>
> I can give you a complete image where just changing kernel (everything
> is same, of course) corruptions goes away.
> we spent a lot, I mean a *lot*, of time looking for our code mistake,
> and so on.
> I don't want to seem rude, but I am sure that Berkeley DB corruption we
> have seen (not just Klibido, but I also think about postgrey, and so on)
> depends on this bug.
> I repeat, if you have time/interest I can give you a complete machine
> to see the problem.

You're not being rude, but I just wanted to point out that this patch
(nor the dirty page accounting also in 2.6.19) doesn't fix anything
that was in 2.6.18, AFAIKS.

I wouldn't discount a kernel bug, but it will be hard to track down
unless you can find an earlier kernel that did not cause the corruptions
and/or provide source for a minimal test case to reproduce.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-04 04:44:50

by Linus Torvalds

[permalink] [raw]

Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback

On Thu, 4 Jan 2007, Nick Piggin wrote:
>
> Yhat's when the bug was introduced -- 2.6.19. 2.6.18 does not have
> this bug, so it cannot be years old.

Actually, I think 2.6.18 may have a subtle variation on it.

In particular, I look back at the try_to_free_buffers() thing that I hated
so much, and it makes me wonder.. It used to do:

spin_lock(&mapping->private_lock);
ret = drop_buffers(page, &buffers_to_free);
spin_unlock(&mapping->private_lock);
if (ret) {
.. crappy comment ..
if (test_clear_page_dirty(page))
task_io_account_cancelled_write(PAGE_CACHE_SIZE);
}

and I think that at least on SMP, we had a race with another CPU doing the
"mark page dirty if it was dirty in the PTE" at the same time. Because the
marking dirty would come in, find no buffers (they just got dropped), and
then mark the page dirty (ignoring the lack of any buffers), but then the
above would do the "test_clear_page_dirty()" thing on it.

Ie the race, I think, existed where that crappy comment was.

But that much older race would only trigger on SMP (or possibly UP with
preempt).

And I haven't actually thought about it that much, so I could be full of
crap. But I don't see anything that protects against it: we may hold the
page lock, but since the code that marks things _dirty_ doesn't
necessarily always hold it, that doesn't help us. And we may hold the
"private_lock", but we drop it before we do the dirty bit clearing, and in
fact on UP+PREEMPT that very dropping could cause an active preemption to
take place, so..

I dunno.

I would certainly suggest the whole series of my "dirty cleanup" be added
on top of 2.6.18.3 (which apparently has the dirty mapped page tracking).
For older kernels? If there is a race there, it must be pretty damn hard
to hit in practice (and it must have been there for a looong time), so
trying to fix it is possibly as likely to cause problems as it migh to fix
them.

Linus

2007-01-04 05:07:57

by Nick Piggin

[permalink] [raw]

Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback

Linus Torvalds wrote:
>
> On Thu, 4 Jan 2007, Nick Piggin wrote:
>
>>Yhat's when the bug was introduced -- 2.6.19. 2.6.18 does not have
>>this bug, so it cannot be years old.
>
>
> Actually, I think 2.6.18 may have a subtle variation on it.
>
> In particular, I look back at the try_to_free_buffers() thing that I hated
> so much, and it makes me wonder.. It used to do:
>
> spin_lock(&mapping->private_lock);
> ret = drop_buffers(page, &buffers_to_free);
> spin_unlock(&mapping->private_lock);
> if (ret) {
> .. crappy comment ..
> if (test_clear_page_dirty(page))
> task_io_account_cancelled_write(PAGE_CACHE_SIZE);
> }
>
> and I think that at least on SMP, we had a race with another CPU doing the
> "mark page dirty if it was dirty in the PTE" at the same time. Because the
> marking dirty would come in, find no buffers (they just got dropped), and
> then mark the page dirty (ignoring the lack of any buffers), but then the
> above would do the "test_clear_page_dirty()" thing on it.
>
> Ie the race, I think, existed where that crappy comment was.
>
> But that much older race would only trigger on SMP (or possibly UP with
> preempt).

Oh yes the try_to_free_buffers race, I think, does exist in older kernels.
Yes according to our earlier analysis it would trigger with UP+preempt and
SMP.

But the patch that Andrea was pointing to was your last patch (The Fix),
which stopped page_mkclean caller throwing out dirty bits. You probably
didn't see that in the mail I cc'ed you on.

So yes it would be interesting to see whether fixing try_to_free_buffers
fixes Andrea's problem on older kernels.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-04 05:41:27

by Andrew Morton

[permalink] [raw]

Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback

On Wed, 3 Jan 2007 20:44:36 -0800 (PST)
Linus Torvalds <[email protected]> wrote:

> Actually, I think 2.6.18 may have a subtle variation on it.
>
> In particular, I look back at the try_to_free_buffers() thing that I hated
> so much, and it makes me wonder.. It used to do:
>
> spin_lock(&mapping->private_lock);
> ret = drop_buffers(page, &buffers_to_free);
> spin_unlock(&mapping->private_lock);
> if (ret) {
> .. crappy comment ..
> if (test_clear_page_dirty(page))
> task_io_account_cancelled_write(PAGE_CACHE_SIZE);
> }
>
> and I think that at least on SMP, we had a race with another CPU doing the
> "mark page dirty if it was dirty in the PTE" at the same time. Because the
> marking dirty would come in, find no buffers (they just got dropped), and
> then mark the page dirty (ignoring the lack of any buffers), but then the
> above would do the "test_clear_page_dirty()" thing on it.
>

That bug was introduced in 2.6.19, with the dirty page tracking patches.

2.6.18 and earlier used ->private_lock coverage in try_to_free_buffers() to
prevent it.

> Ie the race, I think, existed where that crappy comment was.

The comment was complete, accurate and needed.

2007-01-04 06:04:17

by Nick Piggin

[permalink] [raw]

Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback

Andrew Morton wrote:
> On Wed, 3 Jan 2007 20:44:36 -0800 (PST)
> Linus Torvalds <[email protected]> wrote:
>
>
>>Actually, I think 2.6.18 may have a subtle variation on it.
>>
>>In particular, I look back at the try_to_free_buffers() thing that I hated
>>so much, and it makes me wonder.. It used to do:
>>
>> spin_lock(&mapping->private_lock);
>> ret = drop_buffers(page, &buffers_to_free);
>> spin_unlock(&mapping->private_lock);
>> if (ret) {
>> .. crappy comment ..
>> if (test_clear_page_dirty(page))
>> task_io_account_cancelled_write(PAGE_CACHE_SIZE);
>> }
>>
>>and I think that at least on SMP, we had a race with another CPU doing the
>>"mark page dirty if it was dirty in the PTE" at the same time. Because the
>>marking dirty would come in, find no buffers (they just got dropped), and
>>then mark the page dirty (ignoring the lack of any buffers), but then the
>>above would do the "test_clear_page_dirty()" thing on it.
>>
>
>
> That bug was introduced in 2.6.19, with the dirty page tracking patches.
>
> 2.6.18 and earlier used ->private_lock coverage in try_to_free_buffers() to
> prevent it.

Ohh, right you are, I was looking at 2.6.19 sources. The comments above
ttfb match that as well. Curious that the dirty page patches were allowed
to mess with this...

Anyway that leaves us with the question of why Andrea's database is getting
corrupted. Hopefully he can give us a minimal test-case.

>>Ie the race, I think, existed where that crappy comment was.
>
>
> The comment was complete, accurate and needed.

Indeed ;)

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-04 06:12:24

by Andrew Morton

[permalink] [raw]

Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback

On Thu, 04 Jan 2007 17:03:43 +1100
Nick Piggin <[email protected]> wrote:

> > That bug was introduced in 2.6.19, with the dirty page tracking patches.
> >
> > 2.6.18 and earlier used ->private_lock coverage in try_to_free_buffers() to
> > prevent it.
>
> Ohh, right you are, I was looking at 2.6.19 sources. The comments above
> ttfb match that as well. Curious that the dirty page patches were allowed
> to mess with this...

Frankly, those patches scared the crap out of me, specifically because of
the delicacy and complexity of the various dirtiness state coherencies.
But I just didn't have the bandwidth to go through them with a sufficiently
fine toothcomb, sorry.

> Anyway that leaves us with the question of why Andrea's database is getting
> corrupted. Hopefully he can give us a minimal test-case.

It'd odd that stories of pre-2.6.19 BerkeleyDB corruption are now coming
out of the woodwork. It's the first I've ever heard of them.

2007-01-04 06:56:10

by David Miller

[permalink] [raw]

Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback

From: Andrew Morton <[email protected]>
Date: Wed, 3 Jan 2007 22:12:20 -0800

> On Thu, 04 Jan 2007 17:03:43 +1100
> Nick Piggin <[email protected]> wrote:
>
> > > That bug was introduced in 2.6.19, with the dirty page tracking patches.
> > >
> > > 2.6.18 and earlier used ->private_lock coverage in try_to_free_buffers() to
> > > prevent it.
> >
> > Ohh, right you are, I was looking at 2.6.19 sources. The comments above
> > ttfb match that as well. Curious that the dirty page patches were allowed
> > to mess with this...
>
> Frankly, those patches scared the crap out of me, specifically because of
> the delicacy and complexity of the various dirtiness state coherencies.
> But I just didn't have the bandwidth to go through them with a sufficiently
> fine toothcomb, sorry.
>
> > Anyway that leaves us with the question of why Andrea's database is getting
> > corrupted. Hopefully he can give us a minimal test-case.
>
> It'd odd that stories of pre-2.6.19 BerkeleyDB corruption are now coming
> out of the woodwork. It's the first I've ever heard of them.

Note that the original rtorrent debian bug report was against 2.6.18

2007-01-04 07:06:37

by Andrew Morton

[permalink] [raw]

Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback

On Wed, 03 Jan 2007 22:56:07 -0800 (PST)
David Miller <[email protected]> wrote:

> From: Andrew Morton <[email protected]>
> Date: Wed, 3 Jan 2007 22:12:20 -0800
>
> > On Thu, 04 Jan 2007 17:03:43 +1100
> > Nick Piggin <[email protected]> wrote:
> >
> > > > That bug was introduced in 2.6.19, with the dirty page tracking patches.
> > > >
> > > > 2.6.18 and earlier used ->private_lock coverage in try_to_free_buffers() to
> > > > prevent it.
> > >
> > > Ohh, right you are, I was looking at 2.6.19 sources. The comments above
> > > ttfb match that as well. Curious that the dirty page patches were allowed
> > > to mess with this...
> >
> > Frankly, those patches scared the crap out of me, specifically because of
> > the delicacy and complexity of the various dirtiness state coherencies.
> > But I just didn't have the bandwidth to go through them with a sufficiently
> > fine toothcomb, sorry.
> >
> > > Anyway that leaves us with the question of why Andrea's database is getting
> > > corrupted. Hopefully he can give us a minimal test-case.
> >
> > It'd odd that stories of pre-2.6.19 BerkeleyDB corruption are now coming
> > out of the woodwork. It's the first I've ever heard of them.
>
> Note that the original rtorrent debian bug report was against 2.6.18

I think that was 2.6.18+debian-added-dirty-page-tracking-patches.

If that memory is correct, I'll assert (and emphasise) that the cause of the
alleged BerkeleyDB corruption is not known at this time.

The post-2.6.19 "fix" might make it go away. But if it does, we do not know
why, and it might still be there, only harder to hit.

2007-01-04 07:17:12

by Nick Piggin

[permalink] [raw]

Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback

Andrew Morton wrote:
> On Wed, 03 Jan 2007 22:56:07 -0800 (PST)
> David Miller <[email protected]> wrote:

>>>>Anyway that leaves us with the question of why Andrea's database is getting
>>>>corrupted. Hopefully he can give us a minimal test-case.
>>>
>>>It'd odd that stories of pre-2.6.19 BerkeleyDB corruption are now coming
>>>out of the woodwork. It's the first I've ever heard of them.
>>
>>Note that the original rtorrent debian bug report was against 2.6.18
>
>
> I think that was 2.6.18+debian-added-dirty-page-tracking-patches.
>
> If that memory is correct, I'll assert (and emphasise) that the cause of the
> alleged BerkeleyDB corruption is not known at this time.

I think that's right. Even if it were plain 2.6.18 that had rtorrent
corruption, then it would be more evidence we still have an unidentified
bug, because none of the patches fixed anything we have found to be
buggy in 2.6.18.

> The post-2.6.19 "fix" might make it go away. But if it does, we do not know
> why, and it might still be there, only harder to hit.

Likely. I think it is only hiding the bug (maybe the writeout patterns
from shared dirty accounting are changing timings or codepaths).

Of course, this means that we still can't confirm whether or not it is a
kernel bug. It could be a BDB bug that's being hidden.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-04 13:08:45

by Andrea Gelmini

[permalink] [raw]

Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback

On Wed, Jan 03, 2007 at 10:12:20PM -0800, Andrew Morton wrote:
> > Anyway that leaves us with the question of why Andrea's database is getting
> > corrupted. Hopefully he can give us a minimal test-case.
>
> It'd odd that stories of pre-2.6.19 BerkeleyDB corruption are now coming
> out of the woodwork. It's the first I've ever heard of them.

of course, because nobody had never thought it could be a kernel bug.
since first release of klibido we had db corruption. so Bauno, main
author/programmer, introduced various check in it. so we had found db
corruption. he had talked with sleepycat (it happens long before Oracle
buys them), they were very kind, but in the meanwhile lot of code was
changed, so he had decided to wait for other tests.
anyway, in the klibido mailing list people started to complain about
corruption db. we spended a lot of time trying to find clues. anyway, to
make this part very short, after months we got clear that with Red
Hat/Suse kernel we got no crash, and with
vanilla/Debian/Ubuntu/Slackware we can reproduce it with simple action.
so, we started checking klibido code, g++ versions, qt/kde versions, and
so on. but nothing changed.
well, all this story could mean nothing, but then I said "let's look at
other projects using bdb", and I subscribed to different mailing lists.
you can change klibido with postgrey/bogofilter, and you've same story.
if you have time to look at their mailing list archive, you'll see
people complain about db corruption, people telling them "it happens",
and, usually, switching to sqlite.
nobody wrote/write to kernel mailing list because nobody thought/think
it could be a kernel bug.

anyway, the important thing is that I can give you an image of my debian
machine where you can have a lot/not at all corruption just switching
bitween debian kernel (linux-image-2.6.18-3-686) and vanilla
2.6.20-rc2-git1.

by the way, klibido works also under BSD, and we have no bug report
about db corruption (I know, we dunno how many user, which DB size, and
so on).

I repeat, all these things are not so important because klibido, but
because in common with other projects.

I put Bauno, klibido author, in Cc (it also speak english better thank
me...)

thanks a lot for your time,
gelma

2007-01-04 13:16:24

by Andrea Gelmini

[permalink] [raw]

Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback

On Thu, Jan 04, 2007 at 05:03:43PM +1100, Nick Piggin wrote:
> Anyway that leaves us with the question of why Andrea's database is getting
> corrupted. Hopefully he can give us a minimal test-case.

yep, I can give you a complete image of my machine, or a root access.
replicate the problem it's not complicated, but it requires some time
(you have to expire/update usenet/newsgroup headers).
I can talk with Bauno to see if he can produce an automated program that
replicate user behavior.

thanks a lot for your interest,
gelma

2007-01-04 13:17:55

by Andrea Gelmini

[permalink] [raw]

Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback

On Thu, Jan 04, 2007 at 04:07:18PM +1100, Nick Piggin wrote:
> But the patch that Andrea was pointing to was your last patch (The Fix),
> which stopped page_mkclean caller throwing out dirty bits. You probably
> didn't see that in the mail I cc'ed you on.

well, I pointed at that patch for reply, but as I wrote in my first
mail, I was using 2.6.20-rc2-git1.

thank a lot for your time,
gelma

2007-01-04 13:19:18

by Andrea Gelmini

[permalink] [raw]

Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback

On Thu, Jan 04, 2007 at 02:57:24PM +1100, Nick Piggin wrote:
> I wouldn't discount a kernel bug, but it will be hard to track down
> unless you can find an earlier kernel that did not cause the corruptions
> and/or provide source for a minimal test case to reproduce.

see my others reply, please.

thanks a lot for your time,
gelma

2007-01-04 13:25:41

by Hugh Dickins

[permalink] [raw]

Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback

On Wed, 3 Jan 2007, Andrew Morton wrote:
> On Wed, 03 Jan 2007 22:56:07 -0800 (PST)
> David Miller <[email protected]> wrote:
> > From: Andrew Morton <[email protected]>
> > >
> > > It'd odd that stories of pre-2.6.19 BerkeleyDB corruption are now coming
> > > out of the woodwork. It's the first I've ever heard of them.
> >
> > Note that the original rtorrent debian bug report was against 2.6.18
>
> I think that was 2.6.18+debian-added-dirty-page-tracking-patches.

That's right. Debian's 2.6.18-3, not -stable's 2.6.18.3 as Linus feared.
I'll be sending 2.6.18-stable the fix to the msync ENOMEM-on-unmapped
issue later today (that little buglet being what led them to integrate
the much more interesting dirty page tracking patches, which happened
to fix it in passing).

Hugh

2007-01-05 05:36:39

by Nick Piggin

[permalink] [raw]

Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback

Andrea Gelmini wrote:
> On Thu, Jan 04, 2007 at 05:03:43PM +1100, Nick Piggin wrote:
>
>>Anyway that leaves us with the question of why Andrea's database is getting
>>corrupted. Hopefully he can give us a minimal test-case.
>
>
> yep, I can give you a complete image of my machine, or a root access.
> replicate the problem it's not complicated, but it requires some time
> (you have to expire/update usenet/newsgroup headers).
> I can talk with Bauno to see if he can produce an automated program that
> replicate user behavior.

Myself, I don't think I have the time or capability to help much if the
corruption involves a complicated set of userspace software, especially
if you don't have a point where a regression was introduced.

I was thinking like a 100 line C program that I can reproduce here ;)

If you can even describe the steps it does: (eg. mmap file A, write(2) to
it, truncate it, ...., should contain 1s but it contains 0s!), then we
might have some suggestions to try.

One obvious thing is change filesystems or filesystem block sizes, try
ext2 or even tmpfs. Another is to try using write(2) instead of mmap to
write data.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2007-01-05 19:12:55

by Christoph Lameter

[permalink] [raw]

Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback

On Wed, 3 Jan 2007, Linus Torvalds wrote:

> And I haven't actually thought about it that much, so I could be full of
> crap. But I don't see anything that protects against it: we may hold the
> page lock, but since the code that marks things _dirty_ doesn't
> necessarily always hold it, that doesn't help us. And we may hold the
> "private_lock", but we drop it before we do the dirty bit clearing, and in
> fact on UP+PREEMPT that very dropping could cause an active preemption to
> take place, so..

Maybe we should require taking the page lock before the dirty bits are
modified? That way we can avoid all the artistic code in there right now.
It may be even become comprehensible. The page lock and the dirty bit use
the same cacheline (same word actually), contention is rare and so there
is likely only a negligible performance impact.

2007-01-05 19:34:06

by Linus Torvalds

[permalink] [raw]

Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback

On Fri, 5 Jan 2007, Christoph Lameter wrote:
>
> Maybe we should require taking the page lock before the dirty bits are
> modified?

I think it's been suggested several times.

However, a lot of the code isn't really amenable to it as it stands now.
We very much tend to call it in critical sections, and you have to move
them all out of the locks they are now.

The dirty mmap patches actually did some of that, exactly because they
wanted to replace "set_page_dirty()" with "set_page_dirty_balance()",
which can obviously block. However, the _current_ kernels don't seem to
have the bug, and quite frankly, I think it's at least partly because
we've cleaned up and simplified the dirty bit handling even if Andrew
apparently disagrees.

So one of the biggest need for dirty bit cleanup is with the _old_
kernels, the ones before the dirty mmap patches. Nobody is going to do
that, I think - it would just be crazy.

> That way we can avoid all the artistic code in there right now.

"Artistic". Good word. That said, most of the code in there needs its own
locks for other reasons (ie the reason __set_page_dirty_nobuffers() ends
up taking the tree-lock is because of the radix tree bits, which can NOT
be protected by per-page locks _anyway_).

So I don't think you'd actually really get rid of any "artistic" code.

The real "artistry" is the fact that we actually do things in
"set_page_dirty()" even if the page was marked dirty before, exactly
because we need to re-dirty any meta-data (ie buffers). And THAT is what
causes a lot of the strangeness. And again, that really doesn't have much
to do with the page lock - the meta-data tends to generally have its own
locks again, and it really is a _separate_ issue from the "dirty" bit
itself.

So I agree: set_page_dirty() is ugly, but I don't think the real problem
has anything to do with the page lock.

The real confusion comes from how the "dirty" bit means "_some_ part of
this page may or may not be dirty", and that "set_page_dirty()" really not
only needs to set that bit, but also make sure that we know that now ALL
of this page is dirty. That particular confusion takes some time to
assimilate, and the bug in 2.6.19 was due to missing the subtlety.

And that particular confusion we've had forever, and I think the bug in
older kernels is probably due to some other place also having missed
something subtle wrt that thing. For example, anything that
tests/sets/clear just the dirty bit on its own is pretty much buggy by
design, even if it _happens_ to work.

This is partly why I got rid of the old "[test_]clear_page_dirty()",
because it just was confused about the whole thing, and seemed to think
that it was just a simple bit.

Linus

2007-01-05 20:02:55

by Christoph Lameter

[permalink] [raw]

Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback

On Fri, 5 Jan 2007, Linus Torvalds wrote:

> However, a lot of the code isn't really amenable to it as it stands now.
> We very much tend to call it in critical sections, and you have to move
> them all out of the locks they are now.

It looks as if most code handling the dirty bits already uses the page
lock?

> "Artistic". Good word. That said, most of the code in there needs its own
> locks for other reasons (ie the reason __set_page_dirty_nobuffers() ends
> up taking the tree-lock is because of the radix tree bits, which can NOT
> be protected by per-page locks _anyway_).

According to the comments: Most callers of __set_page_dirty_nobuffers hold
the page lock. The only exception seems to be zap_pte_range where we
transfer the dirty information from the pte to the page. This is now much
rarer since the dirty mmap tracking patches make the fault handler deal
with it.

Still, the page dirtying in zap_pte_range remains a potential trouble spot
for the remaining cases in which we need to dirty pages there since it is
not rate limited. There is a potential cause for creating deadlocks
because too many pages were dirtied and the file system cannot allocate
enough memory for writeout.

2007-01-05 21:35:26

by Linus Torvalds

[permalink] [raw]

Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback

On Fri, 5 Jan 2007, Christoph Lameter wrote:
>
> It looks as if most code handling the dirty bits already uses the page
> lock?

Much does. But I did some debugging (when trying to figure out the VM
corruption), and certainly not all of it does. And when I looked at some
of the code-paths, I was not at all sure that we could always get the
lock easily.

> According to the comments: Most callers of __set_page_dirty_nobuffers hold
> the page lock. The only exception seems to be zap_pte_range where we
> transfer the dirty information from the pte to the page. This is now much
> rarer since the dirty mmap tracking patches make the fault handler deal
> with it.

No, it still happens as frequently as before - because the page must STAY
dirty while it's writable. So it's still dirty when unmapped.

What the dirty mmap tracking patches do is that if a _lot_ of pages are
dirty, we'll do the balancing, and at that point, if we end up doing IO,
we mark them clean in the page tables again.

But don't think that they aren't dirty in the page tables, and that
zap-page range doesn't do stuff. I made that thinko myself at some point,
but it's wrong.

> Still, the page dirtying in zap_pte_range remains a potential trouble spot
> for the remaining cases in which we need to dirty pages there since it is
> not rate limited. There is a potential cause for creating deadlocks
> because too many pages were dirtied and the file system cannot allocate
> enough memory for writeout.

Well, this is what the dirty mmap tracking should finally have fixed. We
know we have only a _limited_ number of dirty pages, in a way that we
never knew before.

But afaik, it wasn't just zap_pte_range(), it was something else too that
didn't have the page lock. Maybe I remember wrongly, though.

Anyway, the fundamental issue doesn't go away: the page lock isn't
actually the worst of the dirty bit handling.

Linus

2007-01-08 07:46:45

by dean gaudet

[permalink] [raw]

Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback

On Wed, 3 Jan 2007, Andrew Morton wrote:

> On Wed, 03 Jan 2007 22:56:07 -0800 (PST)
> David Miller <[email protected]> wrote:
>
> > Note that the original rtorrent debian bug report was against 2.6.18
>
> I think that was 2.6.18+debian-added-dirty-page-tracking-patches.

i've seen it on a 2.6.18.4 box (no other patches) which was running
rtorrent at 50mbps rates for a few days... i was only using the box to
stress a new circuit before putting live traffic on it, so i didn't spend
any time debugging the failures.

at least i'm assuming this is the same problem you've all just solved:
rtorrent reports a sha1 miscompare after the download is finished when it
rechecks the entire file before it enters seeding mode?

in this setup the rtorrent was uploading to ~3k clients simultaneously at
50mbps... and downloading new torrents at a rate of ~2mbps (very gross
average -- the download is a lot more bursty than that in practice).

the box has 2GiB of ram, and there were 5.3GB of torrents "live" at any
time. rtorrent is pretty bad at dealing with high bitrates for working
sets larger than ram -- but the 10krpm scsi disk was only ~25% busy
(unlike other setups i've tried at this bitrates where the disks become
the bottleneck unless i back off on the torrent working set size).

btw if anyone has a fast pipe and wants to retest the conditions it should
be easy to reproduce... just join the electricsheep bt swarm and you'll
almost instantly fill your uplink. the clients are very hungry ;) none
of this is pirated material -- it's basically a distributed render farm
sharing animations via bt. let me know if you want help setting such a
test up.

i attached the .config in case there's anything of interest in it.

-dean

Attachments:

config-2.6.18.4.bz2 (12.44 kB)

2007-01-10 13:55:34

by Andrea Gelmini

[permalink] [raw]

Subject: Re: VM: Fix nasty and subtle race in shared mmap'ed page writeback

On Fri, Jan 05, 2007 at 04:36:06PM +1100, Nick Piggin wrote:
> I was thinking like a 100 line C program that I can reproduce here ;)
not soon, but I'll try to produce it.

> If you can even describe the steps it does: (eg. mmap file A, write(2) to
> it, truncate it, ...., should contain 1s but it contains 0s!), then we
> might have some suggestions to try.
I ignore this... I mean, BerkelyDB does all this on its own (we just
talk with the library). we never touch directly the DB.

> One obvious thing is change filesystems or filesystem block sizes, try
> ext2 or even tmpfs. Another is to try using write(2) instead of mmap to
> write data.
well, I tried all filesystems, tmpfs also. I know BDB use a lot mmap.

thanks,
gelma