2007-12-15 23:09:37

by Krzysztof Oledzki

[permalink] [raw]
Subject: Re: [Bug 9182] Critical memory leak (dirty pages)



On Sat, 15 Dec 2007, [email protected] wrote:

> http://bugzilla.kernel.org/show_bug.cgi?id=9182
>
>
> ------- Comment #33 from [email protected] 2007-12-15 14:19 -------
> Krzysztof, I'd hate point you to a hard path (at least time consuming), but
> you've done a lot of digging by now anyway. How about git bisecting between
> 2.6.20-rc2 and rc1? Here is great info on bisecting:
> http://www.kernel.org/doc/local/git-quick.html

As I'm smarter than git-bistect I can tell that 2.6.20-rc1-git8 is as bad
as 2.6.20-rc2 but 2.6.20-rc1-git8 with one patch reverted seems to be OK.
So it took me only 2 reboots. ;)

The guilty patch is the one I proposed just an hour ago:
http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.20.y.git;a=commitdiff_plain;h=fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9

So:
- 2.6.20-rc1: OK
- 2.6.20-rc1-git8 with fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 reverted: OK
- 2.6.20-rc1-git8: very BAD
- 2.6.20-rc2: very BAD
- 2.6.20-rc4: very BAD
- >= 2.6.20: BAD (but not *very* BAD!)

Best regards,

Krzysztof Ol?dzki


2007-12-16 04:36:58

by Andrew Morton

[permalink] [raw]
Subject: Re: [Bug 9182] Critical memory leak (dirty pages)

On Sun, 16 Dec 2007 00:08:52 +0100 (CET) Krzysztof Oledzki <[email protected]> wrote:

>
>
> On Sat, 15 Dec 2007, [email protected] wrote:
>
> > http://bugzilla.kernel.org/show_bug.cgi?id=9182
> >
> >
> > ------- Comment #33 from [email protected] 2007-12-15 14:19 -------
> > Krzysztof, I'd hate point you to a hard path (at least time consuming), but
> > you've done a lot of digging by now anyway. How about git bisecting between
> > 2.6.20-rc2 and rc1? Here is great info on bisecting:
> > http://www.kernel.org/doc/local/git-quick.html
>
> As I'm smarter than git-bistect I can tell that 2.6.20-rc1-git8 is as bad
> as 2.6.20-rc2 but 2.6.20-rc1-git8 with one patch reverted seems to be OK.
> So it took me only 2 reboots. ;)
>
> The guilty patch is the one I proposed just an hour ago:
> http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.20.y.git;a=commitdiff_plain;h=fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9
>
> So:
> - 2.6.20-rc1: OK
> - 2.6.20-rc1-git8 with fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 reverted: OK
> - 2.6.20-rc1-git8: very BAD
> - 2.6.20-rc2: very BAD
> - 2.6.20-rc4: very BAD
> - >= 2.6.20: BAD (but not *very* BAD!)
>

well.. We have code which has been used by *everyone* for a year and it's
misbehaving for you alone. I wonder what you're doing that is
different/special.

Which filesystem, which mount options, what sort of workload?

Thanks.

2007-12-16 09:34:06

by Krzysztof Oledzki

[permalink] [raw]
Subject: Re: [Bug 9182] Critical memory leak (dirty pages)



On Sat, 15 Dec 2007, Andrew Morton wrote:

> On Sun, 16 Dec 2007 00:08:52 +0100 (CET) Krzysztof Oledzki <[email protected]> wrote:
>
>>
>>
>> On Sat, 15 Dec 2007, [email protected] wrote:
>>
>>> http://bugzilla.kernel.org/show_bug.cgi?id=9182
>>>
>>>
>>> ------- Comment #33 from [email protected] 2007-12-15 14:19 -------
>>> Krzysztof, I'd hate point you to a hard path (at least time consuming), but
>>> you've done a lot of digging by now anyway. How about git bisecting between
>>> 2.6.20-rc2 and rc1? Here is great info on bisecting:
>>> http://www.kernel.org/doc/local/git-quick.html
>>
>> As I'm smarter than git-bistect I can tell that 2.6.20-rc1-git8 is as bad
>> as 2.6.20-rc2 but 2.6.20-rc1-git8 with one patch reverted seems to be OK.
>> So it took me only 2 reboots. ;)
>>
>> The guilty patch is the one I proposed just an hour ago:
>> http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.20.y.git;a=commitdiff_plain;h=fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9
>>
>> So:
>> - 2.6.20-rc1: OK
>> - 2.6.20-rc1-git8 with fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 reverted: OK
>> - 2.6.20-rc1-git8: very BAD
>> - 2.6.20-rc2: very BAD
>> - 2.6.20-rc4: very BAD
>> - >= 2.6.20: BAD (but not *very* BAD!)
>>
>
> well.. We have code which has been used by *everyone* for a year and it's
> misbehaving for you alone.

No, not for me alone. Probably only I and Thomas Osterried have systems
where it is so easy to reproduce. Please note that the problem exists on
my all systems, but only on one it is critical. It is enough to run
"sync; sleep 1; sunc; sleep 1; sync; grep Drirty /proc/meminfo" to be sure.
With =>2.6.20-rc1-git8 it *never* falls to 0 an *all* my hosts but only
on one it goes to ~200MB in about 2 weeks and then everything dies:
http://bugzilla.kernel.org/attachment.cgi?id=13824
http://bugzilla.kernel.org/attachment.cgi?id=13825
http://bugzilla.kernel.org/attachment.cgi?id=13826
http://bugzilla.kernel.org/attachment.cgi?id=13827

> I wonder what you're doing that is different/special.
Me to. :|

> Which filesystem, which mount options

- ext3 on RAID1 (MD): / - rootflags=data=journal
- ext3 on LVM on RAID5 (MD)
- nfs

/dev/md0 on / type ext3 (rw)
proc on /proc type proc (rw)
sysfs on /sys type sysfs (rw,nosuid,nodev,noexec)
devpts on /dev/pts type devpts (rw,nosuid,noexec)
/dev/mapper/VolGrp0-usr on /usr type ext3 (rw,nodev,data=journal)
/dev/mapper/VolGrp0-var on /var type ext3 (rw,nodev,data=journal)
/dev/mapper/VolGrp0-squid_spool on /var/cache/squid/cd0 type ext3 (rw,nosuid,nodev,noatime,data=writeback)
/dev/mapper/VolGrp0-squid_spool2 on /var/cache/squid/cd1 type ext3 (rw,nosuid,nodev,noatime,data=writeback)
/dev/mapper/VolGrp0-news_spool on /var/spool/news type ext3 (rw,nosuid,nodev,noatime)
shm on /dev/shm type tmpfs (rw,noexec,nosuid,nodev)
usbfs on /proc/bus/usb type usbfs (rw,noexec,nosuid,devmode=0664,devgid=85)
owl:/usr/gentoo-nfs on /usr/gentoo-nfs type nfs (ro,nosuid,nodev,noatime,bg,intr,tcp,addr=192.168.129.26)


> what sort of workload?
Different, depending on a host: mail (postfix + amavisd + spamassasin +
clamav + sqlgray), squid, mysql, apache, nfs, rsync, .... But it seems
that the biggest problem is on the host running mentioned mail service.

Thanks.

Best regards,

Krzysztof Ol?dzki

2007-12-16 09:52:16

by Andrew Morton

[permalink] [raw]
Subject: Re: [Bug 9182] Critical memory leak (dirty pages)

On Sun, 16 Dec 2007 10:33:20 +0100 (CET) Krzysztof Oledzki <[email protected]> wrote:

>
>
> On Sat, 15 Dec 2007, Andrew Morton wrote:
>
> > On Sun, 16 Dec 2007 00:08:52 +0100 (CET) Krzysztof Oledzki <[email protected]> wrote:
> >
> >>
> >>
> >> On Sat, 15 Dec 2007, [email protected] wrote:
> >>
> >>> http://bugzilla.kernel.org/show_bug.cgi?id=9182
> >>>
> >>>
> >>> ------- Comment #33 from [email protected] 2007-12-15 14:19 -------
> >>> Krzysztof, I'd hate point you to a hard path (at least time consuming), but
> >>> you've done a lot of digging by now anyway. How about git bisecting between
> >>> 2.6.20-rc2 and rc1? Here is great info on bisecting:
> >>> http://www.kernel.org/doc/local/git-quick.html
> >>
> >> As I'm smarter than git-bistect I can tell that 2.6.20-rc1-git8 is as bad
> >> as 2.6.20-rc2 but 2.6.20-rc1-git8 with one patch reverted seems to be OK.
> >> So it took me only 2 reboots. ;)
> >>
> >> The guilty patch is the one I proposed just an hour ago:
> >> http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.20.y.git;a=commitdiff_plain;h=fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9
> >>
> >> So:
> >> - 2.6.20-rc1: OK
> >> - 2.6.20-rc1-git8 with fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 reverted: OK
> >> - 2.6.20-rc1-git8: very BAD
> >> - 2.6.20-rc2: very BAD
> >> - 2.6.20-rc4: very BAD
> >> - >= 2.6.20: BAD (but not *very* BAD!)
> >>
> >
> > well.. We have code which has been used by *everyone* for a year and it's
> > misbehaving for you alone.
>
> No, not for me alone. Probably only I and Thomas Osterried have systems
> where it is so easy to reproduce. Please note that the problem exists on
> my all systems, but only on one it is critical. It is enough to run
> "sync; sleep 1; sunc; sleep 1; sync; grep Drirty /proc/meminfo" to be sure.
> With =>2.6.20-rc1-git8 it *never* falls to 0 an *all* my hosts but only
> on one it goes to ~200MB in about 2 weeks and then everything dies:
> http://bugzilla.kernel.org/attachment.cgi?id=13824
> http://bugzilla.kernel.org/attachment.cgi?id=13825
> http://bugzilla.kernel.org/attachment.cgi?id=13826
> http://bugzilla.kernel.org/attachment.cgi?id=13827
>
> > I wonder what you're doing that is different/special.
> Me to. :|
>
> > Which filesystem, which mount options
>
> - ext3 on RAID1 (MD): / - rootflags=data=journal

It wouldn't surprise me if this is specific to data=journal: that
journalling mode is pretty complex wrt dairty-data handling and isn't well
tested.

Does switching that to data=writeback change things?

THomas, do you have ext3 data=journal on any filesytems?

> - ext3 on LVM on RAID5 (MD)
> - nfs
>
> /dev/md0 on / type ext3 (rw)
> proc on /proc type proc (rw)
> sysfs on /sys type sysfs (rw,nosuid,nodev,noexec)
> devpts on /dev/pts type devpts (rw,nosuid,noexec)
> /dev/mapper/VolGrp0-usr on /usr type ext3 (rw,nodev,data=journal)
> /dev/mapper/VolGrp0-var on /var type ext3 (rw,nodev,data=journal)
> /dev/mapper/VolGrp0-squid_spool on /var/cache/squid/cd0 type ext3 (rw,nosuid,nodev,noatime,data=writeback)
> /dev/mapper/VolGrp0-squid_spool2 on /var/cache/squid/cd1 type ext3 (rw,nosuid,nodev,noatime,data=writeback)
> /dev/mapper/VolGrp0-news_spool on /var/spool/news type ext3 (rw,nosuid,nodev,noatime)
> shm on /dev/shm type tmpfs (rw,noexec,nosuid,nodev)
> usbfs on /proc/bus/usb type usbfs (rw,noexec,nosuid,devmode=0664,devgid=85)
> owl:/usr/gentoo-nfs on /usr/gentoo-nfs type nfs (ro,nosuid,nodev,noatime,bg,intr,tcp,addr=192.168.129.26)
>
>
> > what sort of workload?
> Different, depending on a host: mail (postfix + amavisd + spamassasin +
> clamav + sqlgray), squid, mysql, apache, nfs, rsync, .... But it seems
> that the biggest problem is on the host running mentioned mail service.
>

2007-12-16 13:48:28

by Krzysztof Oledzki

[permalink] [raw]
Subject: Re: [Bug 9182] Critical memory leak (dirty pages)



On Sun, 16 Dec 2007, Andrew Morton wrote:

> On Sun, 16 Dec 2007 10:33:20 +0100 (CET) Krzysztof Oledzki <[email protected]> wrote:
>
>>
>>
>> On Sat, 15 Dec 2007, Andrew Morton wrote:
>>
>>> On Sun, 16 Dec 2007 00:08:52 +0100 (CET) Krzysztof Oledzki <[email protected]> wrote:
>>>
>>>>
>>>>
>>>> On Sat, 15 Dec 2007, [email protected] wrote:
>>>>
>>>>> http://bugzilla.kernel.org/show_bug.cgi?id=9182
>>>>>
>>>>>
>>>>> ------- Comment #33 from [email protected] 2007-12-15 14:19 -------
>>>>> Krzysztof, I'd hate point you to a hard path (at least time consuming), but
>>>>> you've done a lot of digging by now anyway. How about git bisecting between
>>>>> 2.6.20-rc2 and rc1? Here is great info on bisecting:
>>>>> http://www.kernel.org/doc/local/git-quick.html
>>>>
>>>> As I'm smarter than git-bistect I can tell that 2.6.20-rc1-git8 is as bad
>>>> as 2.6.20-rc2 but 2.6.20-rc1-git8 with one patch reverted seems to be OK.
>>>> So it took me only 2 reboots. ;)
>>>>
>>>> The guilty patch is the one I proposed just an hour ago:
>>>> http://git.kernel.org/?p=linux%2Fkernel%2Fgit%2Fstable%2Flinux-2.6.20.y.git;a=commitdiff_plain;h=fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9
>>>>
>>>> So:
>>>> - 2.6.20-rc1: OK
>>>> - 2.6.20-rc1-git8 with fba2591bf4e418b6c3f9f8794c9dd8fe40ae7bd9 reverted: OK
>>>> - 2.6.20-rc1-git8: very BAD
>>>> - 2.6.20-rc2: very BAD
>>>> - 2.6.20-rc4: very BAD
>>>> - >= 2.6.20: BAD (but not *very* BAD!)
>>>>
>>>
>>> well.. We have code which has been used by *everyone* for a year and it's
>>> misbehaving for you alone.
>>
>> No, not for me alone. Probably only I and Thomas Osterried have systems
>> where it is so easy to reproduce. Please note that the problem exists on
>> my all systems, but only on one it is critical. It is enough to run
>> "sync; sleep 1; sunc; sleep 1; sync; grep Drirty /proc/meminfo" to be sure.
>> With =>2.6.20-rc1-git8 it *never* falls to 0 an *all* my hosts but only
>> on one it goes to ~200MB in about 2 weeks and then everything dies:
>> http://bugzilla.kernel.org/attachment.cgi?id=13824
>> http://bugzilla.kernel.org/attachment.cgi?id=13825
>> http://bugzilla.kernel.org/attachment.cgi?id=13826
>> http://bugzilla.kernel.org/attachment.cgi?id=13827
>>
>>> I wonder what you're doing that is different/special.
>> Me to. :|
>>
>>> Which filesystem, which mount options
>>
>> - ext3 on RAID1 (MD): / - rootflags=data=journal
>
> It wouldn't surprise me if this is specific to data=journal: that
> journalling mode is pretty complex wrt dairty-data handling and isn't well
> tested.
>
> Does switching that to data=writeback change things?

I'll confirm this tomorrow but it seems that even switching to
data=ordered (AFAIK default o ext3) is indeed enough to cure this problem.

Two questions remain then: why system dies when dirty reaches ~200MB
and what is wrong with ext3+data=journal with >=2.6.20-rc2?

Best regards,

Krzysztof Ol?dzki

2007-12-16 21:52:16

by Andrew Morton

[permalink] [raw]
Subject: Re: [Bug 9182] Critical memory leak (dirty pages)

On Sun, 16 Dec 2007 14:46:36 +0100 (CET) Krzysztof Oledzki <[email protected]> wrote:

> >>> Which filesystem, which mount options
> >>
> >> - ext3 on RAID1 (MD): / - rootflags=data=journal
> >
> > It wouldn't surprise me if this is specific to data=journal: that
> > journalling mode is pretty complex wrt dairty-data handling and isn't well
> > tested.
> >
> > Does switching that to data=writeback change things?
>
> I'll confirm this tomorrow but it seems that even switching to
> data=ordered (AFAIK default o ext3) is indeed enough to cure this problem.

yes, sorry, I meant ordered.

> Two questions remain then: why system dies when dirty reaches ~200MB

I think you have ~2G of RAM and you're running with
/proc/sys/vm/dirty_ratio=10, yes?

If so, when that machine hits 10% * 2G of dirty memory then everyone who
wants to dirty pages gets blocked.

> and what is wrong with ext3+data=journal with >=2.6.20-rc2?

Ah. It has a bug in it ;)

As I said, data=journal has exceptional handling of pagecache data and is
not well tested. Someone (and I'm not sure who) will need to get in there
and fix it.

2007-12-17 14:26:31

by Jan Kara

[permalink] [raw]
Subject: Re: [Bug 9182] Critical memory leak (dirty pages)

> On Sun, 16 Dec 2007 14:46:36 +0100 (CET) Krzysztof Oledzki <[email protected]> wrote:
>
> > >>> Which filesystem, which mount options
> > >>
> > >> - ext3 on RAID1 (MD): / - rootflags=data=journal
> > >
> > > It wouldn't surprise me if this is specific to data=journal: that
> > > journalling mode is pretty complex wrt dairty-data handling and isn't well
> > > tested.
> > >
> > > Does switching that to data=writeback change things?
> >
> > I'll confirm this tomorrow but it seems that even switching to
> > data=ordered (AFAIK default o ext3) is indeed enough to cure this problem.
>
> yes, sorry, I meant ordered.
>
> > Two questions remain then: why system dies when dirty reaches ~200MB
>
> I think you have ~2G of RAM and you're running with
> /proc/sys/vm/dirty_ratio=10, yes?
>
> If so, when that machine hits 10% * 2G of dirty memory then everyone who
> wants to dirty pages gets blocked.
>
> > and what is wrong with ext3+data=journal with >=2.6.20-rc2?
>
> Ah. It has a bug in it ;)
>
> As I said, data=journal has exceptional handling of pagecache data and is
> not well tested. Someone (and I'm not sure who) will need to get in there
> and fix it.
It seems fsx-linux is able to trigger the leak on my test machine so
I'll have a look into it (not sure if I'll get to it today but I should
find some time for it this week)...

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2007-12-17 17:18:39

by Krzysztof Oledzki

[permalink] [raw]
Subject: Re: [Bug 9182] Critical memory leak (dirty pages)



On Sun, 16 Dec 2007, Andrew Morton wrote:

> On Sun, 16 Dec 2007 14:46:36 +0100 (CET) Krzysztof Oledzki <[email protected]> wrote:
>
>>>>> Which filesystem, which mount options
>>>>
>>>> - ext3 on RAID1 (MD): / - rootflags=data=journal
>>>
>>> It wouldn't surprise me if this is specific to data=journal: that
>>> journalling mode is pretty complex wrt dairty-data handling and isn't well
>>> tested.
>>>
>>> Does switching that to data=writeback change things?
>>
>> I'll confirm this tomorrow but it seems that even switching to
>> data=ordered (AFAIK default o ext3) is indeed enough to cure this problem.
>
> yes, sorry, I meant ordered.

OK, I can confirm that the problem is with data=journal. With data=ordered
I get:

# uname -rns;uptime;sync;sleep 1;sync ;sleep 1; sync;grep Dirty /proc/meminfo
Linux cougar 2.6.24-rc5
17:50:34 up 1 day, 20 min, 1 user, load average: 0.99, 0.48, 0.35
Dirty: 0 kB

>> Two questions remain then: why system dies when dirty reaches ~200MB
>
> I think you have ~2G of RAM and you're running with
> /proc/sys/vm/dirty_ratio=10, yes?
>
> If so, when that machine hits 10% * 2G of dirty memory then everyone who
> wants to dirty pages gets blocked.

Oh, right. Thank you for the explanation.

>> and what is wrong with ext3+data=journal with >=2.6.20-rc2?
>
> Ah. It has a bug in it ;)
>
> As I said, data=journal has exceptional handling of pagecache data and is
> not well tested. Someone (and I'm not sure who) will need to get in there
> and fix it.

OK, I'm willing to test it. ;)

Best regrds,

Krzysztof Ol?dzki

2007-12-19 17:45:50

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Bug 9182] Critical memory leak (dirty pages)



On Sun, 16 Dec 2007, Krzysztof Oledzki wrote:
>
> I'll confirm this tomorrow but it seems that even switching to data=ordered
> (AFAIK default o ext3) is indeed enough to cure this problem.

Ok, do we actually have any ext3 expert following this? I have no idea
about what the journalling code does, but I have painful memories of ext3
doing really odd buffer-head-based IO and totally bypassing all the normal
page dirty logic.

Judging by the symptoms (sorry for not following this well, it came up
while I was mostly away travelling), something probably *does* clear the
dirty bit on the pages, but the dirty *accounting* is not done properly,
so the kernel keeps thinking it has dirty pages.

Now, a simple "grep" shows that ext3 does not actually do any
ClearPageDirty() or similar on its own, although maybe I missed some other
subtle way this can happen. And the *normal* VFS routines that do
ClearPageDirty should all be doing the proper accounting.

So I see a couple of possible cases:

- actually clearing the PG_dirty bit somehow, without doing the
accounting.

This looks very unlikely. PG_dirty is always cleared by some variant of
"*ClearPageDirty()", and that bit definition isn't used for anything
else in the whole kernel judging by "grep" (the page allocator tests
the bit, that's it).

And there aren't that many hits for ClearPageDirty, and they all seem
to do the proper "dec_zone_page_state(page, NR_FILE_DIRTY);" etc if the
mapping has dirty state accounting.

The exceptions seem to be:
- the page freeing path, but that path checks that "mapping" is NULL
(so no accounting), and would complain loudly if it wasn't
- the swap state stuff ("move_from_swap_cache()"), but that should
only ever trigger for swap cache pages (we have a BUG_ON() in that
path), and those don't do dirty accounting anyway.
- pageout(), but again only for pages that have a NULL mapping.

- ext3 might be clearing (probably indirectly) the "page->mapping" thing
or similar, which in turn will make the VFS think that even a dirty
page isn't actually to be accounted for - so when the page *turned*
dirty, it was accounted as a dirty page, but then, when it was cleaned,
the accounting wasn't reversed because ->mapping had become NULL.

This would be some interaction with the truncation logic, and quite
frankly, that should be all shared with the non-journal case, so I find
this all very unlikely.

However, that second case is interesting, because the pageout case
actually has a comment like this:

/*
* Some data journaling orphaned pages can have
* page->mapping == NULL while being dirty with clean buffers.
*/

which really sounds like the case in question.

I may know the VM, but that special case was added due to insane
journaling filesystems, and I don't know what insane things they do. Which
is why I'm wondering if there is any ext3 person who knows the journaling
code?

How/when does it ever "orphan" pages? Because yes, if it ever does that,
and clears the ->mapping field on a mapped page, then that page will have
incremented the dirty counts when it became dirty, but will *not*
decrement the dirty count when it is an orphan.

> Two questions remain then: why system dies when dirty reaches ~200MB and what
> is wrong with ext3+data=journal with >=2.6.20-rc2?

Well, that one is probably pretty straightforward: since the kernel thinks
that there are too many dirty pages, it will ask everybody who creates
more dirty pages to clean out some *old* dirty pages, but since they don't
exist, the whole thing will basically wait forever for a writeout to clean
things out that will never happen.

200MB is 10% of your 2GB of low-mem RAM, and 10% is the default
dirty_ratio that causes synchronous waits for writeback. If you use the
normal 3:1 VM split, the hang should happen even earlier (at the ~100MB
"dirty" mark).

So that part isn't the bug. The bug is in the accounting, but I'm pretty
damn sure that the core VM itself is pretty ok, since that code has now
been stable for people for the last year or so. It seems that ext3 (with
data journaling) does something dodgy wrt some page.

But how about trying this appended patch. It should warn a few times if
some page is ever removed from a mapping while it's dirty (and the mapping
is one that should have been accouned). It also tries to "fix up" the
case, so *if* this is the cause, it should also fix the bug.

I'd love to hear if you get any stack dumps with this, and what the
backtrace is (and whether the dirty counts then stay ok).

The patch is totally untested. It compiles for me. That's all I can say.

(There's a few other places that set ->mapping to NULL, but they're pretty
esoteric. Page migration? Stuff like that).

Linus

---
mm/filemap.c | 12 ++++++++++++
1 files changed, 12 insertions(+), 0 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index 188cf5f..7560843 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -124,6 +124,18 @@ void __remove_from_page_cache(struct page *page)
mapping->nrpages--;
__dec_zone_page_state(page, NR_FILE_PAGES);
BUG_ON(page_mapped(page));
+
+ if (PageDirty(page) && mapping_cap_account_dirty(mapping)) {
+ static int count = 10;
+ if (count) {
+ count--;
+ WARN_ON(1);
+ }
+
+ /* Try to fix up the bug.. */
+ dec_zone_page_state(page, NR_FILE_DIRTY);
+ dec_bdi_stat(mapping->backing_dev_info, BDI_RECLAIMABLE);
+ }
}

void remove_from_page_cache(struct page *page)

2007-12-20 01:05:23

by Jan Kara

[permalink] [raw]
Subject: Re: [Bug 9182] Critical memory leak (dirty pages)

> On Sun, 16 Dec 2007, Krzysztof Oledzki wrote:
> >
> > I'll confirm this tomorrow but it seems that even switching to data=ordered
> > (AFAIK default o ext3) is indeed enough to cure this problem.
>
> Ok, do we actually have any ext3 expert following this? I have no idea
> about what the journalling code does, but I have painful memories of ext3
> doing really odd buffer-head-based IO and totally bypassing all the normal
> page dirty logic.
>
> Judging by the symptoms (sorry for not following this well, it came up
> while I was mostly away travelling), something probably *does* clear the
> dirty bit on the pages, but the dirty *accounting* is not done properly,
> so the kernel keeps thinking it has dirty pages.
>
> Now, a simple "grep" shows that ext3 does not actually do any
> ClearPageDirty() or similar on its own, although maybe I missed some other
> subtle way this can happen. And the *normal* VFS routines that do
> ClearPageDirty should all be doing the proper accounting.
>
> So I see a couple of possible cases:
>
> - actually clearing the PG_dirty bit somehow, without doing the
> accounting.
>
> This looks very unlikely. PG_dirty is always cleared by some variant of
> "*ClearPageDirty()", and that bit definition isn't used for anything
> else in the whole kernel judging by "grep" (the page allocator tests
> the bit, that's it).
>
> And there aren't that many hits for ClearPageDirty, and they all seem
> to do the proper "dec_zone_page_state(page, NR_FILE_DIRTY);" etc if the
> mapping has dirty state accounting.
>
> The exceptions seem to be:
> - the page freeing path, but that path checks that "mapping" is NULL
> (so no accounting), and would complain loudly if it wasn't
> - the swap state stuff ("move_from_swap_cache()"), but that should
> only ever trigger for swap cache pages (we have a BUG_ON() in that
> path), and those don't do dirty accounting anyway.
> - pageout(), but again only for pages that have a NULL mapping.
>
> - ext3 might be clearing (probably indirectly) the "page->mapping" thing
> or similar, which in turn will make the VFS think that even a dirty
> page isn't actually to be accounted for - so when the page *turned*
> dirty, it was accounted as a dirty page, but then, when it was cleaned,
> the accounting wasn't reversed because ->mapping had become NULL.
>
> This would be some interaction with the truncation logic, and quite
> frankly, that should be all shared with the non-journal case, so I find
> this all very unlikely.
>
> However, that second case is interesting, because the pageout case
> actually has a comment like this:
>
> /*
> * Some data journaling orphaned pages can have
> * page->mapping == NULL while being dirty with clean buffers.
> */
>
> which really sounds like the case in question.
>
> I may know the VM, but that special case was added due to insane
> journaling filesystems, and I don't know what insane things they do. Which
> is why I'm wondering if there is any ext3 person who knows the journaling
> code?
Yes, I'm looking into the problem... I think those orphan pages
without mapping are created because we cannot drop truncated
buffers/pages immediately. There can be a committing transaction that
still needs the data in those buffers and until it commits we have to
keep the pages (and even maybe write them to disk etc.). But eventually,
we should write the buffers, call try_to_free_buffers() which calls
cancel_dirty_page() and everything should be happy... in theory ;)
In practice, I have not yet narrowed down where the problem is.
fsx-linux is able to trigger the problem on my test machine so as
suspected it is some bad interaction of writes (plain writes, no mmap),
truncates and probably writeback. Small tests don't seem to trigger the
problem (fsx needs at least few hundreds operations to trigger the
problem) - on the other hand when some sequence of operations causes
lost dirty pages, they are lost deterministically in every run. Also the
file fsx operates on can be fairly small - 2MB was enough - so page
reclaim and such stuff probably isn't the thing we interact with.
Tomorrow I'll try more...

> How/when does it ever "orphan" pages? Because yes, if it ever does that,
> and clears the ->mapping field on a mapped page, then that page will have
> incremented the dirty counts when it became dirty, but will *not*
> decrement the dirty count when it is an orphan.

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2007-12-20 01:19:33

by Nick Piggin

[permalink] [raw]
Subject: Re: [Bug 9182] Critical memory leak (dirty pages)

On Thursday 20 December 2007 12:05, Jan Kara wrote:
> > On Sun, 16 Dec 2007, Krzysztof Oledzki wrote:
> > > I'll confirm this tomorrow but it seems that even switching to
> > > data=ordered (AFAIK default o ext3) is indeed enough to cure this
> > > problem.
> >
> > Ok, do we actually have any ext3 expert following this? I have no idea
> > about what the journalling code does, but I have painful memories of ext3
> > doing really odd buffer-head-based IO and totally bypassing all the
> > normal page dirty logic.
> >
> > Judging by the symptoms (sorry for not following this well, it came up
> > while I was mostly away travelling), something probably *does* clear the
> > dirty bit on the pages, but the dirty *accounting* is not done properly,
> > so the kernel keeps thinking it has dirty pages.
> >
> > Now, a simple "grep" shows that ext3 does not actually do any
> > ClearPageDirty() or similar on its own, although maybe I missed some
> > other subtle way this can happen. And the *normal* VFS routines that do
> > ClearPageDirty should all be doing the proper accounting.
> >
> > So I see a couple of possible cases:
> >
> > - actually clearing the PG_dirty bit somehow, without doing the
> > accounting.
> >
> > This looks very unlikely. PG_dirty is always cleared by some variant
> > of "*ClearPageDirty()", and that bit definition isn't used for anything
> > else in the whole kernel judging by "grep" (the page allocator tests the
> > bit, that's it).
> >
> > And there aren't that many hits for ClearPageDirty, and they all seem
> > to do the proper "dec_zone_page_state(page, NR_FILE_DIRTY);" etc if
> > the mapping has dirty state accounting.
> >
> > The exceptions seem to be:
> > - the page freeing path, but that path checks that "mapping" is NULL
> > (so no accounting), and would complain loudly if it wasn't
> > - the swap state stuff ("move_from_swap_cache()"), but that should
> > only ever trigger for swap cache pages (we have a BUG_ON() in that
> > path), and those don't do dirty accounting anyway.
> > - pageout(), but again only for pages that have a NULL mapping.
> >
> > - ext3 might be clearing (probably indirectly) the "page->mapping" thing
> > or similar, which in turn will make the VFS think that even a dirty
> > page isn't actually to be accounted for - so when the page *turned*
> > dirty, it was accounted as a dirty page, but then, when it was
> > cleaned, the accounting wasn't reversed because ->mapping had become
> > NULL.
> >
> > This would be some interaction with the truncation logic, and quite
> > frankly, that should be all shared with the non-journal case, so I
> > find this all very unlikely.
> >
> > However, that second case is interesting, because the pageout case
> > actually has a comment like this:
> >
> > /*
> > * Some data journaling orphaned pages can have
> > * page->mapping == NULL while being dirty with clean buffers.
> > */
> >
> > which really sounds like the case in question.
> >
> > I may know the VM, but that special case was added due to insane
> > journaling filesystems, and I don't know what insane things they do.
> > Which is why I'm wondering if there is any ext3 person who knows the
> > journaling code?
>
> Yes, I'm looking into the problem... I think those orphan pages
> without mapping are created because we cannot drop truncated
> buffers/pages immediately. There can be a committing transaction that
> still needs the data in those buffers and until it commits we have to
> keep the pages (and even maybe write them to disk etc.). But eventually,
> we should write the buffers, call try_to_free_buffers() which calls
> cancel_dirty_page() and everything should be happy... in theory ;)

If mapping is NULL, then try_to_free_buffers won't call cancel_dirty_page,
I think?

I don't know whether ext3 can be changed to not require/allow these dirty
pages, but I would rather Linus's dirty page accounting fix to go into that
path (the /* can this still happen? */ in try_to_free_buffers()), if possible.

Then you could also have a WARN_ON in __remove_from_page_cache().

2007-12-20 14:24:32

by Björn Steinbrink

[permalink] [raw]
Subject: Re: [Bug 9182] Critical memory leak (dirty pages)

On 2007.12.19 09:44:50 -0800, Linus Torvalds wrote:
>
>
> On Sun, 16 Dec 2007, Krzysztof Oledzki wrote:
> >
> > I'll confirm this tomorrow but it seems that even switching to data=ordered
> > (AFAIK default o ext3) is indeed enough to cure this problem.
>
> Ok, do we actually have any ext3 expert following this? I have no idea
> about what the journalling code does, but I have painful memories of ext3
> doing really odd buffer-head-based IO and totally bypassing all the normal
> page dirty logic.
>
> Judging by the symptoms (sorry for not following this well, it came up
> while I was mostly away travelling), something probably *does* clear the
> dirty bit on the pages, but the dirty *accounting* is not done properly,
> so the kernel keeps thinking it has dirty pages.
>
> Now, a simple "grep" shows that ext3 does not actually do any
> ClearPageDirty() or similar on its own, although maybe I missed some other
> subtle way this can happen. And the *normal* VFS routines that do
> ClearPageDirty should all be doing the proper accounting.
>
> So I see a couple of possible cases:
>
> - actually clearing the PG_dirty bit somehow, without doing the
> accounting.
>
> This looks very unlikely. PG_dirty is always cleared by some variant of
> "*ClearPageDirty()", and that bit definition isn't used for anything
> else in the whole kernel judging by "grep" (the page allocator tests
> the bit, that's it).

OK, so I looked for PG_dirty anyway.

In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers
bail out if the page is dirty.

Then in 3e67c0987d7567ad666641164a153dca9a43b11d, Andrew fixed
truncate_complete_page, because it called cancel_dirty_page (and thus
cleared PG_dirty) after try_to_free_buffers was called via
do_invalidatepage.

Now, if I'm not mistaken, we can end up as follows.

truncate_complete_page()
cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
do_invalidatepage()
ext3_invalidatepage()
journal_invalidatepage()
journal_unmap_buffer()
__dispose_buffer()
__journal_unfile_buffer()
__journal_temp_unlink_buffer()
mark_buffer_dirty(); // PG_dirty set, incr. dirty pages

If journal_unmap_buffer then returns 0, try_to_free_buffers is not
called and neither is cancel_dirty_page, so the dirty pages accounting
is not decreased again.

As try_to_free_buffers got its ext3 hack back in
ecdfc9787fe527491baefc22dce8b2dbd5b2908d, maybe
3e67c0987d7567ad666641164a153dca9a43b11d should be reverted? (Except for
the accounting fix in cancel_dirty_page, of course).


On a side note, before 8368e328dfe1c534957051333a87b3210a12743b the task
io accounting for cancelled writes happened always happened if the page
was dirty, regardless of page->mapping. This was also already true for
the old test_clear_page_dirty code, and the commit log for
8368e328dfe1c534957051333a87b3210a12743b doesn't mention that semantic
change either, so maybe the "if (account_size)" block should be moved
out of the if "(mapping && ...)" block?

Bj?rn - not sending patches because he needs sleep and wouldn't have a
damn clue about what to write as a commit message anyway

2007-12-20 15:05:25

by Jan Kara

[permalink] [raw]
Subject: Re: [Bug 9182] Critical memory leak (dirty pages)

> On 2007.12.19 09:44:50 -0800, Linus Torvalds wrote:
> >
> >
> > On Sun, 16 Dec 2007, Krzysztof Oledzki wrote:
> > >
> > > I'll confirm this tomorrow but it seems that even switching to data=ordered
> > > (AFAIK default o ext3) is indeed enough to cure this problem.
> >
> > Ok, do we actually have any ext3 expert following this? I have no idea
> > about what the journalling code does, but I have painful memories of ext3
> > doing really odd buffer-head-based IO and totally bypassing all the normal
> > page dirty logic.
> >
> > Judging by the symptoms (sorry for not following this well, it came up
> > while I was mostly away travelling), something probably *does* clear the
> > dirty bit on the pages, but the dirty *accounting* is not done properly,
> > so the kernel keeps thinking it has dirty pages.
> >
> > Now, a simple "grep" shows that ext3 does not actually do any
> > ClearPageDirty() or similar on its own, although maybe I missed some other
> > subtle way this can happen. And the *normal* VFS routines that do
> > ClearPageDirty should all be doing the proper accounting.
> >
> > So I see a couple of possible cases:
> >
> > - actually clearing the PG_dirty bit somehow, without doing the
> > accounting.
> >
> > This looks very unlikely. PG_dirty is always cleared by some variant of
> > "*ClearPageDirty()", and that bit definition isn't used for anything
> > else in the whole kernel judging by "grep" (the page allocator tests
> > the bit, that's it).
>
> OK, so I looked for PG_dirty anyway.
>
> In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers
> bail out if the page is dirty.
>
> Then in 3e67c0987d7567ad666641164a153dca9a43b11d, Andrew fixed
> truncate_complete_page, because it called cancel_dirty_page (and thus
> cleared PG_dirty) after try_to_free_buffers was called via
> do_invalidatepage.
>
> Now, if I'm not mistaken, we can end up as follows.
>
> truncate_complete_page()
> cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
> do_invalidatepage()
> ext3_invalidatepage()
> journal_invalidatepage()
> journal_unmap_buffer()
> __dispose_buffer()
> __journal_unfile_buffer()
> __journal_temp_unlink_buffer()
> mark_buffer_dirty(); // PG_dirty set, incr. dirty pages
>
> If journal_unmap_buffer then returns 0, try_to_free_buffers is not
> called and neither is cancel_dirty_page, so the dirty pages accounting
> is not decreased again.
Yes, this can happen. The call to mark_buffer_dirty() is a fallout
from journal_unfile_buffer() trying to sychronise JBD private dirty bit
(jbddirty) with the standard dirty bit. We could actually clear the
jbddirty bit before calling journal_unfile_buffer() so that this doesn't
happen but since Linus changed remove_from_pagecache() to not care about
redirtying the page I guess it's not needed any more...

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2007-12-20 16:05:35

by Jan Kara

[permalink] [raw]
Subject: Re: [Bug 9182] Critical memory leak (dirty pages)

> > On 2007.12.19 09:44:50 -0800, Linus Torvalds wrote:
> > >
> > >
> > > On Sun, 16 Dec 2007, Krzysztof Oledzki wrote:
> > > >
> > > > I'll confirm this tomorrow but it seems that even switching to data=ordered
> > > > (AFAIK default o ext3) is indeed enough to cure this problem.
> > >
> > > Ok, do we actually have any ext3 expert following this? I have no idea
> > > about what the journalling code does, but I have painful memories of ext3
> > > doing really odd buffer-head-based IO and totally bypassing all the normal
> > > page dirty logic.
> > >
> > > Judging by the symptoms (sorry for not following this well, it came up
> > > while I was mostly away travelling), something probably *does* clear the
> > > dirty bit on the pages, but the dirty *accounting* is not done properly,
> > > so the kernel keeps thinking it has dirty pages.
> > >
> > > Now, a simple "grep" shows that ext3 does not actually do any
> > > ClearPageDirty() or similar on its own, although maybe I missed some other
> > > subtle way this can happen. And the *normal* VFS routines that do
> > > ClearPageDirty should all be doing the proper accounting.
> > >
> > > So I see a couple of possible cases:
> > >
> > > - actually clearing the PG_dirty bit somehow, without doing the
> > > accounting.
> > >
> > > This looks very unlikely. PG_dirty is always cleared by some variant of
> > > "*ClearPageDirty()", and that bit definition isn't used for anything
> > > else in the whole kernel judging by "grep" (the page allocator tests
> > > the bit, that's it).
> >
> > OK, so I looked for PG_dirty anyway.
> >
> > In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers
> > bail out if the page is dirty.
> >
> > Then in 3e67c0987d7567ad666641164a153dca9a43b11d, Andrew fixed
> > truncate_complete_page, because it called cancel_dirty_page (and thus
> > cleared PG_dirty) after try_to_free_buffers was called via
> > do_invalidatepage.
> >
> > Now, if I'm not mistaken, we can end up as follows.
> >
> > truncate_complete_page()
> > cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
> > do_invalidatepage()
> > ext3_invalidatepage()
> > journal_invalidatepage()
> > journal_unmap_buffer()
> > __dispose_buffer()
> > __journal_unfile_buffer()
> > __journal_temp_unlink_buffer()
> > mark_buffer_dirty(); // PG_dirty set, incr. dirty pages
> >
> > If journal_unmap_buffer then returns 0, try_to_free_buffers is not
> > called and neither is cancel_dirty_page, so the dirty pages accounting
> > is not decreased again.
> Yes, this can happen. The call to mark_buffer_dirty() is a fallout
> from journal_unfile_buffer() trying to sychronise JBD private dirty bit
> (jbddirty) with the standard dirty bit. We could actually clear the
> jbddirty bit before calling journal_unfile_buffer() so that this doesn't
> happen but since Linus changed remove_from_pagecache() to not care about
> redirtying the page I guess it's not needed any more...
Oops, sorry, I spoke to soon. After thinking more about it, I think we
cannot clear the dirty bit (at least not jbddirty) in all cases and in
fact moving cancel_dirty_page() after do_invalidatepage() call only
hides the real problem.
Let's recap what JBD/ext3 code requires in case of truncation. A
life-cycle of a journaled buffer looks as follows: When we want to write
some data to it, it gets attached to the running transaction. When the
transaction is committing, the buffer is written to the journal.
Sometime later, the buffer is written to it's final place in the
filesystem - this is called checkpoint - and can be released.
Now suppose a write to the buffer happens in one transaction and you
truncate the buffer in the next one. You cannot just free the buffer
immediately - it can for example happen, that the transaction in which
the write happened hasn't committed yet. So we just leave the dirty
buffer there and it should be cleaned up later when the committing
transaction writes the data where it needs...
The problem is that when the commit code writes the buffer, it
eventually calls try_to_free_buffers() but as Nick pointed out,
->mapping is set to NULL by that time so we don't even call
cancel_dirty_page() and so the number of dirty pages is not properly
decreased. Of course, we could decrease the number of dirty pages after
we return from do_invalidatepage when clearing ->mapping but that would
make dirty accounting imprecise - we really still have those dirty data
which need writeout. But it's probably the best workaround I can
currently think of.

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2007-12-20 16:26:57

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Bug 9182] Critical memory leak (dirty pages)



On Thu, 20 Dec 2007, Bj?rn Steinbrink wrote:
>
> OK, so I looked for PG_dirty anyway.
>
> In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers
> bail out if the page is dirty.
>
> Then in 3e67c0987d7567ad666641164a153dca9a43b11d, Andrew fixed
> truncate_complete_page, because it called cancel_dirty_page (and thus
> cleared PG_dirty) after try_to_free_buffers was called via
> do_invalidatepage.
>
> Now, if I'm not mistaken, we can end up as follows.
>
> truncate_complete_page()
> cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
> do_invalidatepage()
> ext3_invalidatepage()
> journal_invalidatepage()
> journal_unmap_buffer()
> __dispose_buffer()
> __journal_unfile_buffer()
> __journal_temp_unlink_buffer()
> mark_buffer_dirty(); // PG_dirty set, incr. dirty pages

Good, this seems to be the exact path that actually triggers it. I got to
journal_unmap_buffer(), but was too lazy to actually then bother to follow
it all the way down - I decided that I didn't actually really even care
what the low-level FS layer did, I had already convinced myself that it
obviously must be dirtying the page some way, since that matched the
symptoms exactly (ie only the journaling case was impacted, and this was
all about the journal).

But perhaps more importantly: regardless of what the low-level filesystem
did at that point, the VM accounting shouldn't care, and should be robust
in the face of a low-level filesystem doing strange and wonderful things.
But thanks for bothering to go through the whole history and figure out
what exactly is up.

> As try_to_free_buffers got its ext3 hack back in
> ecdfc9787fe527491baefc22dce8b2dbd5b2908d, maybe
> 3e67c0987d7567ad666641164a153dca9a43b11d should be reverted? (Except for
> the accounting fix in cancel_dirty_page, of course).

Yes, I think we have room for cleanups now, and I agree: we ended up
reinstating some questionable code in the VM just because we didn't really
know or understand what was going on in the ext3 journal code.

Of course, it may well be that there is something *else* going on too, but
I do believe that this whole case is what it was all about, and the hacks
end up just (a) making the VM harder to understand (because they cause
non-obvious VM code to work around some very specific filesystem
behaviour) and (b) the hacks obviously hid the _real_ issue, but I think
we've established the real cause, and the hacks clearly weren't enough to
really hide it 100% anyway.

However, there's no way I'll play with that right now (I'm planning on an
-rc6 today), but it might be worth it to make a test-cleanup patch for -mm
which does some VM cleanups:

- don't touch dirty pages in fs/buffer.c (ie undo the meat of commit
ecdfc9787fe527491baefc22dce8b2dbd5b2908d, but not resurrecting the
debugging code)

- remove the calling of "cancel_dirty_page()" entirely from
"truncate_complete_page()", and let "remove_from_page_cache()" just
always handle it (and probably just add a "ClearPageDirty()" to match
the "ClearPageUptodate()").

- remove "cancel_dirty_page()" from "truncate_huge_page()", which seems
to be the exact same issue (ie we should just use the logic in
remove_from_page_cache()).

at that point "cancel_dirty_page()" literally is only used for what its
name implies, and the only in-tree use of it seems to be NFS for when
the filesystem gets called for ->invalidatepage - which makes tons of
conceptual sense, but I suspect we could drop it from there too, since the
VM layer _will_ cancel the dirtiness at a VM level when it then later
removes it from the page cache.

So we essentially seem to be able to simplify things a bit by getting rid
of a hack in try_to_free_buffers(), and potentially getting rid of
cancel_dirty_page() entirely.

It would imply that we need to do the task_io_account_cancelled_write()
inside "remove_from_page_cache()", but that should be ok (I don't see any
locking issues there).

> On a side note, before 8368e328dfe1c534957051333a87b3210a12743b the task
> io accounting for cancelled writes happened always happened if the page
> was dirty, regardless of page->mapping. This was also already true for
> the old test_clear_page_dirty code, and the commit log for
> 8368e328dfe1c534957051333a87b3210a12743b doesn't mention that semantic
> change either, so maybe the "if (account_size)" block should be moved
> out of the if "(mapping && ...)" block?

I think the "if (account_size)" thing was *purely* for me being worried
about hugetlb entries, and I think that's the only thing that passes in a
zero account size.

But hugetlbfs already has BDI_CAP_NO_ACCT_DIRTY set (exactly because we
cannot account for those pages *anyway*), so I think we could go further
than move the account_size outside of the test, I think we could probably
remove that test entirely and drop the whole thing.

The thing is, task_io_account_cancelled_write() doesn't make sense on
mappings that don't do dirty accounting, since those mappings are all
special anyway: they don't actually do any real IO, they are all in-ram
things. So I think it should stay inside the

if (mapping && mapping_cap_account_dirty(mapping))
..

test, simply because I don't think it makes any conceptual sense outside
of it.

Hmm?

But none of this seems really critical - just simple cleanups.

Linus

2007-12-20 17:26:07

by Jan Kara

[permalink] [raw]
Subject: Re: [Bug 9182] Critical memory leak (dirty pages)

> On Thu, 20 Dec 2007, Bj?rn Steinbrink wrote:
> >
> > OK, so I looked for PG_dirty anyway.
> >
> > In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers
> > bail out if the page is dirty.
> >
> > Then in 3e67c0987d7567ad666641164a153dca9a43b11d, Andrew fixed
> > truncate_complete_page, because it called cancel_dirty_page (and thus
> > cleared PG_dirty) after try_to_free_buffers was called via
> > do_invalidatepage.
> >
> > Now, if I'm not mistaken, we can end up as follows.
> >
> > truncate_complete_page()
> > cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
> > do_invalidatepage()
> > ext3_invalidatepage()
> > journal_invalidatepage()
> > journal_unmap_buffer()
> > __dispose_buffer()
> > __journal_unfile_buffer()
> > __journal_temp_unlink_buffer()
> > mark_buffer_dirty(); // PG_dirty set, incr. dirty pages
>
> Good, this seems to be the exact path that actually triggers it. I got to
> journal_unmap_buffer(), but was too lazy to actually then bother to follow
> it all the way down - I decided that I didn't actually really even care
> what the low-level FS layer did, I had already convinced myself that it
> obviously must be dirtying the page some way, since that matched the
> symptoms exactly (ie only the journaling case was impacted, and this was
> all about the journal).
>
> But perhaps more importantly: regardless of what the low-level filesystem
> did at that point, the VM accounting shouldn't care, and should be robust
> in the face of a low-level filesystem doing strange and wonderful things.
> But thanks for bothering to go through the whole history and figure out
> what exactly is up.
As I wrote in my previous email, this solution works but hides the
fact that the page really *has* dirty data in it and *is* pinned in memory
until the commit code gets to writing it. So in theory it could disturb
the writeout logic by having more dirty data in memory than vm thinks it
has. Not that I'd have a better fix now but I wanted to point out this
problem.

Honza
--
Jan Kara <[email protected]>
SuSE CR Labs

2007-12-20 19:25:08

by Linus Torvalds

[permalink] [raw]
Subject: Re: [Bug 9182] Critical memory leak (dirty pages)



On Thu, 20 Dec 2007, Jan Kara wrote:
>
> As I wrote in my previous email, this solution works but hides the
> fact that the page really *has* dirty data in it and *is* pinned in memory
> until the commit code gets to writing it. So in theory it could disturb
> the writeout logic by having more dirty data in memory than vm thinks it
> has. Not that I'd have a better fix now but I wanted to point out this
> problem.

Well, I worry more about the VM being sane - and by the time we actually
hit this case, as far as VM sanity is concerned, the page no longer really
exists. It's been removed from the page cache, and it only really exists
as any other random kernel allocation.

The fact that low-level filesystems (in this case ext3 journaling) do
their own insane things is not something the VM even _should_ care about.
It's just an internal FS allocation, and the FS can do whatever the hell
it wants with it, including doing IO etc.

The kernel doesn't consider any other random IO pages to be "dirty" either
(eg if you do direct-IO writes using low-level SCSI commands, the VM
doesn't consider that to be any special dirty stuff, it's just random page
allocations again). This is really no different.

In other words: the Linux "VM" subsystem is really two differnt parts: the
low-level page allocator (which obviously knows that the page is still in
*use*, since it hasn't been free'd), and the higher-level file mapping and
caching stuff that knows about things like page "dirtyiness". And once
you've done a "remove_from_page_cache()", the higher levels are no longer
involved, and dirty accounting simply doesn't get into the picture.

Linus

2007-12-20 22:28:40

by Björn Steinbrink

[permalink] [raw]
Subject: Re: [Bug 9182] Critical memory leak (dirty pages)

On 2007.12.20 08:25:56 -0800, Linus Torvalds wrote:
>
>
> On Thu, 20 Dec 2007, Bj?rn Steinbrink wrote:
> >
> > OK, so I looked for PG_dirty anyway.
> >
> > In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers
> > bail out if the page is dirty.
> >
> > Then in 3e67c0987d7567ad666641164a153dca9a43b11d, Andrew fixed
> > truncate_complete_page, because it called cancel_dirty_page (and thus
> > cleared PG_dirty) after try_to_free_buffers was called via
> > do_invalidatepage.
> >
> > Now, if I'm not mistaken, we can end up as follows.
> >
> > truncate_complete_page()
> > cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
> > do_invalidatepage()
> > ext3_invalidatepage()
> > journal_invalidatepage()
> > journal_unmap_buffer()
> > __dispose_buffer()
> > __journal_unfile_buffer()
> > __journal_temp_unlink_buffer()
> > mark_buffer_dirty(); // PG_dirty set, incr. dirty pages
>
> Good, this seems to be the exact path that actually triggers it. I got to
> journal_unmap_buffer(), but was too lazy to actually then bother to follow
> it all the way down - I decided that I didn't actually really even care
> what the low-level FS layer did, I had already convinced myself that it
> obviously must be dirtying the page some way, since that matched the
> symptoms exactly (ie only the journaling case was impacted, and this was
> all about the journal).
>
> But perhaps more importantly: regardless of what the low-level filesystem
> did at that point, the VM accounting shouldn't care, and should be robust
> in the face of a low-level filesystem doing strange and wonderful things.
> But thanks for bothering to go through the whole history and figure out
> what exactly is up.

Oh well, after seeing the move of cancel_dirty_page, I just went
backwards from __set_page_dirty using cscope + some smart guessing and
quickly ended up at ext3_invalidatepage, so it wasn't that hard :-)

> > As try_to_free_buffers got its ext3 hack back in
> > ecdfc9787fe527491baefc22dce8b2dbd5b2908d, maybe
> > 3e67c0987d7567ad666641164a153dca9a43b11d should be reverted? (Except for
> > the accounting fix in cancel_dirty_page, of course).
>
> Yes, I think we have room for cleanups now, and I agree: we ended up
> reinstating some questionable code in the VM just because we didn't really
> know or understand what was going on in the ext3 journal code.

Hm, you attributed more to my mail than there was actually in it. I
didn't even start to think of cleanups (because I don't know jack about
the whole ext3/jdb stuff, so I simply cannot come up with any cleanups
(yet?)).What I meant is that we only did a half-revert of that hackery.

When try_to_free_buffers started to check for PG_dirty, the
cancel_dirty_page call had to be called before do_invalidatepage, to
"fix" a _huge_ leak. But that caused the accouting breakage we're now
seeing, because we never account for the pages that got redirtied during
do_invalidatepage.

Then the change to try_to_free_buffers got reverted, so we no longer
need to call cancel_dirty_page before do_invalidatepage, but still we
do. Thus the accounting bug remains. So what I meant to suggest was
simply to actually "finish" the revert we started.

Or expressed as a patch:

diff --git a/mm/truncate.c b/mm/truncate.c
index cadc156..2974903 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -98,11 +98,11 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
if (page->mapping != mapping)
return;

- cancel_dirty_page(page, PAGE_CACHE_SIZE);
-
if (PagePrivate(page))
do_invalidatepage(page, 0);

+ cancel_dirty_page(page, PAGE_CACHE_SIZE);
+
remove_from_page_cache(page);
ClearPageUptodate(page);
ClearPageMappedToDisk(page);

I'll be the last one to comment on whether or not that causes inaccurate
accouting, so I'll just watch you and Jan battle that out until someone
comes up with a post-.24 patch to provide a clean fix for the issue.

Krzysztof, could you give this patch a test run?

If that "fixes" the problem for now, I'll try to come up with some
usable commit message, or if somehow wants to beat me to it, you can
already have my

Signed-off-by: Bj?rn Steinbrink <[email protected]>

> > On a side note, before 8368e328dfe1c534957051333a87b3210a12743b the task
> > io accounting for cancelled writes happened always happened if the page
> > was dirty, regardless of page->mapping. This was also already true for
> > the old test_clear_page_dirty code, and the commit log for
> > 8368e328dfe1c534957051333a87b3210a12743b doesn't mention that semantic
> > change either, so maybe the "if (account_size)" block should be moved
> > out of the if "(mapping && ...)" block?
>
> I think the "if (account_size)" thing was *purely* for me being worried
> about hugetlb entries, and I think that's the only thing that passes in a
> zero account size.
>
> But hugetlbfs already has BDI_CAP_NO_ACCT_DIRTY set (exactly because we
> cannot account for those pages *anyway*), so I think we could go further
> than move the account_size outside of the test, I think we could probably
> remove that test entirely and drop the whole thing.
>
> The thing is, task_io_account_cancelled_write() doesn't make sense on
> mappings that don't do dirty accounting, since those mappings are all
> special anyway: they don't actually do any real IO, they are all in-ram
> things. So I think it should stay inside the
>
> if (mapping && mapping_cap_account_dirty(mapping))
> ..
>
> test, simply because I don't think it makes any conceptual sense outside
> of it.
>
> Hmm?

Yeah, you're right. It's symmetric with __set_page_dirty, which only
calls task_io_account_write when there's a mapping.

Bj?rn

2007-12-21 02:09:31

by Nick Piggin

[permalink] [raw]
Subject: Re: [Bug 9182] Critical memory leak (dirty pages)

On Friday 21 December 2007 06:24, Linus Torvalds wrote:
> On Thu, 20 Dec 2007, Jan Kara wrote:
> > As I wrote in my previous email, this solution works but hides the
> > fact that the page really *has* dirty data in it and *is* pinned in
> > memory until the commit code gets to writing it. So in theory it could
> > disturb the writeout logic by having more dirty data in memory than vm
> > thinks it has. Not that I'd have a better fix now but I wanted to point
> > out this problem.
>
> Well, I worry more about the VM being sane - and by the time we actually
> hit this case, as far as VM sanity is concerned, the page no longer really
> exists. It's been removed from the page cache, and it only really exists
> as any other random kernel allocation.

It does allow the VM to just not worry about this. However I don't
really like this kinds of catch-all conditions that are hard to get
rid of and can encourage bad behaviour.

It would be nice if the "insane" things were made to clean up after
themselves.


> The fact that low-level filesystems (in this case ext3 journaling) do
> their own insane things is not something the VM even _should_ care about.
> It's just an internal FS allocation, and the FS can do whatever the hell
> it wants with it, including doing IO etc.
>
> The kernel doesn't consider any other random IO pages to be "dirty" either
> (eg if you do direct-IO writes using low-level SCSI commands, the VM
> doesn't consider that to be any special dirty stuff, it's just random page
> allocations again). This is really no different.
>
> In other words: the Linux "VM" subsystem is really two differnt parts: the
> low-level page allocator (which obviously knows that the page is still in
> *use*, since it hasn't been free'd), and the higher-level file mapping and
> caching stuff that knows about things like page "dirtyiness". And once
> you've done a "remove_from_page_cache()", the higher levels are no longer
> involved, and dirty accounting simply doesn't get into the picture.

That's all true... it would simply be nice to ask the filesystems to do
this. But anyway I think your patch is pretty reasonable for the moment.

2007-12-21 20:00:55

by Krzysztof Oledzki

[permalink] [raw]
Subject: Re: [Bug 9182] Critical memory leak (dirty pages)



On Thu, 20 Dec 2007, Bj?rn Steinbrink wrote:

> On 2007.12.20 08:25:56 -0800, Linus Torvalds wrote:
>>
>>
>> On Thu, 20 Dec 2007, Bj?rn Steinbrink wrote:
>>>
>>> OK, so I looked for PG_dirty anyway.
>>>
>>> In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0 you made try_to_free_buffers
>>> bail out if the page is dirty.
>>>
>>> Then in 3e67c0987d7567ad666641164a153dca9a43b11d, Andrew fixed
>>> truncate_complete_page, because it called cancel_dirty_page (and thus
>>> cleared PG_dirty) after try_to_free_buffers was called via
>>> do_invalidatepage.
>>>
>>> Now, if I'm not mistaken, we can end up as follows.
>>>
>>> truncate_complete_page()
>>> cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
>>> do_invalidatepage()
>>> ext3_invalidatepage()
>>> journal_invalidatepage()
>>> journal_unmap_buffer()
>>> __dispose_buffer()
>>> __journal_unfile_buffer()
>>> __journal_temp_unlink_buffer()
>>> mark_buffer_dirty(); // PG_dirty set, incr. dirty pages
>>
>> Good, this seems to be the exact path that actually triggers it. I got to
>> journal_unmap_buffer(), but was too lazy to actually then bother to follow
>> it all the way down - I decided that I didn't actually really even care
>> what the low-level FS layer did, I had already convinced myself that it
>> obviously must be dirtying the page some way, since that matched the
>> symptoms exactly (ie only the journaling case was impacted, and this was
>> all about the journal).
>>
>> But perhaps more importantly: regardless of what the low-level filesystem
>> did at that point, the VM accounting shouldn't care, and should be robust
>> in the face of a low-level filesystem doing strange and wonderful things.
>> But thanks for bothering to go through the whole history and figure out
>> what exactly is up.
>
> Oh well, after seeing the move of cancel_dirty_page, I just went
> backwards from __set_page_dirty using cscope + some smart guessing and
> quickly ended up at ext3_invalidatepage, so it wasn't that hard :-)
>
>>> As try_to_free_buffers got its ext3 hack back in
>>> ecdfc9787fe527491baefc22dce8b2dbd5b2908d, maybe
>>> 3e67c0987d7567ad666641164a153dca9a43b11d should be reverted? (Except for
>>> the accounting fix in cancel_dirty_page, of course).
>>
>> Yes, I think we have room for cleanups now, and I agree: we ended up
>> reinstating some questionable code in the VM just because we didn't really
>> know or understand what was going on in the ext3 journal code.
>
> Hm, you attributed more to my mail than there was actually in it. I
> didn't even start to think of cleanups (because I don't know jack about
> the whole ext3/jdb stuff, so I simply cannot come up with any cleanups
> (yet?)).What I meant is that we only did a half-revert of that hackery.
>
> When try_to_free_buffers started to check for PG_dirty, the
> cancel_dirty_page call had to be called before do_invalidatepage, to
> "fix" a _huge_ leak. But that caused the accouting breakage we're now
> seeing, because we never account for the pages that got redirtied during
> do_invalidatepage.
>
> Then the change to try_to_free_buffers got reverted, so we no longer
> need to call cancel_dirty_page before do_invalidatepage, but still we
> do. Thus the accounting bug remains. So what I meant to suggest was
> simply to actually "finish" the revert we started.
>
> Or expressed as a patch:
>
> diff --git a/mm/truncate.c b/mm/truncate.c
> index cadc156..2974903 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -98,11 +98,11 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
> if (page->mapping != mapping)
> return;
>
> - cancel_dirty_page(page, PAGE_CACHE_SIZE);
> -
> if (PagePrivate(page))
> do_invalidatepage(page, 0);
>
> + cancel_dirty_page(page, PAGE_CACHE_SIZE);
> +
> remove_from_page_cache(page);
> ClearPageUptodate(page);
> ClearPageMappedToDisk(page);
>
> I'll be the last one to comment on whether or not that causes inaccurate
> accouting, so I'll just watch you and Jan battle that out until someone
> comes up with a post-.24 patch to provide a clean fix for the issue.
>
> Krzysztof, could you give this patch a test run?
>
> If that "fixes" the problem for now, I'll try to come up with some
> usable commit message, or if somehow wants to beat me to it, you can
> already have my
>
> Signed-off-by: Bj?rn Steinbrink <[email protected]>

Checked with 2.6.24-rc5 + debug/fixup patch from Linus + above fix. After
3h there have been no warnings about __remove_from_page_cache(). So, it
seems that it is OK.

Tested-by: Krzysztof Piotr Oledzki <[email protected]>

Best regards,

Krzysztof Ol?dzki

2007-12-21 20:42:30

by Björn Steinbrink

[permalink] [raw]
Subject: [PATCH] Fix dirty page accounting leak with ext3 data=journal

In 46d2277c796f9f4937bfa668c40b2e3f43e93dd0, try_to_free_buffers was
changed to bail out if the page was dirty. That caused
truncate_complete_page to leak massive amounts of memory, because the
dirty bit was only cleared after the call to try_to_free_buffers. So the
call to cancel_dirty_page was moved up to have the dirty bit cleared
early in 3e67c0987d7567ad666641164a153dca9a43b11d.

The problem with that fix is, that the page can be redirtied after
cancel_dirty_page was called, eg. like this:

truncate_complete_page()
cancel_dirty_page() // PG_dirty cleared, decr. dirty pages
do_invalidatepage()
ext3_invalidatepage()
journal_invalidatepage()
journal_unmap_buffer()
__dispose_buffer()
__journal_unfile_buffer()
__journal_temp_unlink_buffer()
mark_buffer_dirty(); // PG_dirty set, incr. dirty pages

And then we end up with dirty pages being wrongly accounted.

In ecdfc9787fe527491baefc22dce8b2dbd5b2908d the changes to
try_to_free_buffers were reverted, so the original reason for the
massive memory leak is gone, so we can also revert the move of
the call to cancel_dirty_page from truncate_complete_page and get the
accounting right again.

Signed-off-by: Bj?rn Steinbrink <[email protected]>
Tested-by: Krzysztof Piotr Oledzki <[email protected]>

---
I'm not sure if it matters, but opposed to the final check in
__remove_from_page_cache, this one also cares about the task io
accounting, so maybe we want to use this instead, although it's not
quite the clean fix either.

diff --git a/mm/truncate.c b/mm/truncate.c
index cadc156..2974903 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -98,11 +98,11 @@ truncate_complete_page(struct address_space *mapping, struct page *page)
if (page->mapping != mapping)
return;

- cancel_dirty_page(page, PAGE_CACHE_SIZE);
-
if (PagePrivate(page))
do_invalidatepage(page, 0);

+ cancel_dirty_page(page, PAGE_CACHE_SIZE);
+
remove_from_page_cache(page);
ClearPageUptodate(page);
ClearPageMappedToDisk(page);