LinuxLists.cc - Page alloc problems with 2.6.32-rc kernels

2009-10-27 09:50:25

Subject: Page alloc problems with 2.6.32-rc kernels

Hello list,

I noticed that I get page alloc errors on one of my machines.
It is an old server I use as a fileserver with 512MB RAM. Before
2.6.3[12] I did not see this problem at all.
This only seems to happens when I use mutt on this machine.
I know that it means I am running out of memory here, but I am wondering
why I did not see this before....

mutt: page allocation failure. order:1, mode:0x20
Pid: 21746, comm: mutt Not tainted 2.6.32-rc5 #2
Call Trace:
[<c11981ef>] ? printk+0x18/0x21
[<c1050945>] __alloc_pages_nodemask+0x417/0x4d8
[<c1069edc>] cache_alloc_refill+0x250/0x496
[<c106a1c6>] __kmalloc+0xa4/0xa8
[<c110ddef>] tty_buffer_request_room+0x88/0x11d
[<c110dfb8>] tty_insert_flip_string+0x27/0x93
[<c110e8d9>] pty_write+0x24/0x4a
[<c110a24f>] n_tty_write+0x163/0x3ab
[<c1020960>] ? default_wake_function+0x0/0xd
[<c1107f3c>] tty_write+0x129/0x1c2
[<c110a0ec>] ? n_tty_write+0x0/0x3ab
[<c106cfe0>] vfs_write+0x8e/0x142
[<c1107e13>] ? tty_write+0x0/0x1c2
[<c106d13d>] sys_write+0x3d/0x6b
[<c1002bb5>] syscall_call+0x7/0xb
Mem-Info:
DMA per-cpu:
CPU 0: hi: 0, btch: 1 usd: 0
Normal per-cpu:
CPU 0: hi: 186, btch: 31 usd: 52
active_anon:17616 inactive_anon:20870 isolated_anon:0
active_file:27963 inactive_file:43021 isolated_file:0
unevictable:0 dirty:284 writeback:0 unstable:0 buffer:5599
free:1401 slab_reclaimable:2966 slab_unreclaimable:3731
mapped:4083 shmem:88 pagetables:515 bounce:0
DMA free:2196kB min:84kB low:104kB high:124kB active_anon:1496kB inactive_anon:2340kB active_file:3212kB inactive_file:5884kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:15804kB mlocked:0kB dirty:40kB writeback:0kB mapped:648kB shmem:0kB slab_reclaimable:244kB slab_unreclaimable:244kB kernel_stack:176kB pagetables:64kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 491 491
Normal free:3408kB min:2792kB low:3488kB high:4188kB active_anon:68968kB inactive_anon:81140kB active_file:108640kB inactive_file:166200kB unevictable:0kB isolated(anon):0kB isolated(file):0kB present:503616kB mlocked:0kB dirty:1096kB writeback:0kB mapped:15684kB shmem:352kB slab_reclaimable:11620kB slab_unreclaimable:14680kB kernel_stack:1688kB pagetables:1996kB unstable:0kB bounce:0kB writeback_tmp:0kB pages_scanned:0 all_unreclaimable? no
lowmem_reserve[]: 0 0 0
DMA: 549*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 2196kB
Normal: 754*4kB 35*8kB 7*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 3408kB
73515 total pagecache pages
2437 pages in swap cache
Swap cache stats: add 22864, delete 20427, find 290544/291230
Free swap = 456788kB
Total swap = 524280kB
130976 pages RAM
1952 pages reserved
52425 pages shared
86057 pages non-shared

Please put me on CC since I am not subscribed to the list.

Kind regards,
Michael

2009-10-27 12:54:40

by Frans Pop

[permalink] [raw]

Subject: Re: Page alloc problems with 2.6.32-rc kernels

Michael Guntsche wrote:
> I noticed that I get page alloc errors on one of my machines.
> It is an old server I use as a fileserver with 512MB RAM. Before
> 2.6.3[12] I did not see this problem at all.
> This only seems to happens when I use mutt on this machine.
> I know that it means I am running out of memory here, but I am wondering
> why I did not see this before....
>
> mutt: page allocation failure. order:1, mode:0x20
> Pid: 21746, comm: mutt Not tainted 2.6.32-rc5 #2

This is a known issue that's being heavily investigated, but is proving
very elusive. See http://lkml.org/lkml/2009/10/22/128 for an overview.

Can you easily reproduce the problem, or does it happen at random moments?
If you can reproduce it, it would be great if you could test the patches
mentioned in that mail.

Cheers,
FJP

2009-10-27 13:01:21

by Jiri Kosina

[permalink] [raw]

Subject: Re: Page alloc problems with 2.6.32-rc kernels

On Tue, 27 Oct 2009, Michael Guntsche wrote:

> Hello list,
>
> I noticed that I get page alloc errors on one of my machines.
> It is an old server I use as a fileserver with 512MB RAM. Before
> 2.6.3[12] I did not see this problem at all.
> This only seems to happens when I use mutt on this machine.
> I know that it means I am running out of memory here, but I am wondering
> why I did not see this before....
>
> mutt: page allocation failure. order:1, mode:0x20
> Pid: 21746, comm: mutt Not tainted 2.6.32-rc5 #2
> Call Trace:
> [<c11981ef>] ? printk+0x18/0x21
> [<c1050945>] __alloc_pages_nodemask+0x417/0x4d8
> [<c1069edc>] cache_alloc_refill+0x250/0x496
> [<c106a1c6>] __kmalloc+0xa4/0xa8
> [<c110ddef>] tty_buffer_request_room+0x88/0x11d
> [<c110dfb8>] tty_insert_flip_string+0x27/0x93
> [<c110e8d9>] pty_write+0x24/0x4a
> [<c110a24f>] n_tty_write+0x163/0x3ab
> [<c1020960>] ? default_wake_function+0x0/0xd
> [<c1107f3c>] tty_write+0x129/0x1c2
> [<c110a0ec>] ? n_tty_write+0x0/0x3ab
> [<c106cfe0>] vfs_write+0x8e/0x142
> [<c1107e13>] ? tty_write+0x0/0x1c2
> [<c106d13d>] sys_write+0x3d/0x6b
> [<c1002bb5>] syscall_call+0x7/0xb

This is a consequence of more general problem in the allocator, and is
probably not tty-specific. Please see the thread below

http://lkml.org/lkml/2009/10/22/128

--
Jiri Kosina
SUSE Labs, Novell Inc.

2009-10-27 14:00:15

by Michael Guntsche

[permalink] [raw]

Subject: Re: Page alloc problems with 2.6.32-rc kernels

On 27 Oct 09 11:38, Frans Pop wrote:
> Michael Guntsche wrote:
> > I noticed that I get page alloc errors on one of my machines.
> > It is an old server I use as a fileserver with 512MB RAM. Before
> > 2.6.3[12] I did not see this problem at all.
> This is a known issue that's being heavily investigated, but is proving
> very elusive. See http://lkml.org/lkml/2009/10/22/128 for an overview.
>
> Can you easily reproduce the problem, or does it happen at random moments?
> If you can reproduce it, it would be great if you could test the patches
> mentioned in that mail.
Hello Frans,

Thanks for the info. I will try to reproduce it with those patches
applied. But it happens randomly I had no problems for several days now
just saw it again yesterday. Nevertheless I see if those patches help
and will report back.

Michael

2009-10-27 14:00:22

by Mel Gorman

[permalink] [raw]

Subject: Re: Page alloc problems with 2.6.32-rc kernels

On Tue, Oct 27, 2009 at 11:38:00AM +0100, Frans Pop wrote:
> Michael Guntsche wrote:
> > I noticed that I get page alloc errors on one of my machines.
> > It is an old server I use as a fileserver with 512MB RAM. Before
> > 2.6.3[12] I did not see this problem at all.
> > This only seems to happens when I use mutt on this machine.
> > I know that it means I am running out of memory here, but I am wondering
> > why I did not see this before....
> >
> > mutt: page allocation failure. order:1, mode:0x20
> > Pid: 21746, comm: mutt Not tainted 2.6.32-rc5 #2
>
> This is a known issue that's being heavily investigated, but is proving
> very elusive. See http://lkml.org/lkml/2009/10/22/128 for an overview.
>
> Can you easily reproduce the problem, or does it happen at random moments?
> If you can reproduce it, it would be great if you could test the patches
> mentioned in that mail.
>

Specifically, this looks very like the bug

"page allocation failure message kernel 2.6.31.4 (tty-related)"

The first three patches on the thread Frans Pop are expected to close
that one out. Michael, I'm hoping they'll help you too. Whether they
help or not, let me know please.

Thanks

--
Mel Gorman
Part-time Phd Student Linux Technology Center
University of Limerick IBM Dublin Software Lab

2009-10-29 08:48:29

by Michael Guntsche

[permalink] [raw]

Subject: Re: Page alloc problems with 2.6.32-rc kernels

On 27 Oct 09 14:00, Mel Gorman wrote:
> Specifically, this looks very like the bug
>
> "page allocation failure message kernel 2.6.31.4 (tty-related)"
>
> The first three patches on the thread Frans Pop are expected to close
> that one out. Michael, I'm hoping they'll help you too. Whether they
> help or not, let me know please.

Just a quick update on that one. I applied the first three patches and
so far the problem did not reappear. But as I said before there is no
easy way for me to reproduce so I will continue testing.

Kind regards,
Michael

2009-10-29 10:05:56

by Michael Guntsche

[permalink] [raw]

Subject: Re: Page alloc problems with 2.6.32-rc kernels

On 27 Oct 09 14:00, Mel Gorman wrote:
> The first three patches on the thread Frans Pop are expected to close
> that one out. Michael, I'm hoping they'll help you too. Whether they
> help or not, let me know please.

Please ignore my last mail. After sending it out via mutt and closing it
I got three page alloc errors. I will now try with 1,2,3+patch 4 and
report back.

/Michael

2009-10-29 15:24:19

by Michael Guntsche

[permalink] [raw]

Subject: Re: Page alloc problems with 2.6.32-rc kernels

Quick update,

with patches 1+2+3+4 i STILL get the problems. I am applying patch 5 now
and report back.

/Michael

2009-10-30 19:28:22

by Michael Guntsche

[permalink] [raw]

Subject: Re: Page alloc problems with 2.6.32-rc kernels

Hello Mel and rest of the list,

I have the server running with all with patches applied and it runs
without any issues. Since adding patch5 seems to make a difference I
will revert 1-4 and only apply patch 5 to see if it work too. I will
report back as soon as I have news.

Kind regards,
Michael

2009-11-02 12:20:19

by Michael Guntsche

[permalink] [raw]

Subject: Re: Page alloc problems with 2.6.32-rc kernels

Missed the CC's

Good morning,

> I have the server running with all with patches applied and it runs
> without any issues. Since adding patch5 seems to make a difference I
> will revert 1-4 and only apply patch 5 to see if it work too. I will
> report back as soon as I have news.

Current status of my tests here. With only patch 5 applied (the revert)
I am not able to reproduce the problem. Reading through the ml archives
I noticed that this revert is somewhat controversial since it seems to
fix other bugs. Is it possible that reverting those fixes just hide the bug I am seeing instead of fixing it?

Kind regards,
Michael

2009-11-04 00:14:10

by Frans Pop

[permalink] [raw]

Subject: Re: Page alloc problems with 2.6.32-rc kernels

Adding a few more CCs.

On Monday 02 November 2009, Michael Guntsche wrote:
> > I have the server running with all with patches applied and it runs
> > without any issues. Since adding patch5 seems to make a difference I
> > will revert 1-4 and only apply patch 5 to see if it work too. I will
> > report back as soon as I have news.
>
> Current status of my tests here. With only patch 5 applied (the revert)
> I am not able to reproduce the problem. Reading through the ml archives
> I noticed that this revert is somewhat controversial since it seems to
> fix other bugs. Is it possible that reverting those fixes just hide the
> bug I am seeing instead of fixing it?

Thanks Michael. That means we now have two cases where reverting the
congestion_wait() changes from .31-rc3 (8aa7e847d8 + 373c0a7ed3) makes a
clear and significant difference.

I wonder if more effort could/should be made on this aspect.

Cheers,
FJP

2009-11-04 07:17:59

by Michael Guntsche

[permalink] [raw]

Subject: Re: Page alloc problems with 2.6.32-rc kernels

On 04 Nov 09 01:14, Frans Pop wrote:
> Thanks Michael. That means we now have two cases where reverting the
> congestion_wait() changes from .31-rc3 (8aa7e847d8 + 373c0a7ed3) makes a
> clear and significant difference.
>
> I wonder if more effort could/should be made on this aspect.

Good morning Frans,

As a cross check I reverted the revert here and tried to reproduce the
problem again. It is a lot harder to trigger for me now (I was not able
to reproduce it yet). I did update my local git tree though, can you
reproduce this problem on your side with current git?

Kind regards,
Michael

2009-11-04 22:14:32

by Frans Pop

[permalink] [raw]

Subject: Re: Page alloc problems with 2.6.32-rc kernels

On Wednesday 04 November 2009, Michael Guntsche wrote:
> On 04 Nov 09 01:14, Frans Pop wrote:
> > Thanks Michael. That means we now have two cases where reverting the
> > congestion_wait() changes from .31-rc3 (8aa7e847d8 + 373c0a7ed3) makes
> > a clear and significant difference.
> >
> > I wonder if more effort could/should be made on this aspect.
>
> As a cross check I reverted the revert here and tried to reproduce the
> problem again. It is a lot harder to trigger for me now (I was not able
> to reproduce it yet). I did update my local git tree though,

OK. Can you tell us a bit more about your setup:
- how much RAM does the system have?
- what's so special about mutt in your case that it triggers these errors?
- do you maybe have a huge mailbox, so mutt uses a lot of memory?
- does starting/use mutt cause swapping when you see the errors?
- do you use disk encryption at all?
- if you do, what is encrypted: the file system, swap, both?

>From your first mail it does look as if you had little free memory and that
swap was in use.

> can you reproduce this problem on your side with current git?

Yes I can, but my test case is somewhat special as it forces a huge amount
of swapping. It does look as if your problem may also be related to
swapping activity.

Cheers,
FJP

2009-11-04 23:07:25

by Michael Guntsche

[permalink] [raw]

Subject: Re: Page alloc problems with 2.6.32-rc kernels

On 04 Nov 09 23:14, Frans Pop wrote:
> OK. Can you tell us a bit more about your setup:
> - how much RAM does the system have?
512MB
> - what's so special about mutt in your case that it triggers these errors?
> - do you maybe have a huge mailbox, so mutt uses a lot of memory?
> - does starting/use mutt cause swapping when you see the errors?
Mutt is accessing a maildir directory with a several subdirectories
directly. All of them are added as mailboxes so I can jump to unread
mails. During startup and folder changing mutt is accessing all the
mailboxes.
>
> From your first mail it does look as if you had little free memory and that
> swap was in use.
I noticed that as well.
During the last days memory usage was not that high so maybe this
is the reason why I did not see any errors. I will continue running
latest git and see if a get the errors again when more memory is being
used.

Kind regards,
Michael