2008-10-26 22:57:36

by Dave Chinner

[permalink] [raw]
Subject: Order 0 page allocation failure under heavy I/O load


I've been running a workload in a UML recently to reproduce a
problem, and I've been seeing all sorts of latency problems on
the host. The hosts is running a standard debian kernel:

$ uname -a
Linux disturbed 2.6.26-1-amd64 #1 SMP Wed Sep 10 15:31:12 UTC 2008 x86_64 GNU/Linux

Basically, the workload running in the UML is:

# fsstress -p 1024 -n 100000 -d /mnt/xfs2/fsstress.dir

Which runs 1024 fsstress processes inside the indicated directory.
Being UML, that translates to 1024 processes on the host doing I/O
to a single file in an XFS filesystem. The problem is that this
load appears to be triggering OOM on the host. The host filesystem
is XFS on a 2 disk MD raid0 stripe.

The host will hang for tens of seconds at a time with both CPU cores
pegged at 100%, and eventually I get this in dmesg:

[1304740.261506] linux: page allocation failure. order:0, mode:0x10000
[1304740.261516] Pid: 10705, comm: linux Tainted: P 2.6.26-1-amd64 #1
[1304740.261520]
[1304740.261520] Call Trace:
[1304740.261557] [<ffffffff802768db>] __alloc_pages_internal+0x3ab/0x3c4
[1304740.261574] [<ffffffff80295248>] kmem_getpages+0x96/0x15f
[1304740.261580] [<ffffffff8029589d>] fallback_alloc+0x170/0x1e6
[1304740.261592] [<ffffffff802954b5>] kmem_cache_alloc_node+0x105/0x138
[1304740.261599] [<ffffffff802955ec>] cache_grow+0xdc/0x21d
[1304740.261609] [<ffffffff802958da>] fallback_alloc+0x1ad/0x1e6
[1304740.261620] [<ffffffff80295edd>] kmem_cache_alloc+0xc4/0xf6
[1304740.261625] [<ffffffff8027310c>] mempool_alloc+0x24/0xda
[1304740.261638] [<ffffffff802bd9a8>] bio_alloc_bioset+0x89/0xd9
[1304740.261657] [<ffffffffa016b2b2>] :dm_mod:clone_bio+0x3a/0x79
[1304740.261674] [<ffffffffa016c120>] :dm_mod:__split_bio+0x13a/0x374
[1304740.261697] [<ffffffffa016c8a1>] :dm_mod:dm_request+0x105/0x127
[1304740.261705] [<ffffffff8030c317>] generic_make_request+0x2fe/0x339
[1304740.261709] [<ffffffff8027310c>] mempool_alloc+0x24/0xda
[1304740.261750] [<ffffffffa01dd354>] :xfs:xfs_cluster_write+0xcd/0xf2
[1304740.261763] [<ffffffff8030d6eb>] submit_bio+0xdb/0xe2
[1304740.261796] [<ffffffffa01dcaf6>] :xfs:xfs_submit_ioend_bio+0x1e/0x27
[1304740.261825] [<ffffffffa01dcbbb>] :xfs:xfs_submit_ioend+0xa7/0xc6
[1304740.261857] [<ffffffffa01dd9fc>] :xfs:xfs_page_state_convert+0x500/0x54f
[1304740.261868] [<ffffffff8027bef4>] vma_prio_tree_next+0x3c/0x52
[1304740.261911] [<ffffffffa01ddbaa>] :xfs:xfs_vm_writepage+0xb4/0xea
[1304740.261920] [<ffffffff80276e57>] __writepage+0xa/0x23
[1304740.261924] [<ffffffff8027731c>] write_cache_pages+0x182/0x2b1
[1304740.261928] [<ffffffff80276e4d>] __writepage+0x0/0x23
[1304740.261952] [<ffffffff80277487>] do_writepages+0x20/0x2d
[1304740.261957] [<ffffffff802b62a4>] __writeback_single_inode+0x144/0x29d
[1304740.261966] [<ffffffff8031c625>] prop_fraction_single+0x35/0x55
[1304740.261976] [<ffffffff802b6768>] sync_sb_inodes+0x1b1/0x293
[1304740.261985] [<ffffffff802b6b96>] writeback_inodes+0x62/0xb3
[1304740.261991] [<ffffffff80277929>] balance_dirty_pages_ratelimited_nr+0x155/0x2e7
[1304740.262010] [<ffffffff8027eb42>] do_wp_page+0x578/0x5b2
[1304740.262027] [<ffffffff80281980>] handle_mm_fault+0x7dd/0x867
[1304740.262037] [<ffffffff80246021>] autoremove_wake_function+0x0/0x2e
[1304740.262051] [<ffffffff80221f78>] do_page_fault+0x5d8/0x9c8
[1304740.262061] [<ffffffff80213822>] genregs_get+0x4f/0x70
[1304740.262072] [<ffffffff80429b89>] error_exit+0x0/0x60
[1304740.262089]
[1304740.262091] Mem-info:
[1304740.262093] Node 0 DMA per-cpu:
[1304740.262096] CPU 0: hi: 0, btch: 1 usd: 0
[1304740.262099] CPU 1: hi: 0, btch: 1 usd: 0
[1304740.262101] Node 0 DMA32 per-cpu:
[1304740.262104] CPU 0: hi: 186, btch: 31 usd: 176
[1304740.262107] CPU 1: hi: 186, btch: 31 usd: 172
[1304740.262111] Active:254755 inactive:180546 dirty:13547 writeback:20016 unstable:0
[1304740.262113] free:3059 slab:39487 mapped:141190 pagetables:16401 bounce:0
[1304740.262116] Node 0 DMA free:8032kB min:28kB low:32kB high:40kB active:1444kB inactive:112kB present:10792kB pages_scanned:64 all_unreclaimable? no
[1304740.262122] lowmem_reserve[]: 0 2004 2004 2004
[1304740.262126] Node 0 DMA32 free:4204kB min:5712kB low:7140kB high:8568kB active:1017576kB inactive:722072kB present:2052256kB pages_scanned:0 all_unreclaimable? no
[1304740.262133] lowmem_reserve[]: 0 0 0 0
[1304740.262136] Node 0 DMA: 160*4kB 82*8kB 32*16kB 11*32kB 8*64kB 4*128kB 3*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 8048kB
[1304740.262146] Node 0 DMA32: 26*4kB 1*8kB 1*16kB 0*32kB 1*64kB 1*128kB 1*256kB 1*512kB 1*1024kB 1*2048kB 0*4096kB = 4160kB
[1304740.262155] 362921 total pagecache pages
[1304740.262158] Swap cache: add 461446, delete 411499, find 5485707/5511715
[1304740.262161] Free swap = 3240688kB
[1304740.262163] Total swap = 4152744kB
[1304740.274260] 524272 pages of RAM
[1304740.274260] 8378 reserved pages
[1304740.274260] 650528 pages shared
[1304740.274260] 49947 pages swap cached

This allocation failure occurred when something wrote to the root filesystem,
which is LVM on a MD RAID1 mirror. It appears to be bio mempool exhaustion that
is triggering the allocation failure report. The allocation failure report
doesn't come out every time the system goes catatonic under this workload - the
failure has been reported twice out of about 10 runs. However, every single run
of the workload has caused the hang-for-tens-of-seconds problem on the host.

Cheers,

Dave.
--
Dave Chinner
[email protected]


2008-10-27 05:54:24

by Cláudio Martins

[permalink] [raw]
Subject: Re: Order 0 page allocation failure under heavy I/O load

On Sunday 26 October 2008, Dave Chinner wrote:

> The host will hang for tens of seconds at a time with both CPU cores
> pegged at 100%, and eventually I get this in dmesg:
>
> [1304740.261506] linux: page allocation failure. order:0, mode:0x10000
> [1304740.261516] Pid: 10705, comm: linux Tainted: P 2.6.26-1-amd64

Hello,

Have you tried to increase vm.min_free_kbytes to something higher, that is
>=30000?

I think I have hit something similar on a box running nbd-server to export a
3 SATA disk raid0 block device over Gigabit Ethernet.
Increasing min_free_kbytes got rid of the messages for me... Though the load
on your system is much higher than on mine.

Regards

Cláudio

2008-10-27 06:22:28

by Dave Chinner

[permalink] [raw]
Subject: Re: Order 0 page allocation failure under heavy I/O load

On Mon, Oct 27, 2008 at 06:47:31AM +0100, Claudio Martins wrote:
> On Sunday 26 October 2008, Dave Chinner wrote:
>
> > The host will hang for tens of seconds at a time with both CPU cores
> > pegged at 100%, and eventually I get this in dmesg:
> >
> > [1304740.261506] linux: page allocation failure. order:0, mode:0x10000
> > [1304740.261516] Pid: 10705, comm: linux Tainted: P 2.6.26-1-amd64
>
> Hello,
>
> Have you tried to increase vm.min_free_kbytes to something higher, that is
> >=30000?

No, because I've found the XFS bug the workload was triggering so
I don't need to run it anymore.

I reported the problem because it appears that we've reported an
allocation failure without very much reclaim scanning (64 pages in
DMA zone, 0 pages in DMA32 zone), and there is apparently pages
available for allocation in the DMA zone:

1304740.262136] Node 0 DMA: 160*4kB 82*8kB 32*16kB 11*32kB 8*64kB 4*128kB 3*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 8048kB

So it appears that memory reclaim has not found the free pages it
apparently has available....

Fundamentally, I/O from a single CPU to a single disk on a machine
with 2GB RAM should not be able to cause allocation failures at all,
especially when the I/O is pure data I/O to a single file. Something
in the default config is busted if I can do that, and that's why
I reported the bug.

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-10-27 08:04:53

by Peter Zijlstra

[permalink] [raw]
Subject: Re: Order 0 page allocation failure under heavy I/O load

On Mon, 2008-10-27 at 17:22 +1100, Dave Chinner wrote:
> On Mon, Oct 27, 2008 at 06:47:31AM +0100, Claudio Martins wrote:
> > On Sunday 26 October 2008, Dave Chinner wrote:
> >
> > > The host will hang for tens of seconds at a time with both CPU cores
> > > pegged at 100%, and eventually I get this in dmesg:
> > >
> > > [1304740.261506] linux: page allocation failure. order:0, mode:0x10000
> > > [1304740.261516] Pid: 10705, comm: linux Tainted: P 2.6.26-1-amd64

> No, because I've found the XFS bug the workload was triggering so
> I don't need to run it anymore.
>
> I reported the problem because it appears that we've reported an
> allocation failure without very much reclaim scanning (64 pages in
> DMA zone, 0 pages in DMA32 zone), and there is apparently pages
> available for allocation in the DMA zone:
>
> 1304740.262136] Node 0 DMA: 160*4kB 82*8kB 32*16kB 11*32kB 8*64kB 4*128kB 3*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 8048kB
>
> So it appears that memory reclaim has not found the free pages it
> apparently has available....
>
> Fundamentally, I/O from a single CPU to a single disk on a machine
> with 2GB RAM should not be able to cause allocation failures at all,
> especially when the I/O is pure data I/O to a single file. Something
> in the default config is busted if I can do that, and that's why
> I reported the bug.

The allocation is 'mode:0x10000', which is __GFP_NOMEMALLOC. That means
the allocation doesn't have __GFP_WAIT, so it cannot do reclaim, it
doesn't have __GFP_HIGH so it can't access some emergency reserves.

The DMA stuff is special, and part of it is guarded for anything but
__GFP_DMA allocations.

You just ran the system very low on memory, and then tried an allocation
that can't do anything about it.. I don't find it very surprising it
fails.

The 'bug' if any, is having such a poor allocation within your IO path.
Not something to blame on the VM.

2008-10-27 10:17:18

by Dave Chinner

[permalink] [raw]
Subject: Re: Order 0 page allocation failure under heavy I/O load

On Mon, Oct 27, 2008 at 09:04:56AM +0100, Peter Zijlstra wrote:
> On Mon, 2008-10-27 at 17:22 +1100, Dave Chinner wrote:
> > On Mon, Oct 27, 2008 at 06:47:31AM +0100, Claudio Martins wrote:
> > > On Sunday 26 October 2008, Dave Chinner wrote:
> > >
> > > > The host will hang for tens of seconds at a time with both CPU cores
> > > > pegged at 100%, and eventually I get this in dmesg:
> > > >
> > > > [1304740.261506] linux: page allocation failure. order:0, mode:0x10000
> > > > [1304740.261516] Pid: 10705, comm: linux Tainted: P 2.6.26-1-amd64
>
> > No, because I've found the XFS bug the workload was triggering so
> > I don't need to run it anymore.
> >
> > I reported the problem because it appears that we've reported an
> > allocation failure without very much reclaim scanning (64 pages in
> > DMA zone, 0 pages in DMA32 zone), and there is apparently pages
> > available for allocation in the DMA zone:
> >
> > 1304740.262136] Node 0 DMA: 160*4kB 82*8kB 32*16kB 11*32kB 8*64kB 4*128kB 3*256kB 0*512kB 0*1024kB 0*2048kB 1*4096kB = 8048kB
> >
> > So it appears that memory reclaim has not found the free pages it
> > apparently has available....
> >
> > Fundamentally, I/O from a single CPU to a single disk on a machine
> > with 2GB RAM should not be able to cause allocation failures at all,
> > especially when the I/O is pure data I/O to a single file. Something
> > in the default config is busted if I can do that, and that's why
> > I reported the bug.
>
> The allocation is 'mode:0x10000', which is __GFP_NOMEMALLOC. That means
> the allocation doesn't have __GFP_WAIT, so it cannot do reclaim, it
> doesn't have __GFP_HIGH so it can't access some emergency reserves.

How did we get a gfp_mask with only __GFP_NOMEMALLOC? A mempool
allocation sets that and many more flags (like __GFP_NORETRY) but
they aren't present in that mask....

> The DMA stuff is special, and part of it is guarded for anything but
> __GFP_DMA allocations.

So if it wasn't a __GFP_DMA allocation - then what ran out of
memory? There appeared to be memory availble in the DMA32 zone....

> You just ran the system very low on memory, and then tried an allocation
> that can't do anything about it.. I don't find it very surprising it
> fails.

I didn't run the system low on memory - the *kernel* did. The
page cache is holding most of memory, and most of that is clean:

Active:254755 inactive:180546 dirty:13547 writeback:20016 unstable:0
free:3059 slab:39487 mapped:141190 pagetables:16401 bounce:0

> The 'bug' if any, is having such a poor allocation within your IO path.
> Not something to blame on the VM.

The I/O path started with a page fault and a call to
balance_dirty_pages_ratelimited_nr(). i.e. all the I/O is being done
by the VM and the allocation failure appears to be caused by
the VM holding all the clean free memory in the page cache where
the I/O layers can't access it. That really does seem like a VM
balance problem to me, not an I/O layer problem....

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-10-28 13:20:39

by Miquel van Smoorenburg

[permalink] [raw]
Subject: Re: Order 0 page allocation failure under heavy I/O load

On Mon, 2008-10-27 at 09:57 +1100, Dave Chinner wrote:
> I've been running a workload in a UML recently to reproduce a
> problem, and I've been seeing all sorts of latency problems on
> the host. The hosts is running a standard debian kernel:
>
> $ uname -a
> Linux disturbed 2.6.26-1-amd64 #1 SMP Wed Sep 10 15:31:12 UTC 2008 x86_64 GNU/Linux
>
> Basically, the workload running in the UML is:
>
> # fsstress -p 1024 -n 100000 -d /mnt/xfs2/fsstress.dir
>
> Which runs 1024 fsstress processes inside the indicated directory.
> Being UML, that translates to 1024 processes on the host doing I/O
> to a single file in an XFS filesystem. The problem is that this
> load appears to be triggering OOM on the host. The host filesystem
> is XFS on a 2 disk MD raid0 stripe.
>
> The host will hang for tens of seconds at a time with both CPU cores
> pegged at 100%, and eventually I get this in dmesg:
>
> [1304740.261506] linux: page allocation failure. order:0, mode:0x10000
> [1304740.261516] Pid: 10705, comm: linux Tainted: P 2.6.26-1-amd64 #1
> [1304740.261520]
> [1304740.261520] Call Trace:
> [1304740.261557] [<ffffffff802768db>] __alloc_pages_internal+0x3ab/0x3c4
> [1304740.261574] [<ffffffff80295248>] kmem_getpages+0x96/0x15f

I saw the same thing, on i386 though. Never saw it on x86_64. For i386
it helped to recompile with the 2G/2G split set. But it appears that my
problem has been solved in 2.6.26.6 by the commit below. Perhaps your
hitting something similar. Your kernel version looks like a debian
version number, and if 2.6.26.6 fixes your problem, please file a debian
bug report so that lenny won't get released with this bug ....

commit 6b546b3dbbc51800bdbd075da923288c6a4fe5af
Author: Mel Gorman <[email protected]>
Date: Sat Sep 13 22:05:39 2008 +0000

mm: mark the correct zone as full when scanning zonelists

commit 5bead2a0680687b9576d57c177988e8aa082b922 upstream

The iterator for_each_zone_zonelist() uses a struct zoneref *z cursor when
scanning zonelists to keep track of where in the zonelist it is. The
zoneref that is returned corresponds to the the next zone that is to be
scanned, not the current one. It was intended to be treated as an opaque
list.

When the page allocator is scanning a zonelist, it marks elements in the
zonelist corresponding to zones that are temporarily full. As the
zonelist is being updated, it uses the cursor here;

if (NUMA_BUILD)
zlc_mark_zone_full(zonelist, z);

This is intended to prevent rescanning in the near future but the zoneref
cursor does not correspond to the zone that has been found to be full.
This is an easy misunderstanding to make so this patch corrects the
problem by changing zoneref cursor to be the current zone being scanned
instead of the next one.

Mike.