2010-08-02 13:19:19

by Chris Webb

[permalink] [raw]
Subject: Over-eager swapping

We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
virtual machines on each of them, and I'm have some trouble with over-eager
swapping on some (but not all) of the machines. This is resulting in
customer reports of very poor response latency from the virtual machines
which have been swapped out, despite the hosts apparently having large
amounts of free memory, and running fine if swap is turned off.

All of the hosts are running a 2.6.32.7 kernel and have ksm enabled with
32GB of RAM and 2x quad-core processors. There is a cluster of Xeon E5420
machines which apparently doesn't exhibit the problem, and a cluster of
2352/2378 Opteron (NUMA) machines, some of which do. The kernel config of
the affected machines is at

http://cdw.me.uk/tmp/config-2.6.32.7

This differs very little from the config on the unaffected Xeon machines,
essentially just

-CONFIG_MCORE2=y
+CONFIG_MK8=y
-CONFIG_X86_P6_NOP=y

On a typical affected machine, the virtual machines and other processes
would apparently leave around 5.5GB of RAM available for buffers, but the
system seems to want to swap out 3GB of anonymous pages to give itself more
like 9GB of buffers:

# cat /proc/meminfo
MemTotal: 33083420 kB
MemFree: 693164 kB
Buffers: 8834380 kB
Cached: 11212 kB
SwapCached: 1443524 kB
Active: 21656844 kB
Inactive: 8119352 kB
Active(anon): 17203092 kB
Inactive(anon): 3729032 kB
Active(file): 4453752 kB
Inactive(file): 4390320 kB
Unevictable: 5472 kB
Mlocked: 5472 kB
SwapTotal: 25165816 kB
SwapFree: 21854572 kB
Dirty: 4300 kB
Writeback: 4 kB
AnonPages: 20780368 kB
Mapped: 6056 kB
Shmem: 56 kB
Slab: 961512 kB
SReclaimable: 438276 kB
SUnreclaim: 523236 kB
KernelStack: 10152 kB
PageTables: 67176 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 41707524 kB
Committed_AS: 39870868 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 150880 kB
VmallocChunk: 34342404996 kB
HardwareCorrupted: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 5824 kB
DirectMap2M: 3205120 kB
DirectMap1G: 30408704 kB

We see this despite the machine having vm.swappiness set to 0 in an attempt
to skew the reclaim as far as possible in favour of releasing page cache
instead of swapping anonymous pages.

After running swapoff -a, the machine is immediately much healthier. Even
while the swap is still being reduced, load goes down and response times in
virtual machines are much improved. Once the swap is completely gone, there
are still several gigabytes of RAM left free which are used for buffers, and
the virtual machines are no longer laggy because they are no longer swapped
out. Running swapon -a again, the affected machine waits for about a minute
with zero swap in use, before the amount of swap in use very rapidly
increases to around 2GB and then continues to increase more steadily to 3GB.

We could run with these machines without swap (in the worst cases we're
already doing so), but I'd prefer to have a reserve of swap available in
case of genuine emergency. If it's a choice between swapping out a guest or
oom-killing it, I'd prefer to swap... but I really don't want to swap out
running virtual machines in order to have eight gigabytes of page cache
instead of five!

Is this a problem with the page reclaim priorities, or am I just tuning
these hosts incorrectly? Is there more detailed info than /proc/meminfo
available which might shed more light on what's going wrong here?

Best wishes,

Chris.


2010-08-02 23:56:05

by Minchan Kim

[permalink] [raw]
Subject: Re: Over-eager swapping

On Mon, Aug 2, 2010 at 9:47 PM, Chris Webb <[email protected]> wrote:
> We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
> virtual machines on each of them, and I'm have some trouble with over-eager
> swapping on some (but not all) of the machines. This is resulting in
> customer reports of very poor response latency from the virtual machines
> which have been swapped out, despite the hosts apparently having large
> amounts of free memory, and running fine if swap is turned off.
>
> All of the hosts are running a 2.6.32.7 kernel and have ksm enabled with
> 32GB of RAM and 2x quad-core processors. There is a cluster of Xeon E5420
> machines which apparently doesn't exhibit the problem, and a cluster of
> 2352/2378 Opteron (NUMA) machines, some of which do. The kernel config of
> the affected machines is at
>
> ?http://cdw.me.uk/tmp/config-2.6.32.7
>
> This differs very little from the config on the unaffected Xeon machines,
> essentially just
>
> ?-CONFIG_MCORE2=y
> ?+CONFIG_MK8=y
> ?-CONFIG_X86_P6_NOP=y
>
> On a typical affected machine, the virtual machines and other processes
> would apparently leave around 5.5GB of RAM available for buffers, but the
> system seems to want to swap out 3GB of anonymous pages to give itself more
> like 9GB of buffers:
>
> ?# cat /proc/meminfo
> ?MemTotal: ? ? ? 33083420 kB
> ?MemFree: ? ? ? ? ?693164 kB
> ?Buffers: ? ? ? ? 8834380 kB
> ?Cached: ? ? ? ? ? ?11212 kB
> ?SwapCached: ? ? ?1443524 kB
> ?Active: ? ? ? ? 21656844 kB
> ?Inactive: ? ? ? ?8119352 kB
> ?Active(anon): ? 17203092 kB
> ?Inactive(anon): ?3729032 kB
> ?Active(file): ? ?4453752 kB
> ?Inactive(file): ?4390320 kB
> ?Unevictable: ? ? ? ?5472 kB
> ?Mlocked: ? ? ? ? ? ?5472 kB
> ?SwapTotal: ? ? ?25165816 kB
> ?SwapFree: ? ? ? 21854572 kB
> ?Dirty: ? ? ? ? ? ? ?4300 kB
> ?Writeback: ? ? ? ? ? ? 4 kB
> ?AnonPages: ? ? ?20780368 kB
> ?Mapped: ? ? ? ? ? ? 6056 kB
> ?Shmem: ? ? ? ? ? ? ? ?56 kB
> ?Slab: ? ? ? ? ? ? 961512 kB
> ?SReclaimable: ? ? 438276 kB
> ?SUnreclaim: ? ? ? 523236 kB
> ?KernelStack: ? ? ? 10152 kB
> ?PageTables: ? ? ? ?67176 kB
> ?NFS_Unstable: ? ? ? ? ?0 kB
> ?Bounce: ? ? ? ? ? ? ? ?0 kB
> ?WritebackTmp: ? ? ? ? ?0 kB
> ?CommitLimit: ? ?41707524 kB
> ?Committed_AS: ? 39870868 kB
> ?VmallocTotal: ? 34359738367 kB
> ?VmallocUsed: ? ? ?150880 kB
> ?VmallocChunk: ? 34342404996 kB
> ?HardwareCorrupted: ? ? 0 kB
> ?HugePages_Total: ? ? ? 0
> ?HugePages_Free: ? ? ? ?0
> ?HugePages_Rsvd: ? ? ? ?0
> ?HugePages_Surp: ? ? ? ?0
> ?Hugepagesize: ? ? ? 2048 kB
> ?DirectMap4k: ? ? ? ?5824 kB
> ?DirectMap2M: ? ? 3205120 kB
> ?DirectMap1G: ? ?30408704 kB
>
> We see this despite the machine having vm.swappiness set to 0 in an attempt
> to skew the reclaim as far as possible in favour of releasing page cache
> instead of swapping anonymous pages.
>

Hmm, Strange.
We reclaim only anon pages when the system has few page cache.
(ie, file + free <= high_water_mark)
But in your meminfo, your system has lots of page cache page.
So It isn't likely.

Another possibility is _zone_reclaim_ in NUMA.
Your working set has many anonymous page.

The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
It can make reclaim mode to lumpy so it can page out anon pages.

Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?

--
Kind regards,
Minchan Kim

2010-08-03 03:33:07

by Chris Webb

[permalink] [raw]
Subject: Re: Over-eager swapping

Minchan Kim <[email protected]> writes:

> Another possibility is _zone_reclaim_ in NUMA.
> Your working set has many anonymous page.
>
> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
> It can make reclaim mode to lumpy so it can page out anon pages.
>
> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?

Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
these are

# cat /proc/sys/vm/zone_reclaim_mode
0
# cat /proc/sys/vm/min_unmapped_ratio
1

I haven't changed either of these from the kernel default.

Many thanks,

Chris.

2010-08-03 04:09:20

by Minchan Kim

[permalink] [raw]
Subject: Re: Over-eager swapping

On Tue, Aug 3, 2010 at 12:31 PM, Chris Webb <[email protected]> wrote:
> Minchan Kim <[email protected]> writes:
>
>> Another possibility is _zone_reclaim_ in NUMA.
>> Your working set has many anonymous page.
>>
>> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
>> It can make reclaim mode to lumpy so it can page out anon pages.
>>
>> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
>
> Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
> these are
>
> ?# cat /proc/sys/vm/zone_reclaim_mode
> ?0
> ?# cat /proc/sys/vm/min_unmapped_ratio
> ?1

if zone_reclaim_mode is zero, it doesn't swap out anon_pages.

1) How does VM reclaim anonymous pages even though vm_swappiness ==
zero and has big page cache?
2) I doubt file pages of your system is fulled by Buffers while Cached
is almost 10M.
Why is it remained although anon pages is swapped out and cached page
are reclaimed?

Hmm. I have no idea. :(

--
Kind regards,
Minchan Kim

2010-08-03 04:28:48

by Fengguang Wu

[permalink] [raw]
Subject: Re: Over-eager swapping

On Tue, Aug 03, 2010 at 12:09:18PM +0800, Minchan Kim wrote:
> On Tue, Aug 3, 2010 at 12:31 PM, Chris Webb <[email protected]> wrote:
> > Minchan Kim <[email protected]> writes:
> >
> >> Another possibility is _zone_reclaim_ in NUMA.
> >> Your working set has many anonymous page.
> >>
> >> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
> >> It can make reclaim mode to lumpy so it can page out anon pages.
> >>
> >> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
> >
> > Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
> > these are
> >
> >  # cat /proc/sys/vm/zone_reclaim_mode
> >  0
> >  # cat /proc/sys/vm/min_unmapped_ratio
> >  1
>
> if zone_reclaim_mode is zero, it doesn't swap out anon_pages.

If there are lots of order-1 or higher allocations, anonymous pages
will be randomly evicted, regardless of their LRU ages. This is
probably another factor why the users claim. Are there easy ways to
confirm this other than patching the kernel?

Chris, what's in your /proc/slabinfo?

Thanks,
Fengguang

2010-08-03 04:47:39

by Minchan Kim

[permalink] [raw]
Subject: Re: Over-eager swapping

On Tue, Aug 3, 2010 at 1:28 PM, Wu Fengguang <[email protected]> wrote:
> On Tue, Aug 03, 2010 at 12:09:18PM +0800, Minchan Kim wrote:
>> On Tue, Aug 3, 2010 at 12:31 PM, Chris Webb <[email protected]> wrote:
>> > Minchan Kim <[email protected]> writes:
>> >
>> >> Another possibility is _zone_reclaim_ in NUMA.
>> >> Your working set has many anonymous page.
>> >>
>> >> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
>> >> It can make reclaim mode to lumpy so it can page out anon pages.
>> >>
>> >> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
>> >
>> > Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
>> > these are
>> >
>> > ?# cat /proc/sys/vm/zone_reclaim_mode
>> > ?0
>> > ?# cat /proc/sys/vm/min_unmapped_ratio
>> > ?1
>>
>> if zone_reclaim_mode is zero, it doesn't swap out anon_pages.
>
> If there are lots of order-1 or higher allocations, anonymous pages
> will be randomly evicted, regardless of their LRU ages. This is

I thought swapped out page is huge (ie, 3G) even though it enters lumpy mode.
But it's possible. :)

> probably another factor why the users claim. Are there easy ways to
> confirm this other than patching the kernel?

cat /proc/buddyinfo can help?

Off-topic:
It would be better to add new vmstat of lumpy entrance.

Pseudo code.

diff --git a/mm/vmscan.c b/mm/vmscan.c
index 0f9f624..d10ff4e 100644
--- a/mm/vmscan.c
+++ b/mm/vmscan.c
@@ -1641,7 +1641,7 @@ out:
}
}

-static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc)
+static void set_lumpy_reclaim_mode(int priority, struct scan_control
*sc, struct zone *zone)
{
/*
* If we need a large contiguous chunk of memory, or have
@@ -1654,6 +1654,9 @@ static void set_lumpy_reclaim_mode(int priority,
struct scan_control *sc)
sc->lumpy_reclaim_mode = 1;
else
sc->lumpy_reclaim_mode = 0;
+
+ if (sc->lumpy_reclaim_mode)
+ inc_zone_state(zone, NR_LUMPY);
}

/*
@@ -1670,7 +1673,7 @@ static void shrink_zone(int priority, struct zone *zone,

get_scan_count(zone, sc, nr, priority);

- set_lumpy_reclaim_mode(priority, sc);
+ set_lumpy_reclaim_mode(priority, sc, zone);

while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
nr[LRU_INACTIVE_FILE]) {

--
Kind regards,
Minchan Kim

2010-08-03 06:39:52

by Fengguang Wu

[permalink] [raw]
Subject: Re: Over-eager swapping

On Tue, Aug 03, 2010 at 12:47:36PM +0800, Minchan Kim wrote:
> On Tue, Aug 3, 2010 at 1:28 PM, Wu Fengguang <[email protected]> wrote:
> > On Tue, Aug 03, 2010 at 12:09:18PM +0800, Minchan Kim wrote:
> >> On Tue, Aug 3, 2010 at 12:31 PM, Chris Webb <[email protected]> wrote:
> >> > Minchan Kim <[email protected]> writes:
> >> >
> >> >> Another possibility is _zone_reclaim_ in NUMA.
> >> >> Your working set has many anonymous page.
> >> >>
> >> >> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
> >> >> It can make reclaim mode to lumpy so it can page out anon pages.
> >> >>
> >> >> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
> >> >
> >> > Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
> >> > these are
> >> >
> >> >  # cat /proc/sys/vm/zone_reclaim_mode
> >> >  0
> >> >  # cat /proc/sys/vm/min_unmapped_ratio
> >> >  1
> >>
> >> if zone_reclaim_mode is zero, it doesn't swap out anon_pages.
> >
> > If there are lots of order-1 or higher allocations, anonymous pages
> > will be randomly evicted, regardless of their LRU ages. This is
>
> I thought swapped out page is huge (ie, 3G) even though it enters lumpy mode.
> But it's possible. :)
>
> > probably another factor why the users claim. Are there easy ways to
> > confirm this other than patching the kernel?
>
> cat /proc/buddyinfo can help?

Some high order slab caches may show up there :)

> Off-topic:
> It would be better to add new vmstat of lumpy entrance.

I think it's a good debug entry. Although convenient, lumpy reclaim
is accompanied with some bad side effects. When something goes wrong,
it helps to check the number of lumpy reclaims.

Thanks,
Fengguang

> Pseudo code.
>
> diff --git a/mm/vmscan.c b/mm/vmscan.c
> index 0f9f624..d10ff4e 100644
> --- a/mm/vmscan.c
> +++ b/mm/vmscan.c
> @@ -1641,7 +1641,7 @@ out:
> }
> }
>
> -static void set_lumpy_reclaim_mode(int priority, struct scan_control *sc)
> +static void set_lumpy_reclaim_mode(int priority, struct scan_control
> *sc, struct zone *zone)
> {
> /*
> * If we need a large contiguous chunk of memory, or have
> @@ -1654,6 +1654,9 @@ static void set_lumpy_reclaim_mode(int priority,
> struct scan_control *sc)
> sc->lumpy_reclaim_mode = 1;
> else
> sc->lumpy_reclaim_mode = 0;
> +
> + if (sc->lumpy_reclaim_mode)
> + inc_zone_state(zone, NR_LUMPY);
> }
>
> /*
> @@ -1670,7 +1673,7 @@ static void shrink_zone(int priority, struct zone *zone,
>
> get_scan_count(zone, sc, nr, priority);
>
> - set_lumpy_reclaim_mode(priority, sc);
> + set_lumpy_reclaim_mode(priority, sc, zone);
>
> while (nr[LRU_INACTIVE_ANON] || nr[LRU_ACTIVE_FILE] ||
> nr[LRU_INACTIVE_FILE]) {
>
> --
> Kind regards,
> Minchan Kim

2010-08-03 21:51:54

by Chris Webb

[permalink] [raw]
Subject: Re: Over-eager swapping

Wu Fengguang <[email protected]> writes:

> Chris, what's in your /proc/slabinfo?

Hi. Sorry for the slow reply. The exact machine from which I previously
extracted that /proc/memstat has unfortunately had swap turned off by a
colleague while I was away, presumably because its behaviour because too
bad. However, here is info from another member of the cluster, this time
with 5GB of buffers and 2GB of swap in use, i.e. the same general problem:

# cat /proc/meminfo
MemTotal: 33084008 kB
MemFree: 2291464 kB
Buffers: 4908468 kB
Cached: 16056 kB
SwapCached: 1427480 kB
Active: 22885508 kB
Inactive: 5719520 kB
Active(anon): 20466488 kB
Inactive(anon): 3215888 kB
Active(file): 2419020 kB
Inactive(file): 2503632 kB
Unevictable: 10688 kB
Mlocked: 10688 kB
SwapTotal: 25165816 kB
SwapFree: 22798248 kB
Dirty: 2616 kB
Writeback: 0 kB
AnonPages: 23410296 kB
Mapped: 6324 kB
Shmem: 56 kB
Slab: 692296 kB
SReclaimable: 189032 kB
SUnreclaim: 503264 kB
KernelStack: 4568 kB
PageTables: 65588 kB
NFS_Unstable: 0 kB
Bounce: 0 kB
WritebackTmp: 0 kB
CommitLimit: 41707820 kB
Committed_AS: 34859884 kB
VmallocTotal: 34359738367 kB
VmallocUsed: 147616 kB
VmallocChunk: 34342399496 kB
HardwareCorrupted: 0 kB
HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB
DirectMap4k: 5888 kB
DirectMap2M: 2156544 kB
DirectMap1G: 31457280 kB

# cat /proc/slabinfo
slabinfo - version: 2.1
# name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
kmalloc_dma-512 32 32 512 32 4 : tunables 0 0 0 : slabdata 1 1 0
nf_conntrack_expect 312 312 208 39 2 : tunables 0 0 0 : slabdata 8 8 0
nf_conntrack 240 240 272 30 2 : tunables 0 0 0 : slabdata 8 8 0
dm_raid1_read_record 0 0 1064 30 8 : tunables 0 0 0 : slabdata 0 0 0
dm_crypt_io 240 260 152 26 1 : tunables 0 0 0 : slabdata 10 10 0
kcopyd_job 0 0 368 22 2 : tunables 0 0 0 : slabdata 0 0 0
dm_uevent 0 0 2608 12 8 : tunables 0 0 0 : slabdata 0 0 0
dm_rq_target_io 0 0 376 21 2 : tunables 0 0 0 : slabdata 0 0 0
cfq_queue 0 0 168 24 1 : tunables 0 0 0 : slabdata 0 0 0
bsg_cmd 0 0 312 26 2 : tunables 0 0 0 : slabdata 0 0 0
mqueue_inode_cache 36 36 896 36 8 : tunables 0 0 0 : slabdata 1 1 0
udf_inode_cache 0 0 640 25 4 : tunables 0 0 0 : slabdata 0 0 0
fuse_request 0 0 632 25 4 : tunables 0 0 0 : slabdata 0 0 0
fuse_inode 0 0 704 23 4 : tunables 0 0 0 : slabdata 0 0 0
ntfs_big_inode_cache 0 0 832 39 8 : tunables 0 0 0 : slabdata 0 0 0
ntfs_inode_cache 0 0 264 31 2 : tunables 0 0 0 : slabdata 0 0 0
isofs_inode_cache 0 0 616 26 4 : tunables 0 0 0 : slabdata 0 0 0
fat_inode_cache 0 0 648 25 4 : tunables 0 0 0 : slabdata 0 0 0
fat_cache 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0
hugetlbfs_inode_cache 28 28 584 28 4 : tunables 0 0 0 : slabdata 1 1 0
squashfs_inode_cache 0 0 640 25 4 : tunables 0 0 0 : slabdata 0 0 0
journal_handle 1360 1360 24 170 1 : tunables 0 0 0 : slabdata 8 8 0
journal_head 288 288 112 36 1 : tunables 0 0 0 : slabdata 8 8 0
revoke_table 512 512 16 256 1 : tunables 0 0 0 : slabdata 2 2 0
revoke_record 1024 1024 32 128 1 : tunables 0 0 0 : slabdata 8 8 0
ext4_inode_cache 0 0 896 36 8 : tunables 0 0 0 : slabdata 0 0 0
ext4_free_block_extents 0 0 56 73 1 : tunables 0 0 0 : slabdata 0 0 0
ext4_alloc_context 0 0 144 28 1 : tunables 0 0 0 : slabdata 0 0 0
ext4_prealloc_space 0 0 104 39 1 : tunables 0 0 0 : slabdata 0 0 0
ext4_system_zone 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0
ext2_inode_cache 0 0 752 21 4 : tunables 0 0 0 : slabdata 0 0 0
ext3_inode_cache 2371 2457 768 21 4 : tunables 0 0 0 : slabdata 117 117 0
ext3_xattr 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
configfs_dir_cache 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
kioctx 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
inotify_inode_mark_entry 36 36 112 36 1 : tunables 0 0 0 : slabdata 1 1 0
posix_timers_cache 224 224 144 28 1 : tunables 0 0 0 : slabdata 8 8 0
kvm_vcpu 38 45 10256 3 8 : tunables 0 0 0 : slabdata 15 15 0
kvm_rmap_desc 19408 21828 40 102 1 : tunables 0 0 0 : slabdata 214 214 0
kvm_pte_chain 14514 28543 56 73 1 : tunables 0 0 0 : slabdata 391 391 0
UDP-Lite 0 0 768 21 4 : tunables 0 0 0 : slabdata 0 0 0
ip_dst_cache 221 231 384 21 2 : tunables 0 0 0 : slabdata 11 11 0
UDP 168 168 768 21 4 : tunables 0 0 0 : slabdata 8 8 0
tw_sock_TCP 256 256 256 32 2 : tunables 0 0 0 : slabdata 8 8 0
TCP 191 220 1472 22 8 : tunables 0 0 0 : slabdata 10 10 0
blkdev_queue 178 210 2128 15 8 : tunables 0 0 0 : slabdata 14 14 0
blkdev_requests 608 816 336 24 2 : tunables 0 0 0 : slabdata 34 34 0
fsnotify_event 0 0 104 39 1 : tunables 0 0 0 : slabdata 0 0 0
sock_inode_cache 250 300 640 25 4 : tunables 0 0 0 : slabdata 12 12 0
file_lock_cache 176 176 184 22 1 : tunables 0 0 0 : slabdata 8 8 0
shmem_inode_cache 1617 1827 776 21 4 : tunables 0 0 0 : slabdata 87 87 0
Acpi-ParseExt 1692 1736 72 56 1 : tunables 0 0 0 : slabdata 31 31 0
proc_inode_cache 1182 1326 616 26 4 : tunables 0 0 0 : slabdata 51 51 0
sigqueue 200 200 160 25 1 : tunables 0 0 0 : slabdata 8 8 0
radix_tree_node 65891 69542 560 29 4 : tunables 0 0 0 : slabdata 2398 2398 0
bdev_cache 312 312 832 39 8 : tunables 0 0 0 : slabdata 8 8 0
sysfs_dir_cache 21585 22287 80 51 1 : tunables 0 0 0 : slabdata 437 437 0
inode_cache 2903 2996 568 28 4 : tunables 0 0 0 : slabdata 107 107 0
dentry 8532 8631 192 21 1 : tunables 0 0 0 : slabdata 411 411 0
buffer_head 1227688 1296648 112 36 1 : tunables 0 0 0 : slabdata 36018 36018 0
vm_area_struct 18494 19389 176 23 1 : tunables 0 0 0 : slabdata 843 843 0
files_cache 236 322 704 23 4 : tunables 0 0 0 : slabdata 14 14 0
signal_cache 606 702 832 39 8 : tunables 0 0 0 : slabdata 18 18 0
sighand_cache 415 480 2112 15 8 : tunables 0 0 0 : slabdata 32 32 0
task_struct 671 840 1616 20 8 : tunables 0 0 0 : slabdata 42 42 0
anon_vma 1511 1920 32 128 1 : tunables 0 0 0 : slabdata 15 15 0
shared_policy_node 255 255 48 85 1 : tunables 0 0 0 : slabdata 3 3 0
numa_policy 19205 20910 24 170 1 : tunables 0 0 0 : slabdata 123 123 0
idr_layer_cache 373 390 544 30 4 : tunables 0 0 0 : slabdata 13 13 0
kmalloc-8192 36 36 8192 4 8 : tunables 0 0 0 : slabdata 9 9 0
kmalloc-4096 2284 2592 4096 8 8 : tunables 0 0 0 : slabdata 324 324 0
kmalloc-2048 750 896 2048 16 8 : tunables 0 0 0 : slabdata 56 56 0
kmalloc-1024 4025 4320 1024 32 8 : tunables 0 0 0 : slabdata 135 135 0
kmalloc-512 1358 1760 512 32 4 : tunables 0 0 0 : slabdata 55 55 0
kmalloc-256 1402 1952 256 32 2 : tunables 0 0 0 : slabdata 61 61 0
kmalloc-128 8625 9280 128 32 1 : tunables 0 0 0 : slabdata 290 290 0
kmalloc-64 7030122 7455232 64 64 1 : tunables 0 0 0 : slabdata 116488 116488 0
kmalloc-32 18603 19712 32 128 1 : tunables 0 0 0 : slabdata 154 154 0
kmalloc-16 8895 9728 16 256 1 : tunables 0 0 0 : slabdata 38 38 0
kmalloc-8 9047 10752 8 512 1 : tunables 0 0 0 : slabdata 21 21 0
kmalloc-192 5130 9135 192 21 1 : tunables 0 0 0 : slabdata 435 435 0
kmalloc-96 1905 2940 96 42 1 : tunables 0 0 0 : slabdata 70 70 0
kmem_cache_node 196 256 64 64 1 : tunables 0 0 0 : slabdata 4 4 0

# cat /proc/buddyinfo
Node 0, zone DMA 2 0 2 2 2 2 2 1 2 2 2
Node 0, zone DMA32 61877 10368 111 10 2 3 1 0 0 0 0
Node 0, zone Normal 2036 0 14 12 6 3 3 0 1 0 0
Node 1, zone Normal 483348 15 2 3 7 1 3 1 0 0 0

Best wishes,

Chris.

2010-08-04 02:21:57

by Fengguang Wu

[permalink] [raw]
Subject: Re: Over-eager swapping

Chris,

Your slabinfo does contain many order 1-3 slab caches, this is a major source
of high order allocations and hence lumpy reclaim. fork() is another.

In another thread, Pekka Enberg offers a tip:

You can pass "slub_debug=o" as a kernel parameter to disable higher
order allocations if you want to test things.

Note that the parameter works on a CONFIG_SLUB_DEBUG=y kernel.

Thanks,
Fengguang


On Wed, Aug 04, 2010 at 05:49:46AM +0800, Chris Webb wrote:
> Wu Fengguang <[email protected]> writes:
>
> > Chris, what's in your /proc/slabinfo?
>
> Hi. Sorry for the slow reply. The exact machine from which I previously
> extracted that /proc/memstat has unfortunately had swap turned off by a
> colleague while I was away, presumably because its behaviour because too
> bad. However, here is info from another member of the cluster, this time
> with 5GB of buffers and 2GB of swap in use, i.e. the same general problem:
>
> # cat /proc/meminfo
> MemTotal: 33084008 kB
> MemFree: 2291464 kB
> Buffers: 4908468 kB
> Cached: 16056 kB
> SwapCached: 1427480 kB
> Active: 22885508 kB
> Inactive: 5719520 kB
> Active(anon): 20466488 kB
> Inactive(anon): 3215888 kB
> Active(file): 2419020 kB
> Inactive(file): 2503632 kB
> Unevictable: 10688 kB
> Mlocked: 10688 kB
> SwapTotal: 25165816 kB
> SwapFree: 22798248 kB
> Dirty: 2616 kB
> Writeback: 0 kB
> AnonPages: 23410296 kB
> Mapped: 6324 kB
> Shmem: 56 kB
> Slab: 692296 kB
> SReclaimable: 189032 kB
> SUnreclaim: 503264 kB
> KernelStack: 4568 kB
> PageTables: 65588 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> WritebackTmp: 0 kB
> CommitLimit: 41707820 kB
> Committed_AS: 34859884 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed: 147616 kB
> VmallocChunk: 34342399496 kB
> HardwareCorrupted: 0 kB
> HugePages_Total: 0
> HugePages_Free: 0
> HugePages_Rsvd: 0
> HugePages_Surp: 0
> Hugepagesize: 2048 kB
> DirectMap4k: 5888 kB
> DirectMap2M: 2156544 kB
> DirectMap1G: 31457280 kB
>
> # cat /proc/slabinfo
> slabinfo - version: 2.1
> # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> kmalloc_dma-512 32 32 512 32 4 : tunables 0 0 0 : slabdata 1 1 0
> nf_conntrack_expect 312 312 208 39 2 : tunables 0 0 0 : slabdata 8 8 0
> nf_conntrack 240 240 272 30 2 : tunables 0 0 0 : slabdata 8 8 0
> dm_raid1_read_record 0 0 1064 30 8 : tunables 0 0 0 : slabdata 0 0 0
> dm_crypt_io 240 260 152 26 1 : tunables 0 0 0 : slabdata 10 10 0
> kcopyd_job 0 0 368 22 2 : tunables 0 0 0 : slabdata 0 0 0
> dm_uevent 0 0 2608 12 8 : tunables 0 0 0 : slabdata 0 0 0
> dm_rq_target_io 0 0 376 21 2 : tunables 0 0 0 : slabdata 0 0 0
> cfq_queue 0 0 168 24 1 : tunables 0 0 0 : slabdata 0 0 0
> bsg_cmd 0 0 312 26 2 : tunables 0 0 0 : slabdata 0 0 0
> mqueue_inode_cache 36 36 896 36 8 : tunables 0 0 0 : slabdata 1 1 0
> udf_inode_cache 0 0 640 25 4 : tunables 0 0 0 : slabdata 0 0 0
> fuse_request 0 0 632 25 4 : tunables 0 0 0 : slabdata 0 0 0
> fuse_inode 0 0 704 23 4 : tunables 0 0 0 : slabdata 0 0 0
> ntfs_big_inode_cache 0 0 832 39 8 : tunables 0 0 0 : slabdata 0 0 0
> ntfs_inode_cache 0 0 264 31 2 : tunables 0 0 0 : slabdata 0 0 0
> isofs_inode_cache 0 0 616 26 4 : tunables 0 0 0 : slabdata 0 0 0
> fat_inode_cache 0 0 648 25 4 : tunables 0 0 0 : slabdata 0 0 0
> fat_cache 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0
> hugetlbfs_inode_cache 28 28 584 28 4 : tunables 0 0 0 : slabdata 1 1 0
> squashfs_inode_cache 0 0 640 25 4 : tunables 0 0 0 : slabdata 0 0 0
> journal_handle 1360 1360 24 170 1 : tunables 0 0 0 : slabdata 8 8 0
> journal_head 288 288 112 36 1 : tunables 0 0 0 : slabdata 8 8 0
> revoke_table 512 512 16 256 1 : tunables 0 0 0 : slabdata 2 2 0
> revoke_record 1024 1024 32 128 1 : tunables 0 0 0 : slabdata 8 8 0
> ext4_inode_cache 0 0 896 36 8 : tunables 0 0 0 : slabdata 0 0 0
> ext4_free_block_extents 0 0 56 73 1 : tunables 0 0 0 : slabdata 0 0 0
> ext4_alloc_context 0 0 144 28 1 : tunables 0 0 0 : slabdata 0 0 0
> ext4_prealloc_space 0 0 104 39 1 : tunables 0 0 0 : slabdata 0 0 0
> ext4_system_zone 0 0 40 102 1 : tunables 0 0 0 : slabdata 0 0 0
> ext2_inode_cache 0 0 752 21 4 : tunables 0 0 0 : slabdata 0 0 0
> ext3_inode_cache 2371 2457 768 21 4 : tunables 0 0 0 : slabdata 117 117 0
> ext3_xattr 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
> configfs_dir_cache 0 0 88 46 1 : tunables 0 0 0 : slabdata 0 0 0
> kioctx 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
> inotify_inode_mark_entry 36 36 112 36 1 : tunables 0 0 0 : slabdata 1 1 0
> posix_timers_cache 224 224 144 28 1 : tunables 0 0 0 : slabdata 8 8 0
> kvm_vcpu 38 45 10256 3 8 : tunables 0 0 0 : slabdata 15 15 0
> kvm_rmap_desc 19408 21828 40 102 1 : tunables 0 0 0 : slabdata 214 214 0
> kvm_pte_chain 14514 28543 56 73 1 : tunables 0 0 0 : slabdata 391 391 0
> UDP-Lite 0 0 768 21 4 : tunables 0 0 0 : slabdata 0 0 0
> ip_dst_cache 221 231 384 21 2 : tunables 0 0 0 : slabdata 11 11 0
> UDP 168 168 768 21 4 : tunables 0 0 0 : slabdata 8 8 0
> tw_sock_TCP 256 256 256 32 2 : tunables 0 0 0 : slabdata 8 8 0
> TCP 191 220 1472 22 8 : tunables 0 0 0 : slabdata 10 10 0
> blkdev_queue 178 210 2128 15 8 : tunables 0 0 0 : slabdata 14 14 0
> blkdev_requests 608 816 336 24 2 : tunables 0 0 0 : slabdata 34 34 0
> fsnotify_event 0 0 104 39 1 : tunables 0 0 0 : slabdata 0 0 0
> sock_inode_cache 250 300 640 25 4 : tunables 0 0 0 : slabdata 12 12 0
> file_lock_cache 176 176 184 22 1 : tunables 0 0 0 : slabdata 8 8 0
> shmem_inode_cache 1617 1827 776 21 4 : tunables 0 0 0 : slabdata 87 87 0
> Acpi-ParseExt 1692 1736 72 56 1 : tunables 0 0 0 : slabdata 31 31 0
> proc_inode_cache 1182 1326 616 26 4 : tunables 0 0 0 : slabdata 51 51 0
> sigqueue 200 200 160 25 1 : tunables 0 0 0 : slabdata 8 8 0
> radix_tree_node 65891 69542 560 29 4 : tunables 0 0 0 : slabdata 2398 2398 0
> bdev_cache 312 312 832 39 8 : tunables 0 0 0 : slabdata 8 8 0
> sysfs_dir_cache 21585 22287 80 51 1 : tunables 0 0 0 : slabdata 437 437 0
> inode_cache 2903 2996 568 28 4 : tunables 0 0 0 : slabdata 107 107 0
> dentry 8532 8631 192 21 1 : tunables 0 0 0 : slabdata 411 411 0
> buffer_head 1227688 1296648 112 36 1 : tunables 0 0 0 : slabdata 36018 36018 0
> vm_area_struct 18494 19389 176 23 1 : tunables 0 0 0 : slabdata 843 843 0
> files_cache 236 322 704 23 4 : tunables 0 0 0 : slabdata 14 14 0
> signal_cache 606 702 832 39 8 : tunables 0 0 0 : slabdata 18 18 0
> sighand_cache 415 480 2112 15 8 : tunables 0 0 0 : slabdata 32 32 0
> task_struct 671 840 1616 20 8 : tunables 0 0 0 : slabdata 42 42 0
> anon_vma 1511 1920 32 128 1 : tunables 0 0 0 : slabdata 15 15 0
> shared_policy_node 255 255 48 85 1 : tunables 0 0 0 : slabdata 3 3 0
> numa_policy 19205 20910 24 170 1 : tunables 0 0 0 : slabdata 123 123 0
> idr_layer_cache 373 390 544 30 4 : tunables 0 0 0 : slabdata 13 13 0
> kmalloc-8192 36 36 8192 4 8 : tunables 0 0 0 : slabdata 9 9 0
> kmalloc-4096 2284 2592 4096 8 8 : tunables 0 0 0 : slabdata 324 324 0
> kmalloc-2048 750 896 2048 16 8 : tunables 0 0 0 : slabdata 56 56 0
> kmalloc-1024 4025 4320 1024 32 8 : tunables 0 0 0 : slabdata 135 135 0
> kmalloc-512 1358 1760 512 32 4 : tunables 0 0 0 : slabdata 55 55 0
> kmalloc-256 1402 1952 256 32 2 : tunables 0 0 0 : slabdata 61 61 0
> kmalloc-128 8625 9280 128 32 1 : tunables 0 0 0 : slabdata 290 290 0
> kmalloc-64 7030122 7455232 64 64 1 : tunables 0 0 0 : slabdata 116488 116488 0
> kmalloc-32 18603 19712 32 128 1 : tunables 0 0 0 : slabdata 154 154 0
> kmalloc-16 8895 9728 16 256 1 : tunables 0 0 0 : slabdata 38 38 0
> kmalloc-8 9047 10752 8 512 1 : tunables 0 0 0 : slabdata 21 21 0
> kmalloc-192 5130 9135 192 21 1 : tunables 0 0 0 : slabdata 435 435 0
> kmalloc-96 1905 2940 96 42 1 : tunables 0 0 0 : slabdata 70 70 0
> kmem_cache_node 196 256 64 64 1 : tunables 0 0 0 : slabdata 4 4 0
>
> # cat /proc/buddyinfo
> Node 0, zone DMA 2 0 2 2 2 2 2 1 2 2 2
> Node 0, zone DMA32 61877 10368 111 10 2 3 1 0 0 0 0
> Node 0, zone Normal 2036 0 14 12 6 3 3 0 1 0 0
> Node 1, zone Normal 483348 15 2 3 7 1 3 1 0 0 0
>
> Best wishes,
>
> Chris.

2010-08-04 03:10:59

by Minchan Kim

[permalink] [raw]
Subject: Re: Over-eager swapping

On Wed, Aug 4, 2010 at 11:21 AM, Wu Fengguang <[email protected]> wrote:
> Chris,
>
> Your slabinfo does contain many order 1-3 slab caches, this is a major source
> of high order allocations and hence lumpy reclaim. fork() is another.
>
> In another thread, Pekka Enberg offers a tip:
>
> ? ? ? ?You can pass "slub_debug=o" as a kernel parameter to disable higher
> ? ? ? ?order allocations if you want to test things.
>
> Note that the parameter works on a CONFIG_SLUB_DEBUG=y kernel.
>
> Thanks,
> Fengguang

He said following as.
"After running swapoff -a, the machine is immediately much healthier. Even
while the swap is still being reduced, load goes down and response times in
virtual machines are much improved. Once the swap is completely gone, there
are still several gigabytes of RAM left free which are used for buffers, and
the virtual machines are no longer laggy because they are no longer swapped
out. Running swapon -a again, the affected machine waits for about a minute
with zero swap in use, before the amount of swap in use very rapidly
increases to around 2GB and then continues to increase more steadily to 3GB."

1. His system works well without swap.
2. His system increase swap by 2G rapidly and more steadily to 3GB.

So I thought it isn't likely to relate normal lumpy.

Of course, without swap, lumpy can scan more file pages to make
contiguous page frames. so it could work well, still. But I can't
understand 2.

Hmm, I have no idea. :(

Off-Topic:

Hi, Pekka.

Document says.
"Debugging options may require the minimum possible slab order to increase as
a result of storing the metadata (for example, caches with PAGE_SIZE object
sizes). ?This has a higher liklihood of resulting in slab allocation errors
in low memory situations or if there's high fragmentation of memory. ?To
switch off debugging for such caches by default, use

? ? ? ?slub_debug=O"

But when I tested it in my machine(2.6.34), ?with slub_debug=O, it
increase objsize and pagesperslab. Even it increase the number of
slab(But I am not sure this part since it might not the same time from
booting)
What am I missing now?

But SLAB seems to be consumed small pages than SLUB. Hmm.
SLAB is more proper than SLUBin small memory system(ex, embedded)?


--
Kind regards,
Minchan Kim


Attachments:
slub_debug.log (5.65 kB)
slub_debug_disable.log (7.86 kB)
Download all attachments

2010-08-04 03:24:06

by Fengguang Wu

[permalink] [raw]
Subject: Re: Over-eager swapping

On Wed, Aug 04, 2010 at 11:10:46AM +0800, Minchan Kim wrote:
> On Wed, Aug 4, 2010 at 11:21 AM, Wu Fengguang <[email protected]> wrote:
> > Chris,
> >
> > Your slabinfo does contain many order 1-3 slab caches, this is a major source
> > of high order allocations and hence lumpy reclaim. fork() is another.
> >
> > In another thread, Pekka Enberg offers a tip:
> >
> >        You can pass "slub_debug=o" as a kernel parameter to disable higher
> >        order allocations if you want to test things.
> >
> > Note that the parameter works on a CONFIG_SLUB_DEBUG=y kernel.
> >
> > Thanks,
> > Fengguang
>
> He said following as.
> "After running swapoff -a, the machine is immediately much healthier. Even
> while the swap is still being reduced, load goes down and response times in
> virtual machines are much improved. Once the swap is completely gone, there
> are still several gigabytes of RAM left free which are used for buffers, and
> the virtual machines are no longer laggy because they are no longer swapped
> out.
>
> Running swapon -a again, the affected machine waits for about a minute
> with zero swap in use,

This is interesting. Why is it waiting for 1m here? Are there high CPU
loads? Would you do a

echo t > /proc/sysrq-trigger

and show us the dmesg?

Thanks,
Fengguang

> before the amount of swap in use very rapidly
> increases to around 2GB and then continues to increase more steadily to 3GB."
>
> 1. His system works well without swap.
> 2. His system increase swap by 2G rapidly and more steadily to 3GB.
>
> So I thought it isn't likely to relate normal lumpy.
>
> Of course, without swap, lumpy can scan more file pages to make
> contiguous page frames. so it could work well, still. But I can't
> understand 2.
>
> Hmm, I have no idea. :(
>
> Off-Topic:
>
> Hi, Pekka.
>
> Document says.
> "Debugging options may require the minimum possible slab order to increase as
> a result of storing the metadata (for example, caches with PAGE_SIZE object
> sizes).  This has a higher liklihood of resulting in slab allocation errors
> in low memory situations or if there's high fragmentation of memory.  To
> switch off debugging for such caches by default, use
>
>        slub_debug=O"
>
> But when I tested it in my machine(2.6.34),  with slub_debug=O, it
> increase objsize and pagesperslab. Even it increase the number of
> slab(But I am not sure this part since it might not the same time from
> booting)
> What am I missing now?
>
> But SLAB seems to be consumed small pages than SLUB. Hmm.
> SLAB is more proper than SLUBin small memory system(ex, embedded)?
>
>
> --
> Kind regards,
> Minchan Kim

>
> slabinfo - version: 2.1
> # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> kvm_vcpu 0 0 9200 3 8 : tunables 0 0 0 : slabdata 0 0 0
> kmalloc_dma-512 16 16 512 16 2 : tunables 0 0 0 : slabdata 1 1 0
> RAWv6 17 17 960 17 4 : tunables 0 0 0 : slabdata 1 1 0
> UDPLITEv6 0 0 960 17 4 : tunables 0 0 0 : slabdata 0 0 0
> UDPv6 51 51 960 17 4 : tunables 0 0 0 : slabdata 3 3 0
> TCPv6 72 72 1728 18 8 : tunables 0 0 0 : slabdata 4 4 0
> nf_conntrack_c10a8540 0 0 280 29 2 : tunables 0 0 0 : slabdata 0 0 0
> dm_raid1_read_record 0 0 1056 31 8 : tunables 0 0 0 : slabdata 0 0 0
> dm_uevent 0 0 2464 13 8 : tunables 0 0 0 : slabdata 0 0 0
> mqueue_inode_cache 18 18 896 18 4 : tunables 0 0 0 : slabdata 1 1 0
> fuse_request 18 18 432 18 2 : tunables 0 0 0 : slabdata 1 1 0
> fuse_inode 21 21 768 21 4 : tunables 0 0 0 : slabdata 1 1 0
> nfsd4_stateowners 0 0 344 23 2 : tunables 0 0 0 : slabdata 0 0 0
> nfs_read_data 72 72 448 18 2 : tunables 0 0 0 : slabdata 4 4 0
> nfs_inode_cache 0 0 1040 31 8 : tunables 0 0 0 : slabdata 0 0 0
> ecryptfs_inode_cache 0 0 1280 25 8 : tunables 0 0 0 : slabdata 0 0 0
> hugetlbfs_inode_cache 24 24 656 24 4 : tunables 0 0 0 : slabdata 1 1 0
> ext4_inode_cache 0 0 1128 29 8 : tunables 0 0 0 : slabdata 0 0 0
> ext2_inode_cache 0 0 944 17 4 : tunables 0 0 0 : slabdata 0 0 0
> ext3_inode_cache 5032 5032 928 17 4 : tunables 0 0 0 : slabdata 296 296 0
> rpc_inode_cache 18 18 896 18 4 : tunables 0 0 0 : slabdata 1 1 0
> UNIX 532 532 832 19 4 : tunables 0 0 0 : slabdata 28 28 0
> UDP-Lite 0 0 832 19 4 : tunables 0 0 0 : slabdata 0 0 0
> UDP 76 76 832 19 4 : tunables 0 0 0 : slabdata 4 4 0
> TCP 60 60 1600 20 8 : tunables 0 0 0 : slabdata 3 3 0
> sgpool-128 48 48 2560 12 8 : tunables 0 0 0 : slabdata 4 4 0
> sgpool-64 100 100 1280 25 8 : tunables 0 0 0 : slabdata 4 4 0
> blkdev_queue 76 76 1688 19 8 : tunables 0 0 0 : slabdata 4 4 0
> biovec-256 10 10 3072 10 8 : tunables 0 0 0 : slabdata 1 1 0
> biovec-128 21 21 1536 21 8 : tunables 0 0 0 : slabdata 1 1 0
> biovec-64 84 84 768 21 4 : tunables 0 0 0 : slabdata 4 4 0
> bip-256 10 10 3200 10 8 : tunables 0 0 0 : slabdata 1 1 0
> bip-128 0 0 1664 19 8 : tunables 0 0 0 : slabdata 0 0 0
> bip-64 0 0 896 18 4 : tunables 0 0 0 : slabdata 0 0 0
> bip-16 100 100 320 25 2 : tunables 0 0 0 : slabdata 4 4 0
> sock_inode_cache 609 609 768 21 4 : tunables 0 0 0 : slabdata 29 29 0
> skbuff_fclone_cache 84 84 384 21 2 : tunables 0 0 0 : slabdata 4 4 0
> shmem_inode_cache 1835 1840 784 20 4 : tunables 0 0 0 : slabdata 92 92 0
> taskstats 96 96 328 24 2 : tunables 0 0 0 : slabdata 4 4 0
> proc_inode_cache 1584 1584 680 24 4 : tunables 0 0 0 : slabdata 66 66 0
> bdev_cache 72 72 896 18 4 : tunables 0 0 0 : slabdata 4 4 0
> inode_cache 7126 7128 656 24 4 : tunables 0 0 0 : slabdata 297 297 0
> signal_cache 332 350 640 25 4 : tunables 0 0 0 : slabdata 14 14 0
> sighand_cache 246 253 1408 23 8 : tunables 0 0 0 : slabdata 11 11 0
> task_xstate 193 196 576 28 4 : tunables 0 0 0 : slabdata 7 7 0
> task_struct 274 285 5472 5 8 : tunables 0 0 0 : slabdata 57 57 0
> radix_tree_node 3208 3213 296 27 2 : tunables 0 0 0 : slabdata 119 119 0
> kmalloc-8192 20 20 8192 4 8 : tunables 0 0 0 : slabdata 5 5 0
> kmalloc-4096 78 80 4096 8 8 : tunables 0 0 0 : slabdata 10 10 0
> kmalloc-2048 400 400 2048 16 8 : tunables 0 0 0 : slabdata 25 25 0
> kmalloc-1024 326 336 1024 16 4 : tunables 0 0 0 : slabdata 21 21 0
> kmalloc-512 758 784 512 16 2 : tunables 0 0 0 : slabdata 49 49 0

> slabinfo - version: 2.1
> # name <active_objs> <num_objs> <objsize> <objperslab> <pagesperslab> : tunables <limit> <batchcount> <sharedfactor> : slabdata <active_slabs> <num_slabs> <sharedavail>
> kvm_vcpu 0 0 9248 3 8 : tunables 0 0 0 : slabdata 0 0 0
> kmalloc_dma-512 29 29 560 29 4 : tunables 0 0 0 : slabdata 1 1 0
> clip_arp_cache 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
> ip6_dst_cache 25 25 320 25 2 : tunables 0 0 0 : slabdata 1 1 0
> ndisc_cache 25 25 320 25 2 : tunables 0 0 0 : slabdata 1 1 0
> RAWv6 16 16 1024 16 4 : tunables 0 0 0 : slabdata 1 1 0
> UDPLITEv6 0 0 960 17 4 : tunables 0 0 0 : slabdata 0 0 0
> UDPv6 68 68 960 17 4 : tunables 0 0 0 : slabdata 4 4 0
> tw_sock_TCPv6 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
> TCPv6 36 36 1792 18 8 : tunables 0 0 0 : slabdata 2 2 0
> nf_conntrack_c10a8540 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
> dm_raid1_read_record 0 0 1096 29 8 : tunables 0 0 0 : slabdata 0 0 0
> kcopyd_job 0 0 376 21 2 : tunables 0 0 0 : slabdata 0 0 0
> dm_uevent 0 0 2504 13 8 : tunables 0 0 0 : slabdata 0 0 0
> dm_rq_target_io 0 0 272 30 2 : tunables 0 0 0 : slabdata 0 0 0
> mqueue_inode_cache 17 17 960 17 4 : tunables 0 0 0 : slabdata 1 1 0
> fuse_request 17 17 480 17 2 : tunables 0 0 0 : slabdata 1 1 0
> fuse_inode 19 19 832 19 4 : tunables 0 0 0 : slabdata 1 1 0
> nfsd4_stateowners 0 0 392 20 2 : tunables 0 0 0 : slabdata 0 0 0
> nfs_write_data 48 48 512 16 2 : tunables 0 0 0 : slabdata 3 3 0
> nfs_read_data 32 32 512 16 2 : tunables 0 0 0 : slabdata 2 2 0
> nfs_inode_cache 0 0 1080 30 8 : tunables 0 0 0 : slabdata 0 0 0
> ecryptfs_key_record_cache 0 0 576 28 4 : tunables 0 0 0 : slabdata 0 0 0
> ecryptfs_sb_cache 0 0 640 25 4 : tunables 0 0 0 : slabdata 0 0 0
> ecryptfs_inode_cache 0 0 1280 25 8 : tunables 0 0 0 : slabdata 0 0 0
> ecryptfs_auth_tok_list_item 0 0 896 18 4 : tunables 0 0 0 : slabdata 0 0 0
> hugetlbfs_inode_cache 23 23 696 23 4 : tunables 0 0 0 : slabdata 1 1 0
> ext4_inode_cache 0 0 1168 28 8 : tunables 0 0 0 : slabdata 0 0 0
> ext2_inode_cache 0 0 984 16 4 : tunables 0 0 0 : slabdata 0 0 0
> ext3_inode_cache 5391 5392 968 16 4 : tunables 0 0 0 : slabdata 337 337 0
> dquot 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
> kioctx 0 0 384 21 2 : tunables 0 0 0 : slabdata 0 0 0
> rpc_buffers 30 30 2112 15 8 : tunables 0 0 0 : slabdata 2 2 0
> rpc_inode_cache 18 18 896 18 4 : tunables 0 0 0 : slabdata 1 1 0
> UNIX 556 558 896 18 4 : tunables 0 0 0 : slabdata 31 31 0
> UDP-Lite 0 0 832 19 4 : tunables 0 0 0 : slabdata 0 0 0
> ip_dst_cache 125 125 320 25 2 : tunables 0 0 0 : slabdata 5 5 0
> arp_cache 100 100 320 25 2 : tunables 0 0 0 : slabdata 4 4 0
> RAW 19 19 832 19 4 : tunables 0 0 0 : slabdata 1 1 0
> UDP 76 76 832 19 4 : tunables 0 0 0 : slabdata 4 4 0
> TCP 76 76 1664 19 8 : tunables 0 0 0 : slabdata 4 4 0
> sgpool-128 48 48 2624 12 8 : tunables 0 0 0 : slabdata 4 4 0
> sgpool-64 96 96 1344 24 8 : tunables 0 0 0 : slabdata 4 4 0
> sgpool-32 92 92 704 23 4 : tunables 0 0 0 : slabdata 4 4 0
> sgpool-16 84 84 384 21 2 : tunables 0 0 0 : slabdata 4 4 0
> blkdev_queue 72 72 1736 18 8 : tunables 0 0 0 : slabdata 4 4 0
> biovec-256 10 10 3136 10 8 : tunables 0 0 0 : slabdata 1 1 0
> biovec-128 20 20 1600 20 8 : tunables 0 0 0 : slabdata 1 1 0
> biovec-64 76 76 832 19 4 : tunables 0 0 0 : slabdata 4 4 0
> bip-256 10 10 3200 10 8 : tunables 0 0 0 : slabdata 1 1 0
> bip-128 0 0 1664 19 8 : tunables 0 0 0 : slabdata 0 0 0
> bip-64 0 0 896 18 4 : tunables 0 0 0 : slabdata 0 0 0
> bip-16 0 0 320 25 2 : tunables 0 0 0 : slabdata 0 0 0
> sock_inode_cache 629 630 768 21 4 : tunables 0 0 0 : slabdata 30 30 0
> skbuff_fclone_cache 72 72 448 18 2 : tunables 0 0 0 : slabdata 4 4 0
> shmem_inode_cache 1862 1862 824 19 4 : tunables 0 0 0 : slabdata 98 98 0
> taskstats 84 84 376 21 2 : tunables 0 0 0 : slabdata 4 4 0
> proc_inode_cache 1623 1650 720 22 4 : tunables 0 0 0 : slabdata 75 75 0
> bdev_cache 68 68 960 17 4 : tunables 0 0 0 : slabdata 4 4 0
> inode_cache 7125 7130 696 23 4 : tunables 0 0 0 : slabdata 310 310 0
> mm_struct 135 138 704 23 4 : tunables 0 0 0 : slabdata 6 6 0
> files_cache 142 150 320 25 2 : tunables 0 0 0 : slabdata 6 6 0
> signal_cache 229 230 704 23 4 : tunables 0 0 0 : slabdata 10 10 0
> sighand_cache 228 230 1408 23 8 : tunables 0 0 0 : slabdata 10 10 0
> task_xstate 195 200 640 25 4 : tunables 0 0 0 : slabdata 8 8 0
> task_struct 271 285 5520 5 8 : tunables 0 0 0 : slabdata 57 57 0
> radix_tree_node 3484 3504 336 24 2 : tunables 0 0 0 : slabdata 146 146 0
> kmalloc-8192 20 20 8192 4 8 : tunables 0 0 0 : slabdata 5 5 0
> kmalloc-4096 79 80 4096 8 8 : tunables 0 0 0 : slabdata 10 10 0
> kmalloc-2048 388 390 2096 15 8 : tunables 0 0 0 : slabdata 26 26 0
> kmalloc-1024 382 390 1072 30 8 : tunables 0 0 0 : slabdata 13 13 0
> kmalloc-512 796 812 560 29 4 : tunables 0 0 0 : slabdata 28 28 0
> kmalloc-256 153 156 304 26 2 : tunables 0 0 0 : slabdata 6 6 0

2010-08-04 10:00:20

by Chris Webb

[permalink] [raw]
Subject: Re: Over-eager swapping

Wu Fengguang <[email protected]> writes:

> This is interesting. Why is it waiting for 1m here? Are there high CPU
> loads? Would you do a
>
> echo t > /proc/sysrq-trigger
>
> and show us the dmesg?

Annoyingly, magic-sysrq isn't compiled in on these kernels. Is there another
way I can get this info for you? Replacing the kernels on the machines is a
painful job as I have to give the clients running on them quite a bit of
notice of the reboot, and I haven't been able to reproduce the problem on a
test machine.

I also think the swap use is much better following a reboot, and only starts
to spiral out of control after the machines have been running for a week or
so.

However, your suggestion is right that the CPU loads on these machines are
typically quite high. The large number of kvm virtual machines they run mean
thatl oads of eight or even sixteen in /proc/loadavg are not unusual, and
these are higher when there's swap than after it has been removed. I assume
this is mostly because of increased IO wait, as this number increases
significantly in top.

Cheers,

Chris.

2010-08-04 11:49:48

by Fengguang Wu

[permalink] [raw]
Subject: Re: Over-eager swapping

On Wed, Aug 04, 2010 at 05:58:12PM +0800, Chris Webb wrote:
> Wu Fengguang <[email protected]> writes:
>
> > This is interesting. Why is it waiting for 1m here? Are there high CPU
> > loads? Would you do a
> >
> > echo t > /proc/sysrq-trigger
> >
> > and show us the dmesg?
>
> Annoyingly, magic-sysrq isn't compiled in on these kernels. Is there another
> way I can get this info for you? Replacing the kernels on the machines is a
> painful job as I have to give the clients running on them quite a bit of
> notice of the reboot, and I haven't been able to reproduce the problem on a
> test machine.

Maybe turn off KSM? It helps to isolate problems. It's a relative new
and complex feature after all.

> I also think the swap use is much better following a reboot, and only starts
> to spiral out of control after the machines have been running for a week or
> so.

Something deteriorates over long time.. It may take time to catch this bug..

> However, your suggestion is right that the CPU loads on these machines are
> typically quite high. The large number of kvm virtual machines they run mean
> thatl oads of eight or even sixteen in /proc/loadavg are not unusual, and
> these are higher when there's swap than after it has been removed. I assume
> this is mostly because of increased IO wait, as this number increases
> significantly in top.

iowait = CPU (idle) waiting for disk IO

So iowait means not CPU load, but somehow disk load :)

Thanks,
Fengguang

2010-08-04 12:06:37

by Chris Webb

[permalink] [raw]
Subject: Re: Over-eager swapping

Wu Fengguang <[email protected]> writes:

> Maybe turn off KSM? It helps to isolate problems. It's a relative new
> and complex feature after all.

Good idea! I'll give that a go on one of the machines without swap at the
moment, re-add the swap with ksm turned off, and see what happens.

> > However, your suggestion is right that the CPU loads on these machines are
> > typically quite high. The large number of kvm virtual machines they run mean
> > thatl oads of eight or even sixteen in /proc/loadavg are not unusual, and
> > these are higher when there's swap than after it has been removed. I assume
> > this is mostly because of increased IO wait, as this number increases
> > significantly in top.
>
> iowait = CPU (idle) waiting for disk IO
>
> So iowait means not CPU load, but somehow disk load :)

Sorry, yes, I wrote very unclearly here. What I should have written is that
the load numbers are fairly high even without swap, when the IO wait figure
is pretty small. This is presumably normal CPU load from the guests.

The load average rises significantly when swap is added, but I think that
rise is due to an increase in processes waiting for IO (io wait %age
increases considerably) rather than extra CPU work. Presumably this is the
IO from swapping.

Cheers,

Chris.

2010-08-18 14:38:57

by Fengguang Wu

[permalink] [raw]
Subject: Re: Over-eager swapping

Chris,

Did you enable any NUMA policy? That could start swapping even if
there are lots of free pages in some nodes.

Are your free pages equally distributed over the nodes? Or limited to
some of the nodes? Try this command:

grep MemFree /sys/devices/system/node/node*/meminfo

Thanks,
Fengguang

2010-08-18 14:49:24

by Chris Webb

[permalink] [raw]
Subject: Re: Over-eager swapping

Wu Fengguang <[email protected]> writes:

> Did you enable any NUMA policy? That could start swapping even if
> there are lots of free pages in some nodes.

Hi. Thanks for the follow-up. We haven't done any configuration or tuning of
NUMA behaviour, but NUMA support is definitely compiled into the kernel:

# zgrep NUMA /proc/config.gz
CONFIG_NUMA_IRQ_DESC=y
CONFIG_NUMA=y
CONFIG_K8_NUMA=y
CONFIG_X86_64_ACPI_NUMA=y
# CONFIG_NUMA_EMU is not set
CONFIG_ACPI_NUMA=y
# grep -i numa /var/log/dmesg.boot
NUMe: Allocated memnodemap from b000 - 1b540
NUMA: Using 20 for the hash shift.

> Are your free pages equally distributed over the nodes? Or limited to
> some of the nodes? Try this command:
>
> grep MemFree /sys/devices/system/node/node*/meminfo

My worst-case machines current have swap completely turned off to make them
usable for clients, but I have one machine which is about 3GB into swap with
8GB of buffers and 3GB free. This shows

# grep MemFree /sys/devices/system/node/node*/meminfo
/sys/devices/system/node/node0/meminfo:Node 0 MemFree: 954500 kB
/sys/devices/system/node/node1/meminfo:Node 1 MemFree: 2374528 kB

I could definitely imagine that one of the nodes could have dipped down to
zero in the past. I'll try enabling swap on one of our machines with the bad
problem late tonight and repeat the experiment. The node meminfo on this box
currently looks like

# grep MemFree /sys/devices/system/node/node*/meminfo
/sys/devices/system/node/node0/meminfo:Node 0 MemFree: 82732 kB
/sys/devices/system/node/node1/meminfo:Node 1 MemFree: 1723896 kB

Best wishes,

Chris.

2010-08-18 15:21:28

by Fengguang Wu

[permalink] [raw]
Subject: Re: Over-eager swapping

Andi, Christoph and Lee:

This looks like an "unbalanced NUMA memory usage leading to premature
swapping" problem.

Thanks,
Fengguang

On Wed, Aug 18, 2010 at 10:46:59PM +0800, Chris Webb wrote:
> Wu Fengguang <[email protected]> writes:
>
> > Did you enable any NUMA policy? That could start swapping even if
> > there are lots of free pages in some nodes.
>
> Hi. Thanks for the follow-up. We haven't done any configuration or tuning of
> NUMA behaviour, but NUMA support is definitely compiled into the kernel:
>
> # zgrep NUMA /proc/config.gz
> CONFIG_NUMA_IRQ_DESC=y
> CONFIG_NUMA=y
> CONFIG_K8_NUMA=y
> CONFIG_X86_64_ACPI_NUMA=y
> # CONFIG_NUMA_EMU is not set
> CONFIG_ACPI_NUMA=y
> # grep -i numa /var/log/dmesg.boot
> NUMe: Allocated memnodemap from b000 - 1b540
> NUMA: Using 20 for the hash shift.
>
> > Are your free pages equally distributed over the nodes? Or limited to
> > some of the nodes? Try this command:
> >
> > grep MemFree /sys/devices/system/node/node*/meminfo
>
> My worst-case machines current have swap completely turned off to make them
> usable for clients, but I have one machine which is about 3GB into swap with
> 8GB of buffers and 3GB free. This shows
>
> # grep MemFree /sys/devices/system/node/node*/meminfo
> /sys/devices/system/node/node0/meminfo:Node 0 MemFree: 954500 kB
> /sys/devices/system/node/node1/meminfo:Node 1 MemFree: 2374528 kB
>
> I could definitely imagine that one of the nodes could have dipped down to
> zero in the past. I'll try enabling swap on one of our machines with the bad
> problem late tonight and repeat the experiment. The node meminfo on this box
> currently looks like
>
> # grep MemFree /sys/devices/system/node/node*/meminfo
> /sys/devices/system/node/node0/meminfo:Node 0 MemFree: 82732 kB
> /sys/devices/system/node/node1/meminfo:Node 1 MemFree: 1723896 kB
>
> Best wishes,
>
> Chris.

2010-08-18 15:57:16

by Christoph Lameter

[permalink] [raw]
Subject: Re: Over-eager swapping

On Wed, 18 Aug 2010, Wu Fengguang wrote:

> Andi, Christoph and Lee:
>
> This looks like an "unbalanced NUMA memory usage leading to premature
> swapping" problem.

Is zone reclaim active? It may not activate on smaller systems leading
to an unbalance memory usage between node.

2010-08-18 15:57:08

by Lee Schermerhorn

[permalink] [raw]
Subject: Re: Over-eager swapping

On Wed, 2010-08-18 at 23:21 +0800, Wu Fengguang wrote:
> Andi, Christoph and Lee:
>
> This looks like an "unbalanced NUMA memory usage leading to premature
> swapping" problem.

What is the value of the vm.zone_reclaim_mode sysctl? If it is !0, the
system will go into zone reclaim before allocating off-node pages.
However, it shouldn't "swap" in this case unless (zone_reclaim_mode & 4)
!= 0. And even then, zone reclaim should only reclaim file pages, not
anon. In theory...

Note: zone_reclaim_mode will be enabled by default [= 1] if the SLIT
contains any distances > 2.0 [20]. Check SLIT values via 'numactl
--hardware'.

Lee

>
> Thanks,
> Fengguang
>
> On Wed, Aug 18, 2010 at 10:46:59PM +0800, Chris Webb wrote:
> > Wu Fengguang <[email protected]> writes:
> >
> > > Did you enable any NUMA policy? That could start swapping even if
> > > there are lots of free pages in some nodes.
> >
> > Hi. Thanks for the follow-up. We haven't done any configuration or tuning of
> > NUMA behaviour, but NUMA support is definitely compiled into the kernel:
> >
> > # zgrep NUMA /proc/config.gz
> > CONFIG_NUMA_IRQ_DESC=y
> > CONFIG_NUMA=y
> > CONFIG_K8_NUMA=y
> > CONFIG_X86_64_ACPI_NUMA=y
> > # CONFIG_NUMA_EMU is not set
> > CONFIG_ACPI_NUMA=y
> > # grep -i numa /var/log/dmesg.boot
> > NUMe: Allocated memnodemap from b000 - 1b540
> > NUMA: Using 20 for the hash shift.
> >
> > > Are your free pages equally distributed over the nodes? Or limited to
> > > some of the nodes? Try this command:
> > >
> > > grep MemFree /sys/devices/system/node/node*/meminfo
> >
> > My worst-case machines current have swap completely turned off to make them
> > usable for clients, but I have one machine which is about 3GB into swap with
> > 8GB of buffers and 3GB free. This shows
> >
> > # grep MemFree /sys/devices/system/node/node*/meminfo
> > /sys/devices/system/node/node0/meminfo:Node 0 MemFree: 954500 kB
> > /sys/devices/system/node/node1/meminfo:Node 1 MemFree: 2374528 kB
> >
> > I could definitely imagine that one of the nodes could have dipped down to
> > zero in the past. I'll try enabling swap on one of our machines with the bad
> > problem late tonight and repeat the experiment. The node meminfo on this box
> > currently looks like
> >
> > # grep MemFree /sys/devices/system/node/node*/meminfo
> > /sys/devices/system/node/node0/meminfo:Node 0 MemFree: 82732 kB
> > /sys/devices/system/node/node1/meminfo:Node 1 MemFree: 1723896 kB
> >
> > Best wishes,
> >
> > Chris.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2010-08-18 16:00:55

by Chris Webb

[permalink] [raw]
Subject: Re: Over-eager swapping

Lee Schermerhorn <[email protected]> writes:

> On Wed, 2010-08-18 at 23:21 +0800, Wu Fengguang wrote:
> > Andi, Christoph and Lee:
> >
> > This looks like an "unbalanced NUMA memory usage leading to premature
> > swapping" problem.
>
> What is the value of the vm.zone_reclaim_mode sysctl? If it is !0, the
> system will go into zone reclaim before allocating off-node pages.
> However, it shouldn't "swap" in this case unless (zone_reclaim_mode & 4)
> != 0. And even then, zone reclaim should only reclaim file pages, not
> anon. In theory...

Hi. This is zero on all our machines:

# sysctl vm.zone_reclaim_mode
vm.zone_reclaim_mode = 0

Cheers,

Chris.

2010-08-18 16:13:10

by Christoph Lameter

[permalink] [raw]
Subject: Re: Over-eager swapping

On Wed, 18 Aug 2010, Chris Webb wrote:

> > != 0. And even then, zone reclaim should only reclaim file pages, not
> > anon. In theory...
>
> Hi. This is zero on all our machines:
>
> # sysctl vm.zone_reclaim_mode
> vm.zone_reclaim_mode = 0

Set it to 1.

2010-08-18 16:14:12

by Fengguang Wu

[permalink] [raw]
Subject: Re: Over-eager swapping

On Wed, Aug 18, 2010 at 11:58:25PM +0800, Chris Webb wrote:
> Lee Schermerhorn <[email protected]> writes:
>
> > On Wed, 2010-08-18 at 23:21 +0800, Wu Fengguang wrote:
> > > Andi, Christoph and Lee:
> > >
> > > This looks like an "unbalanced NUMA memory usage leading to premature
> > > swapping" problem.
> >
> > What is the value of the vm.zone_reclaim_mode sysctl? If it is !0, the
> > system will go into zone reclaim before allocating off-node pages.
> > However, it shouldn't "swap" in this case unless (zone_reclaim_mode & 4)
> > != 0. And even then, zone reclaim should only reclaim file pages, not
> > anon. In theory...
>
> Hi. This is zero on all our machines:
>
> # sysctl vm.zone_reclaim_mode
> vm.zone_reclaim_mode = 0

Chris, can you post /proc/vmstat on the problem machines?

Thanks,
Fengguang

2010-08-18 16:20:27

by Fengguang Wu

[permalink] [raw]
Subject: Re: Over-eager swapping

On Wed, Aug 18, 2010 at 11:57:09PM +0800, Christoph Lameter wrote:
> On Wed, 18 Aug 2010, Wu Fengguang wrote:
>
> > Andi, Christoph and Lee:
> >
> > This looks like an "unbalanced NUMA memory usage leading to premature
> > swapping" problem.
>
> Is zone reclaim active? It may not activate on smaller systems leading
> to an unbalance memory usage between node.

Another possibility is, there are many low watermark page allocations,
leading to kswapd page out activities.

Thanks,
Fengguang

2010-08-18 16:33:56

by Chris Webb

[permalink] [raw]
Subject: Re: Over-eager swapping

Wu Fengguang <[email protected]> writes:

> Chris, can you post /proc/vmstat on the problem machines?

Here's /proc/vmstat from one of the bad machines with swap taken out:

# cat /proc/vmstat
nr_free_pages 115572
nr_inactive_anon 562140
nr_active_anon 5015609
nr_inactive_file 997097
nr_active_file 996989
nr_unevictable 1368
nr_mlock 1368
nr_anon_pages 5862299
nr_mapped 1414
nr_file_pages 1994569
nr_dirty 619
nr_writeback 0
nr_slab_reclaimable 88883
nr_slab_unreclaimable 129859
nr_page_table_pages 15744
nr_kernel_stack 1132
nr_unstable 0
nr_bounce 0
nr_vmscan_write 68708505
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 14
numa_hit 15295188815
numa_miss 9391232519
numa_foreign 9391232519
numa_interleave 16982
numa_local 15294742520
numa_other 9391678814
pgpgin 20644565778
pgpgout 28740368207
pswpin 63818244
pswpout 61199234
pgalloc_dma 0
pgalloc_dma32 4967135753
pgalloc_normal 19812671901
pgalloc_movable 0
pgfree 24779926775
pgactivate 1290396237
pgdeactivate 1289759899
pgfault 19993995783
pgmajfault 21059190
pgrefill_dma 0
pgrefill_dma32 133366009
pgrefill_normal 921184739
pgrefill_movable 0
pgsteal_dma 0
pgsteal_dma32 1275354745
pgsteal_normal 5641309780
pgsteal_movable 0
pgscan_kswapd_dma 0
pgscan_kswapd_dma32 1333139288
pgscan_kswapd_normal 5870516663
pgscan_kswapd_movable 0
pgscan_direct_dma 0
pgscan_direct_dma32 1064518
pgscan_direct_normal 13317302
pgscan_direct_movable 0
zone_reclaim_failed 0
pginodesteal 0
slabs_scanned 1682790400
kswapd_steal 6902288285
kswapd_inodesteal 4909342
pageoutrun 65408579
allocstall 33223
pgrotated 68402979
htlb_buddy_alloc_success 0
htlb_buddy_alloc_fail 0
unevictable_pgs_culled 3538872
unevictable_pgs_scanned 0
unevictable_pgs_rescued 4989403
unevictable_pgs_mlocked 5192009
unevictable_pgs_munlocked 4989074
unevictable_pgs_cleared 2295
unevictable_pgs_stranded 0
unevictable_pgs_mlockfreed 0

The not-so-bad machine that's 3G in swap that I mentioned previously has

# cat /proc/vmstat
nr_free_pages 898394
nr_inactive_anon 834445
nr_active_anon 4118034
nr_inactive_file 904411
nr_active_file 910902
nr_unevictable 2440
nr_mlock 2440
nr_anon_pages 4836349
nr_mapped 1553
nr_file_pages 2243152
nr_dirty 1097
nr_writeback 0
nr_slab_reclaimable 88788
nr_slab_unreclaimable 127310
nr_page_table_pages 14762
nr_kernel_stack 532
nr_unstable 0
nr_bounce 0
nr_vmscan_write 37404214
nr_writeback_temp 0
nr_isolated_anon 0
nr_isolated_file 0
nr_shmem 12
numa_hit 14220178949
numa_miss 3903552922
numa_foreign 3903552922
numa_interleave 16282
numa_local 14219905325
numa_other 3903826546
pgpgin 6500403846
pgpgout 13255814979
pswpin 36384510
pswpout 36380545
pgalloc_dma 4
pgalloc_dma32 2019546454
pgalloc_normal 16466621455
pgalloc_movable 0
pgfree 18487068066
pgactivate 530670561
pgdeactivate 506674301
pgfault 19986735100
pgmajfault 10611234
pgrefill_dma 0
pgrefill_dma32 41306492
pgrefill_normal 318767138
pgrefill_movable 0
pgsteal_dma 0
pgsteal_dma32 214447663
pgsteal_normal 1645250232
pgsteal_movable 0
pgscan_kswapd_dma 0
pgscan_kswapd_dma32 218030201
pgscan_kswapd_normal 1812499810
pgscan_kswapd_movable 0
pgscan_direct_dma 0
pgscan_direct_dma32 157144
pgscan_direct_normal 1095919
pgscan_direct_movable 0
zone_reclaim_failed 0
pginodesteal 0
slabs_scanned 50051072
kswapd_steal 1858447127
kswapd_inodesteal 202297
pageoutrun 15070446
allocstall 3104
pgrotated 37181651
htlb_buddy_alloc_success 0
htlb_buddy_alloc_fail 0
unevictable_pgs_culled 2113384
unevictable_pgs_scanned 0
unevictable_pgs_rescued 3055005
unevictable_pgs_mlocked 3184675
unevictable_pgs_munlocked 3045129
unevictable_pgs_cleared 10034
unevictable_pgs_stranded 0
unevictable_pgs_mlockfreed 0

Best wishes,

Chris.

2010-08-18 16:34:31

by Chris Webb

[permalink] [raw]
Subject: Re: Over-eager swapping

Christoph Lameter <[email protected]> writes:

> On Wed, 18 Aug 2010, Chris Webb wrote:
>
> > > != 0. And even then, zone reclaim should only reclaim file pages, not
> > > anon. In theory...
> >
> > Hi. This is zero on all our machines:
> >
> > # sysctl vm.zone_reclaim_mode
> > vm.zone_reclaim_mode = 0
>
> Set it to 1.

I'll try this tonight: setting this to one and readding swap on one of the
problem machines.

Best wishes,

Chris.

2010-08-18 16:45:46

by Balbir Singh

[permalink] [raw]
Subject: Re: Over-eager swapping

* Chris Webb <[email protected]> [2010-08-02 13:47:35]:

> We run a number of relatively large x86-64 hosts with twenty or so qemu-kvm
> virtual machines on each of them, and I'm have some trouble with over-eager
> swapping on some (but not all) of the machines. This is resulting in
> customer reports of very poor response latency from the virtual machines
> which have been swapped out, despite the hosts apparently having large
> amounts of free memory, and running fine if swap is turned off.
>
> All of the hosts are running a 2.6.32.7 kernel and have ksm enabled with
> 32GB of RAM and 2x quad-core processors. There is a cluster of Xeon E5420
> machines which apparently doesn't exhibit the problem, and a cluster of
> 2352/2378 Opteron (NUMA) machines, some of which do. The kernel config of
> the affected machines is at
>
> http://cdw.me.uk/tmp/config-2.6.32.7
>
> This differs very little from the config on the unaffected Xeon machines,
> essentially just
>
> -CONFIG_MCORE2=y
> +CONFIG_MK8=y
> -CONFIG_X86_P6_NOP=y
>
> On a typical affected machine, the virtual machines and other processes
> would apparently leave around 5.5GB of RAM available for buffers, but the
> system seems to want to swap out 3GB of anonymous pages to give itself more
> like 9GB of buffers:
>
> # cat /proc/meminfo
> MemTotal: 33083420 kB
> MemFree: 693164 kB
> Buffers: 8834380 kB
> Cached: 11212 kB
> SwapCached: 1443524 kB
> Active: 21656844 kB
> Inactive: 8119352 kB
> Active(anon): 17203092 kB
> Inactive(anon): 3729032 kB
> Active(file): 4453752 kB
> Inactive(file): 4390320 kB
> Unevictable: 5472 kB
> Mlocked: 5472 kB
> SwapTotal: 25165816 kB
> SwapFree: 21854572 kB
> Dirty: 4300 kB
> Writeback: 4 kB
> AnonPages: 20780368 kB
> Mapped: 6056 kB
> Shmem: 56 kB
> Slab: 961512 kB
> SReclaimable: 438276 kB
> SUnreclaim: 523236 kB
> KernelStack: 10152 kB
> PageTables: 67176 kB
> NFS_Unstable: 0 kB
> Bounce: 0 kB
> WritebackTmp: 0 kB
> CommitLimit: 41707524 kB
> Committed_AS: 39870868 kB
> VmallocTotal: 34359738367 kB
> VmallocUsed: 150880 kB
> VmallocChunk: 34342404996 kB
> HardwareCorrupted: 0 kB
> HugePages_Total: 0
> HugePages_Free: 0
> HugePages_Rsvd: 0
> HugePages_Surp: 0
> Hugepagesize: 2048 kB
> DirectMap4k: 5824 kB
> DirectMap2M: 3205120 kB
> DirectMap1G: 30408704 kB
>
> We see this despite the machine having vm.swappiness set to 0 in an attempt
> to skew the reclaim as far as possible in favour of releasing page cache
> instead of swapping anonymous pages.
>
> After running swapoff -a, the machine is immediately much healthier. Even
> while the swap is still being reduced, load goes down and response times in
> virtual machines are much improved. Once the swap is completely gone, there
> are still several gigabytes of RAM left free which are used for buffers, and
> the virtual machines are no longer laggy because they are no longer swapped
> out. Running swapon -a again, the affected machine waits for about a minute
> with zero swap in use, before the amount of swap in use very rapidly
> increases to around 2GB and then continues to increase more steadily to 3GB.
>
> We could run with these machines without swap (in the worst cases we're
> already doing so), but I'd prefer to have a reserve of swap available in
> case of genuine emergency. If it's a choice between swapping out a guest or
> oom-killing it, I'd prefer to swap... but I really don't want to swap out
> running virtual machines in order to have eight gigabytes of page cache
> instead of five!
>
> Is this a problem with the page reclaim priorities, or am I just tuning
> these hosts incorrectly? Is there more detailed info than /proc/meminfo
> available which might shed more light on what's going wrong here?
>

Can you give an idea of what the meminfo inside the guest looks like.
Have you looked at
http://kerneltrap.org/mailarchive/linux-kernel/2010/6/8/4580772


--
Three Cheers,
Balbir

2010-08-19 05:13:49

by Balbir Singh

[permalink] [raw]
Subject: Re: Over-eager swapping

* Wu Fengguang <[email protected]> [2010-08-03 12:28:35]:

> On Tue, Aug 03, 2010 at 12:09:18PM +0800, Minchan Kim wrote:
> > On Tue, Aug 3, 2010 at 12:31 PM, Chris Webb <[email protected]> wrote:
> > > Minchan Kim <[email protected]> writes:
> > >
> > >> Another possibility is _zone_reclaim_ in NUMA.
> > >> Your working set has many anonymous page.
> > >>
> > >> The zone_reclaim set priority to ZONE_RECLAIM_PRIORITY.
> > >> It can make reclaim mode to lumpy so it can page out anon pages.
> > >>
> > >> Could you show me /proc/sys/vm/[zone_reclaim_mode/min_unmapped_ratio] ?
> > >
> > > Sure, no problem. On the machine with the /proc/meminfo I showed earlier,
> > > these are
> > >
> > > ?# cat /proc/sys/vm/zone_reclaim_mode
> > > ?0
> > > ?# cat /proc/sys/vm/min_unmapped_ratio
> > > ?1
> >
> > if zone_reclaim_mode is zero, it doesn't swap out anon_pages.
>
> If there are lots of order-1 or higher allocations, anonymous pages
> will be randomly evicted, regardless of their LRU ages. This is
> probably another factor why the users claim. Are there easy ways to
> confirm this other than patching the kernel?
>
> Chris, what's in your /proc/slabinfo?
>

I don't know if Chris saw the link I pointed to earlier, but one of
the reclaim challenges with virtual machines is that cached memory
in the guest (in fact all memory) shows up as anonymous on the host.
If the guests are doing a lot of caching and the guest reclaim sees
no reason to evict the cache, the host will see pressure.

That is one of the reasons I wanted to see meminfo inside the guest if
possible. Setting swappiness to 0 inside the guest is one way of
avoiding double caching that might take place, but I've not found it
to be very effective.

Do we have reason to believe the problem can be solved entirely in the
host?

--
Three Cheers,
Balbir

2010-08-19 05:16:20

by Balbir Singh

[permalink] [raw]
Subject: Re: Over-eager swapping

* Christoph Lameter <[email protected]> [2010-08-18 11:13:03]:

> On Wed, 18 Aug 2010, Chris Webb wrote:
>
> > > != 0. And even then, zone reclaim should only reclaim file pages, not
> > > anon. In theory...
> >
> > Hi. This is zero on all our machines:
> >
> > # sysctl vm.zone_reclaim_mode
> > vm.zone_reclaim_mode = 0
>
> Set it to 1.
>

Isn't that bad in terms of how we treat the cost of remote node
allocations? Is local zone_reclaim() always a good thing or is it
something for chris to try and see if that helps his situation?

--
Three Cheers,
Balbir

2010-08-19 09:28:03

by Chris Webb

[permalink] [raw]
Subject: Re: Over-eager swapping

Balbir Singh <[email protected]> writes:

> Can you give an idea of what the meminfo inside the guest looks like.

Sorry for the slow reply here. Unfortunately not, as these guests are run on
behalf of customers. They install them with operating systems of their
choice, and run them on our service.

> Have you looked at
> http://kerneltrap.org/mailarchive/linux-kernel/2010/6/8/4580772

Yes, I've been watching this discussions with interest. Our application is
one where we have little to no control over what goes on inside the guests,
but these sorts of things definitely make sense where the two are under the
same administrative control.

> Do we have reason to believe the problem can be solved entirely in the
> host?

It's not clear to me why this should be difficult, given that the total size
of vm allocated to guests (and system processes) is always strictly less
than the total amount of RAM available in the host. I do understand that it
won't allow for as impressive overcommit (except by ksm) or be as efficient,
because file-backed guest pages won't get evicted by pressure in the host as
they are indistinguishable from anonymous pages.

After all, a solution that isn't ideal, but does work, is to turn off swap
completely! This is what we've been doing to date. The only problem with
this is that we can't dip into swap in an emergency if there's no swap there
at all.

Best wishes,

Chris.

2010-08-19 10:23:20

by Chris Webb

[permalink] [raw]
Subject: Re: Over-eager swapping

Christoph Lameter <[email protected]> writes:

> On Wed, 18 Aug 2010, Chris Webb wrote:
>
> > > != 0. And even then, zone reclaim should only reclaim file pages, not
> > > anon. In theory...
> >
> > Hi. This is zero on all our machines:
> >
> > # sysctl vm.zone_reclaim_mode
> > vm.zone_reclaim_mode = 0
>
> Set it to 1.

I tried this on a handful of the problem hosts before re-adding their swap.
One of them now runs without dipping into swap. The other three I tried had
the same behaviour of sitting at zero swap usage for a while, before
suddenly spiralling up with %wait going through the roof. I had to swapoff
on them to bring them back into a sane state. So it looks like it helps a
bit, but doesn't cure the problem.

I could definitely believe an explanation that we're swapping in preference
to allocating remote zone pages somehow, given the imbalance in free memory
between the nodes which we saw. However, I read the documentation for
vm.zone_reclaim_mode, which suggests to me that when it was set to zero,
pages from remote zones should be allocated automatically in preference to
swap given that zone_reclaim_mode & 4 == 0?

Cheers,

Chris.

2010-08-19 15:14:24

by Balbir Singh

[permalink] [raw]
Subject: Re: Over-eager swapping

* Chris Webb <[email protected]> [2010-08-19 10:25:36]:

> Balbir Singh <[email protected]> writes:
>
> > Can you give an idea of what the meminfo inside the guest looks like.
>
> Sorry for the slow reply here. Unfortunately not, as these guests are run on
> behalf of customers. They install them with operating systems of their
> choice, and run them on our service.
>

Thanks for clarifying.

> > Have you looked at
> > http://kerneltrap.org/mailarchive/linux-kernel/2010/6/8/4580772
>
> Yes, I've been watching this discussions with interest. Our application is
> one where we have little to no control over what goes on inside the guests,
> but these sorts of things definitely make sense where the two are under the
> same administrative control.
>

Not necessarily, in some cases you can use a guest that uses lesser
page cache, but that might not matter in your case at the moment.

> > Do we have reason to believe the problem can be solved entirely in the
> > host?
>
> It's not clear to me why this should be difficult, given that the total size
> of vm allocated to guests (and system processes) is always strictly less
> than the total amount of RAM available in the host. I do understand that it
> won't allow for as impressive overcommit (except by ksm) or be as efficient,
> because file-backed guest pages won't get evicted by pressure in the host as
> they are indistinguishable from anonymous pages.
>
> After all, a solution that isn't ideal, but does work, is to turn off swap
> completely! This is what we've been doing to date. The only problem with
> this is that we can't dip into swap in an emergency if there's no swap there
> at all.

If you are not overcommitting it should work, in my experiments I've
seen a lot of memory used by the host as page cache on behalf of the
guest. I've done my experiments using cgroups to identify accurate
usage.

--
Three Cheers,
Balbir

2010-08-19 19:03:50

by Christoph Lameter

[permalink] [raw]
Subject: Re: Over-eager swapping

On Thu, 19 Aug 2010, Chris Webb wrote:

> I tried this on a handful of the problem hosts before re-adding their swap.
> One of them now runs without dipping into swap. The other three I tried had
> the same behaviour of sitting at zero swap usage for a while, before
> suddenly spiralling up with %wait going through the roof. I had to swapoff
> on them to bring them back into a sane state. So it looks like it helps a
> bit, but doesn't cure the problem.
>
> I could definitely believe an explanation that we're swapping in preference
> to allocating remote zone pages somehow, given the imbalance in free memory
> between the nodes which we saw. However, I read the documentation for
> vm.zone_reclaim_mode, which suggests to me that when it was set to zero,
> pages from remote zones should be allocated automatically in preference to
> swap given that zone_reclaim_mode & 4 == 0?

If zone reclaim is off then pages from other nodes will be allocated if a
node is filled up with page cache.

zone reclaim typically only evicts clean page cache pages in order to keep
the additional overhead down. Enabling swapping allows a more aggressive
form of recovering memory in preference of going off line.

The VM should work fine even without zone reclaim.