2010-11-12 16:20:23

by Peter Schüller

[permalink] [raw]
Subject: Sudden and massive page cache eviction

Hello,

We have been seeing sudden and repeated evictions of huge amounts of
page cache on some of our servers for reasons that we cannot explain.
We are hoping that someone familiar with the vm subsystem may be able
to shed some light on the issue and perhaps confirm whether it is
plausibly a kernel bug or not. I will try to present the information
most-important-first, but this post will unavoidable be a bit long -
sorry.

First, here is a good example of the symptom (more graphs later on):

http://files.spotify.com/memcut/b_daily_allcut.png

After looking into this we have seen similar incidents on servers
running completely different software; but in this particular case
this machine is running a service which is heavily dependent on the
buffer cache to deal with incoming request load. The direct effects of
these is that we end up in complete I/O saturation (average queue
depth goes to 150-250 and stays there indefinitely or until we
actively tweak it (warm up caches etc)). Our interpretation of that is
that the eviction is not the result of something along the lines of a
large file being removed; given the effects on I/O load it is clear
that the data being evicted is in fact part of the active set used by
the service running on the machine.

The I/O load on these systems comes mainly from two things:

(1) Seek-bound I/O generated by lookups in a BDB (b-tree traversal).
(2) Seek-bound I/O generated by traversal of prefix directory trees
(i.e., 00/01/0001334234...., a poor man's b-tree on top of ext3).
(3) Seek-bound I/O reading small segments of small-to-medium sized
files contained in the prefix tree.

The prefix tree consist of 8*2^16 directory entries in total, with
individual files being in the tens of millions per server.

We initially ran 2.6.32-bpo.5-amd64 (Debian backports kernel) and have
subsequently upgraded some of them to 2.6.36-rc6-amd64 (Debian
experimental repo). While it initially looked like it was behaving
better, it slowly reverted to not making a difference (maybe as a
function of uptime, but we have not had the opportunity to test this
by re-booting some of them so it is an untested hypothesis).

Most of the activity on this system (ignoring the usual stuff like
ssh/cron/syslog/etc) is coming from Python processes that consume
non-trivial amounts of heap space, plus the disk activity and some
POSIX shared memory caching utilized by the BDB library.

We have correlated the incidence of these page eviction with higher
loads on the system; i.e., it tends to happen under high-load periods
and in addition we tend to see additional machines having problems as
a result of us "fixing" a machine that experienced an eviction (we
have some limited cascading effects that causes slightly higher load
on other servers in the cluster when we do that).

We believe the most plausible way an application bug could trigger
this behavior would require that (1) the application allocates the
memory, and (2) actually touches the pages. We believe this to be
unlikely in this case because:

(1) We see similar sudden evictions on various other servers, which
we noticed when we started looking for them.
(2) The fact that it tends to trigger correlated with load suggests
that it is not a functional bug in the service as such as higher load
is in this case unlikely to trigger any paths that does anything
unique with respect to memory allocation. In particular because the
domain logic is all Python, and none of it really deals with data
chunks.
(3) If we did manage to allocate something in the Python heap, we
would have to be "lucky" (or unlucky) if Python were consistently able
to munmap()/brk() down afterwards.

Some additional "sample" graphs showing a few incidences of the problem:

http://files.spotify.com/memcut/a_daily.png
http://files.spotify.com/memcut/a_weekly.png
http://files.spotify.com/memcut/b_daily_allcut.png
http://files.spotify.com/memcut/c_monthly.png
http://files.spotify.com/memcut/c_yearly.png
http://files.spotify.com/memcut/d_monthly.png
http://files.spotify.com/memcut/d_yearly.png
http://files.spotify.com/memcut/a_monthly.png
http://files.spotify.com/memcut/a_yearly.png
http://files.spotify.com/memcut/c_daily.png
http://files.spotify.com/memcut/c_weekly.png
http://files.spotify.com/memcut/d_daily.png
http://files.spotify.com/memcut/d_weekly.png

And here is an example from a server only running PostgreSQL (where
the sudden drop of gigabytes of page cache is unlikely because we are
not DROP:ing tables, nor do we have multi-gigabyte WAL archive sizes,
nor do we have a use-case which will imply ftruncate() on table
files):

http://files.spotify.com/memcut/postgresql_weekly.png

As you can see it's not as significant there, but it seems to, at
least visually, be the same "type" of effect. We've seen similar on
various machines, although depending on service running it may or may
not be explainable by regular file removal.

Further, we have observed the kernel's unwillingness to retain data in
page cache under interesting circumstances:

(1) page cache eviction happens
(2) we warm up our BDB files by cat:ing them (simple but effective)
(3) within a matter of minutes, while there is still several GB of
free (truly free, not page cached), these are evicted (as evidenced by
re-cat:ing them a little while later)

This latest observation we understand may be due to NUMA related
allocation issues, and we should probably try to use numactl to ask
for a more even allocation. We have not yet tried this. However, it is
not clear how any issues having to do with that would cause sudden
eviction of data already *in* the page cache (on whichever node).

--
/ Peter Schuller aka scode


2010-11-23 00:12:37

by Andrew Morton

[permalink] [raw]
Subject: Re: Sudden and massive page cache eviction

(cc linux-mm)

On Fri, 12 Nov 2010 17:20:21 +0100
Peter Sch__ller <[email protected]> wrote:

> Hello,
>
> We have been seeing sudden and repeated evictions of huge amounts of
> page cache on some of our servers for reasons that we cannot explain.
> We are hoping that someone familiar with the vm subsystem may be able
> to shed some light on the issue and perhaps confirm whether it is
> plausibly a kernel bug or not. I will try to present the information
> most-important-first, but this post will unavoidable be a bit long -
> sorry.
>
> First, here is a good example of the symptom (more graphs later on):
>
> http://files.spotify.com/memcut/b_daily_allcut.png
>
> After looking into this we have seen similar incidents on servers
> running completely different software; but in this particular case
> this machine is running a service which is heavily dependent on the
> buffer cache to deal with incoming request load. The direct effects of
> these is that we end up in complete I/O saturation (average queue
> depth goes to 150-250 and stays there indefinitely or until we
> actively tweak it (warm up caches etc)). Our interpretation of that is
> that the eviction is not the result of something along the lines of a
> large file being removed; given the effects on I/O load it is clear
> that the data being evicted is in fact part of the active set used by
> the service running on the machine.
>
> The I/O load on these systems comes mainly from two things:
>
> (1) Seek-bound I/O generated by lookups in a BDB (b-tree traversal).
> (2) Seek-bound I/O generated by traversal of prefix directory trees
> (i.e., 00/01/0001334234...., a poor man's b-tree on top of ext3).
> (3) Seek-bound I/O reading small segments of small-to-medium sized
> files contained in the prefix tree.
>
> The prefix tree consist of 8*2^16 directory entries in total, with
> individual files being in the tens of millions per server.
>
> We initially ran 2.6.32-bpo.5-amd64 (Debian backports kernel) and have
> subsequently upgraded some of them to 2.6.36-rc6-amd64 (Debian
> experimental repo). While it initially looked like it was behaving
> better, it slowly reverted to not making a difference (maybe as a
> function of uptime, but we have not had the opportunity to test this
> by re-booting some of them so it is an untested hypothesis).
>
> Most of the activity on this system (ignoring the usual stuff like
> ssh/cron/syslog/etc) is coming from Python processes that consume
> non-trivial amounts of heap space, plus the disk activity and some
> POSIX shared memory caching utilized by the BDB library.
>
> We have correlated the incidence of these page eviction with higher
> loads on the system; i.e., it tends to happen under high-load periods
> and in addition we tend to see additional machines having problems as
> a result of us "fixing" a machine that experienced an eviction (we
> have some limited cascading effects that causes slightly higher load
> on other servers in the cluster when we do that).
>
> We believe the most plausible way an application bug could trigger
> this behavior would require that (1) the application allocates the
> memory, and (2) actually touches the pages. We believe this to be
> unlikely in this case because:
>
> (1) We see similar sudden evictions on various other servers, which
> we noticed when we started looking for them.
> (2) The fact that it tends to trigger correlated with load suggests
> that it is not a functional bug in the service as such as higher load
> is in this case unlikely to trigger any paths that does anything
> unique with respect to memory allocation. In particular because the
> domain logic is all Python, and none of it really deals with data
> chunks.
> (3) If we did manage to allocate something in the Python heap, we
> would have to be "lucky" (or unlucky) if Python were consistently able
> to munmap()/brk() down afterwards.
>
> Some additional "sample" graphs showing a few incidences of the problem:
>
> http://files.spotify.com/memcut/a_daily.png
> http://files.spotify.com/memcut/a_weekly.png
> http://files.spotify.com/memcut/b_daily_allcut.png
> http://files.spotify.com/memcut/c_monthly.png
> http://files.spotify.com/memcut/c_yearly.png
> http://files.spotify.com/memcut/d_monthly.png
> http://files.spotify.com/memcut/d_yearly.png
> http://files.spotify.com/memcut/a_monthly.png
> http://files.spotify.com/memcut/a_yearly.png
> http://files.spotify.com/memcut/c_daily.png
> http://files.spotify.com/memcut/c_weekly.png
> http://files.spotify.com/memcut/d_daily.png
> http://files.spotify.com/memcut/d_weekly.png
>
> And here is an example from a server only running PostgreSQL (where
> the sudden drop of gigabytes of page cache is unlikely because we are
> not DROP:ing tables, nor do we have multi-gigabyte WAL archive sizes,
> nor do we have a use-case which will imply ftruncate() on table
> files):
>
> http://files.spotify.com/memcut/postgresql_weekly.png
>
> As you can see it's not as significant there, but it seems to, at
> least visually, be the same "type" of effect. We've seen similar on
> various machines, although depending on service running it may or may
> not be explainable by regular file removal.
>
> Further, we have observed the kernel's unwillingness to retain data in
> page cache under interesting circumstances:
>
> (1) page cache eviction happens
> (2) we warm up our BDB files by cat:ing them (simple but effective)
> (3) within a matter of minutes, while there is still several GB of
> free (truly free, not page cached), these are evicted (as evidenced by
> re-cat:ing them a little while later)
>
> This latest observation we understand may be due to NUMA related
> allocation issues, and we should probably try to use numactl to ask
> for a more even allocation. We have not yet tried this. However, it is
> not clear how any issues having to do with that would cause sudden
> eviction of data already *in* the page cache (on whichever node).
>

2010-11-23 08:38:29

by Dave Hansen

[permalink] [raw]
Subject: Re: Sudden and massive page cache eviction

On Mon, 2010-11-22 at 16:11 -0800, Andrew Morton wrote:
> > This latest observation we understand may be due to NUMA related
> > allocation issues, and we should probably try to use numactl to ask
> > for a more even allocation. We have not yet tried this. However, it
> is not clear how any issues having to do with that would cause sudden
> > eviction of data already *in* the page cache (on whichever node).

You don't have anybody messing with /proc/sys/vm/drop_caches, do you?

That can cause massive, otherwise unprovoked page cache eviction.

-- Dave

2010-11-23 09:44:43

by Peter Schüller

[permalink] [raw]
Subject: Re: Sudden and massive page cache eviction

> You don't have anybody messing with /proc/sys/vm/drop_caches, do you?

Highly unlikely given that (1) evictions, while often very
significant, are usually not *complete* (although the first graph
example I provided had a more or less complete eviction) and (2) the
evictions are not obviously periodic indicating some kind of cron job,
and (3) we see the evictions happening across a wide variety of
machines.

So yes, I feel confident that we are not accidentally doing that.

(FWIW though, drop_caches is great. I only recently found out about
it, and it's really useful when benchmarking.)

--
/ Peter Schuller aka scode

2010-11-23 16:19:53

by Dave Hansen

[permalink] [raw]
Subject: Re: Sudden and massive page cache eviction

On Tue, 2010-11-23 at 10:44 +0100, Peter Sch?ller wrote:
> > You don't have anybody messing with /proc/sys/vm/drop_caches, do you?
>
> Highly unlikely given that (1) evictions, while often very
> significant, are usually not *complete* (although the first graph
> example I provided had a more or less complete eviction) and (2) the
> evictions are not obviously periodic indicating some kind of cron job,
> and (3) we see the evictions happening across a wide variety of
> machines.
>
> So yes, I feel confident that we are not accidentally doing that.

Yeah, drop_caches doesn't seem very likely.

Your postgres data looks the cleanest and is probably the easiest to
analyze. Might as well start there:

http://files.spotify.com/memcut/postgresql_weekly.png

As you said, it might not be the same as the others, but it's a decent
place to start. If someone used drop_caches or if someone was randomly
truncating files, we'd expect to see the active/inactive lines both drop
by relatively equivalent amounts, and see them happen at _exactly_ the
same time as the cache eviction. The eviction about 1/3 of the way
through Wednesday in the above graph kinda looks this way, but it's the
exception.

Just eyeballing it, _most_ of the evictions seem to happen after some
movement in the active/inactive lists. We see an "inactive" uptick as
we start to launder pages, and the page activation doesn't keep up with
it. This is a _bit_ weird since we don't see any slab cache or other
users coming to fill the new space. Something _wanted_ the memory, so
why isn't it being used?

Do you have any large page (hugetlbfs) or other multi-order (> 1 page)
allocations happening in the kernel?

If you could start recording /proc/{vmstat,buddystat,meminfo,slabinfo},
it would be immensely useful. The munin graphs are really great, but
they don't have the detail which you can get from stuff like vmstat.

> Further, we have observed the kernel's unwillingness to retain data in
> page cache under interesting circumstances:
>
> (1) page cache eviction happens
> (2) we warm up our BDB files by cat:ing them (simple but effective)
> (3) within a matter of minutes, while there is still several GB of
> free (truly free, not page cached), these are evicted (as evidenced by
> re-cat:ing them a little while later)
>
> This latest observation we understand may be due to NUMA related
> allocation issues, and we should probably try to use numactl to ask
> for a more even allocation. We have not yet tried this. However, it is
> not clear how any issues having to do with that would cause sudden
> eviction of data already *in* the page cache (on whichever node)..

For a page-cache-heavy workload where you care a lot more about things
being _in_ cache rather than having good NUMA locality, you probably
want "zone_reclaim_mode" set to 0:

http://www.kernel.org/doc/Documentation/sysctl/vm.txt

That'll be a bit more comprehensive than messing with numactl. It
really is the best thing if you just don't care about NUMA latencies all
that much. What kind of hardware is this, btw?

-- Dave

2010-11-24 14:02:35

by Peter Schüller

[permalink] [raw]
Subject: Re: Sudden and massive page cache eviction

Hello,

first of all, thank you very much for taking the time to analyze the situation!

> Yeah, drop_caches doesn't seem very likely.
>
> Your postgres data looks the cleanest and is probably the easiest to
> analyze.  Might as well start there:
>
>        http://files.spotify.com/memcut/postgresql_weekly.png

Since you wanted to look at that primarily I re-visited that
particular case to confirm what was really happening. Unfortunately I
have to retract my claim here, because it turns out that we have
backups running locally on the machine before shipping them away, and
it seems that indeed the cache evictions on that machine are
correlated with the removal of said backup after it was shipped away
(and testing confirms the behavior).

Of course that is entirely expected (that removal of a recently
written file will cause a sudden spike in free memory) so the
PostgreSQL graph is a red herring.

This was unfortunate, and a result of me picking this guy fairly
ad-hoc for the purpose of summarizing the situation in my post to the
list. We have spent considerable time trying to make sure that the
evictions are indeed anomalous and not e.g. due to a file removal in
the case of the actual service where we are negatively affected, but I
was not sufficiently careful before proclaiming that we seem to see a
similar effects on other hosts.

It may still be the case, but I am not finding a case which is
sufficiently clear at this time (being sure requires really looking at
what is going on with each class of machine and eliminating various
forms of backups, log rotation and other behavior exhibited). This
also means that in general it's not certain that we are in fact seeing
this behavior on others at all.

However it does leave all other observations, including the very
direct correlation in time with load spikes and a lack of correlation
with backups jobs, and the fact that the eviction seems to be of
actively used data given the resulting I/O storm. So I feel confident
in saying that we definitely do have an actual issue (although as
previously indicated we have not proven conclusively that there is
absolutely no userspace application allocating and touching lots of
pages suddenly, but it seems very unlikely).

> Just eyeballing it, _most_ of the evictions seem to happen after some
> movement in the active/inactive lists.  We see an "inactive" uptick as
> we start to launder pages, and the page activation doesn't keep up with
> it.  This is a _bit_ weird since we don't see any slab cache or other
> users coming to fill the new space.  Something _wanted_ the memory, so
> why isn't it being used?

In this case we have the writing of a backup file (with corresponding
page touching for reading data). This is followed by a period of
reading the recently written file, followed by it being removed.

> Do you have any large page (hugetlbfs) or other multi-order (> 1 page)
> allocations happening in the kernel?

No; we're not using huge pages at all (not consciously). Looking at
/proc/meminfo I can confirm that we just see this:

HugePages_Total: 0
HugePages_Free: 0
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: 2048 kB

So hopefully that should not be a factor.

> If you could start recording /proc/{vmstat,buddystat,meminfo,slabinfo},
> it would be immensely useful.  The munin graphs are really great, but
> they don't have the detail which you can get from stuff like vmstat.

Absolutely. I'll get some recording of those going and run for
sufficient duration to correlate with page evictions.

> For a page-cache-heavy workload where you care a lot more about things
> being _in_ cache rather than having good NUMA locality, you probably
> want "zone_reclaim_mode" set to 0:
>
>        http://www.kernel.org/doc/Documentation/sysctl/vm.txt
>
> That'll be a bit more comprehensive than messing with numactl.  It
> really is the best thing if you just don't care about NUMA latencies all
> that much.

Thanks! That looks to be exactly what we would like in this case and,
if Interpret you and the documentation correctly, obviates the need to
ask for interleaved allocation.

>  What kind of hardware is this, btw?

It varies somewhat in age, but all of them are Intel. The oldest ones
have 16 GB of memory and are of this CPU type:

cpu family : 6
model : 23
model name : Intel(R) Xeon(R) CPU L5420 @ 2.50GHz
stepping : 6
cpu MHz : 2494.168
cache size : 6144 KB

While newer ones have ~36 GB memory and are of this CPU type:

cpu family : 6
model : 26
model name : Intel(R) Xeon(R) CPU E5504 @ 2.00GHz
stepping : 5
cpu MHz : 2000.049
cache size : 4096 KB

Some variation beyond that may exist, but that is the span (and all
are Intel, 8 cores or more).

numactl --show on older machines:

policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7
cpubind: 0
nodebind: 0
membind: 0

And on newer machines:

policy: default
preferred node: current
physcpubind: 0 1 2 3 4 5 6 7
cpubind: 0 1
nodebind: 0 1
membind: 0 1

--
/ Peter Schuller aka scode

2010-11-24 14:14:48

by Peter Schüller

[permalink] [raw]
Subject: Re: Sudden and massive page cache eviction

>> Do you have any large page (hugetlbfs) or other multi-order (> 1 page)
>> allocations happening in the kernel?

I forgot to address the second part of this question: How would I best
inspect whether the kernel is doing that?

Looking at the kmalloc() sizes from vmstat -m I have the following on
one of the machines (so very few larger than 4096). But I suspect you
are asking for something different?

kmalloc-8192 52 56 8192 4
kmalloc-4096 33927 62040 4096 8
kmalloc-2048 338 416 2048 16
kmalloc-1024 76211 246976 1024 32
kmalloc-512 1134 1216 512 32
kmalloc-256 109523 324928 256 32
kmalloc-128 3902 4288 128 32
kmalloc-64 105296 105536 64 64
kmalloc-32 2120 2176 32 128
kmalloc-16 4607 4608 16 256
kmalloc-8 6655 6656 8 512
kmalloc-192 6546 9030 192 21
kmalloc-96 29694 32298 96 42

--
/ Peter Schuller aka scode

2010-11-24 14:20:32

by Pekka Enberg

[permalink] [raw]
Subject: Re: Sudden and massive page cache eviction

Hi Peter,

2010/11/24 Peter Sch?ller <[email protected]>:
>>> Do you have any large page (hugetlbfs) or other multi-order (> 1 page)
>>> allocations happening in the kernel?
>
> I forgot to address the second part of this question: How would I best
> inspect whether the kernel is doing that?

You can, for example, record

cat /proc/meminfo | grep Huge

for large page allocations.

> Looking at the kmalloc() sizes from vmstat -m I have the following on
> one of the machines (so very few larger than 4096). But I suspect you
> are asking for something different?

The "pagesperslab" column of /proc/slabinfo tells you how many pages
slab allocates from the page allocator.

Pekka

2010-11-24 15:32:42

by Peter Schüller

[permalink] [raw]
Subject: Re: Sudden and massive page cache eviction

>> I forgot to address the second part of this question: How would I best
>> inspect whether the kernel is doing that?
>
> You can, for example, record
>
>  cat /proc/meminfo | grep Huge
>
> for large page allocations.

Those show zero a per my other post. However I got the impression Dave
was asking about regular but larger-than-one-page allocations internal
to the kernel, while the Huge* lines in /proc/meminfo refers to
allocations specifically done by userland applications doing huge page
allocation on a system with huge pages enabled - or am I confused?

> The "pagesperslab" column of /proc/slabinfo tells you how many pages
> slab allocates from the page allocator.

Seems to be what vmstat -m reports.

--
/ Peter Schuller aka scode

2010-11-24 17:32:14

by Dave Hansen

[permalink] [raw]
Subject: Re: Sudden and massive page cache eviction

On Wed, 2010-11-24 at 15:14 +0100, Peter Sch?ller wrote:
> >> Do you have any large page (hugetlbfs) or other multi-order (> 1 page)
> >> allocations happening in the kernel?
>
> I forgot to address the second part of this question: How would I best
> inspect whether the kernel is doing that?

I found out yesterday how to do it with tracing, but it's not a horribly
simple thing to do in any case. You can watch the entries in slabinfo
and see if any of the ones with sizes over 4096 bytes are getting used
often. You can also watch /proc/buddyinfo and see how often columns
other than the first couple are moving around.

Jumbo ethernet frames would be the most common reason to see these
allocations. It's _probably_ not an issue in your case.

-- Dave

2010-11-24 17:46:30

by Pekka Enberg

[permalink] [raw]
Subject: Re: Sudden and massive page cache eviction

Hi Peter,

2010/11/24 Peter Sch?ller <[email protected]>:
>>> I forgot to address the second part of this question: How would I best
>>> inspect whether the kernel is doing that?
>>
>> You can, for example, record
>>
>> ?cat /proc/meminfo | grep Huge
>>
>> for large page allocations.
>
> Those show zero a per my other post. However I got the impression Dave
> was asking about regular but larger-than-one-page allocations internal
> to the kernel, while the Huge* lines in /proc/meminfo refers to
> allocations specifically done by userland applications doing huge page
> allocation on a system with huge pages enabled - or am I confused?

He was asking about both (large page allocations and higher order allocations).

>> The "pagesperslab" column of /proc/slabinfo tells you how many pages
>> slab allocates from the page allocator.
>
> Seems to be what vmstat -m reports.

No, "vmstat -m" reports _total number_ of pages allocated. We're
interested in how many pages slab allocator whenever it needs to
allocate memory for a new slab. That's represented by the
"pagesperslab" column of /proc/slabinfo from which you can deduce the
page allocation order.

Pekka

2010-11-25 01:18:53

by Simon Kirby

[permalink] [raw]
Subject: Re: Sudden and massive page cache eviction

On Wed, Nov 24, 2010 at 04:32:39PM +0100, Peter Sch??ller wrote:

> >> I forgot to address the second part of this question: How would I best
> >> inspect whether the kernel is doing that?
> >
> > You can, for example, record
> >
> > ??cat /proc/meminfo | grep Huge
> >
> > for large page allocations.
>
> Those show zero a per my other post. However I got the impression Dave
> was asking about regular but larger-than-one-page allocations internal
> to the kernel, while the Huge* lines in /proc/meminfo refers to
> allocations specifically done by userland applications doing huge page
> allocation on a system with huge pages enabled - or am I confused?

Your page cache dents don't seem quite as big, so it may be something
else, but if it's the same problem we're seeing here, it seems to have to
do with when an order=3 new_slab allocation comes in to grows the kmalloc
slab cache for an __alloc_skb (network packet). This is normal even
without jumbo frames now. When there are no zones with order=3
zone_watermark_ok(), kswapd is woken, which frees things all over the
place to try to get zone_watermark_ok(order=3) to be happy.

We're seeing this throw out a huge number of pages, and we're seeing it
happen even with lots of memory free in the zone. CONFIG_COMPACTION also
currently does not help because try_to_compact_pages() returns early with
COMPACT_SKIPPED if order <= PAGE_ALLOC_COSTLY_ORDER, and, you guessed it,
PAGE_ALLOC_COSTLY_ORDER is set to 3.

I reimplemented zone_pages_ok(order=3) in userspace, and I can see it
happen:

Code here: http://0x.ca/sim/ref/2.6.36/buddyinfo_scroll

Zone order:0 1 2 3 4 5 6 7 8 9 A nr_free state

DMA32 19026 33652 4897 13 5 1 2 0 0 0 0 106262 337 <= 256
Normal 450 0 0 0 0 0 0 0 0 0 0 450 -7 <= 238
DMA32 19301 33869 4665 12 5 1 2 0 0 0 0 106035 329 <= 256
Normal 450 0 0 0 0 0 0 0 0 0 0 450 -7 <= 238
DMA32 19332 33931 4603 9 5 1 2 0 0 0 0 105918 305 <= 256
Normal 450 0 0 0 0 0 0 0 0 0 0 450 -7 <= 238
DMA32 19467 34057 4468 6 5 1 2 0 0 0 0 105741 281 <= 256
Normal 450 0 0 0 0 0 0 0 0 0 0 450 -7 <= 238
DMA32 19591 34181 4344 5 5 1 2 0 0 0 0 105609 273 <= 256
Normal 450 0 0 0 0 0 0 0 0 0 0 450 -7 <= 238
DMA32 19856 34348 4109 2 5 1 2 0 0 0 0 105244 249 <= 256 !!!
Normal 450 0 0 0 0 0 0 0 0 0 0 450 -7 <= 238
DMA32 24088 36476 5437 144 5 1 2 0 0 0 0 120180 1385 <= 256
Normal 1024 1 0 0 0 0 0 0 0 0 0 1026 -5 <= 238
DMA32 26453 37440 6676 623 53 1 2 0 0 0 0 134029 5985 <= 256
Normal 8700 100 0 0 0 0 0 0 0 0 0 8900 193 <= 238
DMA32 48881 38161 7142 966 81 1 2 0 0 0 0 162955 9177 <= 256
Normal 8936 102 0 1 0 0 0 0 0 0 0 9148 205 <= 238
DMA32 66046 40051 7871 1409 135 2 2 0 0 0 0 191256 13617 <= 256
Normal 9019 18 0 0 0 0 0 0 0 0 0 9055 29 <= 238
DMA32 67133 48671 8231 1578 143 2 2 0 0 0 0 212503 15097 <= 256

So, kswapd was woken up at the line that ends in "!!!" there, because
free_pages(249) <= min(256), and so zone_watermark_ok() returned 0, when
an order=3 allocation came in.

Maybe try out that script and see if you see something similar.

Simon-

2010-11-25 15:33:04

by Peter Schüller

[permalink] [raw]
Subject: Re: Sudden and massive page cache eviction

> simple thing to do in any case.  You can watch the entries in slabinfo
> and see if any of the ones with sizes over 4096 bytes are getting used
> often.  You can also watch /proc/buddyinfo and see how often columns
> other than the first couple are moving around.

I collected some information from
/proc/{buddyinfo,meminfo,slabinfo,vmstat} and let it sit, polling
approximately once per minute. I have some results correlated with
another page eviction in graphs. The graph is here:

http://files.spotify.com/memcut/memgraph-20101124.png

The last sudden eviction there occurred somewhere between 22:30 and
22:45. Some URL:s that can be compared for those periods:

Before:
http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:39:30/vmstat
http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:39:30/buddyinfo
http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:39:30/meminfo
http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:39:30/slabinfo

After:
http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:45:31/vmstat
http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:45:31/buddyinfo
http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:45:31/meminfo
http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:45:31/slabinfo

A more complete set of files for several minutes before/after/during
is in this tarball:

http://files.spotify.com/memcut/memgraph-20101124.tar.gz

Diffing the two slabinfos above yields among other things the
following expected or at least plausible decreases since they
correlate with the symptomes:

-ext3_inode_cache 254638 693552 760 43 8 : tunables 0 0
0 : slabdata 16130 16130 0
+ext3_inode_cache 81698 468958 760 43 8 : tunables 0 0
0 : slabdata 10906 10906 0
-dentry 96504 344232 192 42 2 : tunables 0 0
0 : slabdata 8196 8196 0
+dentry 55628 243810 192 42 2 : tunables 0 0
0 : slabdata 5805 5805 0

And if my understanding is correct, buffer_head would be meta-data
associated with cached pages and thus be expected to drop in
correlation with less data cached:

-buffer_head 2109250 4979052 104 39 1 : tunables 0
0 0 : slabdata 127668 127668 0
+buffer_head 838859 4059822 104 39 1 : tunables 0
0 0 : slabdata 104098 104098 0

My knowledge of the implementation is lacking far too much to know
where best to look for the likely culprit in terms of the root cause
of the eviction. The one thing I thought looked suspicious was the
kmalloc increases:

-kmalloc-4096 301 328 4096 8 8 : tunables 0 0
0 : slabdata 41 41 0
+kmalloc-4096 637 680 4096 8 8 : tunables 0 0
0 : slabdata 85 85 0
-kmalloc-2048 18215 19696 2048 16 8 : tunables 0 0
0 : slabdata 1231 1231 0
+kmalloc-2048 41908 51792 2048 16 8 : tunables 0 0
0 : slabdata 3237 3237 0
-kmalloc-1024 85444 97280 1024 32 8 : tunables 0 0
0 : slabdata 3040 3040 0
+kmalloc-1024 267031 327104 1024 32 8 : tunables 0 0
0 : slabdata 10222 10222 0
-kmalloc-512 1988 2176 512 32 4 : tunables 0 0
0 : slabdata 68 68 0
+kmalloc-512 1692 2080 512 32 4 : tunables 0 0
0 : slabdata 65 65 0
-kmalloc-256 102588 119776 256 32 2 : tunables 0 0
0 : slabdata 3743 3743 0
+kmalloc-256 308470 370720 256 32 2 : tunables 0 0
0 : slabdata 11585 11585 0
-kmalloc-128 8435 9760 128 32 1 : tunables 0 0
0 : slabdata 305 305 0
+kmalloc-128 8524 9760 128 32 1 : tunables 0 0
0 : slabdata 305 305 0
-kmalloc-64 96176 405440 64 64 1 : tunables 0 0
0 : slabdata 6335 6335 0
+kmalloc-64 50001 352448 64 64 1 : tunables 0 0
0 : slabdata 5507 5507 0

If my interpretation and understanding is correct, this indicates that
for example, ~3000 to ~10000 3-order allocations resulting from 1 kb
kmalloc():s. Meaning about 0.2 gig ( 7000*4*8*1024/1024/1024). Add the
other ones and we get some more, but only a few hundred megs in total.

Going by the hypothesis that we are seeing the same thing as reported
by Simon Kirby (I'll respond to that E-Mail separately), the total
amount is (as far as I understand) not the important part, but the
fact that we saw a non-trivial increase in 3-order allocations would
perhaps be a consistent observation in that frequent 3-order
allocations might be more likely to trigger the behavior Simon
reports.

I can do additional post-processing on the data that was dumped (such
as graphing it), but I'm not sure which parts would be the most
interesting for figuring out what is going on. Is there something else
I should be collecting for that matter, than the
vmstat/slabinfo/buddyinfo/meminfo?

--
/ Peter Schuller aka scode

2010-11-25 15:59:28

by Peter Schüller

[permalink] [raw]
Subject: Re: Sudden and massive page cache eviction

> Your page cache dents don't seem quite as big, so it may be something
> else, but if it's the same problem we're seeing here, it seems to have to
> do with when an order=3 new_slab allocation comes in to grows the kmalloc
> slab cache for an __alloc_skb (network packet).  This is normal even
> without jumbo frames now.  When there are no zones with order=3
> zone_watermark_ok(), kswapd is woken, which frees things all over the
> place to try to get zone_watermark_ok(order=3) to be happy.
> We're seeing this throw out a huge number of pages, and we're seeing it
> happen even with lots of memory free in the zone.

Is there some way to observe this directly (the amount evicted for low
watermark reasons)?

If not, is logging/summing the return value of balance_pgdat() in
kswapd() (mm/vmscan.c) be the way to accomplish this?

My understanding (and I am saying it just so that people can tell my
if I'm wrong) is that what you're saying implies that kswapd keeps
getting woken up in wakeup_kswapd() due to zone_watermark_ok(), but
kswapd()'s invocation of balance_pgdat() is unable to bring levels
above the low water mark but but evicted large amounts of data while
trying?

> I reimplemented zone_pages_ok(order=3) in userspace, and I can see it
> happen:

(For the ML record/others: I believe that was meant to be
zone_watermark_ok(), not zone_pages_ok(). It's in mm/page_alloc.c)

> Code here: http://0x.ca/sim/ref/2.6.36/buddyinfo_scroll

[snip output]

> So, kswapd was woken up at the line that ends in "!!!" there, because
> free_pages(249) <= min(256), and so zone_watermark_ok() returned 0, when
> an order=3 allocation came in.
>
> Maybe try out that script and see if you see something similar.

Thanks! That looks great. I'll try to set up data collection where
this can be observed and then correlated with a graph and the
vmstat/slabinfo that I just posted, the next time we see an eviction.

(For the record it triggers constantly on my desktop, but that is with
2.6.32 and I'm assuming it is due to differences in that kernel, so
I'm not bothering investigating. It's not triggering constantly on the
2.6.26-rc6 kernel on the production system, and hopefully we can see
it trigger during the evictions.)

--
/ Peter Schuller aka scode

2010-12-01 06:36:55

by Simon Kirby

[permalink] [raw]
Subject: Re: Sudden and massive page cache eviction

Hello! Sorry, I didn't see your email until now.

On Thu, Nov 25, 2010 at 04:59:25PM +0100, Peter Sch??ller wrote:

> > Your page cache dents don't seem quite as big, so it may be something
> > else, but if it's the same problem we're seeing here, it seems to have to
> > do with when an order=3 new_slab allocation comes in to grows the kmalloc
> > slab cache for an __alloc_skb (network packet). ??This is normal even
> > without jumbo frames now. ??When there are no zones with order=3
> > zone_watermark_ok(), kswapd is woken, which frees things all over the
> > place to try to get zone_watermark_ok(order=3) to be happy.
> > We're seeing this throw out a huge number of pages, and we're seeing it
> > happen even with lots of memory free in the zone.
>
> Is there some way to observe this directly (the amount evicted for low
> watermark reasons)?
>
> If not, is logging/summing the return value of balance_pgdat() in
> kswapd() (mm/vmscan.c) be the way to accomplish this?

The way I could "see" it was with http://0x.ca/sim/ref/2.6.36/vmstat_dump
, which updates fast enough (if your terminal doesn't suck) to actually
see the 10 Hz wakeups and similar patterns. pageoutrun goes up each time
balance_pgdat() passes the "loop_again" label.

Watching http://0x.ca/sim/ref/2.6.36/buddyinfo_dump at the same time
shows that while pageoutrun is increasing, Normal pages are being
allocated and freed all very quickly but not reaching the watermark,
and the other kswapd-related counters seem to show this.

> My understanding (and I am saying it just so that people can tell my
> if I'm wrong) is that what you're saying implies that kswapd keeps
> getting woken up in wakeup_kswapd() due to zone_watermark_ok(), but
> kswapd()'s invocation of balance_pgdat() is unable to bring levels
> above the low water mark but but evicted large amounts of data while
> trying?

Yes, though there are a couple of issues here. See the "Free memory
never fully used, swapping" thread and Mel Gorman's comments/patches.

His patches fix kswapd fighting the allocator for me, but I'm still
running into problems with what seems to be fragmentation making it
difficult for kswapd to meet the higher order watermarks after a few
days. Even with everything in slub set order 0 except order 1 where
absolutely required is still seeming to result in lots of free memory
after a week or so of normal load.

> (For the ML record/others: I believe that was meant to be
> zone_watermark_ok(), not zone_pages_ok(). It's in mm/page_alloc.c)

Yes :)

> > Code here: http://0x.ca/sim/ref/2.6.36/buddyinfo_scroll
>
> [snip output]
>
> > So, kswapd was woken up at the line that ends in "!!!" there, because
> > free_pages(249) <= min(256), and so zone_watermark_ok() returned 0, when
> > an order=3 allocation came in.
> >
> > Maybe try out that script and see if you see something similar.
>
> Thanks! That looks great. I'll try to set up data collection where
> this can be observed and then correlated with a graph and the
> vmstat/slabinfo that I just posted, the next time we see an eviction.
>
> (For the record it triggers constantly on my desktop, but that is with
> 2.6.32 and I'm assuming it is due to differences in that kernel, so
> I'm not bothering investigating. It's not triggering constantly on the
> 2.6.26-rc6 kernel on the production system, and hopefully we can see
> it trigger during the evictions.)

Well, nothing seems to care about higher order watermarks unless
something atually tries to allocate from them, so if you don't have
any order-3 allocations, it's unlikely that it will be met after
you've filled free memory. This isn't immediately obvious, because
zone_watermark_ok() works by subtracting free pages in _lower_ orders
from the total free pages in a zone. So, it's the free blocks in the
specified order and _bigger_ that matter in buddyinfo. Also,
min_free_kbytes and lowmem_reserve_ratio prevent all of the buddies
from being split even if _only_ order 0 allocations have ever occurred.

Anyway, how did it go? Did you find anything interesting? :)

Simon-

2010-12-01 09:15:43

by Simon Kirby

[permalink] [raw]
Subject: Re: Sudden and massive page cache eviction

On Thu, Nov 25, 2010 at 04:33:01PM +0100, Peter Sch??ller wrote:

> > simple thing to do in any case. ??You can watch the entries in slabinfo
> > and see if any of the ones with sizes over 4096 bytes are getting used
> > often. ??You can also watch /proc/buddyinfo and see how often columns
> > other than the first couple are moving around.
>
> I collected some information from
> /proc/{buddyinfo,meminfo,slabinfo,vmstat} and let it sit, polling
> approximately once per minute. I have some results correlated with
> another page eviction in graphs. The graph is here:
>
> http://files.spotify.com/memcut/memgraph-20101124.png
>
> The last sudden eviction there occurred somewhere between 22:30 and
> 22:45. Some URL:s that can be compared for those periods:
>
> Before:
> http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:39:30/vmstat
> http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:39:30/buddyinfo
> http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:39:30/meminfo
> http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:39:30/slabinfo
>
> After:
> http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:45:31/vmstat
> http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:45:31/buddyinfo
> http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:45:31/meminfo
> http://files.spotify.com/memcut/memstat-20101124/2010-11-24T22:45:31/slabinfo

Disclaimer: I have no idea what I'm doing. :)

Your buddyinfo looks to be pretty low for order 3 and above, before and
after the sudden eviction, so my guess is that it's probably related to
the issues I'm seeing with fragmentation, but maybe not fighting between
zones, since you seem to have a larger Normal zone than DMA32. (Not
sure, you didn't post /proc/zoneinfo). Also, you seem to be on an
actual NUMA system, so other things are happening there, too.

If you have munin installed (it looks like you do), try enabling the
buddyinfo plugin available since munin 1.4.4. It graphs the buddyinfo
data, so it could be lined up with the memory graphs (thanks Erik).

[snip]

> kmalloc increases:
>
> -kmalloc-4096 301 328 4096 8 8 : tunables 0 0
> 0 : slabdata 41 41 0
> +kmalloc-4096 637 680 4096 8 8 : tunables 0 0
> 0 : slabdata 85 85 0
> -kmalloc-2048 18215 19696 2048 16 8 : tunables 0 0
> 0 : slabdata 1231 1231 0
> +kmalloc-2048 41908 51792 2048 16 8 : tunables 0 0
> 0 : slabdata 3237 3237 0
> -kmalloc-1024 85444 97280 1024 32 8 : tunables 0 0
> 0 : slabdata 3040 3040 0
> +kmalloc-1024 267031 327104 1024 32 8 : tunables 0 0
> 0 : slabdata 10222 10222 0

Note that all of the above are actually attempting order-3 allocations
first; see /sys/kernel/slab/kmalloc-1024/order, for instance. The "8" is
means "8 pages per slab", which means order 3 is the attempted allocation
size.

I did the following on a system to test, but the free memory did not
actually improve. It seems that even only order 1 allocations are enough
to reclaim too much order 0. Even a "while true; sleep .01; done" caused
free memory to start increasing due to order 1 (task_struct allocation)
watermarks waking kswapd, while our other usual VM activity is happening.

#!/bin/bash

for i in /sys/kernel/slab/*/; do
if [ `cat $i/object_size` -le 4096 ]; then
echo 0 > $i/order
else
echo 1 > $i/order
fi
done

But this is on another machine, without Mel's patch, and with 8 GB
memory, so a bigger Normal zone.

[snip]

> If my interpretation and understanding is correct, this indicates that
> for example, ~3000 to ~10000 3-order allocations resulting from 1 kb
> kmalloc():s. Meaning about 0.2 gig ( 7000*4*8*1024/1024/1024). Add the
> other ones and we get some more, but only a few hundred megs in total.
>
> Going by the hypothesis that we are seeing the same thing as reported
> by Simon Kirby (I'll respond to that E-Mail separately), the total
> amount is (as far as I understand) not the important part, but the
> fact that we saw a non-trivial increase in 3-order allocations would
> perhaps be a consistent observation in that frequent 3-order
> allocations might be more likely to trigger the behavior Simon
> reports.

Try installing the "perf" tool. It can be built from the kernel tree in
tools/perf, and then you usually can just copy the binary around. You
can use it to trace the points which cause kswapd to wake up, which will
show which processes are doing it, the order, flags, etc.

Just before the eviction is about to happen (or whenever), try this:

perf record --event vmscan:mm_vmscan_wakeup_kswapd --filter 'order>=3' \
--call-graph -a sleep 30

Then view the recorded events with "perf trace", which should spit out
something like this:

lmtp-3531 [003] 432339.243851: mm_vmscan_wakeup_kswapd: nid=0 zid=2 order=3
lmtp-3531 [003] 432339.243856: mm_vmscan_wakeup_kswapd: nid=0 zid=1 order=3

The process which woke kswapd may not be directly responsible for the
allocation as a network interrrupt or something could have happened on
top of it. See "perf report", which is a bit dodgy at least for me, to
see the stack traces, which might make things clearer. For example, my
traces show that even kswapd wakes kswapd sometimes, but it's because of
a trace like this:

- 9.09% kswapd0 [kernel.kallsyms] [k] wakeup_kswapd
wakeup_kswapd
__alloc_pages_nodemask
alloc_pages_current
new_slab
__slab_alloc
__kmalloc_node_track_caller
__alloc_skb
__netdev_alloc_skb
bnx2_poll_work
bnx2_poll
net_rx_action
__do_softirq
call_softirq
do_softirq
irq_exit
do_IRQ
ret_from_intr
truncate_inode_pages
proc_evict_inode
evict
iput
dentry_iput
d_kill
__shrink_dcache_sb
shrink_dcache_memory
shrink_slab
kswapd

Anyway, maybe you'll see some interesting traces. If kswapd isn't waking
very often, you can also trace "kmem:mm_page_alloc" or similar (see "perf
list"), or try a smaller order or a longer sleep.

Cheers,

Simon-