2013-03-05 18:21:55

by Howard Chu

[permalink] [raw]
Subject: mmap vs fs cache

I'm testing our memory-mapped database code on a small VM. The machine has
32GB of RAM and the size of the DB on disk is ~44GB. The database library
mmaps the entire file as a single region and starts accessing it as a tree of
B+trees. Running on an Ubuntu 3.5.0-23 kernel, XFS on a local disk.

If I start running read-only queries against the DB with a freshly started
server, I see that my process (OpenLDAP slapd) quickly grows to an RSS of
about 16GB in tandem with the FS cache. (I.e., "top" shows 16GB cached, and
slapd is 16GB.)
If I confine my queries to the first 20% of the data then it all fits in RAM
and queries are nice and fast.

if I extend the query range to cover more of the data, approaching the size of
physical RAM, I see something strange - the FS cache keeps growing, but the
slapd process size grows at a slower rate. This is rather puzzling to me since
the only thing triggering reads is accesses through the mmap region.
Eventually the FS cache grows to basically all of the 32GB of RAM (+/- some
text/data space...) but the slapd process only reaches 25GB, at which point it
actually starts to shrink - apparently the FS cache is now stealing pages from
it. I find that a bit puzzling; if the pages are present in memory, and the
only reason they were paged in was to satisfy an mmap reference, why aren't
they simply assigned to the slapd process?

The current behavior gets even more aggravating: I can run a test that spans
exactly 30GB of the data. One would expect that the slapd process should
simply grow to 30GB in size, and then remain static for the remainder of the
test. Instead, the server grows to 25GB, the FS cache grows to 32GB, and
starts stealing pages from the server, shrinking it back down to 19GB or so.

If I do an "echo 1 > /proc/sys/vm/drop_caches" at the onset of this condition,
the FS cache shrinks back to 25GB, matching the slapd process size.
This then frees up enough RAM for slapd to grow further. If I don't do this,
the test is constantly paging in data from disk. Even so, the FS cache
continues to grow faster than the slapd process size, so the system may run
out of free RAM again, and I have to drop caches multiple times before slapd
finally grows to the full 30GB. Once it gets to that size the test runs
entirely from RAM with zero I/Os, but it doesn't get there without a lot of
babysitting.

2 questions:
why is there data in the FS cache that isn't owned by (the mmap of) the
process that caused it to be paged in in the first place?
is there a tunable knob to discourage the page cache from stealing from the
process?

--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/


2013-03-05 23:55:07

by Howard Chu

[permalink] [raw]
Subject: Re: mmap vs fs cache

Howard Chu wrote:
> 2 questions:
> why is there data in the FS cache that isn't owned by (the mmap of) the
> process that caused it to be paged in in the first place?
> is there a tunable knob to discourage the page cache from stealing from the
> process?

This Unmapped page cache control http://lwn.net/Articles/436010/ sounds like
it might have been helpful here. I.e., having a way to prioritize so that
unmapped cache pages get reclaimed in preference to mapped pages could help.
Though I still don't understand why these pages in the cache aren't mapped in
the first place.

--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/

2013-03-06 10:14:15

by Howard Chu

[permalink] [raw]
Subject: Re: mmap vs fs cache

Howard Chu wrote:
> Howard Chu wrote:
>> 2 questions:
>> why is there data in the FS cache that isn't owned by (the mmap of) the
>> process that caused it to be paged in in the first place?
>> is there a tunable knob to discourage the page cache from stealing from the
>> process?
>
> This Unmapped page cache control http://lwn.net/Articles/436010/ sounds like
> it might have been helpful here. I.e., having a way to prioritize so that
> unmapped cache pages get reclaimed in preference to mapped pages could help.
> Though I still don't understand why these pages in the cache aren't mapped in
> the first place.
>
As implied by this post
http://lkml.indiana.edu/hypermail/linux/kernel/0701.3/0354.html setting
swappiness to 0 seems to give the desired effect of preventing mapped pages
from being reclaimed. If this is an intended effect, it would be nice to have
this documented in Documentation/sysctl/vm.txt. If this is not the intended
effect, please don't "fix" this without providing a supported means of doing
the same.

--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/

2013-03-06 10:22:53

by Howard Chu

[permalink] [raw]
Subject: Re: mmap vs fs cache

Howard Chu wrote:
> Howard Chu wrote:
>> Howard Chu wrote:
>>> 2 questions:
>>> why is there data in the FS cache that isn't owned by (the mmap of) the
>>> process that caused it to be paged in in the first place?
>>> is there a tunable knob to discourage the page cache from stealing from the
>>> process?
>>
>> This Unmapped page cache control http://lwn.net/Articles/436010/ sounds like
>> it might have been helpful here. I.e., having a way to prioritize so that
>> unmapped cache pages get reclaimed in preference to mapped pages could help.
>> Though I still don't understand why these pages in the cache aren't mapped in
>> the first place.
>>
> As implied by this post
> http://lkml.indiana.edu/hypermail/linux/kernel/0701.3/0354.html setting
> swappiness to 0 seems to give the desired effect of preventing mapped pages
> from being reclaimed.

I spoke too soon, after a few minutes of load the process size started
shrinking again. My original questions still stand.

--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/

2013-03-07 15:43:19

by Jan Kara

[permalink] [raw]
Subject: Re: mmap vs fs cache

Added mm list to CC.

On Tue 05-03-13 09:57:34, Howard Chu wrote:
> I'm testing our memory-mapped database code on a small VM. The
> machine has 32GB of RAM and the size of the DB on disk is ~44GB. The
> database library mmaps the entire file as a single region and starts
> accessing it as a tree of B+trees. Running on an Ubuntu 3.5.0-23
> kernel, XFS on a local disk.
>
> If I start running read-only queries against the DB with a freshly
> started server, I see that my process (OpenLDAP slapd) quickly grows
> to an RSS of about 16GB in tandem with the FS cache. (I.e., "top"
> shows 16GB cached, and slapd is 16GB.)
> If I confine my queries to the first 20% of the data then it all
> fits in RAM and queries are nice and fast.
>
> if I extend the query range to cover more of the data, approaching
> the size of physical RAM, I see something strange - the FS cache
> keeps growing, but the slapd process size grows at a slower rate.
> This is rather puzzling to me since the only thing triggering reads
> is accesses through the mmap region. Eventually the FS cache grows
> to basically all of the 32GB of RAM (+/- some text/data space...)
> but the slapd process only reaches 25GB, at which point it actually
> starts to shrink - apparently the FS cache is now stealing pages
> from it. I find that a bit puzzling; if the pages are present in
> memory, and the only reason they were paged in was to satisfy an
> mmap reference, why aren't they simply assigned to the slapd
> process?
>
> The current behavior gets even more aggravating: I can run a test
> that spans exactly 30GB of the data. One would expect that the slapd
> process should simply grow to 30GB in size, and then remain static
> for the remainder of the test. Instead, the server grows to 25GB,
> the FS cache grows to 32GB, and starts stealing pages from the
> server, shrinking it back down to 19GB or so.
>
> If I do an "echo 1 > /proc/sys/vm/drop_caches" at the onset of this
> condition, the FS cache shrinks back to 25GB, matching the slapd
> process size.
> This then frees up enough RAM for slapd to grow further. If I don't
> do this, the test is constantly paging in data from disk. Even so,
> the FS cache continues to grow faster than the slapd process size,
> so the system may run out of free RAM again, and I have to drop
> caches multiple times before slapd finally grows to the full 30GB.
> Once it gets to that size the test runs entirely from RAM with zero
> I/Os, but it doesn't get there without a lot of babysitting.
>
> 2 questions:
> why is there data in the FS cache that isn't owned by (the mmap
> of) the process that caused it to be paged in in the first place?
> is there a tunable knob to discourage the page cache from stealing
> from the process?
>
> --
> -- Howard Chu
> CTO, Symas Corp. http://www.symas.com
> Director, Highland Sun http://highlandsun.com/hyc/
> Chief Architect, OpenLDAP http://www.openldap.org/project/
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-03-08 02:09:08

by Johannes Weiner

[permalink] [raw]
Subject: Re: mmap vs fs cache

On Thu, Mar 07, 2013 at 04:43:12PM +0100, Jan Kara wrote:
> Added mm list to CC.
>
> On Tue 05-03-13 09:57:34, Howard Chu wrote:
> > I'm testing our memory-mapped database code on a small VM. The
> > machine has 32GB of RAM and the size of the DB on disk is ~44GB. The
> > database library mmaps the entire file as a single region and starts
> > accessing it as a tree of B+trees. Running on an Ubuntu 3.5.0-23
> > kernel, XFS on a local disk.
> >
> > If I start running read-only queries against the DB with a freshly
> > started server, I see that my process (OpenLDAP slapd) quickly grows
> > to an RSS of about 16GB in tandem with the FS cache. (I.e., "top"
> > shows 16GB cached, and slapd is 16GB.)
> > If I confine my queries to the first 20% of the data then it all
> > fits in RAM and queries are nice and fast.
> >
> > if I extend the query range to cover more of the data, approaching
> > the size of physical RAM, I see something strange - the FS cache
> > keeps growing, but the slapd process size grows at a slower rate.
> > This is rather puzzling to me since the only thing triggering reads
> > is accesses through the mmap region. Eventually the FS cache grows
> > to basically all of the 32GB of RAM (+/- some text/data space...)
> > but the slapd process only reaches 25GB, at which point it actually
> > starts to shrink - apparently the FS cache is now stealing pages
> > from it. I find that a bit puzzling; if the pages are present in
> > memory, and the only reason they were paged in was to satisfy an
> > mmap reference, why aren't they simply assigned to the slapd
> > process?
> >
> > The current behavior gets even more aggravating: I can run a test
> > that spans exactly 30GB of the data. One would expect that the slapd
> > process should simply grow to 30GB in size, and then remain static
> > for the remainder of the test. Instead, the server grows to 25GB,
> > the FS cache grows to 32GB, and starts stealing pages from the
> > server, shrinking it back down to 19GB or so.
> >
> > If I do an "echo 1 > /proc/sys/vm/drop_caches" at the onset of this
> > condition, the FS cache shrinks back to 25GB, matching the slapd
> > process size.
> > This then frees up enough RAM for slapd to grow further. If I don't
> > do this, the test is constantly paging in data from disk. Even so,
> > the FS cache continues to grow faster than the slapd process size,
> > so the system may run out of free RAM again, and I have to drop
> > caches multiple times before slapd finally grows to the full 30GB.
> > Once it gets to that size the test runs entirely from RAM with zero
> > I/Os, but it doesn't get there without a lot of babysitting.
> >
> > 2 questions:
> > why is there data in the FS cache that isn't owned by (the mmap
> > of) the process that caused it to be paged in in the first place?

The filesystem cache is shared among processes because the filesystem
is also shared among processes. If another task were to access the
same file, we still should only have one copy of that data in memory.

It sounds to me like slapd is itself caching all the data it reads.
If that is true, shouldn't it really be using direct IO to prevent
this double buffering of filesystem data in memory?

> > is there a tunable knob to discourage the page cache from stealing
> > from the process?

Try reducing /proc/sys/vm/swappiness, which ranges from 0-100 and
defaults to 60.

2013-03-08 07:46:48

by Howard Chu

[permalink] [raw]
Subject: Re: mmap vs fs cache

Johannes Weiner wrote:
> On Thu, Mar 07, 2013 at 04:43:12PM +0100, Jan Kara wrote:

>>> 2 questions:
>>> why is there data in the FS cache that isn't owned by (the mmap
>>> of) the process that caused it to be paged in in the first place?
>
> The filesystem cache is shared among processes because the filesystem
> is also shared among processes. If another task were to access the
> same file, we still should only have one copy of that data in memory.

That's irrelevant to the question. As I already explained, the first 16GB that
was paged in didn't behave this way. Perhaps "owned" was the wrong word, since
this is a MAP_SHARED mapping. But the point is that the memory is not being
accounted in slapd's process size, when it was before, up to 16GB.

> It sounds to me like slapd is itself caching all the data it reads.

You're misreading the information then. slapd is doing no caching of its own,
its RSS and SHR memory size are both the same. All it is using is the mmap,
nothing else. The RSS == SHR == FS cache, up to 16GB. RSS is always == SHR,
but above 16GB they grow more slowly than the FS cache.

> If that is true, shouldn't it really be using direct IO to prevent
> this double buffering of filesystem data in memory?

There is no double buffering.

>>> is there a tunable knob to discourage the page cache from stealing
>>> from the process?
>
> Try reducing /proc/sys/vm/swappiness, which ranges from 0-100 and
> defaults to 60.

I've already tried setting it to 0 with no effect.

--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/

2013-03-08 08:40:20

by Kirill A. Shutemov

[permalink] [raw]
Subject: Re: mmap vs fs cache

On Thu, Mar 07, 2013 at 11:46:39PM -0800, Howard Chu wrote:
> You're misreading the information then. slapd is doing no caching of
> its own, its RSS and SHR memory size are both the same. All it is
> using is the mmap, nothing else. The RSS == SHR == FS cache, up to
> 16GB. RSS is always == SHR, but above 16GB they grow more slowly
> than the FS cache.

It only means, that some pages got unmapped from your process. It can
happned, for instance, due page migration. There's nothing worry about: it
will be mapped back on next page fault to the page and it's only minor
page fault since the page is in pagecache anyway.

--
Kirill A. Shutemov

2013-03-08 09:40:46

by Howard Chu

[permalink] [raw]
Subject: Re: mmap vs fs cache

Kirill A. Shutemov wrote:
> On Thu, Mar 07, 2013 at 11:46:39PM -0800, Howard Chu wrote:
>> You're misreading the information then. slapd is doing no caching of
>> its own, its RSS and SHR memory size are both the same. All it is
>> using is the mmap, nothing else. The RSS == SHR == FS cache, up to
>> 16GB. RSS is always == SHR, but above 16GB they grow more slowly
>> than the FS cache.
>
> It only means, that some pages got unmapped from your process. It can
> happned, for instance, due page migration. There's nothing worry about: it
> will be mapped back on next page fault to the page and it's only minor
> page fault since the page is in pagecache anyway.

Unfortunately there *is* something to worry about. As I said already - when
the test spans 30GB, the FS cache fills up the rest of RAM and the test is
doing a lot of real I/O even though it shouldn't need to. Please, read the
entire original post before replying.

There is no way that a process that is accessing only 30GB of a mmap should be
able to fill up 32GB of RAM. There's nothing else running on the machine, I've
killed or suspended everything else in userland besides a couple shells
running top and vmstat. When I manually drop_caches repeatedly, then
eventually slapd RSS/SHR grows to 30GB and the physical I/O stops.

--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/

2013-03-08 14:48:46

by Chris Friesen

[permalink] [raw]
Subject: Re: mmap vs fs cache

On 03/08/2013 03:40 AM, Howard Chu wrote:

> There is no way that a process that is accessing only 30GB of a mmap
> should be able to fill up 32GB of RAM. There's nothing else running on
> the machine, I've killed or suspended everything else in userland
> besides a couple shells running top and vmstat. When I manually
> drop_caches repeatedly, then eventually slapd RSS/SHR grows to 30GB and
> the physical I/O stops.

Is it possible that the kernel is doing some sort of automatic
readahead, but it ends up reading pages corresponding to data that isn't
ever queried and so doesn't get mapped by the application?

Chris

2013-03-08 15:01:07

by Howard Chu

[permalink] [raw]
Subject: Re: mmap vs fs cache

Chris Friesen wrote:
> On 03/08/2013 03:40 AM, Howard Chu wrote:
>
>> There is no way that a process that is accessing only 30GB of a mmap
>> should be able to fill up 32GB of RAM. There's nothing else running on
>> the machine, I've killed or suspended everything else in userland
>> besides a couple shells running top and vmstat. When I manually
>> drop_caches repeatedly, then eventually slapd RSS/SHR grows to 30GB and
>> the physical I/O stops.
>
> Is it possible that the kernel is doing some sort of automatic
> readahead, but it ends up reading pages corresponding to data that isn't
> ever queried and so doesn't get mapped by the application?

Yes, that's what I was thinking. I added a posix_madvise(..POSIX_MADV_RANDOM)
but that had no effect on the test.

First obvious conclusion - kswapd is being too aggressive. When free memory
hits the low watermark, the reclaim shrinks slapd down from 25GB to 18-19GB,
while the page cache still contains ~7GB of unmapped pages. Ideally I'd like a
tuning knob so I can say to keep no more than 2GB of unmapped pages in the
cache. (And the desired effect of that would be to allow user processes to
grow to 30GB total, in this case.)

I mentioned this "unmapped page cache control" post already
http://lwn.net/Articles/436010/ but it seems that the idea was ultimately
rejected. Is there anything else similar in current kernels?

--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/

2013-03-08 15:26:49

by Chris Friesen

[permalink] [raw]
Subject: Re: mmap vs fs cache

On 03/08/2013 09:00 AM, Howard Chu wrote:

> First obvious conclusion - kswapd is being too aggressive. When free
> memory hits the low watermark, the reclaim shrinks slapd down from 25GB
> to 18-19GB, while the page cache still contains ~7GB of unmapped pages.
> Ideally I'd like a tuning knob so I can say to keep no more than 2GB of
> unmapped pages in the cache. (And the desired effect of that would be to
> allow user processes to grow to 30GB total, in this case.)
>
> I mentioned this "unmapped page cache control" post already
> http://lwn.net/Articles/436010/ but it seems that the idea was
> ultimately rejected. Is there anything else similar in current kernels?

Sorry, I'm not aware of anything. I'm not a filesystem/vm guy though,
so maybe there's something I don't know about.

I would have expected both posix_madvise(..POSIX_MADV_RANDOM) and
swappiness to help, but it doesn't sound like they're working.

Chris

2013-03-08 16:17:04

by Johannes Weiner

[permalink] [raw]
Subject: Re: mmap vs fs cache

On Fri, Mar 08, 2013 at 07:00:55AM -0800, Howard Chu wrote:
> Chris Friesen wrote:
> >On 03/08/2013 03:40 AM, Howard Chu wrote:
> >
> >>There is no way that a process that is accessing only 30GB of a mmap
> >>should be able to fill up 32GB of RAM. There's nothing else running on
> >>the machine, I've killed or suspended everything else in userland
> >>besides a couple shells running top and vmstat. When I manually
> >>drop_caches repeatedly, then eventually slapd RSS/SHR grows to 30GB and
> >>the physical I/O stops.
> >
> >Is it possible that the kernel is doing some sort of automatic
> >readahead, but it ends up reading pages corresponding to data that isn't
> >ever queried and so doesn't get mapped by the application?
>
> Yes, that's what I was thinking. I added a
> posix_madvise(..POSIX_MADV_RANDOM) but that had no effect on the
> test.
>
> First obvious conclusion - kswapd is being too aggressive. When free
> memory hits the low watermark, the reclaim shrinks slapd down from
> 25GB to 18-19GB, while the page cache still contains ~7GB of
> unmapped pages. Ideally I'd like a tuning knob so I can say to keep
> no more than 2GB of unmapped pages in the cache. (And the desired
> effect of that would be to allow user processes to grow to 30GB
> total, in this case.)

We should find out where the unmapped page cache is coming from if you
are only accessing mapped file cache and disabled readahead.

How do you arrive at this number of unmapped page cache?

What could happen is that previously used and activated pages do not
get evicted anymore since there is a constant supply of younger
reclaimable cache that is actually thrashing. Whenever you drop the
caches, you get rid of those stale active pages and allow the
previously thrashing cache to get activated. However, that would
require that there is already a significant amount of active file
pages before your workload starts (check the nr_active_file number in
/proc/vmstat before launching slapd, try sync; echo 3 >drop_caches
before launching to eliminate this option) OR that the set of pages
accessed during your workload changes and the combined set of pages
accessed by your workload is bigger than available memory -- which you
claimed would not happen because you only access the 30GB file area on
that system.

2013-03-08 20:05:05

by Howard Chu

[permalink] [raw]
Subject: Re: mmap vs fs cache

Johannes Weiner wrote:
> On Fri, Mar 08, 2013 at 07:00:55AM -0800, Howard Chu wrote:
>> Chris Friesen wrote:
>>> On 03/08/2013 03:40 AM, Howard Chu wrote:
>>>
>>>> There is no way that a process that is accessing only 30GB of a mmap
>>>> should be able to fill up 32GB of RAM. There's nothing else running on
>>>> the machine, I've killed or suspended everything else in userland
>>>> besides a couple shells running top and vmstat. When I manually
>>>> drop_caches repeatedly, then eventually slapd RSS/SHR grows to 30GB and
>>>> the physical I/O stops.
>>>
>>> Is it possible that the kernel is doing some sort of automatic
>>> readahead, but it ends up reading pages corresponding to data that isn't
>>> ever queried and so doesn't get mapped by the application?
>>
>> Yes, that's what I was thinking. I added a
>> posix_madvise(..POSIX_MADV_RANDOM) but that had no effect on the
>> test.
>>
>> First obvious conclusion - kswapd is being too aggressive. When free
>> memory hits the low watermark, the reclaim shrinks slapd down from
>> 25GB to 18-19GB, while the page cache still contains ~7GB of
>> unmapped pages. Ideally I'd like a tuning knob so I can say to keep
>> no more than 2GB of unmapped pages in the cache. (And the desired
>> effect of that would be to allow user processes to grow to 30GB
>> total, in this case.)
>
> We should find out where the unmapped page cache is coming from if you
> are only accessing mapped file cache and disabled readahead.
>
> How do you arrive at this number of unmapped page cache?

This number is pretty obvious. When slapd has grown to 25GB, the page cache
has grown to 32GB (less about 200MB, the minfree). So: 7GB unmapped in the cache.

> What could happen is that previously used and activated pages do not
> get evicted anymore since there is a constant supply of younger
> reclaimable cache that is actually thrashing. Whenever you drop the
> caches, you get rid of those stale active pages and allow the
> previously thrashing cache to get activated. However, that would
> require that there is already a significant amount of active file
> pages before your workload starts (check the nr_active_file number in
> /proc/vmstat before launching slapd, try sync; echo 3 >drop_caches
> before launching to eliminate this option) OR that the set of pages
> accessed during your workload changes and the combined set of pages
> accessed by your workload is bigger than available memory -- which you
> claimed would not happen because you only access the 30GB file area on
> that system.

There are no other active pages before the test begins. There's nothing else
running. caches have been dropped completely at the beginning.

The test clearly is accessing only 30GB of data. Once slapd reaches this
process size, the test can be stopped and restarted any number of times, run
for any number of hours continuously, and memory use on the system is
unchanged, and no pageins occur.

--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/

2013-03-09 01:22:23

by Phillip Susi

[permalink] [raw]
Subject: Re: mmap vs fs cache

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 03/08/2013 10:00 AM, Howard Chu wrote:
> Yes, that's what I was thinking. I added a
> posix_madvise(..POSIX_MADV_RANDOM) but that had no effect on the
> test.

Yep, that's because it isn't implemented.

You might try MADV_WILLNEED to schedule it to be read in first. I
believe that will only read in the requested page, without additional
readahead, and then when you fault on the page, it already has IO
scheduled, so the extra readahead will also be skipped.


-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.12 (GNU/Linux)
Comment: Using GnuPG with undefined - http://www.enigmail.net/

iQEcBAEBAgAGBQJROo7GAAoJEJrBOlT6nu759SAH+wRhoUIZUuzNGrhfUJ6RnwV8
VjFyftBCAsdC+Mzq81Da3KJOi+BdYV8VbkYNPzbKll5AnxzL5Udvbdyf9SkROhug
UgLWHe8pC6ZtHfSvWBCqS1YDLkzw+TiWwJzuL5iUEDC2NGuUJQ5SbhwyTEypvWai
pdPZeFVyhLAKOtAUwD5e/5vhBWSq2M1TG2C7BUCow2fbJ6kil+kWuXtiDeNPvtUk
4FwabL8zHA9pNtMlHB0cUrn5W3VQYGqeTaDngjyLxR1gw7uFQn52G47IPe2LAMGx
58L/tHjbkSY9oukGiMHoF1jiaFqJqV1pw+Q2P7S+0XsU8JdW6CmzotTqDmcozqE=
=DOZT
-----END PGP SIGNATURE-----

2013-03-09 02:34:40

by Ric Mason

[permalink] [raw]
Subject: Re: mmap vs fs cache

Hi Johannes,
On 03/08/2013 10:08 AM, Johannes Weiner wrote:
> On Thu, Mar 07, 2013 at 04:43:12PM +0100, Jan Kara wrote:
>> Added mm list to CC.
>>
>> On Tue 05-03-13 09:57:34, Howard Chu wrote:
>>> I'm testing our memory-mapped database code on a small VM. The
>>> machine has 32GB of RAM and the size of the DB on disk is ~44GB. The
>>> database library mmaps the entire file as a single region and starts
>>> accessing it as a tree of B+trees. Running on an Ubuntu 3.5.0-23
>>> kernel, XFS on a local disk.
>>>
>>> If I start running read-only queries against the DB with a freshly
>>> started server, I see that my process (OpenLDAP slapd) quickly grows
>>> to an RSS of about 16GB in tandem with the FS cache. (I.e., "top"
>>> shows 16GB cached, and slapd is 16GB.)
>>> If I confine my queries to the first 20% of the data then it all
>>> fits in RAM and queries are nice and fast.
>>>
>>> if I extend the query range to cover more of the data, approaching
>>> the size of physical RAM, I see something strange - the FS cache
>>> keeps growing, but the slapd process size grows at a slower rate.
>>> This is rather puzzling to me since the only thing triggering reads
>>> is accesses through the mmap region. Eventually the FS cache grows
>>> to basically all of the 32GB of RAM (+/- some text/data space...)
>>> but the slapd process only reaches 25GB, at which point it actually
>>> starts to shrink - apparently the FS cache is now stealing pages
>>> from it. I find that a bit puzzling; if the pages are present in
>>> memory, and the only reason they were paged in was to satisfy an
>>> mmap reference, why aren't they simply assigned to the slapd
>>> process?
>>>
>>> The current behavior gets even more aggravating: I can run a test
>>> that spans exactly 30GB of the data. One would expect that the slapd
>>> process should simply grow to 30GB in size, and then remain static
>>> for the remainder of the test. Instead, the server grows to 25GB,
>>> the FS cache grows to 32GB, and starts stealing pages from the
>>> server, shrinking it back down to 19GB or so.
>>>
>>> If I do an "echo 1 > /proc/sys/vm/drop_caches" at the onset of this
>>> condition, the FS cache shrinks back to 25GB, matching the slapd
>>> process size.
>>> This then frees up enough RAM for slapd to grow further. If I don't
>>> do this, the test is constantly paging in data from disk. Even so,
>>> the FS cache continues to grow faster than the slapd process size,
>>> so the system may run out of free RAM again, and I have to drop
>>> caches multiple times before slapd finally grows to the full 30GB.
>>> Once it gets to that size the test runs entirely from RAM with zero
>>> I/Os, but it doesn't get there without a lot of babysitting.
>>>
>>> 2 questions:
>>> why is there data in the FS cache that isn't owned by (the mmap
>>> of) the process that caused it to be paged in in the first place?
> The filesystem cache is shared among processes because the filesystem
> is also shared among processes. If another task were to access the
> same file, we still should only have one copy of that data in memory.
>
> It sounds to me like slapd is itself caching all the data it reads.
> If that is true, shouldn't it really be using direct IO to prevent
> this double buffering of filesystem data in memory?

When use direct IO is better? When use page cache is better?

>
>>> is there a tunable knob to discourage the page cache from stealing
>>> from the process?
> Try reducing /proc/sys/vm/swappiness, which ranges from 0-100 and
> defaults to 60.

Why redunce? IIUC, swappiness is used to determine how aggressive
reclaim anonymous pages, if the value is high more anonymous pages will
be reclaimed.

>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-03-09 03:28:43

by Ric Mason

[permalink] [raw]
Subject: Re: mmap vs fs cache

Hi Johannes,
On 03/09/2013 12:16 AM, Johannes Weiner wrote:
> On Fri, Mar 08, 2013 at 07:00:55AM -0800, Howard Chu wrote:
>> Chris Friesen wrote:
>>> On 03/08/2013 03:40 AM, Howard Chu wrote:
>>>
>>>> There is no way that a process that is accessing only 30GB of a mmap
>>>> should be able to fill up 32GB of RAM. There's nothing else running on
>>>> the machine, I've killed or suspended everything else in userland
>>>> besides a couple shells running top and vmstat. When I manually
>>>> drop_caches repeatedly, then eventually slapd RSS/SHR grows to 30GB and
>>>> the physical I/O stops.
>>> Is it possible that the kernel is doing some sort of automatic
>>> readahead, but it ends up reading pages corresponding to data that isn't
>>> ever queried and so doesn't get mapped by the application?
>> Yes, that's what I was thinking. I added a
>> posix_madvise(..POSIX_MADV_RANDOM) but that had no effect on the
>> test.
>>
>> First obvious conclusion - kswapd is being too aggressive. When free
>> memory hits the low watermark, the reclaim shrinks slapd down from
>> 25GB to 18-19GB, while the page cache still contains ~7GB of
>> unmapped pages. Ideally I'd like a tuning knob so I can say to keep
>> no more than 2GB of unmapped pages in the cache. (And the desired
>> effect of that would be to allow user processes to grow to 30GB
>> total, in this case.)
> We should find out where the unmapped page cache is coming from if you
> are only accessing mapped file cache and disabled readahead.
>
> How do you arrive at this number of unmapped page cache?
>
> What could happen is that previously used and activated pages do not
> get evicted anymore since there is a constant supply of younger

If a user process exit, its file pages and anonymous pages will be freed
immediately or go through page reclaim?

> reclaimable cache that is actually thrashing. Whenever you drop the
> caches, you get rid of those stale active pages and allow the
> previously thrashing cache to get activated. However, that would
> require that there is already a significant amount of active file

Why you emphasize a *significant* amount of active file pages?

> pages before your workload starts (check the nr_active_file number in
> /proc/vmstat before launching slapd, try sync; echo 3 >drop_caches
> before launching to eliminate this option) OR that the set of pages
> accessed during your workload changes and the combined set of pages
> accessed by your workload is bigger than available memory -- which you
> claimed would not happen because you only access the 30GB file area on
> that system.
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-03-11 11:52:26

by Jan Kara

[permalink] [raw]
Subject: Re: mmap vs fs cache

On Fri 08-03-13 20:22:19, Phillip Susi wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> On 03/08/2013 10:00 AM, Howard Chu wrote:
> > Yes, that's what I was thinking. I added a
> > posix_madvise(..POSIX_MADV_RANDOM) but that had no effect on the
> > test.
>
> Yep, that's because it isn't implemented.
Why do you think so? AFAICS it is implemented by setting VM_RAND_READ
flag in the VMA and do_async_mmap_readahead() and do_sync_mmap_readahead()
check for the flag and don't do anything if it is set...

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-03-11 12:04:32

by Jan Kara

[permalink] [raw]
Subject: Re: mmap vs fs cache

On Fri 08-03-13 12:04:46, Howard Chu wrote:
> Johannes Weiner wrote:
> >On Fri, Mar 08, 2013 at 07:00:55AM -0800, Howard Chu wrote:
> >>Chris Friesen wrote:
> >>>On 03/08/2013 03:40 AM, Howard Chu wrote:
> >>>
> >>>>There is no way that a process that is accessing only 30GB of a mmap
> >>>>should be able to fill up 32GB of RAM. There's nothing else running on
> >>>>the machine, I've killed or suspended everything else in userland
> >>>>besides a couple shells running top and vmstat. When I manually
> >>>>drop_caches repeatedly, then eventually slapd RSS/SHR grows to 30GB and
> >>>>the physical I/O stops.
> >>>
> >>>Is it possible that the kernel is doing some sort of automatic
> >>>readahead, but it ends up reading pages corresponding to data that isn't
> >>>ever queried and so doesn't get mapped by the application?
> >>
> >>Yes, that's what I was thinking. I added a
> >>posix_madvise(..POSIX_MADV_RANDOM) but that had no effect on the
> >>test.
> >>
> >>First obvious conclusion - kswapd is being too aggressive. When free
> >>memory hits the low watermark, the reclaim shrinks slapd down from
> >>25GB to 18-19GB, while the page cache still contains ~7GB of
> >>unmapped pages. Ideally I'd like a tuning knob so I can say to keep
> >>no more than 2GB of unmapped pages in the cache. (And the desired
> >>effect of that would be to allow user processes to grow to 30GB
> >>total, in this case.)
> >
> >We should find out where the unmapped page cache is coming from if you
> >are only accessing mapped file cache and disabled readahead.
> >
> >How do you arrive at this number of unmapped page cache?
>
> This number is pretty obvious. When slapd has grown to 25GB, the
This 25G is presumably from /proc/pid/statm, right?

> page cache has grown to 32GB (less about 200MB, the minfree). So:
And this value is from where? /proc/meminfo - Cached line?

> 7GB unmapped in the cache.
>
> >What could happen is that previously used and activated pages do not
> >get evicted anymore since there is a constant supply of younger
> >reclaimable cache that is actually thrashing. Whenever you drop the
> >caches, you get rid of those stale active pages and allow the
> >previously thrashing cache to get activated. However, that would
> >require that there is already a significant amount of active file
> >pages before your workload starts (check the nr_active_file number in
> >/proc/vmstat before launching slapd, try sync; echo 3 >drop_caches
> >before launching to eliminate this option) OR that the set of pages
> >accessed during your workload changes and the combined set of pages
> >accessed by your workload is bigger than available memory -- which you
> >claimed would not happen because you only access the 30GB file area on
> >that system.
>
> There are no other active pages before the test begins. There's
> nothing else running. caches have been dropped completely at the
> beginning.
>
> The test clearly is accessing only 30GB of data. Once slapd reaches
> this process size, the test can be stopped and restarted any number
> of times, run for any number of hours continuously, and memory use
> on the system is unchanged, and no pageins occur.
Interesting. It might be worth trying what happens if you do
madvise(..., MADV_DONTNEED) on the data file instead of dropping caches
with /proc/sys/vm/drop_caches. That way we can establish whether the extra
cached data is in the data file (things will look the same way as with
drop_caches) or somewhere else (there will be still unmapped page cache).

Honza
--
Jan Kara <[email protected]>
SUSE Labs, CR

2013-03-11 12:40:51

by Howard Chu

[permalink] [raw]
Subject: Re: mmap vs fs cache

Jan Kara wrote:
> On Fri 08-03-13 12:04:46, Howard Chu wrote:
>> The test clearly is accessing only 30GB of data. Once slapd reaches
>> this process size, the test can be stopped and restarted any number
>> of times, run for any number of hours continuously, and memory use
>> on the system is unchanged, and no pageins occur.
> Interesting. It might be worth trying what happens if you do
> madvise(..., MADV_DONTNEED) on the data file instead of dropping caches
> with /proc/sys/vm/drop_caches. That way we can establish whether the extra
> cached data is in the data file (things will look the same way as with
> drop_caches) or somewhere else (there will be still unmapped page cache).

I screwed up. My madvise(RANDOM) call used the wrong address/len so it didn't
cover the whole region. After fixing this, the test now runs as expected - the
slapd process size grows to 30GB without any problem. Sorry for the noise.

--
-- Howard Chu
CTO, Symas Corp. http://www.symas.com
Director, Highland Sun http://highlandsun.com/hyc/
Chief Architect, OpenLDAP http://www.openldap.org/project/

2013-03-11 15:03:28

by Phillip Susi

[permalink] [raw]
Subject: Re: mmap vs fs cache

On 3/11/2013 7:52 AM, Jan Kara wrote:
>> Yep, that's because it isn't implemented.
> Why do you think so? AFAICS it is implemented by setting VM_RAND_READ
> flag in the VMA and do_async_mmap_readahead() and do_sync_mmap_readahead()
> check for the flag and don't do anything if it is set...

Oh, don't know how I missed that... I was just looking for it the other
day and couldn't find any references to VM_RandomReadHint so I assumed
it hadn't been implemented.