2008-07-15 23:03:57

by Eric Rannaud

[permalink] [raw]
Subject: madvise(2) MADV_SEQUENTIAL behavior

mm/madvise.c and madvise(2) say:

* MADV_SEQUENTIAL - pages in the given range will probably be accessed
* once, so they can be aggressively read ahead, and
* can be freed soon after they are accessed.


But as the sample program at the end of this post shows, and as I
understand the code in mm/filemap.c, MADV_SEQUENTIAL will only increase
the amount of read ahead for the specified page range, but will not
influence the rate at which the pages just read will be freed from
memory.

Running the sample program on a large file, say 4GB on a machine with
3GB of RAM, the resident size of the program will grow enough to evict
pretty much everything else. (on 2.6.25.9-40.fc8)

Right before the program below is done reading the 4GB file:

7f6c3e654000-7f6d3e654000 r--s 00000000 fd:02 98125 /tmp/bigfile
Size: 4194304 kB
Rss: 2472220 kB
Pss: 2472220 kB
Shared_Clean: 0 kB
Shared_Dirty: 0 kB
Private_Clean: 2472220 kB
Private_Dirty: 0 kB
Referenced: 718748 kB


I'm well aware that the kernel is free to ignore the advice given
through madvise(2) (fadvise(2) seems to behave similarly, btw), so I'm
certainly not claiming this is a bug. However, I was wondering what was
the rationale behind it, and whether the manpages should be updated to
be more accurate.

There is a very straightforward workaround: MADV_DONTNEED on the range
just read, every so often, will be very effective at controlling the
resident size of the mapping. (mm/madvise.c:madvise_dontneed() calls
zap_page_range())

Thanks.



---
# dd if=/dev/zero of=/tmp/bigfile bs=1024 count=$((4*1024*1024))
# gcc test.c
# Run:
file=/tmp/bigfile; ./a.out $file & pid=$! ; while true; do cat /proc/$pid/smaps | grep -A 8 $file; sleep 1; done

# cat test.c

#include <sys/types.h>
#include <sys/stat.h>
#include <sys/mman.h>
#include <fcntl.h>
#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <errno.h>

int main(int argc, char **argv)
{
if (argc != 2)
return -EINVAL;

char *fn = argv[1];
int fd = open(fn, O_RDONLY);
if (fd < 0)
return -errno;

struct stat st;
int ret = fstat(fd, &st);
if (ret)
return -errno;

unsigned char *map = mmap(0, st.st_size, PROT_READ, MAP_SHARED, fd, 0);
if (map == MAP_FAILED)
return -errno;

ret = madvise(map, st.st_size, MADV_SEQUENTIAL);
if (ret) {
fprintf(stderr, "madvise failed\n");
return -errno;
}

const int pagesize = sysconf(_SC_PAGESIZE);
unsigned char dummy = 0;
off_t i;

for (i = 0; i < st.st_size; i += pagesize) {
dummy += map[i];
}

munmap(map, st.st_size);
close(fd);

return dummy;
}



2008-07-16 12:14:36

by Peter Zijlstra

[permalink] [raw]
Subject: Re: madvise(2) MADV_SEQUENTIAL behavior

On Tue, 2008-07-15 at 23:03 +0000, Eric Rannaud wrote:
> mm/madvise.c and madvise(2) say:
>
> * MADV_SEQUENTIAL - pages in the given range will probably be accessed
> * once, so they can be aggressively read ahead, and
> * can be freed soon after they are accessed.
>
>
> But as the sample program at the end of this post shows, and as I
> understand the code in mm/filemap.c, MADV_SEQUENTIAL will only increase
> the amount of read ahead for the specified page range, but will not
> influence the rate at which the pages just read will be freed from
> memory.

Correct, various attempts have been made to actually implement this, but
non made it through.

My last attempt was:
http://lkml.org/lkml/2007/7/21/219

Rik recently tried something else based on his split-lru series:
http://lkml.org/lkml/2008/7/15/465


2008-07-16 14:50:40

by Rik van Riel

[permalink] [raw]
Subject: Re: madvise(2) MADV_SEQUENTIAL behavior

On Wed, 16 Jul 2008 14:14:55 +0200
Peter Zijlstra <[email protected]> wrote:

> On Tue, 2008-07-15 at 23:03 +0000, Eric Rannaud wrote:
> > mm/madvise.c and madvise(2) say:
> >
> > * MADV_SEQUENTIAL - pages in the given range will probably be accessed
> > * once, so they can be aggressively read ahead, and
> > * can be freed soon after they are accessed.
> >
> >
> > But as the sample program at the end of this post shows, and as I
> > understand the code in mm/filemap.c, MADV_SEQUENTIAL will only increase
> > the amount of read ahead for the specified page range, but will not
> > influence the rate at which the pages just read will be freed from
> > memory.
>
> Correct, various attempts have been made to actually implement this, but
> non made it through.
>
> My last attempt was:
> http://lkml.org/lkml/2007/7/21/219
>
> Rik recently tried something else based on his split-lru series:
> http://lkml.org/lkml/2008/7/15/465

M patch is not going to help with mmap, though.

I believe that for mmap MADV_SEQUENTIAL, we will have to do
an unmap-behind from the fault path. Not every time, but
maybe once per megabyte, unmapping the megabyte behind us.

That way the normal page cache policies (use once, etc) can
take care of page eviction, which should help if the file
is also in use by another process.

--
All Rights Reversed

2008-07-16 21:05:43

by Chris Snook

[permalink] [raw]
Subject: Re: madvise(2) MADV_SEQUENTIAL behavior

Rik van Riel wrote:
> On Wed, 16 Jul 2008 14:14:55 +0200
> Peter Zijlstra <[email protected]> wrote:
>
>> On Tue, 2008-07-15 at 23:03 +0000, Eric Rannaud wrote:
>>> mm/madvise.c and madvise(2) say:
>>>
>>> * MADV_SEQUENTIAL - pages in the given range will probably be accessed
>>> * once, so they can be aggressively read ahead, and
>>> * can be freed soon after they are accessed.
>>>
>>>
>>> But as the sample program at the end of this post shows, and as I
>>> understand the code in mm/filemap.c, MADV_SEQUENTIAL will only increase
>>> the amount of read ahead for the specified page range, but will not
>>> influence the rate at which the pages just read will be freed from
>>> memory.
>> Correct, various attempts have been made to actually implement this, but
>> non made it through.
>>
>> My last attempt was:
>> http://lkml.org/lkml/2007/7/21/219
>>
>> Rik recently tried something else based on his split-lru series:
>> http://lkml.org/lkml/2008/7/15/465
>
> M patch is not going to help with mmap, though.
>
> I believe that for mmap MADV_SEQUENTIAL, we will have to do
> an unmap-behind from the fault path. Not every time, but
> maybe once per megabyte, unmapping the megabyte behind us.
>
> That way the normal page cache policies (use once, etc) can
> take care of page eviction, which should help if the file
> is also in use by another process.
>

Wouldn't it just be easier to not move pages to the active list when
they're referenced via an MADV_SEQUENTIAL mapping? If we keep them on
the inactive list, they'll be candidates for reclaiming, but they'll
still be in pagecache when another task scans through, as long as we're
not under memory pressure.

-- Chris

2008-07-17 00:02:14

by Eric Rannaud

[permalink] [raw]
Subject: Re: madvise(2) MADV_SEQUENTIAL behavior

On Wed, 2008-07-16 at 17:05 -0400, Chris Snook wrote:
> Rik van Riel wrote:
> > I believe that for mmap MADV_SEQUENTIAL, we will have to do
> > an unmap-behind from the fault path. Not every time, but
> > maybe once per megabyte, unmapping the megabyte behind us.
>
> Wouldn't it just be easier to not move pages to the active list when
> they're referenced via an MADV_SEQUENTIAL mapping? If we keep them on
> the inactive list, they'll be candidates for reclaiming, but they'll
> still be in pagecache when another task scans through, as long as we're
> not under memory pressure.

This approach, instead of invalidating the pages right away would
provide a middle ground: a way to tell the kernel "these pages are not
too important".

Whereas if MADV_SEQUENTIAL just invalidates the pages once per megabyte
(say), then it's only doing what is already possible using MADV_DONTNEED
("drop this pages now"). It would automate the process, but it would not
provide a more subtle hint, which could be quite useful.

As I see it, there are two basic concepts here:
- no_reuse (like FADV_NOREUSE)
- more_ra (more readahead)
(DONTNEED being another different concept)

Then:
MADV_SEQUENTIAL = more_ra | no_reuse
FADV_SEQUENTIAL = more_ra | no_reuse
FADV_NOREUSE = no_reuse

Right now, only the 'more_ra' part is implemented. 'no_reuse' could be
implemented as Chris suggests.

It looks like the disagreement a year ago around Peter's approach was
mostly around the question of whether using read ahead as a heuristic
for "drop behind" was safe for all workloads.

Would it be less controversial to remove the heuristic (ra->size ==
ra->ra_pages), and to do something only if the user asked for
_SEQUENTIAL or _NOREUSE?

It might encourage user space applications to start using
FADV_SEQUENTIAL or FADV_NOREUSE more often (as it would become
worthwhile to do so), and if they do (especially cron jobs), the problem
of the slow desktop in the morning would progressively solve itself.

Thanks.

2008-07-17 06:14:47

by Nick Piggin

[permalink] [raw]
Subject: Re: madvise(2) MADV_SEQUENTIAL behavior

On Thursday 17 July 2008 10:01, Eric Rannaud wrote:
> On Wed, 2008-07-16 at 17:05 -0400, Chris Snook wrote:
> > Rik van Riel wrote:
> > > I believe that for mmap MADV_SEQUENTIAL, we will have to do
> > > an unmap-behind from the fault path. Not every time, but
> > > maybe once per megabyte, unmapping the megabyte behind us.
> >
> > Wouldn't it just be easier to not move pages to the active list when
> > they're referenced via an MADV_SEQUENTIAL mapping? If we keep them on
> > the inactive list, they'll be candidates for reclaiming, but they'll
> > still be in pagecache when another task scans through, as long as we're
> > not under memory pressure.
>
> This approach, instead of invalidating the pages right away would
> provide a middle ground: a way to tell the kernel "these pages are not
> too important".
>
> Whereas if MADV_SEQUENTIAL just invalidates the pages once per megabyte
> (say), then it's only doing what is already possible using MADV_DONTNEED
> ("drop this pages now"). It would automate the process, but it would not
> provide a more subtle hint, which could be quite useful.
>
> As I see it, there are two basic concepts here:
> - no_reuse (like FADV_NOREUSE)
> - more_ra (more readahead)
> (DONTNEED being another different concept)
>
> Then:
> MADV_SEQUENTIAL = more_ra | no_reuse
> FADV_SEQUENTIAL = more_ra | no_reuse
> FADV_NOREUSE = no_reuse
>
> Right now, only the 'more_ra' part is implemented. 'no_reuse' could be
> implemented as Chris suggests.
>
> It looks like the disagreement a year ago around Peter's approach was
> mostly around the question of whether using read ahead as a heuristic
> for "drop behind" was safe for all workloads.
>
> Would it be less controversial to remove the heuristic (ra->size ==
> ra->ra_pages), and to do something only if the user asked for
> _SEQUENTIAL or _NOREUSE?

It's far far easier to tell the kernel "I am no longer using these
pages" than to say "I will not use these pages sometime in the future
after I have used them". The former can be done synchronously and with
a much higher efficiency than it takes to scan through LRU lists to
figure this out.

We should be using the SEQUENTIAL to open up readahead windows, and ask
userspace applications to use DONTNEED to drop if it is important. IMO.


> It might encourage user space applications to start using
> FADV_SEQUENTIAL or FADV_NOREUSE more often (as it would become
> worthwhile to do so), and if they do (especially cron jobs), the problem
> of the slow desktop in the morning would progressively solve itself.

The slow desktop in the morning should not happen even without such a
call, because the kernel should not throw out frequently used data (even
if it is not quite so recent) in favour of streaming data.

OK, I figure it doesn't do such a good job now, which is sad, but making
all apps micromanage the pagecache to get reasonable performance on a
2GB+ desktop system is even more sad ;)

2008-07-17 14:20:44

by Rik van Riel

[permalink] [raw]
Subject: Re: madvise(2) MADV_SEQUENTIAL behavior

On Wed, 16 Jul 2008 17:05:14 -0400
Chris Snook <[email protected]> wrote:

> > I believe that for mmap MADV_SEQUENTIAL, we will have to do
> > an unmap-behind from the fault path. Not every time, but
> > maybe once per megabyte, unmapping the megabyte behind us.
> >
> > That way the normal page cache policies (use once, etc) can
> > take care of page eviction, which should help if the file
> > is also in use by another process.
>
> Wouldn't it just be easier to not move pages to the active list when
> they're referenced via an MADV_SEQUENTIAL mapping?

You want to check the MADV_SEQUENTIAL hint at pageout time and
discard the referenced bit from the pte?

> If we keep them on the inactive list, they'll be candidates for
> reclaiming

Only if we ignore the referenced bit. Which I guess we can do.

--
All Rights Reversed

2008-07-17 14:22:42

by Rik van Riel

[permalink] [raw]
Subject: Re: madvise(2) MADV_SEQUENTIAL behavior

On Thu, 17 Jul 2008 16:14:29 +1000
Nick Piggin <[email protected]> wrote:

> > It might encourage user space applications to start using
> > FADV_SEQUENTIAL or FADV_NOREUSE more often (as it would become
> > worthwhile to do so), and if they do (especially cron jobs), the problem
> > of the slow desktop in the morning would progressively solve itself.
>
> The slow desktop in the morning should not happen even without such a
> call, because the kernel should not throw out frequently used data (even
> if it is not quite so recent) in favour of streaming data.
>
> OK, I figure it doesn't do such a good job now, which is sad,

Do you have any tests in mind that we could use to decide
whether the patch I posted Tuesday would do a decent job
at protecting frequently used data from streaming data?

http://lkml.org/lkml/2008/7/15/465

--
All Rights Reversed

2008-07-17 18:05:28

by Chris Snook

[permalink] [raw]
Subject: Re: madvise(2) MADV_SEQUENTIAL behavior

Rik van Riel wrote:
> On Thu, 17 Jul 2008 16:14:29 +1000
> Nick Piggin <[email protected]> wrote:
>
>>> It might encourage user space applications to start using
>>> FADV_SEQUENTIAL or FADV_NOREUSE more often (as it would become
>>> worthwhile to do so), and if they do (especially cron jobs), the problem
>>> of the slow desktop in the morning would progressively solve itself.
>> The slow desktop in the morning should not happen even without such a
>> call, because the kernel should not throw out frequently used data (even
>> if it is not quite so recent) in favour of streaming data.
>>
>> OK, I figure it doesn't do such a good job now, which is sad,
>
> Do you have any tests in mind that we could use to decide
> whether the patch I posted Tuesday would do a decent job
> at protecting frequently used data from streaming data?
>
> http://lkml.org/lkml/2008/7/15/465
>

1) start up a memory-hogging Java app
2) run a full-system backup

If it works well, the Java app shouldn't slow down much.

-- Chris

2008-07-17 18:09:18

by Peter Zijlstra

[permalink] [raw]
Subject: Re: madvise(2) MADV_SEQUENTIAL behavior

Sorry can't resist...

On Thu, 2008-07-17 at 14:04 -0400, Chris Snook wrote:

> 1) start up a memory-hogging Java app

Is there any other kind? :-)