LinuxLists.cc - Best way to pin a page in ext4?

2014-09-15 18:51:01

Subject: Best way to pin a page in ext4?

Hi,

In ext4, we currently use the page cache to store the allocation
bitmaps. The pages are associated with an internal, in-memory inode
which is located in EXT4_SB(sb)->s_buddy_cache. Since the pages can be
reconstructed at will, either by reading them from disk (in the case of
the actual allocation bitmap), or by calculating the buddy bitmap from
the allocation bitmap, normally we allow the VM to eject the pags as
necessary.

For a specialty use case, I've been requested to have an optional mode
where the on-disk bitmaps are pinned into memory; this is a situation
where the file system size is known in advance, and the user is willing
to trade off the locked-down memory for the latency gains required by
this use case.

It seems that the simplest way to do that is to use mlock_vma_page()
when the file system is first mounted, and then use munlock_vma_page()
when the file system is unmounted. However, these functions are in
mm/internal.h, so I figured I'd better ask permission before using
them. Does this sound like a sane way to do things?

The other approach would be to keep an elevated refcount on the pages in
question, but it seemed it would be more efficient use the mlock
facility since that keeps the pages on an unevictable list.

Does using the mlock/munlock_vma_page() functions make sense? Any
pitfalls I should worry about? Note that these pages are never mapped
into userspace, so there is no associated vma; fortunately the functions
don't take a vma argument, their name notwithstanding.....

Thanks,

- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2014-09-15 20:57:23

by Andreas Dilger

[permalink] [raw]

Subject: Re: Best way to pin a page in ext4?

On Sep 15, 2014, at 12:51 PM, Theodore Ts'o <[email protected]> wrote:
> In ext4, we currently use the page cache to store the allocation
> bitmaps. The pages are associated with an internal, in-memory inode
> which is located in EXT4_SB(sb)->s_buddy_cache. Since the pages can be
> reconstructed at will, either by reading them from disk (in the case of
> the actual allocation bitmap), or by calculating the buddy bitmap from
> the allocation bitmap, normally we allow the VM to eject the pags as
> necessary.
>
> For a specialty use case, I've been requested to have an optional mode
> where the on-disk bitmaps are pinned into memory; this is a situation
> where the file system size is known in advance, and the user is willing
> to trade off the locked-down memory for the latency gains required by
> this use case.

As discussed in http://lists.openwall.net/linux-ext4/2013/03/25/15
the bitmap pages were being evicted under memory pressure even when
they are active use. That turned out to be an MM problem and not an
ext4 problem in the end, and was fixed in commit c53954a092d in 3.11,
in case you are running an older kernel.

There was a discussion on whether we were doing all of the right calls
to mark_page_accessed() in the ext4 code to ensure that these bitmaps
were being kept at the hot end of the LRU.

> It seems that the simplest way to do that is to use mlock_vma_page()
> when the file system is first mounted, and then use munlock_vma_page()
> when the file system is unmounted. However, these functions are in
> mm/internal.h, so I figured I'd better ask permission before using
> them. Does this sound like a sane way to do things?
>
> The other approach would be to keep an elevated refcount on the pages in
> question, but it seemed it would be more efficient use the mlock
> facility since that keeps the pages on an unevictable list.

It doesn't seem unreasonable to just grab an extra refcount on the pages
when they are first loaded. However, the memory usage may be fairly
high (32MB per 1TB of disk) so this definitely can't be generally used,
and it would be nice to make sure that ext4 is already doing the right
thing to keep these important pages in cache.

The other option is to improve the in-memory description of free blocks
and use an extent map or rbtree to handle this instead of bitmaps. That
may also speed up allocation in general, but is a lot more work...

> Does using the mlock/munlock_vma_page() functions make sense? Any
> pitfalls I should worry about? Note that these pages are never mapped
> into userspace, so there is no associated vma; fortunately the functions
> don't take a vma argument, their name notwithstanding.....
>
> Thanks,
>
> - Ted

Cheers, Andreas

Attachments:

signature.asc (833.00 B)
Message signed with OpenPGP using GPGMail

2014-09-16 18:07:59

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Best way to pin a page in ext4?

On Mon, Sep 15, 2014 at 02:57:23PM -0600, Andreas Dilger wrote:
>
> As discussed in http://lists.openwall.net/linux-ext4/2013/03/25/15
> the bitmap pages were being evicted under memory pressure even when
> they are active use. That turned out to be an MM problem and not an
> ext4 problem in the end, and was fixed in commit c53954a092d in 3.11,
> in case you are running an older kernel.

Yes, I remember. And that could potentially be a contributing factor,
since the user in question is using 3.2. However, the user in
question has a use case where bitmap pinning is probably going to be
needed given the likely allocation patterns of a DVR; if the pages
aren't pinned, it's likely that by the time the DVR needs to fallocate
space for a new show, the bitmap pages would have been aged out due to
not being frequently accessed enough, even if the usage tracking was
backported to a 3.2 kernel.

> > The other approach would be to keep an elevated refcount on the pages in
> > question, but it seemed it would be more efficient use the mlock
> > facility since that keeps the pages on an unevictable list.
>
> It doesn't seem unreasonable to just grab an extra refcount on the pages
> when they are first loaded.

Well yes, but using mlock_vma_page() would be a bit more efficient,
and technically, more correct than simply elevating the refcount.

> However, the memory usage may be fairly
> high (32MB per 1TB of disk) so this definitely can't be generally used,
> and it would be nice to make sure that ext4 is already doing the right
> thing to keep these important pages in cache.

Well, as I mentioned above, the use case in question is a DVR, where
having the disk need to suddenly seek a large number block groups, and
thus pull in a largish number of allocation bitmaps, might be harmful
for a video replay that might be happening at the same time that the
DVR needs to fallocate space for a new TV show to be recorded.

And for a 2TB disk, the developer in question felt that he could
afford pinning 64MB. So no, it's not a general solution, but it's
probably good enough for now.

Long run, I think we really need to consider trying to cache free
space information in some kind in-memory of rbtree, with a bail-out in
the worst case of the free space is horrendously fragmented in a
particular block group. But as a quick hack, using mlock_vma_page()
was the simplest short term solution.

The main question then for the mm developers is would there be
objections in making mlock/munlock_vma_page() be EXPORT_SYMBOL_GPL and
moving the function declaration from mm/internal.h to
include/linux/mm.h?

Cheers,

- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2014-09-16 18:34:37

by Christoph Lameter (Ampere)

[permalink] [raw]

Subject: Re: Best way to pin a page in ext4?

On Tue, 16 Sep 2014, Theodore Ts'o wrote:

> > It doesn't seem unreasonable to just grab an extra refcount on the pages
> > when they are first loaded.
>
> Well yes, but using mlock_vma_page() would be a bit more efficient,
> and technically, more correct than simply elevating the refcount.

mlocked pages can be affected by page migration. They are not
pinned since POSIX only says that the pages must stay in memory. So the OS
is free to move them around physical memory.

Pinned pages have an elevated refcount. Note also Peter Zijlstra's
recent work on pinned pages.

https://lkml.org/lkml/2014/5/26/345

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2014-09-16 18:56:39

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Best way to pin a page in ext4?

On Tue, Sep 16, 2014 at 01:34:37PM -0500, Christoph Lameter wrote:
> On Tue, 16 Sep 2014, Theodore Ts'o wrote:
>
> > > It doesn't seem unreasonable to just grab an extra refcount on the pages
> > > when they are first loaded.
> >
> > Well yes, but using mlock_vma_page() would be a bit more efficient,
> > and technically, more correct than simply elevating the refcount.
>
> mlocked pages can be affected by page migration. They are not
> pinned since POSIX only says that the pages must stay in memory. So the OS
> is free to move them around physical memory.

And indeed, that would be a better reason to use mlock_vma_page()
rather than elevating the refcount; we just need the page to stay in
memory. If the mm system needs to move the page around to coalesce
for hugepages, or some such, that's fine.

(And so the subject line in my original post is wrong; apologies, I'm
a fs developer, not a mm developer, and so I used the wrong
terminology.)

Cheers,

- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2014-09-17 00:09:07

by Hugh Dickins

[permalink] [raw]

Subject: Re: Best way to pin a page in ext4?

On Tue, 16 Sep 2014, Theodore Ts'o wrote:
> On Mon, Sep 15, 2014 at 02:57:23PM -0600, Andreas Dilger wrote:
> >
> > As discussed in http://lists.openwall.net/linux-ext4/2013/03/25/15
> > the bitmap pages were being evicted under memory pressure even when
> > they are active use. That turned out to be an MM problem and not an
> > ext4 problem in the end, and was fixed in commit c53954a092d in 3.11,
> > in case you are running an older kernel.
>
> Yes, I remember. And that could potentially be a contributing factor,
> since the user in question is using 3.2. However, the user in
> question has a use case where bitmap pinning is probably going to be
> needed given the likely allocation patterns of a DVR; if the pages
> aren't pinned, it's likely that by the time the DVR needs to fallocate
> space for a new show, the bitmap pages would have been aged out due to
> not being frequently accessed enough, even if the usage tracking was
> backported to a 3.2 kernel.
>
> > > The other approach would be to keep an elevated refcount on the pages in
> > > question, but it seemed it would be more efficient use the mlock
> > > facility since that keeps the pages on an unevictable list.
> >
> > It doesn't seem unreasonable to just grab an extra refcount on the pages
> > when they are first loaded.
>
> Well yes, but using mlock_vma_page() would be a bit more efficient,
> and technically, more correct than simply elevating the refcount.
>
> > However, the memory usage may be fairly
> > high (32MB per 1TB of disk) so this definitely can't be generally used,
> > and it would be nice to make sure that ext4 is already doing the right
> > thing to keep these important pages in cache.
>
> Well, as I mentioned above, the use case in question is a DVR, where
> having the disk need to suddenly seek a large number block groups, and
> thus pull in a largish number of allocation bitmaps, might be harmful
> for a video replay that might be happening at the same time that the
> DVR needs to fallocate space for a new TV show to be recorded.
>
> And for a 2TB disk, the developer in question felt that he could
> afford pinning 64MB. So no, it's not a general solution, but it's
> probably good enough for now.
>
> Long run, I think we really need to consider trying to cache free
> space information in some kind in-memory of rbtree, with a bail-out in
> the worst case of the free space is horrendously fragmented in a
> particular block group. But as a quick hack, using mlock_vma_page()
> was the simplest short term solution.
>
> The main question then for the mm developers is would there be
> objections in making mlock/munlock_vma_page() be EXPORT_SYMBOL_GPL and
> moving the function declaration from mm/internal.h to
> include/linux/mm.h?

Yes, I'm afraid there is a pitfall, and there would be objections.

It's not accidental that the function is called mlock_vma_page(): it
and PageMlocked are about support for mlock'ed areas of user memory;
and if you look hard, you'll find that PageMlocked needs to be
"supported" by at least one VM_LOCKED vma.

You might (I'm not certain) be able to get away with extending the
use of mlock_vma_page() and munlock_vma_page() in this (admittedly
attractive) way, up until someone mmap's that range (and mlocks
then munlocks it? again, I'm not certain if that's necessary).
Then the PageMlocked flag is liable to be cleared, because the
page will not be found in any mlock'ed vma; and the page can
then be reclaimed behind your back (statistics gone wrong too?
again I'm not sure).

Now, I expect it's unlikely (impossible?) for anyone to mmap your
bitmap pages while they're being used as filesystem metadata (rather
than mere blockdev pages). But you can see why we would prefer not
to export those functions.

I suspect that to handle your special case, we would need to declare
another page flag: but it would need a lot more uses to justify that.
For now I agree with Andreas, just grab an extra refcount; but you're
right that leaving these pages on evictable LRUs is regrettable,
and can be inefficient under reclaim.

On the page migration issue: it's not quite as straightforward as
Christoph suggests. He and I agree completely that mlocked pages
should be migratable, but some real-time-minded people disagree:
so normal compaction is still forbidden to migrate mlocked pages in
the vanilla kernel (though we in Google patch that prohibition out).
So pinning by refcount is no worse for compaction than mlocking,
in the vanilla kernel.

Hugh

2014-09-17 01:25:42

by Theodore Ts'o

[permalink] [raw]

Subject: Re: Best way to pin a page in ext4?

On Tue, Sep 16, 2014 at 05:07:18PM -0700, Hugh Dickins wrote:
> You might (I'm not certain) be able to get away with extending the
> use of mlock_vma_page() and munlock_vma_page() in this (admittedly
> attractive) way, up until someone mmap's that range (and mlocks
> then munlocks it? again, I'm not certain if that's necessary).
> Then the PageMlocked flag is liable to be cleared, because the
> page will not be found in any mlock'ed vma; and the page can
> then be reclaimed behind your back (statistics gone wrong too?
> again I'm not sure).
>
> Now, I expect it's unlikely (impossible?) for anyone to mmap your
> bitmap pages while they're being used as filesystem metadata (rather
> than mere blockdev pages). But you can see why we would prefer not
> to export those functions.

Yes, it's impossible for anyone to mmap the pages from
EXT4_SB(sb)->s_buddy_cache inode, because it's not exposed in any way
to userspace. But I can see why you wouldn't want it to be used
almost anywhere else.

> For now I agree with Andreas, just grab an extra refcount; but you're
> right that leaving these pages on evictable LRUs is regrettable,
> and can be inefficient under reclaim.

OK, fair enough, that seems simpler all around.

Cheers,

- Ted

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2014-09-17 03:31:24

by Christoph Lameter (Ampere)

[permalink] [raw]

Subject: Re: Best way to pin a page in ext4?

On Tue, 16 Sep 2014, Hugh Dickins wrote:

> On the page migration issue: it's not quite as straightforward as
> Christoph suggests. He and I agree completely that mlocked pages
> should be migratable, but some real-time-minded people disagree:
> so normal compaction is still forbidden to migrate mlocked pages in
> the vanilla kernel (though we in Google patch that prohibition out).
> So pinning by refcount is no worse for compaction than mlocking,
> in the vanilla kernel.

Note though that compaction is not the only mechanism that uses page
migration.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2014-09-17 13:56:14

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Best way to pin a page in ext4?

On Tue, Sep 16, 2014 at 05:07:18PM -0700, Hugh Dickins wrote:
> On the page migration issue: it's not quite as straightforward as
> Christoph suggests. He and I agree completely that mlocked pages
> should be migratable, but some real-time-minded people disagree:
> so normal compaction is still forbidden to migrate mlocked pages in
> the vanilla kernel (though we in Google patch that prohibition out).
> So pinning by refcount is no worse for compaction than mlocking,
> in the vanilla kernel.

These realtime people are fully aware of this -- they should be at
least, I've been telling them for years.

Also, they would be very happy with means to actually pin pages -- as
per the patches Christoph referred to. The advantage of also having
mpin() and co is that we can migrate the memory into non-movable blocks
before returning etc.

In any case, I think we can (and should) change the behaviour of mlock
to be migratable (possibly with an easy way to revert in -rt for
migratory purposes until we get mpin sorted).

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2014-09-17 13:57:19

by Peter Zijlstra

[permalink] [raw]

Subject: Re: Best way to pin a page in ext4?

On Tue, Sep 16, 2014 at 10:31:24PM -0500, Christoph Lameter wrote:
> On Tue, 16 Sep 2014, Hugh Dickins wrote:
>
> > On the page migration issue: it's not quite as straightforward as
> > Christoph suggests. He and I agree completely that mlocked pages
> > should be migratable, but some real-time-minded people disagree:
> > so normal compaction is still forbidden to migrate mlocked pages in
> > the vanilla kernel (though we in Google patch that prohibition out).
> > So pinning by refcount is no worse for compaction than mlocking,
> > in the vanilla kernel.
>
> Note though that compaction is not the only mechanism that uses page
> migration.

Agreed, and not all migration paths check for mlocked iirc. ISTR it is
very much possible for mlocked pages to get migrated in mainline.

--
To unsubscribe, send a message with 'unsubscribe linux-mm' in
the body to [email protected]. For more info on Linux MM,
see: http://www.linux-mm.org/ .
Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2014-09-17 20:39:24

by Hugh Dickins

[permalink] [raw]

Subject: Re: Best way to pin a page in ext4?

On Wed, 17 Sep 2014, Peter Zijlstra wrote:
> On Tue, Sep 16, 2014 at 10:31:24PM -0500, Christoph Lameter wrote:
> > On Tue, 16 Sep 2014, Hugh Dickins wrote:
> >
> > > On the page migration issue: it's not quite as straightforward as
> > > Christoph suggests. He and I agree completely that mlocked pages
> > > should be migratable, but some real-time-minded people disagree:
> > > so normal compaction is still forbidden to migrate mlocked pages in
> > > the vanilla kernel (though we in Google patch that prohibition out).
> > > So pinning by refcount is no worse for compaction than mlocking,
> > > in the vanilla kernel.
> >
> > Note though that compaction is not the only mechanism that uses page
> > migration.

True: offhand, I think memory hotremove, and CMA, and explicit mempolicy
changes, are all (for good reason) allowed to migrate mlocked pages; but
the case which most interests many is migration for compaction.

>
> Agreed, and not all migration paths check for mlocked iirc. ISTR it is
> very much possible for mlocked pages to get migrated in mainline.

I think all the checks are for unevictable; and certainly we permit
races whereby an mlocked page may miss the unevictable LRU, until
subsequent reclaim corrects the omission. But I think that's the
extent to which mlocked pages might be migrated for compaction at
present.

Hugh