This patch aims to implement POSIX_FADV_WILLNEED and POSIX_FADV_DONTNEED
advices to shmem files which can be helpful for the drivers who may want
to manage the pages of shmem files on their own, like, that are created
through shmem_file_setup[_with_mnt]().
Changes in V7:
-- Use folio based interface, shmem_read_folio(), for FADV_WILLNEED.
-- Don't swap the SHM_LOCK'ed pages.
Changes in V6:
-- Replaced the pages with folio's for shmem changes.
-- https://lore.kernel.org/all/[email protected]/
Changes in V5:
-- Moved the 'endbyte' calculations to a header function for use by shmem_fadvise().
-- Addressed comments from suren.
-- No changes in resend. Retested on the latest tip.
-- https://lore.kernel.org/all/[email protected]/
Changes in V4:
-- Changed the code to use reclaim_pages() to writeout the shmem pages to swap and then reclaim.
-- Addressed comments from Mark Hemment and Matthew.
-- fadvise() on shmem file may even unmap a page.
-- https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
Changes in V3:
-- Considered THP pages while doing FADVISE_[DONT|WILL]NEED, identified by Matthew.
-- xarray used properly, as identified by Matthew.
-- Excluded mapped pages as it requires unmapping and the man pages of fadvise don't talk about them.
-- RESEND: Fixed the compilation issue when CONFIG_TMPFS is not defined.
-- https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
Changes in V2:
-- Rearranged the code to not to sleep with rcu_lock while using xas_() functionality.
-- Addressed the comments from Suren.
-- https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
changes in V1:
-- Created the interface for fadvise(2) to work on shmem files.
-- https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
Charan Teja Kalla (2):
mm: fadvise: move 'endbyte' calculations to helper function
mm: shmem: implement POSIX_FADV_[WILL|DONT]NEED for shmem
mm/fadvise.c | 11 +-----
mm/internal.h | 21 +++++++++++
mm/shmem.c | 116 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
3 files changed, 138 insertions(+), 10 deletions(-)
--
2.7.4
On Tue, 14 Feb 2023 18:21:48 +0530 Charan Teja Kalla <[email protected]> wrote:
> This patch aims to implement POSIX_FADV_WILLNEED and POSIX_FADV_DONTNEED
> advices to shmem files which can be helpful for the drivers who may want
> to manage the pages of shmem files on their own, like, that are created
> through shmem_file_setup[_with_mnt]().
Could we please have some additional review of this series?
Thanks.
On Tue, Mar 28, 2023 at 3:50 PM Andrew Morton <[email protected]> wrote:
>
> On Tue, 14 Feb 2023 18:21:48 +0530 Charan Teja Kalla <[email protected]> wrote:
>
> > This patch aims to implement POSIX_FADV_WILLNEED and POSIX_FADV_DONTNEED
> > advices to shmem files which can be helpful for the drivers who may want
> > to manage the pages of shmem files on their own, like, that are created
> > through shmem_file_setup[_with_mnt]().
>
> Could we please have some additional review of this series?
The series LGTM but unfortunately I'm not a shmem expert. I asked
Minchan to take a look as well.
Thanks!
>
> Thanks.
On Tue, Feb 14, 2023 at 4:53 AM Charan Teja Kalla
<[email protected]> wrote:
>
> This patch aims to implement POSIX_FADV_WILLNEED and POSIX_FADV_DONTNEED
> advices to shmem files which can be helpful for the drivers who may want
> to manage the pages of shmem files on their own, like, that are created
> through shmem_file_setup[_with_mnt]().
>
> Changes in V7:
> -- Use folio based interface, shmem_read_folio(), for FADV_WILLNEED.
> -- Don't swap the SHM_LOCK'ed pages.
>
> Changes in V6:
> -- Replaced the pages with folio's for shmem changes.
> -- https://lore.kernel.org/all/[email protected]/
>
> Changes in V5:
> -- Moved the 'endbyte' calculations to a header function for use by shmem_fadvise().
> -- Addressed comments from suren.
> -- No changes in resend. Retested on the latest tip.
> -- https://lore.kernel.org/all/[email protected]/
>
> Changes in V4:
> -- Changed the code to use reclaim_pages() to writeout the shmem pages to swap and then reclaim.
> -- Addressed comments from Mark Hemment and Matthew.
> -- fadvise() on shmem file may even unmap a page.
> -- https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
>
> Changes in V3:
> -- Considered THP pages while doing FADVISE_[DONT|WILL]NEED, identified by Matthew.
> -- xarray used properly, as identified by Matthew.
> -- Excluded mapped pages as it requires unmapping and the man pages of fadvise don't talk about them.
> -- RESEND: Fixed the compilation issue when CONFIG_TMPFS is not defined.
> -- https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
>
> Changes in V2:
> -- Rearranged the code to not to sleep with rcu_lock while using xas_() functionality.
> -- Addressed the comments from Suren.
> -- https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
>
> changes in V1:
> -- Created the interface for fadvise(2) to work on shmem files.
> -- https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
>
>
> Charan Teja Kalla (2):
> mm: fadvise: move 'endbyte' calculations to helper function
> mm: shmem: implement POSIX_FADV_[WILL|DONT]NEED for shmem
>
> mm/fadvise.c | 11 +-----
> mm/internal.h | 21 +++++++++++
> mm/shmem.c | 116 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> 3 files changed, 138 insertions(+), 10 deletions(-)
>
> --
> 2.7.4
>
>
I didn't see this patch before, so I looked a bit at the history. At
some point, in v3, dealing with mapped pages for DONTNEED was left
out, they are now skipped. Unfortunately, that makes this patch no
longer usable for a case that we have: restoring the (approximate)
swap state of a tmpfs file. This involves walking a potentially large
number of regions, and explicitly pushing them out to swap. This can
be used to e.g. restore the state VM memory that is backed by a tmpfs
file, avoiding memory usage by cold VM pages after resume.
If DONTNEED also reclaims mapped pages (e.g. they get pushed out to
swap, if any), implementing the above use case efficiently is simple:
use io_uring with a vector that contains each region and the fadvise
method.
Without DONTNEED reclaiming mapped pages, you'd have to do mmap +
madvise(MADV_PAGEOUT) for each region that you want swapped out, which
is rather inefficient.
I understand that the semantics for POSIX_FADV_DONTNEED on shmem/tmpfs
files are open to interpretation, as it is a special case. And you can
certainly make the argument that relying on behavior caused by what
can be considered an implementation detail is bad.
So, is there any way we can make this use case work efficiently using
this patch?
You state in the commit message:
> So, FADV_DONTNEED also covers the semantics of MADV_PAGEOUT for file pages
> and there is no purpose of PAGEOUT for file pages.
But that doesn't seem correct: for shmem file pages, there actually
can be a purpose, and the FADV_DONTNEED as implemented for shmem in
this patch set does not cover the semantics.
You can say that it doesn't need to cover the pageout case of mapped
shmem pages, and that's fair. But I don't think you can claim that it
covers the case as currently implemented.
I suppose there are three options here:
1) Do nothing, this use case will just have to spend more time doing
mmap+madvise
2) Don't skip mapped pages for POSIX_FADV_DONTNEED in shmem_fadvise
3) Implement something like POSIX_FADV_PAGEOUT_NP, which would include
mapped pages.
What do people think?
- Frank
On Thu, Apr 13, 2023 at 12:45 PM Frank van der Linden <[email protected]> wrote:
>
> On Tue, Feb 14, 2023 at 4:53 AM Charan Teja Kalla
> <[email protected]> wrote:
> >
> > This patch aims to implement POSIX_FADV_WILLNEED and POSIX_FADV_DONTNEED
> > advices to shmem files which can be helpful for the drivers who may want
> > to manage the pages of shmem files on their own, like, that are created
> > through shmem_file_setup[_with_mnt]().
> >
> > Changes in V7:
> > -- Use folio based interface, shmem_read_folio(), for FADV_WILLNEED.
> > -- Don't swap the SHM_LOCK'ed pages.
> >
> > Changes in V6:
> > -- Replaced the pages with folio's for shmem changes.
> > -- https://lore.kernel.org/all/[email protected]/
> >
> > Changes in V5:
> > -- Moved the 'endbyte' calculations to a header function for use by shmem_fadvise().
> > -- Addressed comments from suren.
> > -- No changes in resend. Retested on the latest tip.
> > -- https://lore.kernel.org/all/[email protected]/
> >
> > Changes in V4:
> > -- Changed the code to use reclaim_pages() to writeout the shmem pages to swap and then reclaim.
> > -- Addressed comments from Mark Hemment and Matthew.
> > -- fadvise() on shmem file may even unmap a page.
> > -- https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
> >
> > Changes in V3:
> > -- Considered THP pages while doing FADVISE_[DONT|WILL]NEED, identified by Matthew.
> > -- xarray used properly, as identified by Matthew.
> > -- Excluded mapped pages as it requires unmapping and the man pages of fadvise don't talk about them.
> > -- RESEND: Fixed the compilation issue when CONFIG_TMPFS is not defined.
> > -- https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
> >
> > Changes in V2:
> > -- Rearranged the code to not to sleep with rcu_lock while using xas_() functionality.
> > -- Addressed the comments from Suren.
> > -- https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
> >
> > changes in V1:
> > -- Created the interface for fadvise(2) to work on shmem files.
> > -- https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
> >
> >
> > Charan Teja Kalla (2):
> > mm: fadvise: move 'endbyte' calculations to helper function
> > mm: shmem: implement POSIX_FADV_[WILL|DONT]NEED for shmem
> >
> > mm/fadvise.c | 11 +-----
> > mm/internal.h | 21 +++++++++++
> > mm/shmem.c | 116 ++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
> > 3 files changed, 138 insertions(+), 10 deletions(-)
> >
> > --
> > 2.7.4
> >
> >
>
> I didn't see this patch before, so I looked a bit at the history. At
> some point, in v3, dealing with mapped pages for DONTNEED was left
> out, they are now skipped. Unfortunately, that makes this patch no
> longer usable for a case that we have: restoring the (approximate)
> swap state of a tmpfs file. This involves walking a potentially large
> number of regions, and explicitly pushing them out to swap. This can
> be used to e.g. restore the state VM memory that is backed by a tmpfs
> file, avoiding memory usage by cold VM pages after resume.
>
> If DONTNEED also reclaims mapped pages (e.g. they get pushed out to
> swap, if any), implementing the above use case efficiently is simple:
> use io_uring with a vector that contains each region and the fadvise
> method.
>
> Without DONTNEED reclaiming mapped pages, you'd have to do mmap +
> madvise(MADV_PAGEOUT) for each region that you want swapped out, which
> is rather inefficient.
>
> I understand that the semantics for POSIX_FADV_DONTNEED on shmem/tmpfs
> files are open to interpretation, as it is a special case. And you can
> certainly make the argument that relying on behavior caused by what
> can be considered an implementation detail is bad.
>
> So, is there any way we can make this use case work efficiently using
> this patch?
>
> You state in the commit message:
>
> > So, FADV_DONTNEED also covers the semantics of MADV_PAGEOUT for file pages
> > and there is no purpose of PAGEOUT for file pages.
>
> But that doesn't seem correct: for shmem file pages, there actually
> can be a purpose, and the FADV_DONTNEED as implemented for shmem in
> this patch set does not cover the semantics.
>
> You can say that it doesn't need to cover the pageout case of mapped
> shmem pages, and that's fair. But I don't think you can claim that it
> covers the case as currently implemented.
>
> I suppose there are three options here:
>
> 1) Do nothing, this use case will just have to spend more time doing
> mmap+madvise
> 2) Don't skip mapped pages for POSIX_FADV_DONTNEED in shmem_fadvise
> 3) Implement something like POSIX_FADV_PAGEOUT_NP, which would include
> mapped pages.
>
> What do people think?
>
> - Frank
Hmm, actually, looking at it a bit more, there are several issues
here. One is that with fadvise, you can't be sure if you are the only
one dealing with the page in a mapped way(with madvise, if mapcount ==
1, that mean's it's just you, but you don't know that for fadvise, so
that makes correctly dealing with mapped pages harder).
Also, the madvise loop issue can probably also be dealt with via io_uring.
So, I think we can deal with the use case I mentioned in a different
way, that is mostly unrelated to this patchset, so basically:
disregard my previous email.
- Frank
Thanks Frank!!
On 4/14/2023 11:14 PM, Frank van der Linden wrote:
>> I didn't see this patch before, so I looked a bit at the history. At
>> some point, in v3, dealing with mapped pages for DONTNEED was left
>> out, they are now skipped. Unfortunately, that makes this patch no
>> longer usable for a case that we have: restoring the (approximate)
>> swap state of a tmpfs file. This involves walking a potentially large
>> number of regions, and explicitly pushing them out to swap. This can
>> be used to e.g. restore the state VM memory that is backed by a tmpfs
>> file, avoiding memory usage by cold VM pages after resume.
>>
This is an interesting use case and I feel this really supports this
patchset. IIUC, supporting the reclaim of mapped file pages through
fadvise() helps this usecase where you can avoid traversing the large
number of vma regions as you can directly issue the fadvise() on the
shmem file 'fd' and it takes care. Am I correct?
> Hmm, actually, looking at it a bit more, there are several issues
> here. One is that with fadvise, you can't be sure if you are the only
> one dealing with the page in a mapped way(with madvise, if mapcount ==
> 1, that mean's it's just you, but you don't know that for fadvise, so
> that makes correctly dealing with mapped pages harder).
>
Sorry, Why not for fadvise()? I can still attempt only if the page is
mapped and its mapcount == 1, but then we already have madvise() for
such pages and why not we simply use it.
> Also, the madvise loop issue can probably also be dealt with via io_uring.
>
> So, I think we can deal with the use case I mentioned in a different
> way, that is mostly unrelated to this patchset, so basically:
> disregard my previous email.
Sure! If this patch looks good for you, Reviewed/Ack tags can help me..
Thanks,
Charan
On Fri, Apr 14, 2023 at 12:10 PM Charan Teja Kalla
<[email protected]> wrote:
>
> Thanks Frank!!
>
> On 4/14/2023 11:14 PM, Frank van der Linden wrote:
> >> I didn't see this patch before, so I looked a bit at the history. At
> >> some point, in v3, dealing with mapped pages for DONTNEED was left
> >> out, they are now skipped. Unfortunately, that makes this patch no
> >> longer usable for a case that we have: restoring the (approximate)
> >> swap state of a tmpfs file. This involves walking a potentially large
> >> number of regions, and explicitly pushing them out to swap. This can
> >> be used to e.g. restore the state VM memory that is backed by a tmpfs
> >> file, avoiding memory usage by cold VM pages after resume.
> >>
>
> This is an interesting use case and I feel this really supports this
> patchset. IIUC, supporting the reclaim of mapped file pages through
> fadvise() helps this usecase where you can avoid traversing the large
> number of vma regions as you can directly issue the fadvise() on the
> shmem file 'fd' and it takes care. Am I correct?
Right, that's correct. The only snag here is that fadvise, with your
patch set, will skip mapped pages, which might be an issue for this
case.
>
> > Hmm, actually, looking at it a bit more, there are several issues
> > here. One is that with fadvise, you can't be sure if you are the only
> > one dealing with the page in a mapped way(with madvise, if mapcount ==
> > 1, that mean's it's just you, but you don't know that for fadvise, so
> > that makes correctly dealing with mapped pages harder).
> >
> Sorry, Why not for fadvise()? I can still attempt only if the page is
> mapped and its mapcount == 1, but then we already have madvise() for
> such pages and why not we simply use it.
Yes, you could use madvise (as I was thinking). One issue with that
is, though, that madvise(PAGEOUT) is based on a page table walk of
present PTEs. So you actually need to map and touch the page before
madvise will work, which is suboptimal. A direct fadvise solution
would be nicer, since that does a file mapping walk. However, that can
be addressed later, my comments weren't intended to raise an objection
- merely that there is a chance here to address this usecase. But that
shouldn't block anything. It's something to keep in mind, though.
I'll do some experiments to see what the best solution is here. But,
any follow-ups would be on top of your patch, so I'm not raising
objections, merely hoping that this, and possible extra work, could
solve this case.
- Frank
On 4/15/2023 3:32 AM, Frank van der Linden wrote:
> However, that can
> be addressed later, my comments weren't intended to raise an objection
> - merely that there is a chance here to address this usecase. But that
> shouldn't block anything. It's something to keep in mind, though.
>
Noted. Thanks!!
> I'll do some experiments to see what the best solution is here
Sure. We can work and add the support for mapped pages too which will go
on top of these patches.
Thanks,
Charan
Below is a quick patch to allow FADVISE_DONTNEED for shmem to reclaim
mapped pages too. This would fit our usecase, and matches MADV_PAGEOUT
more closely.
The patch series as posted skips mapped pages even if you remove
the folio_mapped() check, because page_referenced() in
shrink_page_list() will find page tables with the page mapped,
and ignore_references is not set when called from reclaim_pages().
You can make this work in a similar fashion to MADV_PAGEOUT by
first unmapping a page, but only if the mapping belongs to
the caller. You just have to change the check for "is there
only one mapping and am I the owner". To do that, change a few
lines in try_to_unmap to allow for checking the mm the mapping
belongs to, and pass in current->mm (NULL still unmaps all mappings).
I lightly tested this in a somewhat older codebase, so the diff
below isn't fully tested. But if there are no objections to
this approach, we could add it on top of your patchset after
better testing.
- Frank
diff --git a/include/linux/rmap.h b/include/linux/rmap.h
index b87d01660412..4403cc2ccc4c 100644
--- a/include/linux/rmap.h
+++ b/include/linux/rmap.h
@@ -368,6 +368,8 @@ int folio_referenced(struct folio *, int is_locked,
void try_to_migrate(struct folio *folio, enum ttu_flags flags);
void try_to_unmap(struct folio *, enum ttu_flags flags);
+void try_to_unmap_mm(struct mm_struct *mm, struct folio *folio,
+ enum ttu_flags flags);
int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
unsigned long end, struct page **pages,
diff --git a/mm/rmap.c b/mm/rmap.c
index 8632e02661ac..4d30e8f5afe2 100644
--- a/mm/rmap.c
+++ b/mm/rmap.c
@@ -1443,6 +1443,11 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
munlock_vma_folio(folio, vma, compound);
}
+struct unmap_arg {
+ enum ttu_flags flags;
+ struct mm_struct *mm;
+};
+
/*
* @arg: enum ttu_flags will be passed to this argument
*/
@@ -1455,7 +1460,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
struct page *subpage;
bool anon_exclusive, ret = true;
struct mmu_notifier_range range;
- enum ttu_flags flags = (enum ttu_flags)(long)arg;
+ struct unmap_arg *uap = (struct unmap_arg *)arg;
+ enum ttu_flags flags = uap->flags;
+
+ if (uap->mm && uap->mm != mm)
+ return true;
/*
* When racing against e.g. zap_pte_range() on another cpu,
@@ -1776,6 +1785,7 @@ static int folio_not_mapped(struct folio *folio)
/**
* try_to_unmap - Try to remove all page table mappings to a folio.
+ * @mm: mm to unmap from (NULL to unmap from all)
* @folio: The folio to unmap.
* @flags: action and flags
*
@@ -1785,11 +1795,16 @@ static int folio_not_mapped(struct folio *folio)
*
* Context: Caller must hold the folio lock.
*/
-void try_to_unmap(struct folio *folio, enum ttu_flags flags)
+void try_to_unmap_mm(struct mm_struct *mm, struct folio *folio,
+ enum ttu_flags flags)
{
+ struct unmap_arg ua = {
+ .flags = flags,
+ .mm = mm,
+ };
struct rmap_walk_control rwc = {
.rmap_one = try_to_unmap_one,
- .arg = (void *)flags,
+ .arg = (void *)&ua,
.done = folio_not_mapped,
.anon_lock = folio_lock_anon_vma_read,
};
@@ -1800,6 +1815,11 @@ void try_to_unmap(struct folio *folio, enum ttu_flags flags)
rmap_walk(folio, &rwc);
}
+void try_to_unmap(struct folio *folio, enum ttu_flags flags)
+{
+ try_to_unmap_mm(NULL, folio, flags);
+}
+
/*
* @arg: enum ttu_flags will be passed to this argument.
*
diff --git a/mm/shmem.c b/mm/shmem.c
index 1af85259b6fc..b24af2fb3378 100644
--- a/mm/shmem.c
+++ b/mm/shmem.c
@@ -2362,8 +2362,24 @@ static void shmem_isolate_pages_range(struct address_space *mapping, loff_t star
if (!folio_try_get(folio))
continue;
- if (folio_test_unevictable(folio) || folio_mapped(folio) ||
- folio_isolate_lru(folio)) {
+
+ if (folio_test_unevictable(folio)) {
+ folio_put(folio);
+ continue;
+ }
+
+ /*
+ * If the folio is mapped once, try to unmap it from the
+ * caller's page table. If it's still mapped afterwards,
+ * it belongs to someone else, and we're not going to
+ * change someone else's mapping.
+ */
+ if (folio_mapcount(folio) == 1 && folio_trylock(folio)) {
+ try_to_unmap_mm(current->mm, folio, TTU_BATCH_FLUSH);
+ folio_unlock(folio);
+ }
+
+ if (folio_mapped(folio) || folio_isolate_lru(folio)) {
folio_put(folio);
continue;
}
@@ -2383,6 +2399,8 @@ static void shmem_isolate_pages_range(struct address_space *mapping, loff_t star
}
}
rcu_read_unlock();
+
+ try_to_unmap_flush();
}
static int shmem_fadvise_dontneed(struct address_space *mapping, loff_t start,
On Tue, Apr 18, 2023 at 05:29:42PM +0000, Frank van der Linden wrote:
> Below is a quick patch to allow FADVISE_DONTNEED for shmem to reclaim
> mapped pages too. This would fit our usecase, and matches MADV_PAGEOUT
> more closely.
>
> The patch series as posted skips mapped pages even if you remove
> the folio_mapped() check, because page_referenced() in
> shrink_page_list() will find page tables with the page mapped,
> and ignore_references is not set when called from reclaim_pages().
>
> You can make this work in a similar fashion to MADV_PAGEOUT by
> first unmapping a page, but only if the mapping belongs to
> the caller. You just have to change the check for "is there
> only one mapping and am I the owner". To do that, change a few
> lines in try_to_unmap to allow for checking the mm the mapping
> belongs to, and pass in current->mm (NULL still unmaps all mappings).
>
> I lightly tested this in a somewhat older codebase, so the diff
> below isn't fully tested. But if there are no objections to
> this approach, we could add it on top of your patchset after
> better testing.
>
> - Frank
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index b87d01660412..4403cc2ccc4c 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -368,6 +368,8 @@ int folio_referenced(struct folio *, int is_locked,
>
> void try_to_migrate(struct folio *folio, enum ttu_flags flags);
> void try_to_unmap(struct folio *, enum ttu_flags flags);
> +void try_to_unmap_mm(struct mm_struct *mm, struct folio *folio,
> + enum ttu_flags flags);
>
> int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
> unsigned long end, struct page **pages,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 8632e02661ac..4d30e8f5afe2 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1443,6 +1443,11 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
> munlock_vma_folio(folio, vma, compound);
> }
>
> +struct unmap_arg {
> + enum ttu_flags flags;
> + struct mm_struct *mm;
> +};
> +
> /*
> * @arg: enum ttu_flags will be passed to this argument
> */
> @@ -1455,7 +1460,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> struct page *subpage;
> bool anon_exclusive, ret = true;
> struct mmu_notifier_range range;
> - enum ttu_flags flags = (enum ttu_flags)(long)arg;
> + struct unmap_arg *uap = (struct unmap_arg *)arg;
> + enum ttu_flags flags = uap->flags;
> +
> + if (uap->mm && uap->mm != mm)
> + return true;
>
> /*
> * When racing against e.g. zap_pte_range() on another cpu,
> @@ -1776,6 +1785,7 @@ static int folio_not_mapped(struct folio *folio)
>
> /**
> * try_to_unmap - Try to remove all page table mappings to a folio.
> + * @mm: mm to unmap from (NULL to unmap from all)
> * @folio: The folio to unmap.
> * @flags: action and flags
> *
> @@ -1785,11 +1795,16 @@ static int folio_not_mapped(struct folio *folio)
> *
> * Context: Caller must hold the folio lock.
> */
> -void try_to_unmap(struct folio *folio, enum ttu_flags flags)
> +void try_to_unmap_mm(struct mm_struct *mm, struct folio *folio,
> + enum ttu_flags flags)
> {
> + struct unmap_arg ua = {
> + .flags = flags,
> + .mm = mm,
> + };
> struct rmap_walk_control rwc = {
> .rmap_one = try_to_unmap_one,
> - .arg = (void *)flags,
> + .arg = (void *)&ua,
> .done = folio_not_mapped,
> .anon_lock = folio_lock_anon_vma_read,
> };
> @@ -1800,6 +1815,11 @@ void try_to_unmap(struct folio *folio, enum ttu_flags flags)
> rmap_walk(folio, &rwc);
> }
>
> +void try_to_unmap(struct folio *folio, enum ttu_flags flags)
> +{
> + try_to_unmap_mm(NULL, folio, flags);
> +}
> +
> /*
> * @arg: enum ttu_flags will be passed to this argument.
> *
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 1af85259b6fc..b24af2fb3378 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2362,8 +2362,24 @@ static void shmem_isolate_pages_range(struct address_space *mapping, loff_t star
>
> if (!folio_try_get(folio))
> continue;
> - if (folio_test_unevictable(folio) || folio_mapped(folio) ||
> - folio_isolate_lru(folio)) {
> +
> + if (folio_test_unevictable(folio)) {
> + folio_put(folio);
> + continue;
> + }
> +
> + /*
> + * If the folio is mapped once, try to unmap it from the
> + * caller's page table. If it's still mapped afterwards,
> + * it belongs to someone else, and we're not going to
> + * change someone else's mapping.
> + */
> + if (folio_mapcount(folio) == 1 && folio_trylock(folio)) {
> + try_to_unmap_mm(current->mm, folio, TTU_BATCH_FLUSH);
> + folio_unlock(folio);
> + }
Is rmap walk can be done from a RCU read critical section which does not
allow explicit blocking?
Thanks,
Pavan
Hi Frank,
Lets start a separate thread to add the support for mapped pages. I
think one question that can come while review is: "what is the overhead
an application has in collecting the memory regions and issuing the
MADV_PAGEOUT, that made to add this support?". Let me try to get details
for this from my side too.
BTW, thanks for this POC code!!
Thanks,
Charan
On 4/18/2023 10:59 PM, Frank van der Linden wrote:
> Below is a quick patch to allow FADVISE_DONTNEED for shmem to reclaim
> mapped pages too. This would fit our usecase, and matches MADV_PAGEOUT
> more closely.
>
> The patch series as posted skips mapped pages even if you remove
> the folio_mapped() check, because page_referenced() in
> shrink_page_list() will find page tables with the page mapped,
> and ignore_references is not set when called from reclaim_pages().
>
> You can make this work in a similar fashion to MADV_PAGEOUT by
> first unmapping a page, but only if the mapping belongs to
> the caller. You just have to change the check for "is there
> only one mapping and am I the owner". To do that, change a few
> lines in try_to_unmap to allow for checking the mm the mapping
> belongs to, and pass in current->mm (NULL still unmaps all mappings).
>
> I lightly tested this in a somewhat older codebase, so the diff
> below isn't fully tested. But if there are no objections to
> this approach, we could add it on top of your patchset after
> better testing.
>
> - Frank
>
> diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> index b87d01660412..4403cc2ccc4c 100644
> --- a/include/linux/rmap.h
> +++ b/include/linux/rmap.h
> @@ -368,6 +368,8 @@ int folio_referenced(struct folio *, int is_locked,
>
> void try_to_migrate(struct folio *folio, enum ttu_flags flags);
> void try_to_unmap(struct folio *, enum ttu_flags flags);
> +void try_to_unmap_mm(struct mm_struct *mm, struct folio *folio,
> + enum ttu_flags flags);
>
> int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
> unsigned long end, struct page **pages,
> diff --git a/mm/rmap.c b/mm/rmap.c
> index 8632e02661ac..4d30e8f5afe2 100644
> --- a/mm/rmap.c
> +++ b/mm/rmap.c
> @@ -1443,6 +1443,11 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
> munlock_vma_folio(folio, vma, compound);
> }
>
> +struct unmap_arg {
> + enum ttu_flags flags;
> + struct mm_struct *mm;
> +};
> +
> /*
> * @arg: enum ttu_flags will be passed to this argument
> */
> @@ -1455,7 +1460,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> struct page *subpage;
> bool anon_exclusive, ret = true;
> struct mmu_notifier_range range;
> - enum ttu_flags flags = (enum ttu_flags)(long)arg;
> + struct unmap_arg *uap = (struct unmap_arg *)arg;
> + enum ttu_flags flags = uap->flags;
> +
> + if (uap->mm && uap->mm != mm)
> + return true;
>
> /*
> * When racing against e.g. zap_pte_range() on another cpu,
> @@ -1776,6 +1785,7 @@ static int folio_not_mapped(struct folio *folio)
>
> /**
> * try_to_unmap - Try to remove all page table mappings to a folio.
> + * @mm: mm to unmap from (NULL to unmap from all)
> * @folio: The folio to unmap.
> * @flags: action and flags
> *
> @@ -1785,11 +1795,16 @@ static int folio_not_mapped(struct folio *folio)
> *
> * Context: Caller must hold the folio lock.
> */
> -void try_to_unmap(struct folio *folio, enum ttu_flags flags)
> +void try_to_unmap_mm(struct mm_struct *mm, struct folio *folio,
> + enum ttu_flags flags)
> {
> + struct unmap_arg ua = {
> + .flags = flags,
> + .mm = mm,
> + };
> struct rmap_walk_control rwc = {
> .rmap_one = try_to_unmap_one,
> - .arg = (void *)flags,
> + .arg = (void *)&ua,
> .done = folio_not_mapped,
> .anon_lock = folio_lock_anon_vma_read,
> };
> @@ -1800,6 +1815,11 @@ void try_to_unmap(struct folio *folio, enum ttu_flags flags)
> rmap_walk(folio, &rwc);
> }
>
> +void try_to_unmap(struct folio *folio, enum ttu_flags flags)
> +{
> + try_to_unmap_mm(NULL, folio, flags);
> +}
> +
> /*
> * @arg: enum ttu_flags will be passed to this argument.
> *
> diff --git a/mm/shmem.c b/mm/shmem.c
> index 1af85259b6fc..b24af2fb3378 100644
> --- a/mm/shmem.c
> +++ b/mm/shmem.c
> @@ -2362,8 +2362,24 @@ static void shmem_isolate_pages_range(struct address_space *mapping, loff_t star
>
> if (!folio_try_get(folio))
> continue;
> - if (folio_test_unevictable(folio) || folio_mapped(folio) ||
> - folio_isolate_lru(folio)) {
> +
> + if (folio_test_unevictable(folio)) {
> + folio_put(folio);
> + continue;
> + }
> +
> + /*
> + * If the folio is mapped once, try to unmap it from the
> + * caller's page table. If it's still mapped afterwards,
> + * it belongs to someone else, and we're not going to
> + * change someone else's mapping.
> + */
> + if (folio_mapcount(folio) == 1 && folio_trylock(folio)) {
> + try_to_unmap_mm(current->mm, folio, TTU_BATCH_FLUSH);
> + folio_unlock(folio);
> + }
> +
> + if (folio_mapped(folio) || folio_isolate_lru(folio)) {
> folio_put(folio);
> continue;
> }
> @@ -2383,6 +2399,8 @@ static void shmem_isolate_pages_range(struct address_space *mapping, loff_t star
> }
> }
> rcu_read_unlock();
> +
> + try_to_unmap_flush();
> }
>
> static int shmem_fadvise_dontneed(struct address_space *mapping, loff_t start,
On Tue, Apr 18, 2023 at 9:19 PM Pavan Kondeti <[email protected]> wrote:
>
> On Tue, Apr 18, 2023 at 05:29:42PM +0000, Frank van der Linden wrote:
> > Below is a quick patch to allow FADVISE_DONTNEED for shmem to reclaim
> > mapped pages too. This would fit our usecase, and matches MADV_PAGEOUT
> > more closely.
> >
> > The patch series as posted skips mapped pages even if you remove
> > the folio_mapped() check, because page_referenced() in
> > shrink_page_list() will find page tables with the page mapped,
> > and ignore_references is not set when called from reclaim_pages().
> >
> > You can make this work in a similar fashion to MADV_PAGEOUT by
> > first unmapping a page, but only if the mapping belongs to
> > the caller. You just have to change the check for "is there
> > only one mapping and am I the owner". To do that, change a few
> > lines in try_to_unmap to allow for checking the mm the mapping
> > belongs to, and pass in current->mm (NULL still unmaps all mappings).
> >
> > I lightly tested this in a somewhat older codebase, so the diff
> > below isn't fully tested. But if there are no objections to
> > this approach, we could add it on top of your patchset after
> > better testing.
> >
> > - Frank
> >
> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index b87d01660412..4403cc2ccc4c 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -368,6 +368,8 @@ int folio_referenced(struct folio *, int is_locked,
> >
> > void try_to_migrate(struct folio *folio, enum ttu_flags flags);
> > void try_to_unmap(struct folio *, enum ttu_flags flags);
> > +void try_to_unmap_mm(struct mm_struct *mm, struct folio *folio,
> > + enum ttu_flags flags);
> >
> > int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
> > unsigned long end, struct page **pages,
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 8632e02661ac..4d30e8f5afe2 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1443,6 +1443,11 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
> > munlock_vma_folio(folio, vma, compound);
> > }
> >
> > +struct unmap_arg {
> > + enum ttu_flags flags;
> > + struct mm_struct *mm;
> > +};
> > +
> > /*
> > * @arg: enum ttu_flags will be passed to this argument
> > */
> > @@ -1455,7 +1460,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> > struct page *subpage;
> > bool anon_exclusive, ret = true;
> > struct mmu_notifier_range range;
> > - enum ttu_flags flags = (enum ttu_flags)(long)arg;
> > + struct unmap_arg *uap = (struct unmap_arg *)arg;
> > + enum ttu_flags flags = uap->flags;
> > +
> > + if (uap->mm && uap->mm != mm)
> > + return true;
> >
> > /*
> > * When racing against e.g. zap_pte_range() on another cpu,
> > @@ -1776,6 +1785,7 @@ static int folio_not_mapped(struct folio *folio)
> >
> > /**
> > * try_to_unmap - Try to remove all page table mappings to a folio.
> > + * @mm: mm to unmap from (NULL to unmap from all)
> > * @folio: The folio to unmap.
> > * @flags: action and flags
> > *
> > @@ -1785,11 +1795,16 @@ static int folio_not_mapped(struct folio *folio)
> > *
> > * Context: Caller must hold the folio lock.
> > */
> > -void try_to_unmap(struct folio *folio, enum ttu_flags flags)
> > +void try_to_unmap_mm(struct mm_struct *mm, struct folio *folio,
> > + enum ttu_flags flags)
> > {
> > + struct unmap_arg ua = {
> > + .flags = flags,
> > + .mm = mm,
> > + };
> > struct rmap_walk_control rwc = {
> > .rmap_one = try_to_unmap_one,
> > - .arg = (void *)flags,
> > + .arg = (void *)&ua,
> > .done = folio_not_mapped,
> > .anon_lock = folio_lock_anon_vma_read,
> > };
> > @@ -1800,6 +1815,11 @@ void try_to_unmap(struct folio *folio, enum ttu_flags flags)
> > rmap_walk(folio, &rwc);
> > }
> >
> > +void try_to_unmap(struct folio *folio, enum ttu_flags flags)
> > +{
> > + try_to_unmap_mm(NULL, folio, flags);
> > +}
> > +
> > /*
> > * @arg: enum ttu_flags will be passed to this argument.
> > *
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 1af85259b6fc..b24af2fb3378 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2362,8 +2362,24 @@ static void shmem_isolate_pages_range(struct address_space *mapping, loff_t star
> >
> > if (!folio_try_get(folio))
> > continue;
> > - if (folio_test_unevictable(folio) || folio_mapped(folio) ||
> > - folio_isolate_lru(folio)) {
> > +
> > + if (folio_test_unevictable(folio)) {
> > + folio_put(folio);
> > + continue;
> > + }
> > +
> > + /*
> > + * If the folio is mapped once, try to unmap it from the
> > + * caller's page table. If it's still mapped afterwards,
> > + * it belongs to someone else, and we're not going to
> > + * change someone else's mapping.
> > + */
> > + if (folio_mapcount(folio) == 1 && folio_trylock(folio)) {
> > + try_to_unmap_mm(current->mm, folio, TTU_BATCH_FLUSH);
> > + folio_unlock(folio);
> > + }
>
> Is rmap walk can be done from a RCU read critical section which does not
> allow explicit blocking?
>
> Thanks,
> Pavan
True, yes, rmap_walk may block, so it the try_to_unmap calls should be
outside the loop. The easiest thing to do there is to add all mapped
pages to a separate list, walk that list outside of the rcu lock for
i_mapping, and add all pages that could be unmapped to the return
list.
- Frank
Sure, we can add a separate thread. I can send a more formal version
of my quick diff (one that has actually been tested more thoroughly)
to kick it off.
Basically, using madvise() would work, but then I realized you'd have
to first explicitly map all regions and touch them to get them in your
page table, after which you then call MADV_PAGEOUT to get rid of them
again. Which is just .. silly :)
- Frank
On Wed, Apr 19, 2023 at 7:39 AM Charan Teja Kalla
<[email protected]> wrote:
>
> Hi Frank,
>
> Lets start a separate thread to add the support for mapped pages. I
> think one question that can come while review is: "what is the overhead
> an application has in collecting the memory regions and issuing the
> MADV_PAGEOUT, that made to add this support?". Let me try to get details
> for this from my side too.
>
> BTW, thanks for this POC code!!
>
> Thanks,
> Charan
>
> On 4/18/2023 10:59 PM, Frank van der Linden wrote:
> > Below is a quick patch to allow FADVISE_DONTNEED for shmem to reclaim
> > mapped pages too. This would fit our usecase, and matches MADV_PAGEOUT
> > more closely.
> >
> > The patch series as posted skips mapped pages even if you remove
> > the folio_mapped() check, because page_referenced() in
> > shrink_page_list() will find page tables with the page mapped,
> > and ignore_references is not set when called from reclaim_pages().
> >
> > You can make this work in a similar fashion to MADV_PAGEOUT by
> > first unmapping a page, but only if the mapping belongs to
> > the caller. You just have to change the check for "is there
> > only one mapping and am I the owner". To do that, change a few
> > lines in try_to_unmap to allow for checking the mm the mapping
> > belongs to, and pass in current->mm (NULL still unmaps all mappings).
> >
> > I lightly tested this in a somewhat older codebase, so the diff
> > below isn't fully tested. But if there are no objections to
> > this approach, we could add it on top of your patchset after
> > better testing.
> >
> > - Frank
> >
> > diff --git a/include/linux/rmap.h b/include/linux/rmap.h
> > index b87d01660412..4403cc2ccc4c 100644
> > --- a/include/linux/rmap.h
> > +++ b/include/linux/rmap.h
> > @@ -368,6 +368,8 @@ int folio_referenced(struct folio *, int is_locked,
> >
> > void try_to_migrate(struct folio *folio, enum ttu_flags flags);
> > void try_to_unmap(struct folio *, enum ttu_flags flags);
> > +void try_to_unmap_mm(struct mm_struct *mm, struct folio *folio,
> > + enum ttu_flags flags);
> >
> > int make_device_exclusive_range(struct mm_struct *mm, unsigned long start,
> > unsigned long end, struct page **pages,
> > diff --git a/mm/rmap.c b/mm/rmap.c
> > index 8632e02661ac..4d30e8f5afe2 100644
> > --- a/mm/rmap.c
> > +++ b/mm/rmap.c
> > @@ -1443,6 +1443,11 @@ void page_remove_rmap(struct page *page, struct vm_area_struct *vma,
> > munlock_vma_folio(folio, vma, compound);
> > }
> >
> > +struct unmap_arg {
> > + enum ttu_flags flags;
> > + struct mm_struct *mm;
> > +};
> > +
> > /*
> > * @arg: enum ttu_flags will be passed to this argument
> > */
> > @@ -1455,7 +1460,11 @@ static bool try_to_unmap_one(struct folio *folio, struct vm_area_struct *vma,
> > struct page *subpage;
> > bool anon_exclusive, ret = true;
> > struct mmu_notifier_range range;
> > - enum ttu_flags flags = (enum ttu_flags)(long)arg;
> > + struct unmap_arg *uap = (struct unmap_arg *)arg;
> > + enum ttu_flags flags = uap->flags;
> > +
> > + if (uap->mm && uap->mm != mm)
> > + return true;
> >
> > /*
> > * When racing against e.g. zap_pte_range() on another cpu,
> > @@ -1776,6 +1785,7 @@ static int folio_not_mapped(struct folio *folio)
> >
> > /**
> > * try_to_unmap - Try to remove all page table mappings to a folio.
> > + * @mm: mm to unmap from (NULL to unmap from all)
> > * @folio: The folio to unmap.
> > * @flags: action and flags
> > *
> > @@ -1785,11 +1795,16 @@ static int folio_not_mapped(struct folio *folio)
> > *
> > * Context: Caller must hold the folio lock.
> > */
> > -void try_to_unmap(struct folio *folio, enum ttu_flags flags)
> > +void try_to_unmap_mm(struct mm_struct *mm, struct folio *folio,
> > + enum ttu_flags flags)
> > {
> > + struct unmap_arg ua = {
> > + .flags = flags,
> > + .mm = mm,
> > + };
> > struct rmap_walk_control rwc = {
> > .rmap_one = try_to_unmap_one,
> > - .arg = (void *)flags,
> > + .arg = (void *)&ua,
> > .done = folio_not_mapped,
> > .anon_lock = folio_lock_anon_vma_read,
> > };
> > @@ -1800,6 +1815,11 @@ void try_to_unmap(struct folio *folio, enum ttu_flags flags)
> > rmap_walk(folio, &rwc);
> > }
> >
> > +void try_to_unmap(struct folio *folio, enum ttu_flags flags)
> > +{
> > + try_to_unmap_mm(NULL, folio, flags);
> > +}
> > +
> > /*
> > * @arg: enum ttu_flags will be passed to this argument.
> > *
> > diff --git a/mm/shmem.c b/mm/shmem.c
> > index 1af85259b6fc..b24af2fb3378 100644
> > --- a/mm/shmem.c
> > +++ b/mm/shmem.c
> > @@ -2362,8 +2362,24 @@ static void shmem_isolate_pages_range(struct address_space *mapping, loff_t star
> >
> > if (!folio_try_get(folio))
> > continue;
> > - if (folio_test_unevictable(folio) || folio_mapped(folio) ||
> > - folio_isolate_lru(folio)) {
> > +
> > + if (folio_test_unevictable(folio)) {
> > + folio_put(folio);
> > + continue;
> > + }
> > +
> > + /*
> > + * If the folio is mapped once, try to unmap it from the
> > + * caller's page table. If it's still mapped afterwards,
> > + * it belongs to someone else, and we're not going to
> > + * change someone else's mapping.
> > + */
> > + if (folio_mapcount(folio) == 1 && folio_trylock(folio)) {
> > + try_to_unmap_mm(current->mm, folio, TTU_BATCH_FLUSH);
> > + folio_unlock(folio);
> > + }
> > +
> > + if (folio_mapped(folio) || folio_isolate_lru(folio)) {
> > folio_put(folio);
> > continue;
> > }
> > @@ -2383,6 +2399,8 @@ static void shmem_isolate_pages_range(struct address_space *mapping, loff_t star
> > }
> > }
> > rcu_read_unlock();
> > +
> > + try_to_unmap_flush();
> > }
> >
> > static int shmem_fadvise_dontneed(struct address_space *mapping, loff_t start,