LinuxLists.cc - [PATCH linux-next] Fix shmem huge page failed to set F_SEAL

2022-02-15 07:58:17

Subject: [PATCH linux-next] Fix shmem huge page failed to set F_SEAL_WRITE attribute problem

From: wangyong <[email protected]>

After enabling tmpfs filesystem to support transparent hugepage with the
following command:
echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled
The docker program adds F_SEAL_WRITE through the following command will
prompt EBUSY.
fcntl(5, F_ADD_SEALS, F_SEAL_WRITE)=-1.

It is found that in memfd_wait_for_pins function, the page_count of
hugepage is 512 and page_mapcount is 0, which does not meet the
conditions:
page_count(page) - page_mapcount(page) != 1.
But the page is not busy at this time, therefore, the page_order of
hugepage should be taken into account in the calculation.

Reported-by: Zeal Robot <[email protected]>
Signed-off-by: wangyong <[email protected]>
---
mm/memfd.c | 16 +++++++++++++---
1 file changed, 13 insertions(+), 3 deletions(-)

diff --git a/mm/memfd.c b/mm/memfd.c
index 9f80f162791a..26d1d390a22a 100644
--- a/mm/memfd.c
+++ b/mm/memfd.c
@@ -31,6 +31,7 @@
static void memfd_tag_pins(struct xa_state *xas)
{
struct page *page;
+ int count = 0;
unsigned int tagged = 0;

lru_add_drain();
@@ -39,8 +40,12 @@ static void memfd_tag_pins(struct xa_state *xas)
xas_for_each(xas, page, ULONG_MAX) {
if (xa_is_value(page))
continue;
+
page = find_subpage(page, xas->xa_index);
- if (page_count(page) - page_mapcount(page) > 1)
+ count = page_count(page);
+ if (PageTransCompound(page))
+ count -= (1 << compound_order(compound_head(page))) - 1;
+ if (count - page_mapcount(page) > 1)
xas_set_mark(xas, MEMFD_TAG_PINNED);

if (++tagged % XA_CHECK_SCHED)
@@ -67,11 +72,12 @@ static int memfd_wait_for_pins(struct address_space *mapping)
{
XA_STATE(xas, &mapping->i_pages, 0);
struct page *page;
- int error, scan;
+ int error, scan, count;

memfd_tag_pins(&xas);

error = 0;
+ count = 0;
for (scan = 0; scan <= LAST_SCAN; scan++) {
unsigned int tagged = 0;

@@ -89,8 +95,12 @@ static int memfd_wait_for_pins(struct address_space *mapping)
bool clear = true;
if (xa_is_value(page))
continue;
+
page = find_subpage(page, xas.xa_index);
- if (page_count(page) - page_mapcount(page) != 1) {
+ count = page_count(page);
+ if (PageTransCompound(page))
+ count -= (1 << compound_order(compound_head(page))) - 1;
+ if (count - page_mapcount(page) != 1) {
/*
* On the last scan, we clean up all those tags
* we inserted; but make a note that we still
--
2.15.2

2022-02-15 23:52:01

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH linux-next] Fix shmem huge page failed to set F_SEAL_WRITE attribute problem

On Tue, 15 Feb 2022 07:37:43 +0000 [email protected] wrote:

> From: wangyong <[email protected]>
>
> After enabling tmpfs filesystem to support transparent hugepage with the
> following command:
> echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled
> The docker program adds F_SEAL_WRITE through the following command will
> prompt EBUSY.
> fcntl(5, F_ADD_SEALS, F_SEAL_WRITE)=-1.
>
> It is found that in memfd_wait_for_pins function, the page_count of
> hugepage is 512 and page_mapcount is 0, which does not meet the
> conditions:
> page_count(page) - page_mapcount(page) != 1.
> But the page is not busy at this time, therefore, the page_order of
> hugepage should be taken into account in the calculation.

What are the real-world runtime effects of this?

Do we think that this fix (or one similar to it) should be backported
into -stable kernels?

If "yes" then Mike's 5d752600a8c373 ("mm: restructure memfd code") will
get in the way because it moved lots of code around.

But then, that's four years old and perhaps that's far enough back in
time.

2022-02-16 07:41:39

by CGEL

[permalink] [raw]

Subject: Re: [PATCH linux-next] Fix shmem huge page failed to set F_SEAL_WRITE attribute problem

O Tue, Feb 15, 2022 at 02:12:36PM -0800, Andrew Morton wrote:
> On Tue, 15 Feb 2022 07:37:43 +0000 [email protected] wrote:
>
> > From: wangyong <[email protected]>
> >
> > After enabling tmpfs filesystem to support transparent hugepage with the
> > following command:
> > echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled
> > The docker program adds F_SEAL_WRITE through the following command will
> > prompt EBUSY.
> > fcntl(5, F_ADD_SEALS, F_SEAL_WRITE)=-1.
> >
> > It is found that in memfd_wait_for_pins function, the page_count of
> > hugepage is 512 and page_mapcount is 0, which does not meet the
> > conditions:
> > page_count(page) - page_mapcount(page) != 1.
> > But the page is not busy at this time, therefore, the page_order of
> > hugepage should be taken into account in the calculation.
>
> What are the real-world runtime effects of this?
>
The problem I encounter is that the "docker-runc run busybox" command
fails, and then the container cannot be started. The following alarm is
prompted:
[pid 1412] fcntl(5, F_ADD_SEALS,F_SEAL_SEAL|F_SEAL_SHRINK|F_SEAL_GROW|F_SEAL_WRITE) = -1 EBUSY (Device or resource busy)
[pid 1412] close(5) = 0
[pid 1412] write(2, "nsenter: could not ensure we are"..., 74) = 74
...
[pid 1491] write(3, "\33[31mERRO\33[0m[0005] container_li"..., 166) = 166
[pid 1491] write(2, "container_linux.go:299: starting"..., 144container_linux.go:299: starting container process caused
"process_linux.go:245: running exec setns process for init caused \"exit statu" ) = 144

I'm not sure how this will affect other situations.
> Do we think that this fix (or one similar to it) should be backported
> into -stable kernels?
>
> If "yes" then Mike's 5d752600a8c373 ("mm: restructure memfd code") will
> get in the way because it moved lots of code around.
>
Yes, 4.14 does not have this patch, but 4.19 does.
In addition, Kirill A. Shutemov's 800d8c63b2e989c2e349632d1648119bf5862f01
(shmem: add huge pages support) is not included in 4.4, but it is available in 4.14.

> But then, that's four years old and perhaps that's far enough back in
> time.

Thanks.

2022-02-17 04:18:13

by Mike Kravetz

[permalink] [raw]

Subject: Re: [PATCH linux-next] Fix shmem huge page failed to set F_SEAL_WRITE attribute problem

On 2/14/22 23:37, [email protected] wrote:
> From: wangyong <[email protected]>
>
> After enabling tmpfs filesystem to support transparent hugepage with the
> following command:
> echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled
> The docker program adds F_SEAL_WRITE through the following command will
> prompt EBUSY.
> fcntl(5, F_ADD_SEALS, F_SEAL_WRITE)=-1.
>
> It is found that in memfd_wait_for_pins function, the page_count of
> hugepage is 512 and page_mapcount is 0, which does not meet the
> conditions:
> page_count(page) - page_mapcount(page) != 1.
> But the page is not busy at this time, therefore, the page_order of
> hugepage should be taken into account in the calculation.
>
> Reported-by: Zeal Robot <[email protected]>
> Signed-off-by: wangyong <[email protected]>
> ---
> mm/memfd.c | 16 +++++++++++++---
> 1 file changed, 13 insertions(+), 3 deletions(-)
>
> diff --git a/mm/memfd.c b/mm/memfd.c
> index 9f80f162791a..26d1d390a22a 100644
> --- a/mm/memfd.c
> +++ b/mm/memfd.c
> @@ -31,6 +31,7 @@
> static void memfd_tag_pins(struct xa_state *xas)
> {
> struct page *page;
> + int count = 0;
> unsigned int tagged = 0;
>
> lru_add_drain();
> @@ -39,8 +40,12 @@ static void memfd_tag_pins(struct xa_state *xas)
> xas_for_each(xas, page, ULONG_MAX) {
> if (xa_is_value(page))
> continue;
> +
> page = find_subpage(page, xas->xa_index);
> - if (page_count(page) - page_mapcount(page) > 1)
> + count = page_count(page);
> + if (PageTransCompound(page))

PageTransCompound() is true for hugetlb pages as well as THP. And, hugetlb
pages will not have a ref per subpage as THP does. So, I believe this will
break hugetlb seal usage.

I was trying to do some testing via the memfd selftests, but those have some
other issues for hugetlb that need to be fixed. :(
--
Mike Kravetz

> + count -= (1 << compound_order(compound_head(page))) - 1;
> + if (count - page_mapcount(page) > 1)
> xas_set_mark(xas, MEMFD_TAG_PINNED);
>
> if (++tagged % XA_CHECK_SCHED)
> @@ -67,11 +72,12 @@ static int memfd_wait_for_pins(struct address_space *mapping)
> {
> XA_STATE(xas, &mapping->i_pages, 0);
> struct page *page;
> - int error, scan;
> + int error, scan, count;
>
> memfd_tag_pins(&xas);
>
> error = 0;
> + count = 0;
> for (scan = 0; scan <= LAST_SCAN; scan++) {
> unsigned int tagged = 0;
>
> @@ -89,8 +95,12 @@ static int memfd_wait_for_pins(struct address_space *mapping)
> bool clear = true;
> if (xa_is_value(page))
> continue;
> +
> page = find_subpage(page, xas.xa_index);
> - if (page_count(page) - page_mapcount(page) != 1) {
> + count = page_count(page);
> + if (PageTransCompound(page))
> + count -= (1 << compound_order(compound_head(page))) - 1;
> + if (count - page_mapcount(page) != 1) {
> /*
> * On the last scan, we clean up all those tags
> * we inserted; but make a note that we still

2022-02-17 14:35:43

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH linux-next] Fix shmem huge page failed to set F_SEAL_WRITE attribute problem

On Wed, 16 Feb 2022, Mike Kravetz wrote:
> On 2/14/22 23:37, [email protected] wrote:
> > From: wangyong <[email protected]>
> >
> > After enabling tmpfs filesystem to support transparent hugepage with the
> > following command:
> > echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled
> > The docker program adds F_SEAL_WRITE through the following command will
> > prompt EBUSY.
> > fcntl(5, F_ADD_SEALS, F_SEAL_WRITE)=-1.
> >
> > It is found that in memfd_wait_for_pins function, the page_count of
> > hugepage is 512 and page_mapcount is 0, which does not meet the
> > conditions:
> > page_count(page) - page_mapcount(page) != 1.
> > But the page is not busy at this time, therefore, the page_order of
> > hugepage should be taken into account in the calculation.
> >
> > Reported-by: Zeal Robot <[email protected]>
> > Signed-off-by: wangyong <[email protected]>
> > ---
> > mm/memfd.c | 16 +++++++++++++---
> > 1 file changed, 13 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/memfd.c b/mm/memfd.c
> > index 9f80f162791a..26d1d390a22a 100644
> > --- a/mm/memfd.c
> > +++ b/mm/memfd.c
> > @@ -31,6 +31,7 @@
> > static void memfd_tag_pins(struct xa_state *xas)
> > {
> > struct page *page;
> > + int count = 0;
> > unsigned int tagged = 0;
> >
> > lru_add_drain();
> > @@ -39,8 +40,12 @@ static void memfd_tag_pins(struct xa_state *xas)
> > xas_for_each(xas, page, ULONG_MAX) {
> > if (xa_is_value(page))
> > continue;
> > +
> > page = find_subpage(page, xas->xa_index);
> > - if (page_count(page) - page_mapcount(page) > 1)
> > + count = page_count(page);
> > + if (PageTransCompound(page))
>
> PageTransCompound() is true for hugetlb pages as well as THP. And, hugetlb
> pages will not have a ref per subpage as THP does. So, I believe this will
> break hugetlb seal usage.

Yes, I think so too; and that is not the only issue with the patch
(I don't think page_mapcount is enough, I had to use total_mapcount).

It's a good find, and thank you WangYong for the report.
I found the same issue when testing my MFD_HUGEPAGE patch last year,
and devised a patch to fix it (and keep MFD_HUGETLB working) then; but
never sent that in because there wasn't time to re-present MFD_HUGEPAGE.

I'm currently retesting my patch: just found something failing which
I thought should pass; but maybe I'm confused, or maybe the xarray is
working differently now. I'm rushing to reply now because I don't want
others to waste their own time on it.

Andrew, please expect a replacement patch for this issue, but
I certainly have more testing and checking to do before sending.

Hugh

>
> I was trying to do some testing via the memfd selftests, but those have some
> other issues for hugetlb that need to be fixed. :(
> --
> Mike Kravetz

2022-02-17 23:58:45

by Matthew Wilcox

[permalink] [raw]

Subject: Re: [PATCH linux-next] Fix shmem huge page failed to set F_SEAL_WRITE attribute problem

On Wed, Feb 16, 2022 at 05:25:17PM -0800, Hugh Dickins wrote:
> On Wed, 16 Feb 2022, Mike Kravetz wrote:
> > On 2/14/22 23:37, [email protected] wrote:
> > > From: wangyong <[email protected]>
> > >
> > > After enabling tmpfs filesystem to support transparent hugepage with the
> > > following command:
> > > echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled
> > > The docker program adds F_SEAL_WRITE through the following command will
> > > prompt EBUSY.
> > > fcntl(5, F_ADD_SEALS, F_SEAL_WRITE)=-1.
> > >
> > > It is found that in memfd_wait_for_pins function, the page_count of
> > > hugepage is 512 and page_mapcount is 0, which does not meet the
> > > conditions:
> > > page_count(page) - page_mapcount(page) != 1.
> > > But the page is not busy at this time, therefore, the page_order of
> > > hugepage should be taken into account in the calculation.
> > >
> > > Reported-by: Zeal Robot <[email protected]>
> > > Signed-off-by: wangyong <[email protected]>
> > > ---
> > > mm/memfd.c | 16 +++++++++++++---
> > > 1 file changed, 13 insertions(+), 3 deletions(-)
> > >
> > > diff --git a/mm/memfd.c b/mm/memfd.c
> > > index 9f80f162791a..26d1d390a22a 100644
> > > --- a/mm/memfd.c
> > > +++ b/mm/memfd.c
> > > @@ -31,6 +31,7 @@
> > > static void memfd_tag_pins(struct xa_state *xas)
> > > {
> > > struct page *page;
> > > + int count = 0;
> > > unsigned int tagged = 0;
> > >
> > > lru_add_drain();
> > > @@ -39,8 +40,12 @@ static void memfd_tag_pins(struct xa_state *xas)
> > > xas_for_each(xas, page, ULONG_MAX) {
> > > if (xa_is_value(page))
> > > continue;
> > > +
> > > page = find_subpage(page, xas->xa_index);
> > > - if (page_count(page) - page_mapcount(page) > 1)
> > > + count = page_count(page);
> > > + if (PageTransCompound(page))
> >
> > PageTransCompound() is true for hugetlb pages as well as THP. And, hugetlb
> > pages will not have a ref per subpage as THP does. So, I believe this will
> > break hugetlb seal usage.
>
> Yes, I think so too; and that is not the only issue with the patch
> (I don't think page_mapcount is enough, I had to use total_mapcount).
>
> It's a good find, and thank you WangYong for the report.
> I found the same issue when testing my MFD_HUGEPAGE patch last year,
> and devised a patch to fix it (and keep MFD_HUGETLB working) then; but
> never sent that in because there wasn't time to re-present MFD_HUGEPAGE.
>
> I'm currently retesting my patch: just found something failing which
> I thought should pass; but maybe I'm confused, or maybe the xarray is
> working differently now. I'm rushing to reply now because I don't want
> others to waste their own time on it.

I did change how the XArray works for THP recently.

Kirill's original patch stored:

512: p
513: p+1
514: p+2
...
1023: p+511

A couple of years ago, I changed it to store:

512: p
513: p
514: p
...
1023: p

And in January, Linus merged the commit which changes it to:

512-575: p
576-639: (sibling of 512)
640-703: (sibling of 512)
...
960-1023: (sibling of 512)

That is, I removed a level of the tree and store sibling entries
rather than duplicate entries. That wasn't for fun; I needed to do
that in order to make msync() work with large folios. Commit
6b24ca4a1a8d has more detail and hopefully can inspire whatever
changes you need to make to your patch.

2022-02-27 03:11:33

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH linux-next] Fix shmem huge page failed to set F_SEAL_WRITE attribute problem

On Thu, 17 Feb 2022, Matthew Wilcox wrote:
> On Wed, Feb 16, 2022 at 05:25:17PM -0800, Hugh Dickins wrote:
> > On Wed, 16 Feb 2022, Mike Kravetz wrote:
> > > On 2/14/22 23:37, [email protected] wrote:
...
> > > > @@ -39,8 +40,12 @@ static void memfd_tag_pins(struct xa_state *xas)
> > > > xas_for_each(xas, page, ULONG_MAX) {
> > > > if (xa_is_value(page))
> > > > continue;
> > > > +
> > > > page = find_subpage(page, xas->xa_index);
> > > > - if (page_count(page) - page_mapcount(page) > 1)
> > > > + count = page_count(page);
> > > > + if (PageTransCompound(page))
> > >
> > > PageTransCompound() is true for hugetlb pages as well as THP. And, hugetlb
> > > pages will not have a ref per subpage as THP does. So, I believe this will
> > > break hugetlb seal usage.
> >
> > Yes, I think so too; and that is not the only issue with the patch
> > (I don't think page_mapcount is enough, I had to use total_mapcount).

Mike, we had the same instinctive reaction to seeing a PageTransCompound
check in code also exposed to PageHuge pages; but in fact that seems to
have worked correctly - those hugetlbfs pages are hard to predict!
But it was not working on pte maps of THPs.

> >
> > It's a good find, and thank you WangYong for the report.
> > I found the same issue when testing my MFD_HUGEPAGE patch last year,
> > and devised a patch to fix it (and keep MFD_HUGETLB working) then; but
> > never sent that in because there wasn't time to re-present MFD_HUGEPAGE.
> >
> > I'm currently retesting my patch: just found something failing which
> > I thought should pass; but maybe I'm confused, or maybe the xarray is
> > working differently now. I'm rushing to reply now because I don't want
> > others to waste their own time on it.
>
> I did change how the XArray works for THP recently.
>
> Kirill's original patch stored:
>
> 512: p
> 513: p+1
> 514: p+2
> ...
> 1023: p+511
>
> A couple of years ago, I changed it to store:
>
> 512: p
> 513: p
> 514: p
> ...
> 1023: p
>
> And in January, Linus merged the commit which changes it to:
>
> 512-575: p
> 576-639: (sibling of 512)
> 640-703: (sibling of 512)
> ...
> 960-1023: (sibling of 512)
>
> That is, I removed a level of the tree and store sibling entries
> rather than duplicate entries. That wasn't for fun; I needed to do
> that in order to make msync() work with large folios. Commit
> 6b24ca4a1a8d has more detail and hopefully can inspire whatever
> changes you need to make to your patch.

Matthew, thanks for the very detailed info, you shouldn't have taken
so much trouble over it: I knew you had done something of that kind,
and yes, that's where my suspicion lay at the time of writing. But
you'll be relieved to know that the patch I wrote before your changes
turned out to be unaffected, and just as valid after your changes.

"just found something failing which I thought should pass" was me
forgetting, again and again, just how limited are the allowed
possibilities for F_SEAL_WRITE when mmaps are outstanding.

One thinks that a PROT_READ, MAP_SHARED mapping would be allowed;
but of course all the memfds are automatically O_RDWR, so mprotect
(no sealing hook) allows it to be changed to PROT_READ|PROT_WRITE,
so F_SEAL_WRITE is forbidden on any MAP_SHARED mapping: only allowed
on MAP_PRIVATEs.

I'll now re-read the commit message I wrote before, update if necessary,
and then send to Andrew, asking him to replace the one in this thread.

Hugh

2022-02-27 06:37:57

by Hugh Dickins

[permalink] [raw]

Subject: [PATCH] memfd: fix F_SEAL_WRITE after shmem huge page allocated

Wangyong reports: after enabling tmpfs filesystem to support
transparent hugepage with the following command:

echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled

the docker program tries to add F_SEAL_WRITE through the following
command, but it fails unexpectedly with errno EBUSY:

fcntl(5, F_ADD_SEALS, F_SEAL_WRITE) = -1.

That is because memfd_tag_pins() and memfd_wait_for_pins() were never
updated for shmem huge pages: checking page_mapcount() against
page_count() is hopeless on THP subpages - they need to check
total_mapcount() against page_count() on THP heads only.

Make memfd_tag_pins() (compared > 1) as strict as memfd_wait_for_pins()
(compared != 1): either can be justified, but given the non-atomic
total_mapcount() calculation, it is better now to be strict. Bear in
mind that total_mapcount() itself scans all of the THP subpages, when
choosing to take an XA_CHECK_SCHED latency break.

Also fix the unlikely xa_is_value() case in memfd_wait_for_pins(): if a
page has been swapped out since memfd_tag_pins(), then its refcount must
have fallen, and so it can safely be untagged.

Reported-by: Zeal Robot <[email protected]>
Reported-by: wangyong <[email protected]>
Signed-off-by: Hugh Dickins <[email protected]>
Cc: <[email protected]>
---
Andrew, please remove
fix-shmem-huge-page-failed-to-set-f_seal_write-attribute-problem.patch
fix-shmem-huge-page-failed-to-set-f_seal_write-attribute-problem-fix.patch
from mmotm, and replace them by this patch against 5.17-rc5:
wangyong's patch did not handle the case of pte-mapped huge pages, and I
had this one from earlier, when I found the same issue with MFD_HUGEPAGE
(but MFD_HUGEPAGE did not go in, so I didn't post this one, forgetting
the transparent_hugepage/shmem_enabled case).

mm/memfd.c | 40 ++++++++++++++++++++++++++++------------
1 file changed, 28 insertions(+), 12 deletions(-)

--- 5.17-rc5/mm/memfd.c
+++ linux/mm/memfd.c
@@ -31,20 +31,28 @@
static void memfd_tag_pins(struct xa_state *xas)
{
struct page *page;
- unsigned int tagged = 0;
+ int latency = 0;
+ int cache_count;

lru_add_drain();

xas_lock_irq(xas);
xas_for_each(xas, page, ULONG_MAX) {
- if (xa_is_value(page))
- continue;
- page = find_subpage(page, xas->xa_index);
- if (page_count(page) - page_mapcount(page) > 1)
+ cache_count = 1;
+ if (!xa_is_value(page) &&
+ PageTransHuge(page) && !PageHuge(page))
+ cache_count = HPAGE_PMD_NR;
+
+ if (!xa_is_value(page) &&
+ page_count(page) - total_mapcount(page) != cache_count)
xas_set_mark(xas, MEMFD_TAG_PINNED);
+ if (cache_count != 1)
+ xas_set(xas, page->index + cache_count);

- if (++tagged % XA_CHECK_SCHED)
+ latency += cache_count;
+ if (latency < XA_CHECK_SCHED)
continue;
+ latency = 0;

xas_pause(xas);
xas_unlock_irq(xas);
@@ -73,7 +81,8 @@ static int memfd_wait_for_pins(struct ad

error = 0;
for (scan = 0; scan <= LAST_SCAN; scan++) {
- unsigned int tagged = 0;
+ int latency = 0;
+ int cache_count;

if (!xas_marked(&xas, MEMFD_TAG_PINNED))
break;
@@ -87,10 +96,14 @@ static int memfd_wait_for_pins(struct ad
xas_lock_irq(&xas);
xas_for_each_marked(&xas, page, ULONG_MAX, MEMFD_TAG_PINNED) {
bool clear = true;
- if (xa_is_value(page))
- continue;
- page = find_subpage(page, xas.xa_index);
- if (page_count(page) - page_mapcount(page) != 1) {
+
+ cache_count = 1;
+ if (!xa_is_value(page) &&
+ PageTransHuge(page) && !PageHuge(page))
+ cache_count = HPAGE_PMD_NR;
+
+ if (!xa_is_value(page) && cache_count !=
+ page_count(page) - total_mapcount(page)) {
/*
* On the last scan, we clean up all those tags
* we inserted; but make a note that we still
@@ -103,8 +116,11 @@ static int memfd_wait_for_pins(struct ad
}
if (clear)
xas_clear_mark(&xas, MEMFD_TAG_PINNED);
- if (++tagged % XA_CHECK_SCHED)
+
+ latency += cache_count;
+ if (latency < XA_CHECK_SCHED)
continue;
+ latency = 0;

xas_pause(&xas);
xas_unlock_irq(&xas);

2022-03-02 14:25:24

by Yong Wang

[permalink] [raw]

Subject: Re: [PATCH] memfd: fix F_SEAL_WRITE after shmem huge page allocated

Hello,
this patch does not apply to the 4.19 kernel.
Is it necessary to make corresponding patches for each stable version?

Thanks.

Hugh Dickins <[email protected]> 于2022年2月27日周日 14:41写道：
>
> Wangyong reports: after enabling tmpfs filesystem to support
> transparent hugepage with the following command:
>
> echo always > /sys/kernel/mm/transparent_hugepage/shmem_enabled
>
> the docker program tries to add F_SEAL_WRITE through the following
> command, but it fails unexpectedly with errno EBUSY:
>
> fcntl(5, F_ADD_SEALS, F_SEAL_WRITE) = -1.
>
> That is because memfd_tag_pins() and memfd_wait_for_pins() were never
> updated for shmem huge pages: checking page_mapcount() against
> page_count() is hopeless on THP subpages - they need to check
> total_mapcount() against page_count() on THP heads only.
>
> Make memfd_tag_pins() (compared > 1) as strict as memfd_wait_for_pins()
> (compared != 1): either can be justified, but given the non-atomic
> total_mapcount() calculation, it is better now to be strict. Bear in
> mind that total_mapcount() itself scans all of the THP subpages, when
> choosing to take an XA_CHECK_SCHED latency break.
>
> Also fix the unlikely xa_is_value() case in memfd_wait_for_pins(): if a
> page has been swapped out since memfd_tag_pins(), then its refcount must
> have fallen, and so it can safely be untagged.
>
> Reported-by: Zeal Robot <[email protected]>
> Reported-by: wangyong <[email protected]>
> Signed-off-by: Hugh Dickins <[email protected]>
> Cc: <[email protected]>
> ---
> Andrew, please remove
> fix-shmem-huge-page-failed-to-set-f_seal_write-attribute-problem.patch
> fix-shmem-huge-page-failed-to-set-f_seal_write-attribute-problem-fix.patch
> from mmotm, and replace them by this patch against 5.17-rc5:
> wangyong's patch did not handle the case of pte-mapped huge pages, and I
> had this one from earlier, when I found the same issue with MFD_HUGEPAGE
> (but MFD_HUGEPAGE did not go in, so I didn't post this one, forgetting
> the transparent_hugepage/shmem_enabled case).
>
> mm/memfd.c | 40 ++++++++++++++++++++++++++++------------
> 1 file changed, 28 insertions(+), 12 deletions(-)
>
> --- 5.17-rc5/mm/memfd.c
> +++ linux/mm/memfd.c
> @@ -31,20 +31,28 @@
> static void memfd_tag_pins(struct xa_state *xas)
> {
> struct page *page;
> - unsigned int tagged = 0;
> + int latency = 0;
> + int cache_count;
>
> lru_add_drain();
>
> xas_lock_irq(xas);
> xas_for_each(xas, page, ULONG_MAX) {
> - if (xa_is_value(page))
> - continue;
> - page = find_subpage(page, xas->xa_index);
> - if (page_count(page) - page_mapcount(page) > 1)
> + cache_count = 1;
> + if (!xa_is_value(page) &&
> + PageTransHuge(page) && !PageHuge(page))
> + cache_count = HPAGE_PMD_NR;
> +
> + if (!xa_is_value(page) &&
> + page_count(page) - total_mapcount(page) != cache_count)
> xas_set_mark(xas, MEMFD_TAG_PINNED);
> + if (cache_count != 1)
> + xas_set(xas, page->index + cache_count);
>
> - if (++tagged % XA_CHECK_SCHED)
> + latency += cache_count;
> + if (latency < XA_CHECK_SCHED)
> continue;
> + latency = 0;
>
> xas_pause(xas);
> xas_unlock_irq(xas);
> @@ -73,7 +81,8 @@ static int memfd_wait_for_pins(struct ad
>
> error = 0;
> for (scan = 0; scan <= LAST_SCAN; scan++) {
> - unsigned int tagged = 0;
> + int latency = 0;
> + int cache_count;
>
> if (!xas_marked(&xas, MEMFD_TAG_PINNED))
> break;
> @@ -87,10 +96,14 @@ static int memfd_wait_for_pins(struct ad
> xas_lock_irq(&xas);
> xas_for_each_marked(&xas, page, ULONG_MAX, MEMFD_TAG_PINNED) {
> bool clear = true;
> - if (xa_is_value(page))
> - continue;
> - page = find_subpage(page, xas.xa_index);
> - if (page_count(page) - page_mapcount(page) != 1) {
> +
> + cache_count = 1;
> + if (!xa_is_value(page) &&
> + PageTransHuge(page) && !PageHuge(page))
> + cache_count = HPAGE_PMD_NR;
> +
> + if (!xa_is_value(page) && cache_count !=
> + page_count(page) - total_mapcount(page)) {
> /*
> * On the last scan, we clean up all those tags
> * we inserted; but make a note that we still
> @@ -103,8 +116,11 @@ static int memfd_wait_for_pins(struct ad
> }
> if (clear)
> xas_clear_mark(&xas, MEMFD_TAG_PINNED);
> - if (++tagged % XA_CHECK_SCHED)
> +
> + latency += cache_count;
> + if (latency < XA_CHECK_SCHED)
> continue;
> + latency = 0;
>
> xas_pause(&xas);
> xas_unlock_irq(&xas);

2022-03-02 23:31:27

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH] memfd: fix F_SEAL_WRITE after shmem huge page allocated

On Wed, 2 Mar 2022, yong w wrote:

> Hello,
> this patch does not apply to the 4.19 kernel.
> Is it necessary to make corresponding patches for each stable version?

I expect there will be three variants (if it's worth porting back
to older stables: you make it clear that you do want 4.19, thanks):
one for xarray kernels, one for radix-tree kernels, and one for
old shmem.c-not-memfd.c kernels; or perhaps I've missed a variant.

Once this patch has gone to Linus, then been picked up by GregKH
for recent kernels, I'll respond to the mail when it goes into his
tree to provide the others (or maybe I won't bother with the oldest).

We didn't research a "Fixes:" tag for the patch, so Greg may quietly
stop at the oldest kernel to which the patch does not apply, instead
of sending out explicit pleas for substitute patches as he usually does;
but I'll know anyway when the recent ones go in, and respond then.

Hugh