LinuxLists.cc - [PATCH 0/2] mm, thp: fix file-backed THP race in collapse

[permalink] [raw]

Subject: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

Transparent huge page has supported read-only non-shmem files. The file-
backed THP is collapsed by khugepaged and truncated when written (for
shared libraries).

However, there is race in two possible places.

1) multiple writers truncate the same page cache concurrently;
2) collapse_file rolls back when writer truncates the page cache;

In both cases, subpage(s) of file THP can be revealed by find_get_entry
in truncate_inode_pages_range, which will trigger PageTail BUG_ON in
truncate_inode_page, as follows.

[40326.247034] page:000000009e420ff2 refcount:1 mapcount:0 mapping:0000000000000000 index:0x7ff pfn:0x50c3ff
[40326.247041] head:0000000075ff816d order:9 compound_mapcount:0 compound_pincount:0
[40326.247046] flags: 0x37fffe0000010815(locked|uptodate|lru|arch_1|head)
[40326.247051] raw: 37fffe0000000000 fffffe0013108001 dead000000000122 dead000000000400
[40326.247053] raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
[40326.247055] head: 37fffe0000010815 fffffe001066bd48 ffff000404183c20 0000000000000000
[40326.247057] head: 0000000000000600 0000000000000000 00000001ffffffff ffff000c0345a000
[40326.247058] page dumped because: VM_BUG_ON_PAGE(PageTail(page))
[40326.247077] ------------[ cut here ]------------
[40326.247080] kernel BUG at mm/truncate.c:213!
[40326.280581] Internal error: Oops - BUG: 0 [#1] SMP
[40326.281077] Modules linked in: xfs(E) libcrc32c(E) rfkill(E) ...
[40326.285130] CPU: 14 PID: 11394 Comm: check_madvise_d Kdump: ...
[40326.286202] Hardware name: ECS, BIOS 0.0.0 02/06/2015
[40326.286968] pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
[40326.287584] pc : truncate_inode_page+0x64/0x70
[40326.288040] lr : truncate_inode_page+0x64/0x70
[40326.288498] sp : ffff80001b60b900
[40326.288837] x29: ffff80001b60b900 x28: 00000000000007ff
[40326.289377] x27: ffff80001b60b9a0 x26: 0000000000000000
[40326.289943] x25: 000000000000000f x24: ffff80001b60b9a0
[40326.290485] x23: ffff80001b60ba18 x22: ffff0001e0999ea8
[40326.291027] x21: ffff0000c21db300 x20: ffffffffffffffff
[40326.291566] x19: fffffe001310ffc0 x18: 0000000000000020
[40326.292106] x17: 0000000000000000 x16: 0000000000000000
[40326.292655] x15: ffff0000c21db960 x14: 3030306666666620
[40326.293197] x13: 6666666666666666 x12: 3130303030303030
[40326.293746] x11: ffff8000117b69b8 x10: 00000000ffff8000
[40326.294313] x9 : ffff80001012690c x8 : 0000000000000000
[40326.294851] x7 : ffff8000114f69b8 x6 : 0000000000017ffd
[40326.295392] x5 : ffff0007fffbcbc8 x4 : ffff80001b60b5c0
[40326.295942] x3 : 0000000000000001 x2 : 0000000000000000
[40326.296497] x1 : 0000000000000000 x0 : 0000000000000000
[40326.297047] Call trace:
[40326.297304] truncate_inode_page+0x64/0x70
[40326.297724] truncate_inode_pages_range+0x550/0x7e4
[40326.298251] truncate_pagecache+0x58/0x80
[40326.298662] do_dentry_open+0x1e4/0x3c0
[40326.299052] vfs_open+0x38/0x44
[40326.299377] do_open+0x1f0/0x310
[40326.299709] path_openat+0x114/0x1dc
[40326.300077] do_filp_open+0x84/0x134
[40326.300444] do_sys_openat2+0xbc/0x164
[40326.300825] __arm64_sys_openat+0x74/0xc0
[40326.301236] el0_svc_common.constprop.0+0x88/0x220
[40326.301723] do_el0_svc+0x30/0xa0
[40326.302089] el0_svc+0x20/0x30
[40326.302404] el0_sync_handler+0x1a4/0x1b0
[40326.302814] el0_sync+0x180/0x1c0
[40326.303157] Code: aa0103e0 900061e1 910ec021 9400d300 (d4210000)
[40326.303775] ---[ end trace f70cdb42cb7c2d42 ]---
[40326.304244] Kernel panic - not syncing: Oops - BUG: Fatal exception

This checks the page mapping and retries when subpage of file THP is
found, in truncate_inode_pages_range.

Fixes: eb6ecbed0aa2 ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
Signed-off-by: Xu Yu <[email protected]>
Signed-off-by: Rongwei Wang <[email protected]>
---
mm/filemap.c | 7 ++++++-
mm/truncate.c | 17 ++++++++++++++++-
2 files changed, 22 insertions(+), 2 deletions(-)

diff --git a/mm/filemap.c b/mm/filemap.c
index dae481293..a3af2ec 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2093,7 +2093,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
if (!xa_is_value(page)) {
if (page->index < start)
goto put;
- VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
if (page->index + thp_nr_pages(page) - 1 > end)
goto put;
if (!trylock_page(page))
@@ -2102,6 +2101,12 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
goto unlock;
VM_BUG_ON_PAGE(!thp_contains(page, xas.xa_index),
page);
+ /*
+ * We can find and get head page of file THP with
+ * non-head index. The head page should have already
+ * be truncated with page->mapping reset to NULL.
+ */
+ VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
}
indices[pvec->nr] = xas.xa_index;
if (!pagevec_add(pvec, page))
diff --git a/mm/truncate.c b/mm/truncate.c
index 714eaf1..3f47190 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -319,7 +319,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
index = start;
while (index < end && find_lock_entries(mapping, index, end - 1,
&pvec, indices)) {
- index = indices[pagevec_count(&pvec) - 1] + 1;
+ index = indices[pagevec_count(&pvec) - 1] +
+ thp_nr_pages(pvec.pages[pagevec_count(&pvec) - 1]);
truncate_exceptional_pvec_entries(mapping, &pvec, indices);
for (i = 0; i < pagevec_count(&pvec); i++)
truncate_cleanup_page(pvec.pages[i]);
@@ -392,6 +393,20 @@ void truncate_inode_pages_range(struct address_space *mapping,
continue;

lock_page(page);
+ /*
+ * Already truncated? We can find and get subpage
+ * of file THP, of which the head page is truncated.
+ *
+ * In addition, another race will be avoided, where
+ * collapse_file rolls back when writer truncates the
+ * page cache.
+ */
+ if (page_mapping(page) != mapping) {
+ unlock_page(page);
+ /* Restart to make sure all gone */
+ index = start - 1;
+ break;
+ }
WARN_ON(page_to_index(page) != index);
wait_on_page_writeback(page);
truncate_inode_page(mapping, page);
--
1.8.3.1

2021-09-22 07:09:24

[permalink] [raw]

Subject: [PATCH v2 2/2] mm, thp: bail out early in collapse_file for writeback page

Currently collapse_file does not explicitly check PG_writeback, instead,
page_has_private and try_to_release_page are used to filter writeback
pages. This does not work for xfs with blocksize equal to or larger
than pagesize, because in such case xfs has no page->private.

This makes collapse_file bail out early for writeback page. Otherwise,
xfs end_page_writeback will panic as follows.

[ 6411.448211] page:fffffe00201bcc80 refcount:0 mapcount:0 mapping:ffff0003f88c86a8 index:0x0 pfn:0x84ef32
[ 6411.448304] aops:xfs_address_space_operations [xfs] ino:30000b7 dentry name:"libtest.so"
[ 6411.448312] flags: 0x57fffe0000008027(locked|referenced|uptodate|active|writeback)
[ 6411.448317] raw: 57fffe0000008027 ffff80001b48bc28 ffff80001b48bc28 ffff0003f88c86a8
[ 6411.448321] raw: 0000000000000000 0000000000000000 00000000ffffffff ffff0000c3e9a000
[ 6411.448324] page dumped because: VM_BUG_ON_PAGE(((unsigned int) page_ref_count(page) + 127u <= 127u))
[ 6411.448327] page->mem_cgroup:ffff0000c3e9a000
[ 6411.448340] ------------[ cut here ]------------
[ 6411.448343] kernel BUG at include/linux/mm.h:1212!
[ 6411.449288] Internal error: Oops - BUG: 0 [#1] SMP
[ 6411.449786] Modules linked in:
[ 6411.449790] BUG: Bad page state in process khugepaged pfn:84ef32
[ 6411.450143] xfs(E)
[ 6411.450459] page:fffffe00201bcc80 refcount:0 mapcount:0 mapping:0 index:0x0 pfn:0x84ef32
[ 6411.451361] libcrc32c(E) rfkill(E) aes_ce_blk(E) crypto_simd(E) ...
[ 6411.451387] CPU: 25 PID: 0 Comm: swapper/25 Kdump: loaded Tainted: ...
[ 6411.451389] pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
[ 6411.451393] pc : end_page_writeback+0x1c0/0x214
[ 6411.451394] lr : end_page_writeback+0x1c0/0x214
[ 6411.451395] sp : ffff800011ce3cc0
[ 6411.451396] x29: ffff800011ce3cc0 x28: 0000000000000000
[ 6411.451398] x27: ffff000c04608040 x26: 0000000000000000
[ 6411.451399] x25: ffff000c04608040 x24: 0000000000001000
[ 6411.451401] x23: ffff0003f88c8530 x22: 0000000000001000
[ 6411.451403] x21: ffff0003f88c8530 x20: 0000000000000000
[ 6411.451404] x19: fffffe00201bcc80 x18: 0000000000000030
[ 6411.451406] x17: 0000000000000000 x16: 0000000000000000
[ 6411.451407] x15: ffff000c018f9760 x14: ffffffffffffffff
[ 6411.451409] x13: ffff8000119d72b0 x12: ffff8000119d6ee3
[ 6411.451410] x11: ffff8000117b69b8 x10: 00000000ffff8000
[ 6411.451412] x9 : ffff800010617534 x8 : 0000000000000000
[ 6411.451413] x7 : ffff8000114f69b8 x6 : 000000000000000f
[ 6411.451415] x5 : 0000000000000000 x4 : 0000000000000000
[ 6411.451416] x3 : 0000000000000400 x2 : 0000000000000000
[ 6411.451418] x1 : 0000000000000000 x0 : 0000000000000000
[ 6411.451420] Call trace:
[ 6411.451421] end_page_writeback+0x1c0/0x214
[ 6411.451424] iomap_finish_page_writeback+0x13c/0x204
[ 6411.451425] iomap_finish_ioend+0xe8/0x19c
[ 6411.451426] iomap_writepage_end_bio+0x38/0x50
[ 6411.451427] bio_endio+0x168/0x1ec
[ 6411.451430] blk_update_request+0x278/0x3f0
[ 6411.451432] blk_mq_end_request+0x34/0x15c
[ 6411.451435] virtblk_request_done+0x38/0x74 [virtio_blk]
[ 6411.451437] blk_done_softirq+0xc4/0x110
[ 6411.451439] __do_softirq+0x128/0x38c
[ 6411.451441] __irq_exit_rcu+0x118/0x150
[ 6411.451442] irq_exit+0x1c/0x30
[ 6411.451445] __handle_domain_irq+0x8c/0xf0
[ 6411.451446] gic_handle_irq+0x84/0x108
[ 6411.451447] el1_irq+0xcc/0x180
[ 6411.451448] arch_cpu_idle+0x18/0x40
[ 6411.451450] default_idle_call+0x4c/0x1a0
[ 6411.451453] cpuidle_idle_call+0x168/0x1e0
[ 6411.451454] do_idle+0xb4/0x104
[ 6411.451455] cpu_startup_entry+0x30/0x9c
[ 6411.451458] secondary_start_kernel+0x104/0x180
[ 6411.451460] Code: d4210000 b0006161 910c8021 94013f4d (d4210000)
[ 6411.451462] ---[ end trace 4a88c6a074082f8c ]---
[ 6411.451464] Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt

Fixes: eb6ecbed0aa2 ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
Signed-off-by: Xu Yu <[email protected]>
Signed-off-by: Rongwei Wang <[email protected]>
---
mm/khugepaged.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 045cc57..48de4e1 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1763,6 +1763,10 @@ static void collapse_file(struct mm_struct *mm,
filemap_flush(mapping);
result = SCAN_FAIL;
goto xa_unlocked;
+ } else if (PageWriteback(page)) {
+ xas_unlock_irq(&xas);
+ result = SCAN_FAIL;
+ goto xa_unlocked;
} else if (trylock_page(page)) {
get_page(page);
xas_unlock_irq(&xas);
@@ -1798,7 +1802,8 @@ static void collapse_file(struct mm_struct *mm,
goto out_unlock;
}

- if (!is_shmem && PageDirty(page)) {
+ if (!is_shmem && (PageDirty(page) ||
+ PageWriteback(page))) {
/*
* khugepaged only works on read-only fd, so this
* page is dirty because it hasn't been flushed
--
1.8.3.1

2021-09-22 11:41:26

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote:
> Transparent huge page has supported read-only non-shmem files. The file-
> backed THP is collapsed by khugepaged and truncated when written (for
> shared libraries).
>
> However, there is race in two possible places.
>
> 1) multiple writers truncate the same page cache concurrently;
> 2) collapse_file rolls back when writer truncates the page cache;

As I've said before, the bug here is that somehow there is a writable fd
to a file with THPs. That's what we need to track down and fix.

https://lore.kernel.org/linux-mm/[email protected]/

2021-09-22 17:06:43

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

> On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <[email protected]> wrote:
>
> On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote:
>> Transparent huge page has supported read-only non-shmem files. The file-
>> backed THP is collapsed by khugepaged and truncated when written (for
>> shared libraries).
>>
>> However, there is race in two possible places.
>>
>> 1) multiple writers truncate the same page cache concurrently;
>> 2) collapse_file rolls back when writer truncates the page cache;
>
> As I've said before, the bug here is that somehow there is a writable fd
> to a file with THPs. That's what we need to track down and fix.
Hi, Matthew
I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs"
Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way. So, the
Pagecache of DSO need to be cleaned when someone opened this DSO in a writeable way. The process of
Pagecache cleaned mainly refers to ’truncate_inode_pages_range’.

Of course, the above is consensus for us.

The ’somehow’ that your said is sure for us. Maybe I can try to describe roughly here:

In collapse_file <khugepaged>, when PTEs scan failed (e.g. SCAN_FAIL, SCAN_TRUNCATED..), the following
code will be run:

} else {
struct page *page;

/* Something went wrong: roll back page cache changes */
xas_lock_irq(&xas);
mapping->nrpages -= nr_none;

if (is_shmem)
shmem_uncharge(mapping->host, nr_none);

xas_set(&xas, start);
xas_for_each(&xas, page, end - 1) {
page = list_first_entry_or_null(&pagelist,
struct page, lru);
if (!page || xas.xa_index < page->index) {
if (!nr_none)
break;
nr_none--;
/* Put holes back where they were */
xas_store(&xas, NULL);
continue;
}

VM_BUG_ON_PAGE(page->index != xas.xa_index, page);

/* Unfreeze the page. */
list_del(&page->lru);
page_ref_unfreeze(page, 2);
line1 xas_store(&xas, page);
xas_pause(&xas);
xas_unlock_irq(&xas);
unlock_page(page);
putback_lru_page(page);
xas_lock_irq(&xas);
}
….
new_page->mapping = NULL;
}
line2 unlock_page(new_page);
---

The above code refers to roll back. When someone starts open a DSO in a writeable way, and the collapse_file
is scanning PTEs. The hugepage had been locked and was mapped in page cache before line1. And the hugepage
not be in pagecache and be unlocked at line2. The race that between ‘collapse_file’ and ’truncate_inode_pages_range’
will happen when ‘collapse_file' is executing line1 and line2.
This race can be shown concisely:

Khugepaged: writer
Collapse_file: truncate_inode_pages_range

lock_page(hugepage) skip hugepage because locked by others

scan_left_pages() and wait_lock_page(hugepage)
scan_fail() & xas_store(&xas, 4k_page)

unlock_page(hugepage)
get_lock_page(hugepage)
hugepage is not in pagecache here, but not be checked

This situation triggers easily if we setting a very small ‘scan_sleep_millisecs’.

The above descriptions are roughly my understanding of ’somehow’. It says a lot of nonsense, maybe it helps!
>
> https://lore.kernel.org/linux-mm/[email protected]/
All in all, what you mean is that we should solve this race at the source? However, a writer happens to appear in the middle
of a process in ‘collapse_file', so this bug solved when roll back. The above is my understanding, but it may be
wrong.

Thanks

2021-09-24 02:45:30

by Andrew Morton

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <[email protected]> wrote:

>
>
> > On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <[email protected]> wrote:
> >
> > On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote:
> >> Transparent huge page has supported read-only non-shmem files. The file-
> >> backed THP is collapsed by khugepaged and truncated when written (for
> >> shared libraries).
> >>
> >> However, there is race in two possible places.
> >>
> >> 1) multiple writers truncate the same page cache concurrently;
> >> 2) collapse_file rolls back when writer truncates the page cache;
> >
> > As I've said before, the bug here is that somehow there is a writable fd
> > to a file with THPs. That's what we need to track down and fix.
> Hi, Matthew
> I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs"
> Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way.
>
> ...
>
> > https://lore.kernel.org/linux-mm/[email protected]/
> All in all, what you mean is that we should solve this race at the source?

Matthew is being pretty clear here: we shouldn't be permitting
userspace to get a writeable fd for a thp-backed file.

Why are we permitting the DSO to be opened writeably? If there's a
legitimate case for doing this then presumably "mm, thp: relax the
VM_DENYWRITE constraint on file-backed THPs: should be fixed or
reverted.

If there is no legitimate use case for returning a writeable fd for a
thp-backed file then we should fail such an attempt at open(). This
approach has back-compatibility issues which need to be thought about.
Perhaps we should permit the open-writeably attempt to appear to
succeed, but to really return a read-only fd?

2021-09-24 03:10:37

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On Thu, Sep 23, 2021 at 7:43 PM Andrew Morton <[email protected]> wrote:
>
> On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <[email protected]> wrote:
>
> >
> >
> > > On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <[email protected]> wrote:
> > >
> > > On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote:
> > >> Transparent huge page has supported read-only non-shmem files. The file-
> > >> backed THP is collapsed by khugepaged and truncated when written (for
> > >> shared libraries).
> > >>
> > >> However, there is race in two possible places.
> > >>
> > >> 1) multiple writers truncate the same page cache concurrently;
> > >> 2) collapse_file rolls back when writer truncates the page cache;
> > >
> > > As I've said before, the bug here is that somehow there is a writable fd
> > > to a file with THPs. That's what we need to track down and fix.
> > Hi, Matthew
> > I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs"
> > Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way.
> >
> > ...
> >
> > > https://lore.kernel.org/linux-mm/[email protected]/
> > All in all, what you mean is that we should solve this race at the source?
>
> Matthew is being pretty clear here: we shouldn't be permitting
> userspace to get a writeable fd for a thp-backed file.

No, he doesn't mean it IIRC. Actually we had the same conversation for
another patch. Quoted below:

" > > Things have already gone wrong before we get to this point. See
> > do_dentry_open(). You aren't supposed to be able to get a writable file
> > descriptor on a file which has had huge pages added to the page cache
> > without the filesystem's knowledge. That's the problem that needs to
> > be fixed.
>
> I don't quite understand your point here. Do you mean do_dentry_open()
> should fail for such cases instead of truncating the page cache?

No, do_dentry_open() should have truncated the page cache when it was
called and found that there were THPs in the cache. Then khugepaged
should see that someone has the file open for write and decline to create
new THPs. So it shouldn't be possible to get here with THPs in the cache."

Please see https://lore.kernel.org/linux-mm/YUkCI2I085Sos%[email protected]/

But actually "mm, thp: relax the VM_DENYWRITE constraint on
file-backed THPs" did so exactly.

>
> Why are we permitting the DSO to be opened writeably? If there's a
> legitimate case for doing this then presumably "mm, thp: relax the
> VM_DENYWRITE constraint on file-backed THPs: should be fixed or
> reverted.

Unfortunately we can't revert this commit anymore since VM_DENYWRITE
is gone due to commit 8d0920bde5eb ("mm: remove VM_DENYWRITE")

>
> If there is no legitimate use case for returning a writeable fd for a
> thp-backed file then we should fail such an attempt at open(). This
> approach has back-compatibility issues which need to be thought about.
> Perhaps we should permit the open-writeably attempt to appear to
> succeed, but to really return a read-only fd?
>
>

2021-09-24 03:36:34

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On 9/24/21 10:43 AM, Andrew Morton wrote:
> On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <[email protected]> wrote:
>
>>
>>> On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <[email protected]> wrote:
>>>
>>> On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote:
>>>> Transparent huge page has supported read-only non-shmem files. The file-
>>>> backed THP is collapsed by khugepaged and truncated when written (for
>>>> shared libraries).
>>>>
>>>> However, there is race in two possible places.
>>>>
>>>> 1) multiple writers truncate the same page cache concurrently;
>>>> 2) collapse_file rolls back when writer truncates the page cache;
>>> As I've said before, the bug here is that somehow there is a writable fd
>>> to a file with THPs. That's what we need to track down and fix.
>> Hi, Matthew
>> I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs"
>> Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way.
>>
>> ...
>>
>>> https://lore.kernel.org/linux-mm/[email protected]/
>> All in all, what you mean is that we should solve this race at the source?
> Matthew is being pretty clear here: we shouldn't be permitting
> userspace to get a writeable fd for a thp-backed file.
>
> Why are we permitting the DSO to be opened writeably? If there's a
> legitimate case for doing this then presumably "mm, thp: relax the

Hi, we have written a user case to trigger this race mentioned above.
this case just create one reader

to open DSO in RDONLY mode, and keep making the mapping page of DSO use
huge pages by madvise,

then multiple writer to open and write the same DSO.

I will send it later after simple adjust.

Thanks

> VM_DENYWRITE constraint on file-backed THPs: should be fixed or
> reverted.
>
> If there is no legitimate use case for returning a writeable fd for a
> thp-backed file then we should fail such an attempt at open(). This
> approach has back-compatibility issues which need to be thought about.
> Perhaps we should permit the open-writeably attempt to appear to
> succeed, but to really return a read-only fd?

2021-09-24 07:17:18

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On 9/24/21 10:43 AM, Andrew Morton wrote:
> On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <[email protected]> wrote:
>
>>
>>
>>> On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <[email protected]> wrote:
>>>
>>> On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote:
>>>> Transparent huge page has supported read-only non-shmem files. The file-
>>>> backed THP is collapsed by khugepaged and truncated when written (for
>>>> shared libraries).
>>>>
>>>> However, there is race in two possible places.
>>>>
>>>> 1) multiple writers truncate the same page cache concurrently;
>>>> 2) collapse_file rolls back when writer truncates the page cache;
>>>
>>> As I've said before, the bug here is that somehow there is a writable fd
>>> to a file with THPs. That's what we need to track down and fix.
>> Hi, Matthew
>> I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs"
>> Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way.
>>
>> ...
>>
>>> https://lore.kernel.org/linux-mm/[email protected]/
>> All in all, what you mean is that we should solve this race at the source?
>
> Matthew is being pretty clear here: we shouldn't be permitting
> userspace to get a writeable fd for a thp-backed file.
>
> Why are we permitting the DSO to be opened writeably? If there's a
> legitimate case for doing this then presumably "mm, thp: relax the
There is a use case to stress file-backed THP within attachment.
I test this case in a system which has enabled CONFIG_READ_ONLY_THP_FOR_FS:

$ gcc -Wall -g -o stress_madvise_dso stress_madvise_dso.c
$ ulimit -s unlimited
$ ./stress_madvise_dso 10000 <libtest.so>

the meaning of above parameters:
10000: the max test time;
<libtest.so>: the DSO that will been mapped into file-backed THP by
madvise. It recommended that the text segment of DSO to be tested is
greater than 2M.

The crash will been triggered at once in the latest kernel. And this
case also can used to trigger the bug that mentioned in our another patch.

> VM_DENYWRITE constraint on file-backed THPs: should be fixed or
> reverted.
>
> If there is no legitimate use case for returning a writeable fd for a
> thp-backed file then we should fail such an attempt at open(). This
> approach has back-compatibility issues which need to be thought about.
> Perhaps we should permit the open-writeably attempt to appear to
> succeed, but to really return a read-only fd?
>

Attachments:

stress_madvise_dso.c (6.79 kB)

2021-09-27 22:26:01

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On Fri, Sep 24, 2021 at 12:12 AM Rongwei Wang
<[email protected]> wrote:
>
>
>
> On 9/24/21 10:43 AM, Andrew Morton wrote:
> > On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <[email protected]> wrote:
> >
> >>
> >>
> >>> On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <[email protected]> wrote:
> >>>
> >>> On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote:
> >>>> Transparent huge page has supported read-only non-shmem files. The file-
> >>>> backed THP is collapsed by khugepaged and truncated when written (for
> >>>> shared libraries).
> >>>>
> >>>> However, there is race in two possible places.
> >>>>
> >>>> 1) multiple writers truncate the same page cache concurrently;
> >>>> 2) collapse_file rolls back when writer truncates the page cache;
> >>>
> >>> As I've said before, the bug here is that somehow there is a writable fd
> >>> to a file with THPs. That's what we need to track down and fix.
> >> Hi, Matthew
> >> I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs"
> >> Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way.
> >>
> >> ...
> >>
> >>> https://lore.kernel.org/linux-mm/[email protected]/
> >> All in all, what you mean is that we should solve this race at the source?
> >
> > Matthew is being pretty clear here: we shouldn't be permitting
> > userspace to get a writeable fd for a thp-backed file.
> >
> > Why are we permitting the DSO to be opened writeably? If there's a
> > legitimate case for doing this then presumably "mm, thp: relax the
> There is a use case to stress file-backed THP within attachment.
> I test this case in a system which has enabled CONFIG_READ_ONLY_THP_FOR_FS:
>
> $ gcc -Wall -g -o stress_madvise_dso stress_madvise_dso.c
> $ ulimit -s unlimited
> $ ./stress_madvise_dso 10000 <libtest.so>
>
> the meaning of above parameters:
> 10000: the max test time;
> <libtest.so>: the DSO that will been mapped into file-backed THP by
> madvise. It recommended that the text segment of DSO to be tested is
> greater than 2M.
>
> The crash will been triggered at once in the latest kernel. And this
> case also can used to trigger the bug that mentioned in our another patch.

Hmm.. I am not able to use the repro program to crash the system. Not
sure what I did wrong.

OTOH, does it make sense to block writes within khugepaged, like:

diff --git i/mm/khugepaged.c w/mm/khugepaged.c
index 045cc579f724e..ad7c41ec15027 100644
--- i/mm/khugepaged.c
+++ w/mm/khugepaged.c
@@ -51,6 +51,7 @@ enum scan_result {
SCAN_CGROUP_CHARGE_FAIL,
SCAN_TRUNCATED,
SCAN_PAGE_HAS_PRIVATE,
+ SCAN_BUSY_WRITE,
};

#define CREATE_TRACE_POINTS
@@ -1652,6 +1653,11 @@ static void collapse_file(struct mm_struct *mm,
/* Only allocate from the target node */
gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;

+ if (deny_write_access(file)) {
+ result = SCAN_BUSY_WRITE;
+ return;
+ }
+
new_page = khugepaged_alloc_page(hpage, gfp, node);
if (!new_page) {
result = SCAN_ALLOC_HUGE_PAGE_FAIL;
@@ -1863,19 +1869,6 @@ static void collapse_file(struct mm_struct *mm,
else {
__mod_lruvec_page_state(new_page, NR_FILE_THPS, nr);
filemap_nr_thps_inc(mapping);
- /*
- * Paired with smp_mb() in do_dentry_open() to ensure
- * i_writecount is up to date and the update to nr_thps is
- * visible. Ensures the page cache will be truncated if the
- * file is opened writable.
- */
- smp_mb();
- if (inode_is_open_for_write(mapping->host)) {
- result = SCAN_FAIL;
- __mod_lruvec_page_state(new_page, NR_FILE_THPS, -nr);
- filemap_nr_thps_dec(mapping);
- goto xa_locked;
- }
}

if (nr_none) {
@@ -1976,6 +1969,7 @@ static void collapse_file(struct mm_struct *mm,
VM_BUG_ON(!list_empty(&pagelist));
if (!IS_ERR_OR_NULL(*hpage))
mem_cgroup_uncharge(*hpage);
+ allow_write_access(file);
/* TODO: tracepoints */
}

2021-09-28 12:11:14

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On Mon, Sep 27, 2021 at 03:24:47PM -0700, Song Liu wrote:
> OTOH, does it make sense to block writes within khugepaged, like:
> @@ -1652,6 +1653,11 @@ static void collapse_file(struct mm_struct *mm,
> /* Only allocate from the target node */
> gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
>
> + if (deny_write_access(file)) {
> + result = SCAN_BUSY_WRITE;
> + return;
> + }

The problem is that it denies, rather than blocking. That means that the
writer gets ETXTBSY instead of waiting until khugepaged is done.

2021-09-28 16:22:36

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On 9/28/21 6:24 AM, Song Liu wrote:
> On Fri, Sep 24, 2021 at 12:12 AM Rongwei Wang
> <[email protected]> wrote:
>>
>>
>>
>> On 9/24/21 10:43 AM, Andrew Morton wrote:
>>> On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <[email protected]> wrote:
>>>
>>>>
>>>>
>>>>> On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <[email protected]> wrote:
>>>>>
>>>>> On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote:
>>>>>> Transparent huge page has supported read-only non-shmem files. The file-
>>>>>> backed THP is collapsed by khugepaged and truncated when written (for
>>>>>> shared libraries).
>>>>>>
>>>>>> However, there is race in two possible places.
>>>>>>
>>>>>> 1) multiple writers truncate the same page cache concurrently;
>>>>>> 2) collapse_file rolls back when writer truncates the page cache;
>>>>>
>>>>> As I've said before, the bug here is that somehow there is a writable fd
>>>>> to a file with THPs. That's what we need to track down and fix.
>>>> Hi, Matthew
>>>> I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs"
>>>> Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way.
>>>>
>>>> ...
>>>>
>>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>> All in all, what you mean is that we should solve this race at the source?
>>>
>>> Matthew is being pretty clear here: we shouldn't be permitting
>>> userspace to get a writeable fd for a thp-backed file.
>>>
>>> Why are we permitting the DSO to be opened writeably? If there's a
>>> legitimate case for doing this then presumably "mm, thp: relax the
>> There is a use case to stress file-backed THP within attachment.
>> I test this case in a system which has enabled CONFIG_READ_ONLY_THP_FOR_FS:
>>
>> $ gcc -Wall -g -o stress_madvise_dso stress_madvise_dso.c
>> $ ulimit -s unlimited
>> $ ./stress_madvise_dso 10000 <libtest.so>
>>
>> the meaning of above parameters:
>> 10000: the max test time;
>> <libtest.so>: the DSO that will been mapped into file-backed THP by
>> madvise. It recommended that the text segment of DSO to be tested is
>> greater than 2M.
>>
>> The crash will been triggered at once in the latest kernel. And this
>> case also can used to trigger the bug that mentioned in our another patch.
>
> Hmm.. I am not able to use the repro program to crash the system. Not
> sure what I did wrong.
>
Hi
I have tried to check my test case again. Can you make sure the DSO that
you test have THP mapping?

If you are willing to try again, I can send my libtest.c which is used
to test by myself (actually, it shouldn't be target DSO problem).

Thanks very much!
> OTOH, does it make sense to block writes within khugepaged, like:
>
> diff --git i/mm/khugepaged.c w/mm/khugepaged.c
> index 045cc579f724e..ad7c41ec15027 100644
> --- i/mm/khugepaged.c
> +++ w/mm/khugepaged.c
> @@ -51,6 +51,7 @@ enum scan_result {
> SCAN_CGROUP_CHARGE_FAIL,
> SCAN_TRUNCATED,
> SCAN_PAGE_HAS_PRIVATE,
> + SCAN_BUSY_WRITE,
> };
>
> #define CREATE_TRACE_POINTS
> @@ -1652,6 +1653,11 @@ static void collapse_file(struct mm_struct *mm,
> /* Only allocate from the target node */
> gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
>
> + if (deny_write_access(file)) {
> + result = SCAN_BUSY_WRITE;
> + return;
> + }
> +
This can indeed avoid some possible races from source.

But, I am thinking about whether this will lead to DDoS attack?
I remember the reason of DSO has ignored MAP_DENYWRITE in kernel
is that DDoS attack. In addition, 'deny_write_access' will change
the behavior, such as user will get 'Text file busy' during
collapse_file. I am not sure whether the behavior changing is acceptable
in user space.

If it is acceptable, I am very willing to fix the races like your way.

Thanks!
> new_page = khugepaged_alloc_page(hpage, gfp, node);
> if (!new_page) {
> result = SCAN_ALLOC_HUGE_PAGE_FAIL;
> @@ -1863,19 +1869,6 @@ static void collapse_file(struct mm_struct *mm,
> else {
> __mod_lruvec_page_state(new_page, NR_FILE_THPS, nr);
> filemap_nr_thps_inc(mapping);
> - /*
> - * Paired with smp_mb() in do_dentry_open() to ensure
> - * i_writecount is up to date and the update to nr_thps is
> - * visible. Ensures the page cache will be truncated if the
> - * file is opened writable.
> - */
> - smp_mb();
> - if (inode_is_open_for_write(mapping->host)) {
> - result = SCAN_FAIL;
> - __mod_lruvec_page_state(new_page, NR_FILE_THPS, -nr);
> - filemap_nr_thps_dec(mapping);
> - goto xa_locked;
> - }
> }
>
> if (nr_none) {
> @@ -1976,6 +1969,7 @@ static void collapse_file(struct mm_struct *mm,
> VM_BUG_ON(!list_empty(&pagelist));
> if (!IS_ERR_OR_NULL(*hpage))
> mem_cgroup_uncharge(*hpage);
> + allow_write_access(file);
> /* TODO: tracepoints */
> }
>

Attachments:

libtest.c (976.00 B)

2021-09-28 17:03:19

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On Tue, Sep 28, 2021 at 5:07 AM Matthew Wilcox <[email protected]> wrote:
>
> On Mon, Sep 27, 2021 at 03:24:47PM -0700, Song Liu wrote:
> > OTOH, does it make sense to block writes within khugepaged, like:
> > @@ -1652,6 +1653,11 @@ static void collapse_file(struct mm_struct *mm,
> > /* Only allocate from the target node */
> > gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
> >
> > + if (deny_write_access(file)) {
> > + result = SCAN_BUSY_WRITE;
> > + return;
> > + }
>
> The problem is that it denies, rather than blocking. That means that the
> writer gets ETXTBSY instead of waiting until khugepaged is done.
>

Yes, I was thinking about the same problem last night and this morning.
Unfortunately, I haven't got a good solution yet. Do you have some
suggestions on this?

Thanks,
Song

2021-09-29 07:18:04

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On Tue, Sep 28, 2021 at 9:20 AM Rongwei Wang
<[email protected]> wrote:
>
>
>
> On 9/28/21 6:24 AM, Song Liu wrote:
> > On Fri, Sep 24, 2021 at 12:12 AM Rongwei Wang
> > <[email protected]> wrote:
> >>
> >>
> >>
> >> On 9/24/21 10:43 AM, Andrew Morton wrote:
> >>> On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <[email protected]> wrote:
> >>>
> >>>>
> >>>>
> >>>>> On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <[email protected]> wrote:
> >>>>>
> >>>>> On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote:
> >>>>>> Transparent huge page has supported read-only non-shmem files. The file-
> >>>>>> backed THP is collapsed by khugepaged and truncated when written (for
> >>>>>> shared libraries).
> >>>>>>
> >>>>>> However, there is race in two possible places.
> >>>>>>
> >>>>>> 1) multiple writers truncate the same page cache concurrently;
> >>>>>> 2) collapse_file rolls back when writer truncates the page cache;
> >>>>>
> >>>>> As I've said before, the bug here is that somehow there is a writable fd
> >>>>> to a file with THPs. That's what we need to track down and fix.
> >>>> Hi, Matthew
> >>>> I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs"
> >>>> Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way.
> >>>>
> >>>> ...
> >>>>
> >>>>> https://lore.kernel.org/linux-mm/[email protected]/
> >>>> All in all, what you mean is that we should solve this race at the source?
> >>>
> >>> Matthew is being pretty clear here: we shouldn't be permitting
> >>> userspace to get a writeable fd for a thp-backed file.
> >>>
> >>> Why are we permitting the DSO to be opened writeably? If there's a
> >>> legitimate case for doing this then presumably "mm, thp: relax the
> >> There is a use case to stress file-backed THP within attachment.
> >> I test this case in a system which has enabled CONFIG_READ_ONLY_THP_FOR_FS:
> >>
> >> $ gcc -Wall -g -o stress_madvise_dso stress_madvise_dso.c
> >> $ ulimit -s unlimited
> >> $ ./stress_madvise_dso 10000 <libtest.so>
> >>
> >> the meaning of above parameters:
> >> 10000: the max test time;
> >> <libtest.so>: the DSO that will been mapped into file-backed THP by
> >> madvise. It recommended that the text segment of DSO to be tested is
> >> greater than 2M.
> >>
> >> The crash will been triggered at once in the latest kernel. And this
> >> case also can used to trigger the bug that mentioned in our another patch.
> >
> > Hmm.. I am not able to use the repro program to crash the system. Not
> > sure what I did wrong.
> >
> Hi
> I have tried to check my test case again. Can you make sure the DSO that
> you test have THP mapping?
>
> If you are willing to try again, I can send my libtest.c which is used
> to test by myself (actually, it shouldn't be target DSO problem).
>
> Thanks very much!
> > OTOH, does it make sense to block writes within khugepaged, like:
> >
> > diff --git i/mm/khugepaged.c w/mm/khugepaged.c
> > index 045cc579f724e..ad7c41ec15027 100644
> > --- i/mm/khugepaged.c
> > +++ w/mm/khugepaged.c
> > @@ -51,6 +51,7 @@ enum scan_result {
> > SCAN_CGROUP_CHARGE_FAIL,
> > SCAN_TRUNCATED,
> > SCAN_PAGE_HAS_PRIVATE,
> > + SCAN_BUSY_WRITE,
> > };
> >
> > #define CREATE_TRACE_POINTS
> > @@ -1652,6 +1653,11 @@ static void collapse_file(struct mm_struct *mm,
> > /* Only allocate from the target node */
> > gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
> >
> > + if (deny_write_access(file)) {
> > + result = SCAN_BUSY_WRITE;
> > + return;
> > + }
> > +
> This can indeed avoid some possible races from source.
>
> But, I am thinking about whether this will lead to DDoS attack?
> I remember the reason of DSO has ignored MAP_DENYWRITE in kernel
> is that DDoS attack. In addition, 'deny_write_access' will change
> the behavior, such as user will get 'Text file busy' during
> collapse_file. I am not sure whether the behavior changing is acceptable
> in user space.
>
> If it is acceptable, I am very willing to fix the races like your way.

I guess we should not let the write get ETXTBUSY for khugepaged work.

I am getting some segfault on stress_madvise_dso. And it doesn't really
generate the bug stack in my vm (qemu-system-x86_64). Is there an newer
version of it?

Thanks,
Song

2021-09-29 07:59:49

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On 9/29/21 3:14 PM, Song Liu wrote:
> On Tue, Sep 28, 2021 at 9:20 AM Rongwei Wang
> <[email protected]> wrote:
>>
>>
>>
>> On 9/28/21 6:24 AM, Song Liu wrote:
>>> On Fri, Sep 24, 2021 at 12:12 AM Rongwei Wang
>>> <[email protected]> wrote:
>>>>
>>>>
>>>>
>>>> On 9/24/21 10:43 AM, Andrew Morton wrote:
>>>>> On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <[email protected]> wrote:
>>>>>
>>>>>>
>>>>>>
>>>>>>> On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <[email protected]> wrote:
>>>>>>>
>>>>>>> On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote:
>>>>>>>> Transparent huge page has supported read-only non-shmem files. The file-
>>>>>>>> backed THP is collapsed by khugepaged and truncated when written (for
>>>>>>>> shared libraries).
>>>>>>>>
>>>>>>>> However, there is race in two possible places.
>>>>>>>>
>>>>>>>> 1) multiple writers truncate the same page cache concurrently;
>>>>>>>> 2) collapse_file rolls back when writer truncates the page cache;
>>>>>>>
>>>>>>> As I've said before, the bug here is that somehow there is a writable fd
>>>>>>> to a file with THPs. That's what we need to track down and fix.
>>>>>> Hi, Matthew
>>>>>> I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs"
>>>>>> Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way.
>>>>>>
>>>>>> ...
>>>>>>
>>>>>>> https://lore.kernel.org/linux-mm/[email protected]/
>>>>>> All in all, what you mean is that we should solve this race at the source?
>>>>>
>>>>> Matthew is being pretty clear here: we shouldn't be permitting
>>>>> userspace to get a writeable fd for a thp-backed file.
>>>>>
>>>>> Why are we permitting the DSO to be opened writeably? If there's a
>>>>> legitimate case for doing this then presumably "mm, thp: relax the
>>>> There is a use case to stress file-backed THP within attachment.
>>>> I test this case in a system which has enabled CONFIG_READ_ONLY_THP_FOR_FS:
>>>>
>>>> $ gcc -Wall -g -o stress_madvise_dso stress_madvise_dso.c
>>>> $ ulimit -s unlimited
>>>> $ ./stress_madvise_dso 10000 <libtest.so>
>>>>
>>>> the meaning of above parameters:
>>>> 10000: the max test time;
>>>> <libtest.so>: the DSO that will been mapped into file-backed THP by
>>>> madvise. It recommended that the text segment of DSO to be tested is
>>>> greater than 2M.
>>>>
>>>> The crash will been triggered at once in the latest kernel. And this
>>>> case also can used to trigger the bug that mentioned in our another patch.
>>>
>>> Hmm.. I am not able to use the repro program to crash the system. Not
>>> sure what I did wrong.
>>>
>> Hi
>> I have tried to check my test case again. Can you make sure the DSO that
>> you test have THP mapping?
>>
>> If you are willing to try again, I can send my libtest.c which is used
>> to test by myself (actually, it shouldn't be target DSO problem).
>>
>> Thanks very much!
>>> OTOH, does it make sense to block writes within khugepaged, like:
>>>
>>> diff --git i/mm/khugepaged.c w/mm/khugepaged.c
>>> index 045cc579f724e..ad7c41ec15027 100644
>>> --- i/mm/khugepaged.c
>>> +++ w/mm/khugepaged.c
>>> @@ -51,6 +51,7 @@ enum scan_result {
>>> SCAN_CGROUP_CHARGE_FAIL,
>>> SCAN_TRUNCATED,
>>> SCAN_PAGE_HAS_PRIVATE,
>>> + SCAN_BUSY_WRITE,
>>> };
>>>
>>> #define CREATE_TRACE_POINTS
>>> @@ -1652,6 +1653,11 @@ static void collapse_file(struct mm_struct *mm,
>>> /* Only allocate from the target node */
>>> gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
>>>
>>> + if (deny_write_access(file)) {
>>> + result = SCAN_BUSY_WRITE;
>>> + return;
>>> + }
>>> +
>> This can indeed avoid some possible races from source.
>>
>> But, I am thinking about whether this will lead to DDoS attack?
>> I remember the reason of DSO has ignored MAP_DENYWRITE in kernel
>> is that DDoS attack. In addition, 'deny_write_access' will change
>> the behavior, such as user will get 'Text file busy' during
>> collapse_file. I am not sure whether the behavior changing is acceptable
>> in user space.
>>
>> If it is acceptable, I am very willing to fix the races like your way.
>
> I guess we should not let the write get ETXTBUSY for khugepaged work.
>
> I am getting some segfault on stress_madvise_dso. And it doesn't really
> generate the bug stack in my vm (qemu-system-x86_64). Is there an newer
Hi, I can sure I am not update the stress_madvise_dso.c.

My test environment is vm (qemu-system-aarch64, 32 cores). And I can
think of the following possibilities:

(1) in thread_read()

printf("read %s\n", dso_path);
d = open(dso_path, O_RDONLY);
/* The start addr must be alignment with 2M */
void *p = mmap((void *)0x40000dc00000UL, 0x800000, PROT_READ |
PROT_EXEC,MAP_PRIVATE, fd, 0);
if (p == MAP_FAILED) {
perror("mmap");
goto out;
}

0x40000dc00000 is random setting by myself. I am not sure this address
is available in your vm.

(2) in thread_write()
int fd = open(dso_path, O_RDWR);
p = mmap(NULL, 0x800000, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
if (p == MAP_FAILED) {
perror("mmap");
goto out; /* fail */
}

because of I am sure the DSO is bigger than 0x800000, so directly map
the DSO using 0x800000. Maybe I had use '-z max-page-size=0x200000' to
compile the DSO? likes:
$ gcc -z max-page-size=0x200000 -o libtest.so -shared libtest.o

If you don't mind, you can send the segment fault log to me. And I will
find x86 environment to test.

Thanks!
> version of it?
>
> Thanks,
> Song
>

2021-09-29 17:52:30

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On Wed, Sep 29, 2021 at 12:50 AM Rongwei Wang
<[email protected]> wrote:
>
>
>
> On 9/29/21 3:14 PM, Song Liu wrote:
> > On Tue, Sep 28, 2021 at 9:20 AM Rongwei Wang
> > <[email protected]> wrote:
> >>
> >>
> >>
> >> On 9/28/21 6:24 AM, Song Liu wrote:
> >>> On Fri, Sep 24, 2021 at 12:12 AM Rongwei Wang
> >>> <[email protected]> wrote:
> >>>>
> >>>>
> >>>>
> >>>> On 9/24/21 10:43 AM, Andrew Morton wrote:
> >>>>> On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <[email protected]> wrote:
> >>>>>
> >>>>>>
> >>>>>>
> >>>>>>> On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <[email protected]> wrote:
> >>>>>>>
> >>>>>>> On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote:
> >>>>>>>> Transparent huge page has supported read-only non-shmem files. The file-
> >>>>>>>> backed THP is collapsed by khugepaged and truncated when written (for
> >>>>>>>> shared libraries).
> >>>>>>>>
> >>>>>>>> However, there is race in two possible places.
> >>>>>>>>
> >>>>>>>> 1) multiple writers truncate the same page cache concurrently;
> >>>>>>>> 2) collapse_file rolls back when writer truncates the page cache;
> >>>>>>>
> >>>>>>> As I've said before, the bug here is that somehow there is a writable fd
> >>>>>>> to a file with THPs. That's what we need to track down and fix.
> >>>>>> Hi, Matthew
> >>>>>> I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs"
> >>>>>> Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way.
> >>>>>>
> >>>>>> ...
> >>>>>>
> >>>>>>> https://lore.kernel.org/linux-mm/[email protected]/
> >>>>>> All in all, what you mean is that we should solve this race at the source?
> >>>>>
> >>>>> Matthew is being pretty clear here: we shouldn't be permitting
> >>>>> userspace to get a writeable fd for a thp-backed file.
> >>>>>
> >>>>> Why are we permitting the DSO to be opened writeably? If there's a
> >>>>> legitimate case for doing this then presumably "mm, thp: relax the
> >>>> There is a use case to stress file-backed THP within attachment.
> >>>> I test this case in a system which has enabled CONFIG_READ_ONLY_THP_FOR_FS:
> >>>>
> >>>> $ gcc -Wall -g -o stress_madvise_dso stress_madvise_dso.c
> >>>> $ ulimit -s unlimited
> >>>> $ ./stress_madvise_dso 10000 <libtest.so>
> >>>>
> >>>> the meaning of above parameters:
> >>>> 10000: the max test time;
> >>>> <libtest.so>: the DSO that will been mapped into file-backed THP by
> >>>> madvise. It recommended that the text segment of DSO to be tested is
> >>>> greater than 2M.
> >>>>
> >>>> The crash will been triggered at once in the latest kernel. And this
> >>>> case also can used to trigger the bug that mentioned in our another patch.
> >>>
> >>> Hmm.. I am not able to use the repro program to crash the system. Not
> >>> sure what I did wrong.
> >>>
> >> Hi
> >> I have tried to check my test case again. Can you make sure the DSO that
> >> you test have THP mapping?
> >>
> >> If you are willing to try again, I can send my libtest.c which is used
> >> to test by myself (actually, it shouldn't be target DSO problem).
> >>
> >> Thanks very much!
> >>> OTOH, does it make sense to block writes within khugepaged, like:
> >>>
> >>> diff --git i/mm/khugepaged.c w/mm/khugepaged.c
> >>> index 045cc579f724e..ad7c41ec15027 100644
> >>> --- i/mm/khugepaged.c
> >>> +++ w/mm/khugepaged.c
> >>> @@ -51,6 +51,7 @@ enum scan_result {
> >>> SCAN_CGROUP_CHARGE_FAIL,
> >>> SCAN_TRUNCATED,
> >>> SCAN_PAGE_HAS_PRIVATE,
> >>> + SCAN_BUSY_WRITE,
> >>> };
> >>>
> >>> #define CREATE_TRACE_POINTS
> >>> @@ -1652,6 +1653,11 @@ static void collapse_file(struct mm_struct *mm,
> >>> /* Only allocate from the target node */
> >>> gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
> >>>
> >>> + if (deny_write_access(file)) {
> >>> + result = SCAN_BUSY_WRITE;
> >>> + return;
> >>> + }
> >>> +
> >> This can indeed avoid some possible races from source.
> >>
> >> But, I am thinking about whether this will lead to DDoS attack?
> >> I remember the reason of DSO has ignored MAP_DENYWRITE in kernel
> >> is that DDoS attack. In addition, 'deny_write_access' will change
> >> the behavior, such as user will get 'Text file busy' during
> >> collapse_file. I am not sure whether the behavior changing is acceptable
> >> in user space.
> >>
> >> If it is acceptable, I am very willing to fix the races like your way.
> >
> > I guess we should not let the write get ETXTBUSY for khugepaged work.
> >
> > I am getting some segfault on stress_madvise_dso. And it doesn't really
> > generate the bug stack in my vm (qemu-system-x86_64). Is there an newer
> Hi, I can sure I am not update the stress_madvise_dso.c.
>
> My test environment is vm (qemu-system-aarch64, 32 cores). And I can
> think of the following possibilities:
>
> (1) in thread_read()
>
> printf("read %s\n", dso_path);
> d = open(dso_path, O_RDONLY);
> /* The start addr must be alignment with 2M */
> void *p = mmap((void *)0x40000dc00000UL, 0x800000, PROT_READ |
> PROT_EXEC,MAP_PRIVATE, fd, 0);
> if (p == MAP_FAILED) {
> perror("mmap");
> goto out;
> }
>
> 0x40000dc00000 is random setting by myself. I am not sure this address
> is available in your vm.
>
> (2) in thread_write()
> int fd = open(dso_path, O_RDWR);
> p = mmap(NULL, 0x800000, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> if (p == MAP_FAILED) {
> perror("mmap");
> goto out; /* fail */
> }
>
> because of I am sure the DSO is bigger than 0x800000, so directly map
> the DSO using 0x800000. Maybe I had use '-z max-page-size=0x200000' to
> compile the DSO? likes:
> $ gcc -z max-page-size=0x200000 -o libtest.so -shared libtest.o
>
> If you don't mind, you can send the segment fault log to me. And I will
> find x86 environment to test.

I fixed the segfault with
1. malloc buf (as it is too big for stack) in thread_read
2. reduce memcpy() size in thread_read.

Now, I am able to crash the system on
find_lock_entries () {
...
VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
}
I guess it is related. I will test more.

Thanks,
Song

2021-09-29 19:24:48

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On Wed, Sep 29, 2021 at 09:59:38AM -0700, Song Liu wrote:
> On Wed, Sep 29, 2021 at 12:50 AM Rongwei Wang
> <[email protected]> wrote:
> >
> >
> >
> > On 9/29/21 3:14 PM, Song Liu wrote:
> > > On Tue, Sep 28, 2021 at 9:20 AM Rongwei Wang
> > > <[email protected]> wrote:
> > >>
> > >>
> > >>
> > >> On 9/28/21 6:24 AM, Song Liu wrote:
> > >>> On Fri, Sep 24, 2021 at 12:12 AM Rongwei Wang
> > >>> <[email protected]> wrote:
> > >>>>
> > >>>>
> > >>>>
> > >>>> On 9/24/21 10:43 AM, Andrew Morton wrote:
> > >>>>> On Thu, 23 Sep 2021 01:04:54 +0800 Rongwei Wang <[email protected]> wrote:
> > >>>>>
> > >>>>>>
> > >>>>>>
> > >>>>>>> On Sep 22, 2021, at 7:37 PM, Matthew Wilcox <[email protected]> wrote:
> > >>>>>>>
> > >>>>>>> On Wed, Sep 22, 2021 at 03:06:44PM +0800, Rongwei Wang wrote:
> > >>>>>>>> Transparent huge page has supported read-only non-shmem files. The file-
> > >>>>>>>> backed THP is collapsed by khugepaged and truncated when written (for
> > >>>>>>>> shared libraries).
> > >>>>>>>>
> > >>>>>>>> However, there is race in two possible places.
> > >>>>>>>>
> > >>>>>>>> 1) multiple writers truncate the same page cache concurrently;
> > >>>>>>>> 2) collapse_file rolls back when writer truncates the page cache;
> > >>>>>>>
> > >>>>>>> As I've said before, the bug here is that somehow there is a writable fd
> > >>>>>>> to a file with THPs. That's what we need to track down and fix.
> > >>>>>> Hi, Matthew
> > >>>>>> I am not sure get your means. We know “mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs"
> > >>>>>> Introduced file-backed THPs for DSO. It is possible {very rarely} for DSO to be opened in writeable way.
> > >>>>>>
> > >>>>>> ...
> > >>>>>>
> > >>>>>>> https://lore.kernel.org/linux-mm/[email protected]/
> > >>>>>> All in all, what you mean is that we should solve this race at the source?
> > >>>>>
> > >>>>> Matthew is being pretty clear here: we shouldn't be permitting
> > >>>>> userspace to get a writeable fd for a thp-backed file.
> > >>>>>
> > >>>>> Why are we permitting the DSO to be opened writeably? If there's a
> > >>>>> legitimate case for doing this then presumably "mm, thp: relax the
> > >>>> There is a use case to stress file-backed THP within attachment.
> > >>>> I test this case in a system which has enabled CONFIG_READ_ONLY_THP_FOR_FS:
> > >>>>
> > >>>> $ gcc -Wall -g -o stress_madvise_dso stress_madvise_dso.c
> > >>>> $ ulimit -s unlimited
> > >>>> $ ./stress_madvise_dso 10000 <libtest.so>
> > >>>>
> > >>>> the meaning of above parameters:
> > >>>> 10000: the max test time;
> > >>>> <libtest.so>: the DSO that will been mapped into file-backed THP by
> > >>>> madvise. It recommended that the text segment of DSO to be tested is
> > >>>> greater than 2M.
> > >>>>
> > >>>> The crash will been triggered at once in the latest kernel. And this
> > >>>> case also can used to trigger the bug that mentioned in our another patch.
> > >>>
> > >>> Hmm.. I am not able to use the repro program to crash the system. Not
> > >>> sure what I did wrong.
> > >>>
> > >> Hi
> > >> I have tried to check my test case again. Can you make sure the DSO that
> > >> you test have THP mapping?
> > >>
> > >> If you are willing to try again, I can send my libtest.c which is used
> > >> to test by myself (actually, it shouldn't be target DSO problem).
> > >>
> > >> Thanks very much!
> > >>> OTOH, does it make sense to block writes within khugepaged, like:
> > >>>
> > >>> diff --git i/mm/khugepaged.c w/mm/khugepaged.c
> > >>> index 045cc579f724e..ad7c41ec15027 100644
> > >>> --- i/mm/khugepaged.c
> > >>> +++ w/mm/khugepaged.c
> > >>> @@ -51,6 +51,7 @@ enum scan_result {
> > >>> SCAN_CGROUP_CHARGE_FAIL,
> > >>> SCAN_TRUNCATED,
> > >>> SCAN_PAGE_HAS_PRIVATE,
> > >>> + SCAN_BUSY_WRITE,
> > >>> };
> > >>>
> > >>> #define CREATE_TRACE_POINTS
> > >>> @@ -1652,6 +1653,11 @@ static void collapse_file(struct mm_struct *mm,
> > >>> /* Only allocate from the target node */
> > >>> gfp = alloc_hugepage_khugepaged_gfpmask() | __GFP_THISNODE;
> > >>>
> > >>> + if (deny_write_access(file)) {
> > >>> + result = SCAN_BUSY_WRITE;
> > >>> + return;
> > >>> + }
> > >>> +
> > >> This can indeed avoid some possible races from source.
> > >>
> > >> But, I am thinking about whether this will lead to DDoS attack?
> > >> I remember the reason of DSO has ignored MAP_DENYWRITE in kernel
> > >> is that DDoS attack. In addition, 'deny_write_access' will change
> > >> the behavior, such as user will get 'Text file busy' during
> > >> collapse_file. I am not sure whether the behavior changing is acceptable
> > >> in user space.
> > >>
> > >> If it is acceptable, I am very willing to fix the races like your way.
> > >
> > > I guess we should not let the write get ETXTBUSY for khugepaged work.
> > >
> > > I am getting some segfault on stress_madvise_dso. And it doesn't really
> > > generate the bug stack in my vm (qemu-system-x86_64). Is there an newer
> > Hi, I can sure I am not update the stress_madvise_dso.c.
> >
> > My test environment is vm (qemu-system-aarch64, 32 cores). And I can
> > think of the following possibilities:
> >
> > (1) in thread_read()
> >
> > printf("read %s\n", dso_path);
> > d = open(dso_path, O_RDONLY);
> > /* The start addr must be alignment with 2M */
> > void *p = mmap((void *)0x40000dc00000UL, 0x800000, PROT_READ |
> > PROT_EXEC,MAP_PRIVATE, fd, 0);
> > if (p == MAP_FAILED) {
> > perror("mmap");
> > goto out;
> > }
> >
> > 0x40000dc00000 is random setting by myself. I am not sure this address
> > is available in your vm.
> >
> > (2) in thread_write()
> > int fd = open(dso_path, O_RDWR);
> > p = mmap(NULL, 0x800000, PROT_READ | PROT_WRITE, MAP_SHARED, fd, 0);
> > if (p == MAP_FAILED) {
> > perror("mmap");
> > goto out; /* fail */
> > }
> >
> > because of I am sure the DSO is bigger than 0x800000, so directly map
> > the DSO using 0x800000. Maybe I had use '-z max-page-size=0x200000' to
> > compile the DSO? likes:
> > $ gcc -z max-page-size=0x200000 -o libtest.so -shared libtest.o
> >
> > If you don't mind, you can send the segment fault log to me. And I will
> > find x86 environment to test.
>
> I fixed the segfault with
> 1. malloc buf (as it is too big for stack) in thread_read
> 2. reduce memcpy() size in thread_read.
>
> Now, I am able to crash the system on
> find_lock_entries () {
> ...
> VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
> }
> I guess it is related. I will test more.

That's a bogus VM_BUG_ON. I have a patch in my tree to delete it.
Andrew has it too, but for some reason, he hasn't sent it on to Linus.

+++ b/mm/filemap.c
@@ -2093,7 +2093,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
if (!xa_is_value(page)) {
if (page->index < start)
goto put;
- VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
if (page->index + thp_nr_pages(page) - 1 > end)
goto put;
if (!trylock_page(page))

2021-09-30 00:04:38

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On Wed, Sep 29, 2021 at 10:56 AM Matthew Wilcox <[email protected]> wrote:
>
[...]
> > Now, I am able to crash the system on
> > find_lock_entries () {
> > ...
> > VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
> > }
> > I guess it is related. I will test more.
>
> That's a bogus VM_BUG_ON. I have a patch in my tree to delete it.
> Andrew has it too, but for some reason, he hasn't sent it on to Linus.
>
> +++ b/mm/filemap.c
> @@ -2093,7 +2093,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
> if (!xa_is_value(page)) {
> if (page->index < start)
> goto put;
> - VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
> if (page->index + thp_nr_pages(page) - 1 > end)
> goto put;
> if (!trylock_page(page))

Yes, after removing this line, I am able to see the same bug.

Here is my finding so far:

The issue is NOT caused by concurrent khugepaged:collapse_file() and
truncate_pagecache(inode, 0). With some printks, we can see a clear
time gap (>2 second ) between collapse_file() finishes, and
truncate_pagecache() (which crashes soon). Therefore, my earlier
suggestion that adds deny_write_access() to collapse_file() does NOT
work.

The crash is actually caused by concurrent truncate_pagecache(inode, 0).
If I change the number of write thread in stress_madvise_dso.c to one,
(IOW, one thread_read and one thread_write), I cannot reproduce the
crash anymore.

I think this means we cannot fix this issue in collapse_file(), because it
finishes long before the crash.

Thanks,
Song

2021-09-30 00:06:50

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On Wed, Sep 29, 2021 at 04:41:48PM -0700, Song Liu wrote:
> The issue is NOT caused by concurrent khugepaged:collapse_file() and
> truncate_pagecache(inode, 0). With some printks, we can see a clear
> time gap (>2 second ) between collapse_file() finishes, and
> truncate_pagecache() (which crashes soon). Therefore, my earlier
> suggestion that adds deny_write_access() to collapse_file() does NOT
> work.
>
> The crash is actually caused by concurrent truncate_pagecache(inode, 0).
> If I change the number of write thread in stress_madvise_dso.c to one,
> (IOW, one thread_read and one thread_write), I cannot reproduce the
> crash anymore.
>
> I think this means we cannot fix this issue in collapse_file(), because it
> finishes long before the crash.

Ah! So are we missing one or more of these locks:

inode_lock(inode);
filemap_invalidate_lock(mapping);

in the open path?

2021-09-30 00:46:49

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On Wed, Sep 29, 2021 at 5:02 PM Matthew Wilcox <[email protected]> wrote:
>
> On Wed, Sep 29, 2021 at 04:41:48PM -0700, Song Liu wrote:
> > The issue is NOT caused by concurrent khugepaged:collapse_file() and
> > truncate_pagecache(inode, 0). With some printks, we can see a clear
> > time gap (>2 second ) between collapse_file() finishes, and
> > truncate_pagecache() (which crashes soon). Therefore, my earlier
> > suggestion that adds deny_write_access() to collapse_file() does NOT
> > work.
> >
> > The crash is actually caused by concurrent truncate_pagecache(inode, 0).
> > If I change the number of write thread in stress_madvise_dso.c to one,
> > (IOW, one thread_read and one thread_write), I cannot reproduce the
> > crash anymore.
> >
> > I think this means we cannot fix this issue in collapse_file(), because it
> > finishes long before the crash.
>
> Ah! So are we missing one or more of these locks:
>
> inode_lock(inode);
> filemap_invalidate_lock(mapping);
>
> in the open path?

The following fixes the crash in my test. But I am not sure whether this is the
best fix.

Rongwei, could you please run more tests on it?

Thanks,
Song

diff --git i/fs/open.c w/fs/open.c
index daa324606a41f..d13c4668b2e53 100644
--- i/fs/open.c
+++ w/fs/open.c
@@ -856,8 +856,11 @@ static int do_dentry_open(struct file *f,
* of THPs into the page cache will fail.
*/
smp_mb();
- if (filemap_nr_thps(inode->i_mapping))
+ if (filemap_nr_thps(inode->i_mapping)) {
+ filemap_invalidate_lock(inode->i_mapping);
truncate_pagecache(inode, 0);
+ filemap_invalidate_unlock(inode->i_mapping);
+ }
}

return 0;

2021-09-30 02:19:20

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On 9/30/21 8:41 AM, Song Liu wrote:
> On Wed, Sep 29, 2021 at 5:02 PM Matthew Wilcox <[email protected]> wrote:
>>
>> On Wed, Sep 29, 2021 at 04:41:48PM -0700, Song Liu wrote:
>>> The issue is NOT caused by concurrent khugepaged:collapse_file() and
>>> truncate_pagecache(inode, 0). With some printks, we can see a clear
>>> time gap (>2 second ) between collapse_file() finishes, and
>>> truncate_pagecache() (which crashes soon). Therefore, my earlier
>>> suggestion that adds deny_write_access() to collapse_file() does NOT
>>> work.
>>>
>>> The crash is actually caused by concurrent truncate_pagecache(inode, 0).
>>> If I change the number of write thread in stress_madvise_dso.c to one,
>>> (IOW, one thread_read and one thread_write), I cannot reproduce the
>>> crash anymore.
>>>
>>> I think this means we cannot fix this issue in collapse_file(), because it
>>> finishes long before the crash.
>>
>> Ah! So are we missing one or more of these locks:
>>
>> inode_lock(inode);
>> filemap_invalidate_lock(mapping);
>>
>> in the open path?
>
> The following fixes the crash in my test. But I am not sure whether this is the
> best fix.
>
> Rongwei, could you please run more tests on it?
Yes, I'd like to.
>
> Thanks,
> Song
>
>
> diff --git i/fs/open.c w/fs/open.c
> index daa324606a41f..d13c4668b2e53 100644
> --- i/fs/open.c
> +++ w/fs/open.c
> @@ -856,8 +856,11 @@ static int do_dentry_open(struct file *f,
> * of THPs into the page cache will fail.
> */
> smp_mb();
> - if (filemap_nr_thps(inode->i_mapping))
> + if (filemap_nr_thps(inode->i_mapping)) {
> + filemap_invalidate_lock(inode->i_mapping);
Learned something, Thanks!

But, the race between collapse_file and truncate_pagecache, I am not
sure whether it exists or not. If exists, whether this patch only can
fix truncate_pagecache concurrent?

Anyway, I will run more tests on it at first.

Thanks!
> truncate_pagecache(inode, 0);
> + filemap_invalidate_unlock(inode->i_mapping);
> + }
> }
>
> return 0;
>

2021-09-30 03:25:20

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On 9/30/21 7:41 AM, Song Liu wrote:
> On Wed, Sep 29, 2021 at 10:56 AM Matthew Wilcox <[email protected]> wrote:
>>
> [...]
>>> Now, I am able to crash the system on
>>> find_lock_entries () {
>>> ...
>>> VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
>>> }
>>> I guess it is related. I will test more.
>>
>> That's a bogus VM_BUG_ON. I have a patch in my tree to delete it.
>> Andrew has it too, but for some reason, he hasn't sent it on to Linus.
>>
>> +++ b/mm/filemap.c
>> @@ -2093,7 +2093,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
>> if (!xa_is_value(page)) {
>> if (page->index < start)
>> goto put;
>> - VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
>> if (page->index + thp_nr_pages(page) - 1 > end)
>> goto put;
>> if (!trylock_page(page))
>
> Yes, after removing this line, I am able to see the same bug.
>
> Here is my finding so far:
>
> The issue is NOT caused by concurrent khugepaged:collapse_file() and
> truncate_pagecache(inode, 0). With some printks, we can see a clear
> time gap (>2 second ) between collapse_file() finishes, and
> truncate_pagecache() (which crashes soon). Therefore, my earlier
> suggestion that adds deny_write_access() to collapse_file() does NOT
> work.
>
> The crash is actually caused by concurrent truncate_pagecache(inode, 0).
> If I change the number of write thread in stress_madvise_dso.c to one,
> (IOW, one thread_read and one thread_write), I cannot reproduce the
> crash anymore.
Whether CONFIG_DEBUG_VM is enabled in your vm?

I think the second possibility mentioned above will been found if you
enable CONFIG_DEBUG_VM:

1) multiple writers truncate the same page cache concurrently;
2) collapse_file rolls back when writer truncates the page cache;

The following log will be print after enable CONFIG_DEBUG_VM:

[22216.789904] do_idle+0xb4/0x104
[22216.789906] cpu_startup_entry+0x34/0x9c
[22216.790144] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS
0.0.0 02/06/2015
[22216.790553] secondary_start_kernel+0x104/0x180
[22216.790778] Call trace:
[22216.791300] Code: d4210000 b0006161 910d4021 94013b45 (d4210000)
[22216.791662] dump_backtrace+0x0/0x1ec
[22216.791664] show_stack+0x24/0x30
[22216.791956] ---[ end trace dc769a61c1af087b ]---
[22216.792295] dump_stack+0xd0/0x128
[22216.792299] bad_page+0xe4/0x110
[22216.792579] Kernel panic - not syncing: Oops - BUG: Fatal exception
in interrupt
[22216.792937] check_free_page_bad+0x84/0x90
[22216.792940] free_pcp_prepare+0x1fc/0x21c
[22216.793253] SMP: stopping secondary CPUs
[22216.793525] free_unref_page+0x2c/0xec
[22216.805537] __put_page+0x60/0x70
[22216.805931] collapse_file+0xdc8/0x12f0
[22216.806385] khugepaged_scan_file+0x2dc/0x37c
[22216.806900] khugepaged_scan_mm_slot+0x2e0/0x380
[22216.807450] khugepaged_do_scan+0x2dc/0x2fc
[22216.807946] khugepaged+0x38/0x100
[22216.808342] kthread+0x11c/0x120
[22216.808735] Kernel Offset: disabled
[22216.809153] CPU features: 0x0040002,62208238
[22216.809681] Memory Limit: none
[22216.813477] Starting crashdump kernel...

So I think the race also exists between collapse_file and
truncate_pagecache.

>
> I think this means we cannot fix this issue in collapse_file(), because it
> finishes long before the crash.
>
> Thanks,
> Song
>

2021-09-30 03:38:53

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On Wed, Sep 29, 2021 at 6:54 PM Rongwei Wang
<[email protected]> wrote:
>
>
>
> On 9/30/21 7:41 AM, Song Liu wrote:
> > On Wed, Sep 29, 2021 at 10:56 AM Matthew Wilcox <[email protected]> wrote:
> >>
> > [...]
> >>> Now, I am able to crash the system on
> >>> find_lock_entries () {
> >>> ...
> >>> VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
> >>> }
> >>> I guess it is related. I will test more.
> >>
> >> That's a bogus VM_BUG_ON. I have a patch in my tree to delete it.
> >> Andrew has it too, but for some reason, he hasn't sent it on to Linus.
> >>
> >> +++ b/mm/filemap.c
> >> @@ -2093,7 +2093,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
> >> if (!xa_is_value(page)) {
> >> if (page->index < start)
> >> goto put;
> >> - VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
> >> if (page->index + thp_nr_pages(page) - 1 > end)
> >> goto put;
> >> if (!trylock_page(page))
> >
> > Yes, after removing this line, I am able to see the same bug.
> >
> > Here is my finding so far:
> >
> > The issue is NOT caused by concurrent khugepaged:collapse_file() and
> > truncate_pagecache(inode, 0). With some printks, we can see a clear
> > time gap (>2 second ) between collapse_file() finishes, and
> > truncate_pagecache() (which crashes soon). Therefore, my earlier
> > suggestion that adds deny_write_access() to collapse_file() does NOT
> > work.
> >
> > The crash is actually caused by concurrent truncate_pagecache(inode, 0).
> > If I change the number of write thread in stress_madvise_dso.c to one,
> > (IOW, one thread_read and one thread_write), I cannot reproduce the
> > crash anymore.
> Whether CONFIG_DEBUG_VM is enabled in your vm?
>
> I think the second possibility mentioned above will been found if you
> enable CONFIG_DEBUG_VM:
>
> 1) multiple writers truncate the same page cache concurrently;
> 2) collapse_file rolls back when writer truncates the page cache;
>
> The following log will be print after enable CONFIG_DEBUG_VM:
>
> [22216.789904] do_idle+0xb4/0x104
> [22216.789906] cpu_startup_entry+0x34/0x9c
> [22216.790144] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS
> 0.0.0 02/06/2015
> [22216.790553] secondary_start_kernel+0x104/0x180
> [22216.790778] Call trace:
> [22216.791300] Code: d4210000 b0006161 910d4021 94013b45 (d4210000)
> [22216.791662] dump_backtrace+0x0/0x1ec
> [22216.791664] show_stack+0x24/0x30
> [22216.791956] ---[ end trace dc769a61c1af087b ]---
> [22216.792295] dump_stack+0xd0/0x128
> [22216.792299] bad_page+0xe4/0x110
> [22216.792579] Kernel panic - not syncing: Oops - BUG: Fatal exception
> in interrupt
> [22216.792937] check_free_page_bad+0x84/0x90
> [22216.792940] free_pcp_prepare+0x1fc/0x21c
> [22216.793253] SMP: stopping secondary CPUs
> [22216.793525] free_unref_page+0x2c/0xec
> [22216.805537] __put_page+0x60/0x70
> [22216.805931] collapse_file+0xdc8/0x12f0
> [22216.806385] khugepaged_scan_file+0x2dc/0x37c
> [22216.806900] khugepaged_scan_mm_slot+0x2e0/0x380
> [22216.807450] khugepaged_do_scan+0x2dc/0x2fc
> [22216.807946] khugepaged+0x38/0x100
> [22216.808342] kthread+0x11c/0x120
> [22216.808735] Kernel Offset: disabled
> [22216.809153] CPU features: 0x0040002,62208238
> [22216.809681] Memory Limit: none
> [22216.813477] Starting crashdump kernel...
>
> So I think the race also exists between collapse_file and
> truncate_pagecache.

I do have CONFIG_DEBUG_VM, but I haven't hit this issue yet.

Thanks,
Song

2021-09-30 05:27:50

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On Wed, 29 Sep 2021, Song Liu wrote:
> On Wed, Sep 29, 2021 at 6:54 PM Rongwei Wang
> <[email protected]> wrote:
> > On 9/30/21 7:41 AM, Song Liu wrote:
> > > On Wed, Sep 29, 2021 at 10:56 AM Matthew Wilcox <[email protected]> wrote:
> > >>
> > > [...]
> > >>> Now, I am able to crash the system on
> > >>> find_lock_entries () {
> > >>> ...
> > >>> VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
> > >>> }
> > >>> I guess it is related. I will test more.
> > >>
> > >> That's a bogus VM_BUG_ON. I have a patch in my tree to delete it.
> > >> Andrew has it too, but for some reason, he hasn't sent it on to Linus.

I think Andrew has held back from sending it on to Linus because I pointed
out that the example Matthew cited (shmem and swap cache) was wrong, and
could not explain it: we wanted to understand what syzbot had actually
hit before sending on.

Would this bug be a good explanation for it?

In the meantime, independently, I was looking at fuse_try_move_page(),
which appears to do the old splice thievery that got disabled from splice
itself, stealing a page from one mapping to put into another. I suspect
that could result in a page->index != xas.xa_index too (outside page lock).

> > >>
> > >> +++ b/mm/filemap.c
> > >> @@ -2093,7 +2093,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
> > >> if (!xa_is_value(page)) {
> > >> if (page->index < start)
> > >> goto put;
> > >> - VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
> > >> if (page->index + thp_nr_pages(page) - 1 > end)
> > >> goto put;
> > >> if (!trylock_page(page))
> > >
> > > Yes, after removing this line, I am able to see the same bug.
> > >
> > > Here is my finding so far:
> > >
> > > The issue is NOT caused by concurrent khugepaged:collapse_file() and
> > > truncate_pagecache(inode, 0). With some printks, we can see a clear
> > > time gap (>2 second ) between collapse_file() finishes, and
> > > truncate_pagecache() (which crashes soon). Therefore, my earlier
> > > suggestion that adds deny_write_access() to collapse_file() does NOT
> > > work.
> > >
> > > The crash is actually caused by concurrent truncate_pagecache(inode, 0).
> > > If I change the number of write thread in stress_madvise_dso.c to one,
> > > (IOW, one thread_read and one thread_write), I cannot reproduce the
> > > crash anymore.
> > Whether CONFIG_DEBUG_VM is enabled in your vm?
> >
> > I think the second possibility mentioned above will been found if you
> > enable CONFIG_DEBUG_VM:
> >
> > 1) multiple writers truncate the same page cache concurrently;
> > 2) collapse_file rolls back when writer truncates the page cache;
> >
> > The following log will be print after enable CONFIG_DEBUG_VM:
> >
> > [22216.789904] do_idle+0xb4/0x104
> > [22216.789906] cpu_startup_entry+0x34/0x9c
> > [22216.790144] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS
> > 0.0.0 02/06/2015
> > [22216.790553] secondary_start_kernel+0x104/0x180
> > [22216.790778] Call trace:
> > [22216.791300] Code: d4210000 b0006161 910d4021 94013b45 (d4210000)
> > [22216.791662] dump_backtrace+0x0/0x1ec
> > [22216.791664] show_stack+0x24/0x30
> > [22216.791956] ---[ end trace dc769a61c1af087b ]---
> > [22216.792295] dump_stack+0xd0/0x128
> > [22216.792299] bad_page+0xe4/0x110
> > [22216.792579] Kernel panic - not syncing: Oops - BUG: Fatal exception
> > in interrupt
> > [22216.792937] check_free_page_bad+0x84/0x90
> > [22216.792940] free_pcp_prepare+0x1fc/0x21c
> > [22216.793253] SMP: stopping secondary CPUs
> > [22216.793525] free_unref_page+0x2c/0xec
> > [22216.805537] __put_page+0x60/0x70
> > [22216.805931] collapse_file+0xdc8/0x12f0
> > [22216.806385] khugepaged_scan_file+0x2dc/0x37c
> > [22216.806900] khugepaged_scan_mm_slot+0x2e0/0x380
> > [22216.807450] khugepaged_do_scan+0x2dc/0x2fc
> > [22216.807946] khugepaged+0x38/0x100
> > [22216.808342] kthread+0x11c/0x120
> > [22216.808735] Kernel Offset: disabled
> > [22216.809153] CPU features: 0x0040002,62208238
> > [22216.809681] Memory Limit: none
> > [22216.813477] Starting crashdump kernel...
> >
> > So I think the race also exists between collapse_file and
> > truncate_pagecache.
>
> I do have CONFIG_DEBUG_VM, but I haven't hit this issue yet.

Sorry, it's taken me a long time to get into this bug(s).

I haven't tried to reproduce it, but I do think that Rongwei will
be proved right.

In doing a lockless lookup of the page cache, we tend to imagine
that the THP head page will be encountered first, and the special
treatment for the head will do the right thing to skip the tails.

But when racing against collapse_file()'s rewind, I agree with
Rongwei that it is possible for truncation (or page cache cleanup
in this instance) to collect a pagevec which starts with a PageTail
some way into the compound page.

shmem_undo_range() (which shmem uses rather than truncate_pagecache())
would not call truncate_inode_page() on a THP tail: if it encounters a
tail, it will try to split the THP, and not call truncate_inode_page()
if it cannot. Unless I'm inventing the memory, I now think that I did
encounter this race between truncate and collapse on huge shmem, and
had to re-craft my shmem_punch_compound() to handle it correctly.

If it is just a matter of collapse_file() rewind, I suppose we could
reverse the direction in which that proceeds; but I'm not convinced
that would be enough, and don't think we need such a "big" change.

Aside from the above page->index mischeck in find_lock_entries(),
I now think this bug needs nothing more than simply removing the
VM_BUG_ON_PAGE(PageTail(page), page) from truncate_inode_page().

(You could say move its page->mapping check before its PageTail
check; but the PageTail check would then never catch anything.)

Rongwei's patch went in the right direction, but why add N lines
if deleting one is good? And for a while I thought that idea of
using filemap_invalidate_lock() was very good; but now think
the page lock we already have is better, and solve both races.

Hugh

2021-09-30 17:42:10

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On Thu, Sep 30, 2021 at 9:49 AM Hugh Dickins <[email protected]> wrote:
>
> On Thu, 30 Sep 2021, Matthew Wilcox wrote:
> > On Wed, Sep 29, 2021 at 10:24:44PM -0700, Hugh Dickins wrote:
> > >
> > > Aside from the above page->index mischeck in find_lock_entries(),
> > > I now think this bug needs nothing more than simply removing the
> > > VM_BUG_ON_PAGE(PageTail(page), page) from truncate_inode_page().
> >
> > I don't think that's right. This bug was also observed when calling
> > truncate(), so there's clearly a situation where two concurrent calls
> > to truncate_pagecache() leaves a THP in the cache.
>
> I assume you're thinking of one of the fuzzer blkdev ones:
> https://lore.kernel.org/linux-mm/CACkBjsbtF_peC7N_4mRfHML_BeiPe+O9DahTfr84puSG_J9rcQ@mail.gmail.com/
> or
> https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/
>
> I haven't started on those ones yet: yes, I imagine one or both of those
> will need a further fix (S_ISREG() check somewhere if we're lucky; but
> could well be nastier); but for the bug in this thread, I expect

Makes sense to me. We should be able to check S_ISREG() in khugepaged,
if it is not a regular file, just bail out. Sounds not that nasty to
me AFAIU.

> removing the VM_BUG_ON_PAGE(PageTail) to be enough.
>
> If you're thinking of something else, please send a link if you can - thanks.
>
> Hugh
>

2021-09-30 18:52:35

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On Wed, Sep 29, 2021 at 10:24:44PM -0700, Hugh Dickins wrote:
> On Wed, 29 Sep 2021, Song Liu wrote:
> > On Wed, Sep 29, 2021 at 6:54 PM Rongwei Wang
> > <[email protected]> wrote:
> > > On 9/30/21 7:41 AM, Song Liu wrote:
> > > > On Wed, Sep 29, 2021 at 10:56 AM Matthew Wilcox <[email protected]> wrote:
> > > >>
> > > > [...]
> > > >>> Now, I am able to crash the system on
> > > >>> find_lock_entries () {
> > > >>> ...
> > > >>> VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
> > > >>> }
> > > >>> I guess it is related. I will test more.
> > > >>
> > > >> That's a bogus VM_BUG_ON. I have a patch in my tree to delete it.
> > > >> Andrew has it too, but for some reason, he hasn't sent it on to Linus.
>
> I think Andrew has held back from sending it on to Linus because I pointed
> out that the example Matthew cited (shmem and swap cache) was wrong, and
> could not explain it: we wanted to understand what syzbot had actually
> hit before sending on.
>
> Would this bug be a good explanation for it?

I don't think so? Even if that specific example is not what's happening,
the general principle is that you can't verify page->index until you're
holding the page lock. So the VM_BUG_ON is definitely in the wrong place.

> In the meantime, independently, I was looking at fuse_try_move_page(),
> which appears to do the old splice thievery that got disabled from splice
> itself, stealing a page from one mapping to put into another. I suspect
> that could result in a page->index != xas.xa_index too (outside page lock).

I think there are other examples too where page->index gets changed ...
they're not springing to mind right now, but I have some other intricate
details occupying that bit of my brain at the moment.

> > > I think the second possibility mentioned above will been found if you
> > > enable CONFIG_DEBUG_VM:
> > >
> > > 1) multiple writers truncate the same page cache concurrently;
> > > 2) collapse_file rolls back when writer truncates the page cache;
> > >
> > > The following log will be print after enable CONFIG_DEBUG_VM:
> > >
> > > [22216.789904] do_idle+0xb4/0x104
> > > [22216.789906] cpu_startup_entry+0x34/0x9c
> > > [22216.790144] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS
> > > 0.0.0 02/06/2015
> > > [22216.790553] secondary_start_kernel+0x104/0x180
> > > [22216.790778] Call trace:
> > > [22216.791300] Code: d4210000 b0006161 910d4021 94013b45 (d4210000)
> > > [22216.791662] dump_backtrace+0x0/0x1ec
> > > [22216.791664] show_stack+0x24/0x30
> > > [22216.791956] ---[ end trace dc769a61c1af087b ]---
> > > [22216.792295] dump_stack+0xd0/0x128
> > > [22216.792299] bad_page+0xe4/0x110
> > > [22216.792579] Kernel panic - not syncing: Oops - BUG: Fatal exception
> > > in interrupt
> > > [22216.792937] check_free_page_bad+0x84/0x90
> > > [22216.792940] free_pcp_prepare+0x1fc/0x21c
> > > [22216.793253] SMP: stopping secondary CPUs
> > > [22216.793525] free_unref_page+0x2c/0xec
> > > [22216.805537] __put_page+0x60/0x70
> > > [22216.805931] collapse_file+0xdc8/0x12f0
> > > [22216.806385] khugepaged_scan_file+0x2dc/0x37c
> > > [22216.806900] khugepaged_scan_mm_slot+0x2e0/0x380
> > > [22216.807450] khugepaged_do_scan+0x2dc/0x2fc
> > > [22216.807946] khugepaged+0x38/0x100
> > > [22216.808342] kthread+0x11c/0x120
> > > [22216.808735] Kernel Offset: disabled
> > > [22216.809153] CPU features: 0x0040002,62208238
> > > [22216.809681] Memory Limit: none
> > > [22216.813477] Starting crashdump kernel...
> > >
> > > So I think the race also exists between collapse_file and
> > > truncate_pagecache.
> >
> > I do have CONFIG_DEBUG_VM, but I haven't hit this issue yet.
>
> Sorry, it's taken me a long time to get into this bug(s).
>
> I haven't tried to reproduce it, but I do think that Rongwei will
> be proved right.
>
> In doing a lockless lookup of the page cache, we tend to imagine
> that the THP head page will be encountered first, and the special
> treatment for the head will do the right thing to skip the tails.
>
> But when racing against collapse_file()'s rewind, I agree with
> Rongwei that it is possible for truncation (or page cache cleanup
> in this instance) to collect a pagevec which starts with a PageTail
> some way into the compound page.
>
> shmem_undo_range() (which shmem uses rather than truncate_pagecache())
> would not call truncate_inode_page() on a THP tail: if it encounters a
> tail, it will try to split the THP, and not call truncate_inode_page()
> if it cannot. Unless I'm inventing the memory, I now think that I did
> encounter this race between truncate and collapse on huge shmem, and
> had to re-craft my shmem_punch_compound() to handle it correctly.
>
> If it is just a matter of collapse_file() rewind, I suppose we could
> reverse the direction in which that proceeds; but I'm not convinced
> that would be enough, and don't think we need such a "big" change.
>
> Aside from the above page->index mischeck in find_lock_entries(),
> I now think this bug needs nothing more than simply removing the
> VM_BUG_ON_PAGE(PageTail(page), page) from truncate_inode_page().

I don't think that's right. This bug was also observed when calling
truncate(), so there's clearly a situation where two concurrent calls
to truncate_pagecache() leaves a THP in the cache.

Unless there's a case where mapping->nr_thps gets corrupted, so the
open() thinks there's no THPs in the page cache, when there's actually
one or more? That might be a problem that we're solving by locking
around the truncate_pagecache() call?

> (You could say move its page->mapping check before its PageTail
> check; but the PageTail check would then never catch anything.)
>
> Rongwei's patch went in the right direction, but why add N lines
> if deleting one is good? And for a while I thought that idea of
> using filemap_invalidate_lock() was very good; but now think
> the page lock we already have is better, and solve both races.
>
> Hugh

2021-09-30 18:58:02

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On Thu, 30 Sep 2021, Matthew Wilcox wrote:
> On Wed, Sep 29, 2021 at 10:24:44PM -0700, Hugh Dickins wrote:
> >
> > Aside from the above page->index mischeck in find_lock_entries(),
> > I now think this bug needs nothing more than simply removing the
> > VM_BUG_ON_PAGE(PageTail(page), page) from truncate_inode_page().
>
> I don't think that's right. This bug was also observed when calling
> truncate(), so there's clearly a situation where two concurrent calls
> to truncate_pagecache() leaves a THP in the cache.

I assume you're thinking of one of the fuzzer blkdev ones:
https://lore.kernel.org/linux-mm/CACkBjsbtF_peC7N_4mRfHML_BeiPe+O9DahTfr84puSG_J9rcQ@mail.gmail.com/
or
https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/

I haven't started on those ones yet: yes, I imagine one or both of those
will need a further fix (S_ISREG() check somewhere if we're lucky; but
could well be nastier); but for the bug in this thread, I expect
removing the VM_BUG_ON_PAGE(PageTail) to be enough.

If you're thinking of something else, please send a link if you can - thanks.

Hugh

2021-10-02 02:37:32

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On 10/1/21 12:49 AM, Hugh Dickins wrote:
> On Thu, 30 Sep 2021, Matthew Wilcox wrote:
>> On Wed, Sep 29, 2021 at 10:24:44PM -0700, Hugh Dickins wrote:
>>>
>>> Aside from the above page->index mischeck in find_lock_entries(),
>>> I now think this bug needs nothing more than simply removing the
>>> VM_BUG_ON_PAGE(PageTail(page), page) from truncate_inode_page().
>>
>> I don't think that's right. This bug was also observed when calling
>> truncate(), so there's clearly a situation where two concurrent calls
>> to truncate_pagecache() leaves a THP in the cache.
>
> I assume you're thinking of one of the fuzzer blkdev ones:
> https://lore.kernel.org/linux-mm/CACkBjsbtF_peC7N_4mRfHML_BeiPe+O9DahTfr84puSG_J9rcQ@mail.gmail.com/
> or
> https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/
>
> I haven't started on those ones yet: yes, I imagine one or both of those
> will need a further fix (S_ISREG() check somewhere if we're lucky; but
> could well be nastier); but for the bug in this thread, I expect
> removing the VM_BUG_ON_PAGE(PageTail) to be enough.
Thanks for your advices!

I'm so sorry that delay the test due to my recent vacation. I plan to
start further test tomorrow. I think removing the
VM_BUG_ON_PAGE(PageTail) is a good idea, but also think using
filemap_invalidate_lock is safer and necessary. And of course, this is
just my own view! So, now, I tend to use filemap_invalidate_lock and
either check page mapping again or remove VM_BUG_ON_PAGE(PageTail).

Anyway, I will run more tests and then give feedback.

Thanks!
>
> If you're thinking of something else, please send a link if you can - thanks.
>
> Hugh
>

2021-10-02 17:20:22

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On Thu, Sep 30, 2021 at 10:39:14AM -0700, Yang Shi wrote:
> On Thu, Sep 30, 2021 at 9:49 AM Hugh Dickins <[email protected]> wrote:
> > I assume you're thinking of one of the fuzzer blkdev ones:
> > https://lore.kernel.org/linux-mm/CACkBjsbtF_peC7N_4mRfHML_BeiPe+O9DahTfr84puSG_J9rcQ@mail.gmail.com/
> > or
> > https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/
> >
> > I haven't started on those ones yet: yes, I imagine one or both of those
> > will need a further fix (S_ISREG() check somewhere if we're lucky; but
> > could well be nastier); but for the bug in this thread, I expect
>
> Makes sense to me. We should be able to check S_ISREG() in khugepaged,
> if it is not a regular file, just bail out. Sounds not that nasty to
> me AFAIU.

I don't see why we should have an S_ISREG() check. I agree it's not the
intended usecase, but it ought to work fine. Unless there's something
I'm missing?

2021-10-04 19:11:59

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On Sat, Oct 2, 2021 at 10:09 AM Matthew Wilcox <[email protected]> wrote:
>
> On Thu, Sep 30, 2021 at 10:39:14AM -0700, Yang Shi wrote:
> > On Thu, Sep 30, 2021 at 9:49 AM Hugh Dickins <[email protected]> wrote:
> > > I assume you're thinking of one of the fuzzer blkdev ones:
> > > https://lore.kernel.org/linux-mm/CACkBjsbtF_peC7N_4mRfHML_BeiPe+O9DahTfr84puSG_J9rcQ@mail.gmail.com/
> > > or
> > > https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/
> > >
> > > I haven't started on those ones yet: yes, I imagine one or both of those
> > > will need a further fix (S_ISREG() check somewhere if we're lucky; but
> > > could well be nastier); but for the bug in this thread, I expect
> >
> > Makes sense to me. We should be able to check S_ISREG() in khugepaged,
> > if it is not a regular file, just bail out. Sounds not that nasty to
> > me AFAIU.
>
> I don't see why we should have an S_ISREG() check. I agree it's not the
> intended usecase, but it ought to work fine. Unless there's something
> I'm missing?

Check out this bug report:
https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/
and the patch from me:
https://lore.kernel.org/linux-mm/[email protected]/

I don't think we handle buffers correctly for file THP, right? My
patch is ad hoc, so I thought Hugh's suggestion makes some sense to
me. Why do we have THP collapsed for unintended usecase in the first
place?

2021-10-04 19:38:18

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On Mon, Oct 04, 2021 at 11:28:45AM -0700, Yang Shi wrote:
> On Sat, Oct 2, 2021 at 10:09 AM Matthew Wilcox <[email protected]> wrote:
> >
> > On Thu, Sep 30, 2021 at 10:39:14AM -0700, Yang Shi wrote:
> > > On Thu, Sep 30, 2021 at 9:49 AM Hugh Dickins <[email protected]> wrote:
> > > > I assume you're thinking of one of the fuzzer blkdev ones:
> > > > https://lore.kernel.org/linux-mm/CACkBjsbtF_peC7N_4mRfHML_BeiPe+O9DahTfr84puSG_J9rcQ@mail.gmail.com/
> > > > or
> > > > https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/
> > > >
> > > > I haven't started on those ones yet: yes, I imagine one or both of those
> > > > will need a further fix (S_ISREG() check somewhere if we're lucky; but
> > > > could well be nastier); but for the bug in this thread, I expect
> > >
> > > Makes sense to me. We should be able to check S_ISREG() in khugepaged,
> > > if it is not a regular file, just bail out. Sounds not that nasty to
> > > me AFAIU.
> >
> > I don't see why we should have an S_ISREG() check. I agree it's not the
> > intended usecase, but it ought to work fine. Unless there's something
> > I'm missing?
>
> Check out this bug report:
> https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/
> and the patch from me:
> https://lore.kernel.org/linux-mm/[email protected]/
>
> I don't think we handle buffers correctly for file THP, right? My
> patch is ad hoc, so I thought Hugh's suggestion makes some sense to
> me. Why do we have THP collapsed for unintended usecase in the first
> place?

OK, I've done some more digging. I think what's going on with this
report is userspace opens the block device RO, causes the page cache to
be loaded with data, then khugepaged comes in and creates THPs.
What confuses me is that these THPs have private data attached to them.
I don't know how that happens. If it's block device specific, then
yes, something like your S_ISREG() idea should work fine. Otherwise,
we might need to track down another problem.

Hao Sun, can you try this patch and see what comes out of it?

diff --git a/fs/buffer.c b/fs/buffer.c
index c615387aedca..fefe05a9973d 100644
--- a/fs/buffer.c
+++ b/fs/buffer.c
@@ -872,6 +872,7 @@ link_dev_buffers(struct page *page, struct buffer_head *head)
bh = bh->b_this_page;
} while (bh);
tail->b_this_page = head;
+ VM_BUG_ON_PAGE(PageCompound(page), page);
attach_page_private(page, head);
}

@@ -1577,6 +1578,7 @@ void create_empty_buffers(struct page *page,
bh = bh->b_this_page;
} while (bh != head);
}
+ VM_BUG_ON_PAGE(PageCompound(page), page);
attach_page_private(page, head);
spin_unlock(&page->mapping->private_lock);
}
@@ -2565,6 +2567,7 @@ static void attach_nobh_buffers(struct page *page, struct buffer_head *head)
bh->b_this_page = head;
bh = bh->b_this_page;
} while (bh != head);
+ VM_BUG_ON_PAGE(PageCompound(page), page);
attach_page_private(page, head);
spin_unlock(&page->mapping->private_lock);
}

2021-10-04 23:03:02

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

Hi,
I have run our cases these two days to stress test new Patch #1. The new
Patch #1 mainly add filemap_invalidate_{un}lock before and after
truncate_pagecache(), basing on original Patch #1. And the crash has not
happened.

Now, I keep the original Patch #1, then adding the code below which
suggested by liu song (I'm not sure which one I should add in the next
version, Suggested-by or Signed-off-by? If you know, please remind me).

- if (filemap_nr_thps(inode->i_mapping))
+ if (filemap_nr_thps(inode->i_mapping)) {
+ filemap_invalidate_lock(inode->i_mapping);
truncate_pagecache(inode, 0);
+ filemap_invalidate_unlock(inode->i_mapping);
+ }

And the reason for keeping the original Patch #1 is mainly to fix the
race between collapse_file and truncate_pagecache. It seems necessary.
Despite the two-day test, I did not reproduce this race any more.

In addition, I also test the below method:

diff --git a/mm/truncate.c b/mm/truncate.c
index 3f47190f98a8..33604e4ce60a 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -210,8 +210,6 @@ invalidate_complete_page(struct address_space
*mapping, struct page *page)

int truncate_inode_page(struct address_space *mapping, struct page *page)
{
- VM_BUG_ON_PAGE(PageTail(page), page);
-
if (page->mapping != mapping)
return -EIO;

I am not very sure this VM_BUG_ON_PAGE(PageTail) is what Hugh means. And
the test results show that only removing this VM_BUG_ON_PAGE(PageTail)
has no effect. So, I still keep the original Patch #1 to fix one race.

I plan to send Patch v3 after receiving your reply.

Thanks!

On 9/30/21 8:41 AM, Song Liu wrote:
> On Wed, Sep 29, 2021 at 5:02 PM Matthew Wilcox <[email protected]> wrote:
>>
>> On Wed, Sep 29, 2021 at 04:41:48PM -0700, Song Liu wrote:
>>> The issue is NOT caused by concurrent khugepaged:collapse_file() and
>>> truncate_pagecache(inode, 0). With some printks, we can see a clear
>>> time gap (>2 second ) between collapse_file() finishes, and
>>> truncate_pagecache() (which crashes soon). Therefore, my earlier
>>> suggestion that adds deny_write_access() to collapse_file() does NOT
>>> work.
>>>
>>> The crash is actually caused by concurrent truncate_pagecache(inode, 0).
>>> If I change the number of write thread in stress_madvise_dso.c to one,
>>> (IOW, one thread_read and one thread_write), I cannot reproduce the
>>> crash anymore.
>>>
>>> I think this means we cannot fix this issue in collapse_file(), because it
>>> finishes long before the crash.
>>
>> Ah! So are we missing one or more of these locks:
>>
>> inode_lock(inode);
>> filemap_invalidate_lock(mapping);
>>
>> in the open path?
>
> The following fixes the crash in my test. But I am not sure whether this is the
> best fix.
>
> Rongwei, could you please run more tests on it?
>
> Thanks,
> Song
>
>
> diff --git i/fs/open.c w/fs/open.c
> index daa324606a41f..d13c4668b2e53 100644
> --- i/fs/open.c
> +++ w/fs/open.c
> @@ -856,8 +856,11 @@ static int do_dentry_open(struct file *f,
> * of THPs into the page cache will fail.
> */
> smp_mb();
> - if (filemap_nr_thps(inode->i_mapping))
> + if (filemap_nr_thps(inode->i_mapping)) {
> + filemap_invalidate_lock(inode->i_mapping);
> truncate_pagecache(inode, 0);
> + filemap_invalidate_unlock(inode->i_mapping);
> + }
> }
>
> return 0;
>

2021-10-04 23:34:21

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On Tue, Oct 05, 2021 at 01:26:50AM +0800, Rongwei Wang wrote:
> Hi,
> I have run our cases these two days to stress test new Patch #1. The new
> Patch #1 mainly add filemap_invalidate_{un}lock before and after
> truncate_pagecache(), basing on original Patch #1. And the crash has not
> happened.

You shouldn't need most of patch 1.

In fact, the only two patches you should need would be this:

+++ b/mm/filemap.c
@@ -2093,7 +2093,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
if (!xa_is_value(page)) {
if (page->index < start)
goto put;
- VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
if (page->index + thp_nr_pages(page) - 1 > end)
goto put;
if (!trylock_page(page))

(already in Andrew's tree) and:

> - if (filemap_nr_thps(inode->i_mapping))
> + if (filemap_nr_thps(inode->i_mapping)) {
> + filemap_invalidate_lock(inode->i_mapping);
> truncate_pagecache(inode, 0);
> + filemap_invalidate_unlock(inode->i_mapping);
> + }

If you can still hit a bug with just those two patches, then something
else is going wrong, and needs to be investigated.

2021-10-04 23:45:24

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On Mon, Oct 4, 2021 at 10:26 AM Rongwei Wang
<[email protected]> wrote:
>
> Hi,
> I have run our cases these two days to stress test new Patch #1. The new
> Patch #1 mainly add filemap_invalidate_{un}lock before and after
> truncate_pagecache(), basing on original Patch #1. And the crash has not
> happened.
>
> Now, I keep the original Patch #1, then adding the code below which
> suggested by liu song (I'm not sure which one I should add in the next
> version, Suggested-by or Signed-off-by? If you know, please remind me).
>
> - if (filemap_nr_thps(inode->i_mapping))
> + if (filemap_nr_thps(inode->i_mapping)) {
> + filemap_invalidate_lock(inode->i_mapping);
> truncate_pagecache(inode, 0);
> + filemap_invalidate_unlock(inode->i_mapping);
> + }

It is mostly suggested by Matthew. If the patch goes that way, you can add

Tested-by: Song Liu <[email protected]>

2021-10-05 01:59:46

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On 10/5/21 3:05 AM, Matthew Wilcox wrote:
> On Tue, Oct 05, 2021 at 01:26:50AM +0800, Rongwei Wang wrote:
>> Hi,
>> I have run our cases these two days to stress test new Patch #1. The new
>> Patch #1 mainly add filemap_invalidate_{un}lock before and after
>> truncate_pagecache(), basing on original Patch #1. And the crash has not
>> happened.
>
> You shouldn't need most of patch 1.
>
> In fact, the only two patches you should need would be this:
>
> +++ b/mm/filemap.c
> @@ -2093,7 +2093,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
> if (!xa_is_value(page)) {
> if (page->index < start)
> goto put;
> - VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
> if (page->index + thp_nr_pages(page) - 1 > end)
> goto put;
> if (!trylock_page(page))
>
> (already in Andrew's tree) and:
>
>> - if (filemap_nr_thps(inode->i_mapping))
>> + if (filemap_nr_thps(inode->i_mapping)) {
>> + filemap_invalidate_lock(inode->i_mapping);
>> truncate_pagecache(inode, 0);
>> + filemap_invalidate_unlock(inode->i_mapping);
>> + }
>
> If you can still hit a bug with just those two patches, then something
> else is going wrong, and needs to be investigated.
OK, I see what your mean. I will send Patch v3 and only keep
filemap_invalidate_{un}lock in Patch #1.

Thanks!
>

2021-10-05 02:28:26

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On Mon, 4 Oct 2021, Matthew Wilcox wrote:
> On Mon, Oct 04, 2021 at 11:28:45AM -0700, Yang Shi wrote:
> > On Sat, Oct 2, 2021 at 10:09 AM Matthew Wilcox <[email protected]> wrote:
> > > On Thu, Sep 30, 2021 at 10:39:14AM -0700, Yang Shi wrote:
> > > > On Thu, Sep 30, 2021 at 9:49 AM Hugh Dickins <[email protected]> wrote:
> > > > > I assume you're thinking of one of the fuzzer blkdev ones:
> > > > > https://lore.kernel.org/linux-mm/CACkBjsbtF_peC7N_4mRfHML_BeiPe+O9DahTfr84puSG_J9rcQ@mail.gmail.com/
> > > > > or
> > > > > https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/
> > > > >
> > > > > I haven't started on those ones yet: yes, I imagine one or both of those
> > > > > will need a further fix (S_ISREG() check somewhere if we're lucky; but
> > > > > could well be nastier); but for the bug in this thread, I expect
> > > >
> > > > Makes sense to me. We should be able to check S_ISREG() in khugepaged,
> > > > if it is not a regular file, just bail out. Sounds not that nasty to
> > > > me AFAIU.
> > >
> > > I don't see why we should have an S_ISREG() check. I agree it's not the
> > > intended usecase, but it ought to work fine. Unless there's something
> > > I'm missing?
> >
> > Check out this bug report:
> > https://lore.kernel.org/lkml/CACkBjsYwLYLRmX8GpsDpMthagWOjWWrNxqY6ZLNQVr6yx+f5vA@mail.gmail.com/
> > and the patch from me:
> > https://lore.kernel.org/linux-mm/[email protected]/
> >
> > I don't think we handle buffers correctly for file THP, right? My
> > patch is ad hoc, so I thought Hugh's suggestion makes some sense to
> > me. Why do we have THP collapsed for unintended usecase in the first
> > place?
>
> OK, I've done some more digging. I think what's going on with this
> report is userspace opens the block device RO, causes the page cache to
> be loaded with data, then khugepaged comes in and creates THPs.

Yes.

> What confuses me is that these THPs have private data attached to them.
> I don't know how that happens. If it's block device specific, then
> yes, something like your S_ISREG() idea should work fine. Otherwise,
> we might need to track down another problem.

Agreed, the file THP is created without PagePrivate, so the puzzle was
why the read-only cached page would later become page_has_private().

The C repro showed that it uses (a BTRFS_IOC_ADD_DEV ioctl which might
not be relevant and) a BLKRRPART ioctl 0x125f: I didn't follow BLKRRPART
all the way down, but imagine it has to attach buffer-heads to re-read
the partition table. Which would explain it.

Aside from that particular ioctl, it seems a good idea to insist on
S_ISREG just to shrink the attack surface: as Yang Shi says, executable
THP on block device was never an intended usecase, and not a usecase
anyone is likely to miss! And that fuzzer appears to delight in
tormenting /dev/nullb0, so let's just seal off that avenue.

You're right to have some doubt, as to whether there might be other
ways for buffer-heads to get attached, even on a read-only regular
file; but no way has sprung to my mind, and READ_ONLY_THP_FOR_FS has
survived well in its intended usage: so I think we should proceed on
the assumption that no further bugs remain - then fix them when found.

I wasn't able to reproduce the problem with the repro, would need to
waste many hours to do so. But here's the untested S_ISREG patch I
came up with. Sorry, I've mixed something else in: in moving the
alignment part to clarify the conditions, I was alarmed to see that
shmem with !shmem_huge_enabled was falling through to THP_FOR_FS to
give unexpected huge pages: fixed that, though later found there's
a separate shmem_huge_enabled() check which should exclude it.

--- 5.15-rc4/mm/khugepaged.c 2021-09-12 17:39:21.943438422 -0700
+++ linux/khugepaged.c 2021-10-03 20:41:13.194822795 -0700
@@ -445,22 +445,25 @@ static bool hugepage_vma_check(struct vm
if (!transhuge_vma_enabled(vma, vm_flags))
return false;

+ if (vma->vm_file && !IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) -
+ vma->vm_pgoff, HPAGE_PMD_NR))
+ return false;
+
/* Enabled via shmem mount options or sysfs settings. */
- if (shmem_file(vma->vm_file) && shmem_huge_enabled(vma)) {
- return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
- HPAGE_PMD_NR);
- }
+ if (shmem_file(vma->vm_file))
+ return shmem_huge_enabled(vma);

/* THP settings require madvise. */
if (!(vm_flags & VM_HUGEPAGE) && !khugepaged_always())
return false;

/* Read-only file mappings need to be aligned for THP to work. */
- if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) && vma->vm_file &&
- !inode_is_open_for_write(vma->vm_file->f_inode) &&
- (vm_flags & VM_EXEC)) {
- return IS_ALIGNED((vma->vm_start >> PAGE_SHIFT) - vma->vm_pgoff,
- HPAGE_PMD_NR);
+ if (IS_ENABLED(CONFIG_READ_ONLY_THP_FOR_FS) &&
+ (vm_flags & VM_EXEC) && vma->vm_file) {
+ struct inode *inode = vma->vm_file->f_inode;
+
+ return !inode_is_open_for_write(inode) &&
+ S_ISREG(inode->i_mode);
}

if (!vma->anon_vma || vma->vm_ops)

2021-10-05 02:59:58

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On Tue, 5 Oct 2021, Rongwei Wang wrote:

> Hi,
> I have run our cases these two days to stress test new Patch #1. The new Patch
> #1 mainly add filemap_invalidate_{un}lock before and after
> truncate_pagecache(), basing on original Patch #1. And the crash has not
> happened.
>
> Now, I keep the original Patch #1, then adding the code below which suggested
> by liu song (I'm not sure which one I should add in the next version,
> Suggested-by or Signed-off-by? If you know, please remind me).
>
> - if (filemap_nr_thps(inode->i_mapping))
> + if (filemap_nr_thps(inode->i_mapping)) {
> + filemap_invalidate_lock(inode->i_mapping);
> truncate_pagecache(inode, 0);
> + filemap_invalidate_unlock(inode->i_mapping);
> + }

I won't NAK that patch; but I still believe it's unnecessary, and don't
see how it protects against all the races (collapse_file() does not use
that lock, whereas collapse_file() does use page lock). And if you're
hoping to fix 5.10, then you will have to backport those invalidate_lock
patches there too (they're really intended to protect hole-punching).

>
> And the reason for keeping the original Patch #1 is mainly to fix the race
> between collapse_file and truncate_pagecache. It seems necessary. Despite the
> two-day test, I did not reproduce this race any more.
>
> In addition, I also test the below method:
>
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 3f47190f98a8..33604e4ce60a 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -210,8 +210,6 @@ invalidate_complete_page(struct address_space *mapping,
> struct page *page)
>
> int truncate_inode_page(struct address_space *mapping, struct page *page)
> {
> - VM_BUG_ON_PAGE(PageTail(page), page);
> -
> if (page->mapping != mapping)
> return -EIO;
>
> I am not very sure this VM_BUG_ON_PAGE(PageTail) is what Hugh means. And
> the test results show that only removing this VM_BUG_ON_PAGE(PageTail) has no
> effect. So, I still keep the original Patch #1 to fix one race.

Yes, that's exactly what I meant, and thank you for intending to try it.

But if that patch had "no effect", then I think you were not running the
kernel with that patch applied: because it deletes the BUG on line 213
of mm/truncate.c, which is what you reported in the first mail!

Or, is line 213 of mm/truncate.c in your 5.10.46-hugetext+ kernel
something else? I've been looking at 5.15-rc.

But I wasn't proposing to delete it merely to hide the BUG: as I hope
I explained, we could move it below the page->mapping check, but it
wouldn't really be of any value there since tails have NULL page->mapping
anyway (well, I didn't check first and second tails, maybe mapping gets
reused for some compound page field in those). I was proposing to delete
it because the page->mapping check then weeds out the racy case once
we're holding page lock, without the need for adding anything special.

Hugh

2021-10-05 03:13:22

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On Mon, Oct 04, 2021 at 07:58:10PM -0700, Hugh Dickins wrote:
> On Tue, 5 Oct 2021, Rongwei Wang wrote:
>
> > Hi,
> > I have run our cases these two days to stress test new Patch #1. The new Patch
> > #1 mainly add filemap_invalidate_{un}lock before and after
> > truncate_pagecache(), basing on original Patch #1. And the crash has not
> > happened.
> >
> > Now, I keep the original Patch #1, then adding the code below which suggested
> > by liu song (I'm not sure which one I should add in the next version,
> > Suggested-by or Signed-off-by? If you know, please remind me).
> >
> > - if (filemap_nr_thps(inode->i_mapping))
> > + if (filemap_nr_thps(inode->i_mapping)) {
> > + filemap_invalidate_lock(inode->i_mapping);
> > truncate_pagecache(inode, 0);
> > + filemap_invalidate_unlock(inode->i_mapping);
> > + }
>
> I won't NAK that patch; but I still believe it's unnecessary, and don't
> see how it protects against all the races (collapse_file() does not use
> that lock, whereas collapse_file() does use page lock). And if you're
> hoping to fix 5.10, then you will have to backport those invalidate_lock
> patches there too (they're really intended to protect hole-punching).

I believe all we really need to do is protect against calling
truncate_pagecache() simultaneously to avoid one of the calls seeing
a tail page. i_mutex would work for this purpose just as well as
filemap_invalidate_lock(). See, for example, ext4_zero_range() which
first takes inode_lock(), then filemap_invalidate_lock() before calling
truncate_pagecache_range().

> > And the reason for keeping the original Patch #1 is mainly to fix the race
> > between collapse_file and truncate_pagecache. It seems necessary. Despite the
> > two-day test, I did not reproduce this race any more.
> >
> > In addition, I also test the below method:
> >
> > diff --git a/mm/truncate.c b/mm/truncate.c
> > index 3f47190f98a8..33604e4ce60a 100644
> > --- a/mm/truncate.c
> > +++ b/mm/truncate.c
> > @@ -210,8 +210,6 @@ invalidate_complete_page(struct address_space *mapping,
> > struct page *page)
> >
> > int truncate_inode_page(struct address_space *mapping, struct page *page)
> > {
> > - VM_BUG_ON_PAGE(PageTail(page), page);
> > -
> > if (page->mapping != mapping)
> > return -EIO;
> >
> > I am not very sure this VM_BUG_ON_PAGE(PageTail) is what Hugh means. And
> > the test results show that only removing this VM_BUG_ON_PAGE(PageTail) has no
> > effect. So, I still keep the original Patch #1 to fix one race.
>
> Yes, that's exactly what I meant, and thank you for intending to try it.
>
> But if that patch had "no effect", then I think you were not running the
> kernel with that patch applied: because it deletes the BUG on line 213
> of mm/truncate.c, which is what you reported in the first mail!
>
> Or, is line 213 of mm/truncate.c in your 5.10.46-hugetext+ kernel
> something else? I've been looking at 5.15-rc.
>
> But I wasn't proposing to delete it merely to hide the BUG: as I hope
> I explained, we could move it below the page->mapping check, but it
> wouldn't really be of any value there since tails have NULL page->mapping
> anyway (well, I didn't check first and second tails, maybe mapping gets
> reused for some compound page field in those). I was proposing to delete
> it because the page->mapping check then weeds out the racy case once
> we're holding page lock, without the need for adding anything special.

I think if we remove the race with the above mutex lock then we'll never
see a tail page in this routine.

2021-10-05 09:07:31

[permalink] [raw]

Subject: Re: [PATCH v2 1/2] mm, thp: check page mapping when truncating page cache

On 10/5/21 10:58 AM, Hugh Dickins wrote:
> On Tue, 5 Oct 2021, Rongwei Wang wrote:
>
>> Hi,
>> I have run our cases these two days to stress test new Patch #1. The new Patch
>> #1 mainly add filemap_invalidate_{un}lock before and after
>> truncate_pagecache(), basing on original Patch #1. And the crash has not
>> happened.
>>
>> Now, I keep the original Patch #1, then adding the code below which suggested
>> by liu song (I'm not sure which one I should add in the next version,
>> Suggested-by or Signed-off-by? If you know, please remind me).
>>
>> - if (filemap_nr_thps(inode->i_mapping))
>> + if (filemap_nr_thps(inode->i_mapping)) {
>> + filemap_invalidate_lock(inode->i_mapping);
>> truncate_pagecache(inode, 0);
>> + filemap_invalidate_unlock(inode->i_mapping);
>> + }
>
> I won't NAK that patch; but I still believe it's unnecessary, and don't
> see how it protects against all the races (collapse_file() does not use
> that lock, whereas collapse_file() does use page lock). And if you're
> hoping to fix 5.10, then you will have to backport those invalidate_lock
> patches there too (they're really intended to protect hole-punching).
>
>>
>> And the reason for keeping the original Patch #1 is mainly to fix the race
>> between collapse_file and truncate_pagecache. It seems necessary. Despite the
>> two-day test, I did not reproduce this race any more.
>>
>> In addition, I also test the below method:
>>
>> diff --git a/mm/truncate.c b/mm/truncate.c
>> index 3f47190f98a8..33604e4ce60a 100644
>> --- a/mm/truncate.c
>> +++ b/mm/truncate.c
>> @@ -210,8 +210,6 @@ invalidate_complete_page(struct address_space *mapping,
>> struct page *page)
>>
>> int truncate_inode_page(struct address_space *mapping, struct page *page)
>> {
>> - VM_BUG_ON_PAGE(PageTail(page), page);
>> -
>> if (page->mapping != mapping)
>> return -EIO;
>>
>> I am not very sure this VM_BUG_ON_PAGE(PageTail) is what Hugh means. And
>> the test results show that only removing this VM_BUG_ON_PAGE(PageTail) has no
>> effect. So, I still keep the original Patch #1 to fix one race.
>
> Yes, that's exactly what I meant, and thank you for intending to try it.
>
> But if that patch had "no effect", then I think you were not running the
> kernel with that patch applied: because it deletes the BUG on line 213
> of mm/truncate.c, which is what you reported in the first mail!
>
> Or, is line 213 of mm/truncate.c in your 5.10.46-hugetext+ kernel
> something else? I've been looking at 5.15-rc.
Hi, Hugh

I'm sorry the confusing '5.10.46-hugetext+'. I am also look and test at
5.15-rc.
>
> But I wasn't proposing to delete it merely to hide the BUG: as I hope
> I explained, we could move it below the page->mapping check, but it
> wouldn't really be of any value there since tails have NULL page->mapping
> anyway (well, I didn't check first and second tails, maybe mapping gets
> reused for some compound page field in those). I was proposing to delete
> it because the page->mapping check then weeds out the racy case once
> we're holding page lock, without the need for adding anything special.
>
> Hugh
Today, I try again to create some cases to reproduce the race, such as
ensuring that multiple processes are always executing
‘truncate_pagecache’ and only mapping 2M DSO. In this way, I try to
ensure that the target of 'collapse_file' and 'truncate_pagecache' can
only be the same VMA, to increase the probability of reproducing that
race. But, I can't reproduce that race any more.

In fact, according to the previous experience, the current number of
attempts should be able to reproduce that race.

If you have the idea about creating this case, please tell me, and I can
try again. Or we can solve it when it appears again.

Thanks!
>

2021-10-06 02:20:25

[permalink] [raw]

Subject: [PATCH v3 v3 0/2] mm, thp: fix file-backed THP race in collapse_file and truncate pagecache

Hi,
We found two bugs related to file-backed THP in our cases, recently.
The two bugs rough description as following:

1. in truncate_inode_pages_range, subpage(s) of file-backed THP can be
revealed by find_get_entry.

2. 'collapse_file' miss the pages which in writeback but no private.
This situation will be triggered in XFS when block size is set to
PAGESIZE.

These two patches mainly fix the above mentioned bugs, and have been
tested in latest branch.

Changelog:

v2 -> v3:
- Patch "mm, thp: lock filemap when truncating page cache"
add filemap_invalidate_{un}lock before and after calling truncate_pagecache (Suggested by Song Liu and Matthew).

v1 -> v2:
- Patch "mm, thp: check page mapping when truncating page cache"
move the check of page mapping to behind lock_page.
- Patch "mm, thp: bail out early in collapse_file for writeback page"
check the writeback flag before taking page lock (Suggested by Yang Shi).

v1 link:
https://patchwork.kernel.org/project/linux-mm/cover/[email protected]/
v2 link:
https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/

Rongwei Wang (2):
mm, thp: lock filemap when truncating page cache
mm, thp: bail out early in collapse_file for writeback page

fs/open.c | 5 ++++-
mm/khugepaged.c | 7 ++++++-
2 files changed, 10 insertions(+), 2 deletions(-)

--
2.27.0

2021-10-06 02:22:50

[permalink] [raw]

Subject: [PATCH v3 v3 2/2] mm, thp: bail out early in collapse_file for writeback page

Currently collapse_file does not explicitly check PG_writeback, instead,
page_has_private and try_to_release_page are used to filter writeback
pages. This does not work for xfs with blocksize equal to or larger
than pagesize, because in such case xfs has no page->private.

This makes collapse_file bail out early for writeback page. Otherwise,
xfs end_page_writeback will panic as follows.

[ 6411.448211] page:fffffe00201bcc80 refcount:0 mapcount:0 mapping:ffff0003f88c86a8 index:0x0 pfn:0x84ef32
[ 6411.448304] aops:xfs_address_space_operations [xfs] ino:30000b7 dentry name:"libtest.so"
[ 6411.448312] flags: 0x57fffe0000008027(locked|referenced|uptodate|active|writeback)
[ 6411.448317] raw: 57fffe0000008027 ffff80001b48bc28 ffff80001b48bc28 ffff0003f88c86a8
[ 6411.448321] raw: 0000000000000000 0000000000000000 00000000ffffffff ffff0000c3e9a000
[ 6411.448324] page dumped because: VM_BUG_ON_PAGE(((unsigned int) page_ref_count(page) + 127u <= 127u))
[ 6411.448327] page->mem_cgroup:ffff0000c3e9a000
[ 6411.448340] ------------[ cut here ]------------
[ 6411.448343] kernel BUG at include/linux/mm.h:1212!
[ 6411.449288] Internal error: Oops - BUG: 0 [#1] SMP
[ 6411.449786] Modules linked in:
[ 6411.449790] BUG: Bad page state in process khugepaged pfn:84ef32
[ 6411.450143] xfs(E)
[ 6411.450459] page:fffffe00201bcc80 refcount:0 mapcount:0 mapping:0 index:0x0 pfn:0x84ef32
[ 6411.451361] libcrc32c(E) rfkill(E) aes_ce_blk(E) crypto_simd(E) ...
[ 6411.451387] CPU: 25 PID: 0 Comm: swapper/25 Kdump: loaded Tainted: ...
[ 6411.451389] pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
[ 6411.451393] pc : end_page_writeback+0x1c0/0x214
[ 6411.451394] lr : end_page_writeback+0x1c0/0x214
[ 6411.451395] sp : ffff800011ce3cc0
[ 6411.451396] x29: ffff800011ce3cc0 x28: 0000000000000000
[ 6411.451398] x27: ffff000c04608040 x26: 0000000000000000
[ 6411.451399] x25: ffff000c04608040 x24: 0000000000001000
[ 6411.451401] x23: ffff0003f88c8530 x22: 0000000000001000
[ 6411.451403] x21: ffff0003f88c8530 x20: 0000000000000000
[ 6411.451404] x19: fffffe00201bcc80 x18: 0000000000000030
[ 6411.451406] x17: 0000000000000000 x16: 0000000000000000
[ 6411.451407] x15: ffff000c018f9760 x14: ffffffffffffffff
[ 6411.451409] x13: ffff8000119d72b0 x12: ffff8000119d6ee3
[ 6411.451410] x11: ffff8000117b69b8 x10: 00000000ffff8000
[ 6411.451412] x9 : ffff800010617534 x8 : 0000000000000000
[ 6411.451413] x7 : ffff8000114f69b8 x6 : 000000000000000f
[ 6411.451415] x5 : 0000000000000000 x4 : 0000000000000000
[ 6411.451416] x3 : 0000000000000400 x2 : 0000000000000000
[ 6411.451418] x1 : 0000000000000000 x0 : 0000000000000000
[ 6411.451420] Call trace:
[ 6411.451421] end_page_writeback+0x1c0/0x214
[ 6411.451424] iomap_finish_page_writeback+0x13c/0x204
[ 6411.451425] iomap_finish_ioend+0xe8/0x19c
[ 6411.451426] iomap_writepage_end_bio+0x38/0x50
[ 6411.451427] bio_endio+0x168/0x1ec
[ 6411.451430] blk_update_request+0x278/0x3f0
[ 6411.451432] blk_mq_end_request+0x34/0x15c
[ 6411.451435] virtblk_request_done+0x38/0x74 [virtio_blk]
[ 6411.451437] blk_done_softirq+0xc4/0x110
[ 6411.451439] __do_softirq+0x128/0x38c
[ 6411.451441] __irq_exit_rcu+0x118/0x150
[ 6411.451442] irq_exit+0x1c/0x30
[ 6411.451445] __handle_domain_irq+0x8c/0xf0
[ 6411.451446] gic_handle_irq+0x84/0x108
[ 6411.451447] el1_irq+0xcc/0x180
[ 6411.451448] arch_cpu_idle+0x18/0x40
[ 6411.451450] default_idle_call+0x4c/0x1a0
[ 6411.451453] cpuidle_idle_call+0x168/0x1e0
[ 6411.451454] do_idle+0xb4/0x104
[ 6411.451455] cpu_startup_entry+0x30/0x9c
[ 6411.451458] secondary_start_kernel+0x104/0x180
[ 6411.451460] Code: d4210000 b0006161 910c8021 94013f4d (d4210000)
[ 6411.451462] ---[ end trace 4a88c6a074082f8c ]---
[ 6411.451464] Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt

Fixes: eb6ecbed0aa2 ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
Suggested-by: Yang Shi <[email protected]>
Signed-off-by: Xu Yu <[email protected]>
Signed-off-by: Rongwei Wang <[email protected]>
---
mm/khugepaged.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 045cc579f724..48de4e1b0783 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1763,6 +1763,10 @@ static void collapse_file(struct mm_struct *mm,
filemap_flush(mapping);
result = SCAN_FAIL;
goto xa_unlocked;
+ } else if (PageWriteback(page)) {
+ xas_unlock_irq(&xas);
+ result = SCAN_FAIL;
+ goto xa_unlocked;
} else if (trylock_page(page)) {
get_page(page);
xas_unlock_irq(&xas);
@@ -1798,7 +1802,8 @@ static void collapse_file(struct mm_struct *mm,
goto out_unlock;
}

- if (!is_shmem && PageDirty(page)) {
+ if (!is_shmem && (PageDirty(page) ||
+ PageWriteback(page))) {
/*
* khugepaged only works on read-only fd, so this
* page is dirty because it hasn't been flushed
--
2.27.0

2021-10-06 02:44:55

[permalink] [raw]

Subject: Re: [PATCH v3 v3 2/2] mm, thp: bail out early in collapse_file for writeback page

On Wed, Oct 06, 2021 at 10:18:37AM +0800, Rongwei Wang wrote:
> Currently collapse_file does not explicitly check PG_writeback, instead,
> page_has_private and try_to_release_page are used to filter writeback
> pages. This does not work for xfs with blocksize equal to or larger
> than pagesize, because in such case xfs has no page->private.
>
> This makes collapse_file bail out early for writeback page. Otherwise,
> xfs end_page_writeback will panic as follows.

Could you cut the timestamps out? They don't add any value.

> [ 6411.448211] page:fffffe00201bcc80 refcount:0 mapcount:0 mapping:ffff0003f88c86a8 index:0x0 pfn:0x84ef32
> [ 6411.448304] aops:xfs_address_space_operations [xfs] ino:30000b7 dentry name:"libtest.so"
> [ 6411.448312] flags: 0x57fffe0000008027(locked|referenced|uptodate|active|writeback)
> [ 6411.448317] raw: 57fffe0000008027 ffff80001b48bc28 ffff80001b48bc28 ffff0003f88c86a8
> [ 6411.448321] raw: 0000000000000000 0000000000000000 00000000ffffffff ffff0000c3e9a000
> [ 6411.448324] page dumped because: VM_BUG_ON_PAGE(((unsigned int) page_ref_count(page) + 127u <= 127u))
> [ 6411.448327] page->mem_cgroup:ffff0000c3e9a000
> [ 6411.448340] ------------[ cut here ]------------
> [ 6411.448343] kernel BUG at include/linux/mm.h:1212!
> [ 6411.449288] Internal error: Oops - BUG: 0 [#1] SMP
> [ 6411.449786] Modules linked in:
> [ 6411.449790] BUG: Bad page state in process khugepaged pfn:84ef32
> [ 6411.450143] xfs(E)
> [ 6411.450459] page:fffffe00201bcc80 refcount:0 mapcount:0 mapping:0 index:0x0 pfn:0x84ef32
> [ 6411.451361] libcrc32c(E) rfkill(E) aes_ce_blk(E) crypto_simd(E) ...
> [ 6411.451387] CPU: 25 PID: 0 Comm: swapper/25 Kdump: loaded Tainted: ...
> [ 6411.451389] pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
> [ 6411.451393] pc : end_page_writeback+0x1c0/0x214
> [ 6411.451394] lr : end_page_writeback+0x1c0/0x214
> [ 6411.451395] sp : ffff800011ce3cc0
> [ 6411.451396] x29: ffff800011ce3cc0 x28: 0000000000000000
> [ 6411.451398] x27: ffff000c04608040 x26: 0000000000000000
> [ 6411.451399] x25: ffff000c04608040 x24: 0000000000001000
> [ 6411.451401] x23: ffff0003f88c8530 x22: 0000000000001000
> [ 6411.451403] x21: ffff0003f88c8530 x20: 0000000000000000
> [ 6411.451404] x19: fffffe00201bcc80 x18: 0000000000000030
> [ 6411.451406] x17: 0000000000000000 x16: 0000000000000000
> [ 6411.451407] x15: ffff000c018f9760 x14: ffffffffffffffff
> [ 6411.451409] x13: ffff8000119d72b0 x12: ffff8000119d6ee3
> [ 6411.451410] x11: ffff8000117b69b8 x10: 00000000ffff8000
> [ 6411.451412] x9 : ffff800010617534 x8 : 0000000000000000
> [ 6411.451413] x7 : ffff8000114f69b8 x6 : 000000000000000f
> [ 6411.451415] x5 : 0000000000000000 x4 : 0000000000000000
> [ 6411.451416] x3 : 0000000000000400 x2 : 0000000000000000
> [ 6411.451418] x1 : 0000000000000000 x0 : 0000000000000000
> [ 6411.451420] Call trace:
> [ 6411.451421] end_page_writeback+0x1c0/0x214
> [ 6411.451424] iomap_finish_page_writeback+0x13c/0x204
> [ 6411.451425] iomap_finish_ioend+0xe8/0x19c
> [ 6411.451426] iomap_writepage_end_bio+0x38/0x50
> [ 6411.451427] bio_endio+0x168/0x1ec
> [ 6411.451430] blk_update_request+0x278/0x3f0
> [ 6411.451432] blk_mq_end_request+0x34/0x15c
> [ 6411.451435] virtblk_request_done+0x38/0x74 [virtio_blk]
> [ 6411.451437] blk_done_softirq+0xc4/0x110
> [ 6411.451439] __do_softirq+0x128/0x38c
> [ 6411.451441] __irq_exit_rcu+0x118/0x150
> [ 6411.451442] irq_exit+0x1c/0x30
> [ 6411.451445] __handle_domain_irq+0x8c/0xf0
> [ 6411.451446] gic_handle_irq+0x84/0x108
> [ 6411.451447] el1_irq+0xcc/0x180
> [ 6411.451448] arch_cpu_idle+0x18/0x40
> [ 6411.451450] default_idle_call+0x4c/0x1a0
> [ 6411.451453] cpuidle_idle_call+0x168/0x1e0
> [ 6411.451454] do_idle+0xb4/0x104
> [ 6411.451455] cpu_startup_entry+0x30/0x9c
> [ 6411.451458] secondary_start_kernel+0x104/0x180
> [ 6411.451460] Code: d4210000 b0006161 910c8021 94013f4d (d4210000)
> [ 6411.451462] ---[ end trace 4a88c6a074082f8c ]---
> [ 6411.451464] Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt
>
> Fixes: eb6ecbed0aa2 ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
> Suggested-by: Yang Shi <[email protected]>
> Signed-off-by: Xu Yu <[email protected]>
> Signed-off-by: Rongwei Wang <[email protected]>

Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>

Although I think the Fixes: line is wrong. This bug goes all the
way back to 99cb0dbd47a1.

2021-10-06 08:44:09

[permalink] [raw]

Subject: Re: [PATCH v3 v3 2/2] mm, thp: bail out early in collapse_file for writeback page

On 10/6/21 10:41 AM, Matthew Wilcox wrote:
> On Wed, Oct 06, 2021 at 10:18:37AM +0800, Rongwei Wang wrote:
>> Currently collapse_file does not explicitly check PG_writeback, instead,
>> page_has_private and try_to_release_page are used to filter writeback
>> pages. This does not work for xfs with blocksize equal to or larger
>> than pagesize, because in such case xfs has no page->private.
>>
>> This makes collapse_file bail out early for writeback page. Otherwise,
>> xfs end_page_writeback will panic as follows.
>
> Could you cut the timestamps out? They don't add any value.
OK, no problem!
>
>> [ 6411.448211] page:fffffe00201bcc80 refcount:0 mapcount:0 mapping:ffff0003f88c86a8 index:0x0 pfn:0x84ef32
>> [ 6411.448304] aops:xfs_address_space_operations [xfs] ino:30000b7 dentry name:"libtest.so"
>> [ 6411.448312] flags: 0x57fffe0000008027(locked|referenced|uptodate|active|writeback)
>> [ 6411.448317] raw: 57fffe0000008027 ffff80001b48bc28 ffff80001b48bc28 ffff0003f88c86a8
>> [ 6411.448321] raw: 0000000000000000 0000000000000000 00000000ffffffff ffff0000c3e9a000
>> [ 6411.448324] page dumped because: VM_BUG_ON_PAGE(((unsigned int) page_ref_count(page) + 127u <= 127u))
>> [ 6411.448327] page->mem_cgroup:ffff0000c3e9a000
>> [ 6411.448340] ------------[ cut here ]------------
>> [ 6411.448343] kernel BUG at include/linux/mm.h:1212!
>> [ 6411.449288] Internal error: Oops - BUG: 0 [#1] SMP
>> [ 6411.449786] Modules linked in:
>> [ 6411.449790] BUG: Bad page state in process khugepaged pfn:84ef32
>> [ 6411.450143] xfs(E)
>> [ 6411.450459] page:fffffe00201bcc80 refcount:0 mapcount:0 mapping:0 index:0x0 pfn:0x84ef32
>> [ 6411.451361] libcrc32c(E) rfkill(E) aes_ce_blk(E) crypto_simd(E) ...
>> [ 6411.451387] CPU: 25 PID: 0 Comm: swapper/25 Kdump: loaded Tainted: ...
>> [ 6411.451389] pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
>> [ 6411.451393] pc : end_page_writeback+0x1c0/0x214
>> [ 6411.451394] lr : end_page_writeback+0x1c0/0x214
>> [ 6411.451395] sp : ffff800011ce3cc0
>> [ 6411.451396] x29: ffff800011ce3cc0 x28: 0000000000000000
>> [ 6411.451398] x27: ffff000c04608040 x26: 0000000000000000
>> [ 6411.451399] x25: ffff000c04608040 x24: 0000000000001000
>> [ 6411.451401] x23: ffff0003f88c8530 x22: 0000000000001000
>> [ 6411.451403] x21: ffff0003f88c8530 x20: 0000000000000000
>> [ 6411.451404] x19: fffffe00201bcc80 x18: 0000000000000030
>> [ 6411.451406] x17: 0000000000000000 x16: 0000000000000000
>> [ 6411.451407] x15: ffff000c018f9760 x14: ffffffffffffffff
>> [ 6411.451409] x13: ffff8000119d72b0 x12: ffff8000119d6ee3
>> [ 6411.451410] x11: ffff8000117b69b8 x10: 00000000ffff8000
>> [ 6411.451412] x9 : ffff800010617534 x8 : 0000000000000000
>> [ 6411.451413] x7 : ffff8000114f69b8 x6 : 000000000000000f
>> [ 6411.451415] x5 : 0000000000000000 x4 : 0000000000000000
>> [ 6411.451416] x3 : 0000000000000400 x2 : 0000000000000000
>> [ 6411.451418] x1 : 0000000000000000 x0 : 0000000000000000
>> [ 6411.451420] Call trace:
>> [ 6411.451421] end_page_writeback+0x1c0/0x214
>> [ 6411.451424] iomap_finish_page_writeback+0x13c/0x204
>> [ 6411.451425] iomap_finish_ioend+0xe8/0x19c
>> [ 6411.451426] iomap_writepage_end_bio+0x38/0x50
>> [ 6411.451427] bio_endio+0x168/0x1ec
>> [ 6411.451430] blk_update_request+0x278/0x3f0
>> [ 6411.451432] blk_mq_end_request+0x34/0x15c
>> [ 6411.451435] virtblk_request_done+0x38/0x74 [virtio_blk]
>> [ 6411.451437] blk_done_softirq+0xc4/0x110
>> [ 6411.451439] __do_softirq+0x128/0x38c
>> [ 6411.451441] __irq_exit_rcu+0x118/0x150
>> [ 6411.451442] irq_exit+0x1c/0x30
>> [ 6411.451445] __handle_domain_irq+0x8c/0xf0
>> [ 6411.451446] gic_handle_irq+0x84/0x108
>> [ 6411.451447] el1_irq+0xcc/0x180
>> [ 6411.451448] arch_cpu_idle+0x18/0x40
>> [ 6411.451450] default_idle_call+0x4c/0x1a0
>> [ 6411.451453] cpuidle_idle_call+0x168/0x1e0
>> [ 6411.451454] do_idle+0xb4/0x104
>> [ 6411.451455] cpu_startup_entry+0x30/0x9c
>> [ 6411.451458] secondary_start_kernel+0x104/0x180
>> [ 6411.451460] Code: d4210000 b0006161 910c8021 94013f4d (d4210000)
>> [ 6411.451462] ---[ end trace 4a88c6a074082f8c ]---
>> [ 6411.451464] Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt
>>
>> Fixes: eb6ecbed0aa2 ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
>> Suggested-by: Yang Shi <[email protected]>
>> Signed-off-by: Xu Yu <[email protected]>
>> Signed-off-by: Rongwei Wang <[email protected]>
>
> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
>
> Although I think the Fixes: line is wrong. This bug goes all the
> way back to 99cb0dbd47a1.
In fact, I understand it this way: the 99cb0dbd47a1 only introduce
file-backed THP for VMA with MAP_DENYWRITE, denying write directly not
to truncate pagecache when it is opened. And eb6ecbed0aa2 will to
truncate pagecache when DSO is opened by a writer, then will trigger ...

It seems 'Fixes: eb6ecbed0aa2' is more reasonable.
>

2021-10-06 18:01:22

[permalink] [raw]

Subject: Re: [PATCH v3 v3 2/2] mm, thp: bail out early in collapse_file for writeback page

On Tue, Oct 5, 2021 at 7:18 PM Rongwei Wang
<[email protected]> wrote:
>
> Currently collapse_file does not explicitly check PG_writeback, instead,
> page_has_private and try_to_release_page are used to filter writeback
> pages. This does not work for xfs with blocksize equal to or larger
> than pagesize, because in such case xfs has no page->private.
>
> This makes collapse_file bail out early for writeback page. Otherwise,
> xfs end_page_writeback will panic as follows.
>
> [ 6411.448211] page:fffffe00201bcc80 refcount:0 mapcount:0 mapping:ffff0003f88c86a8 index:0x0 pfn:0x84ef32
> [ 6411.448304] aops:xfs_address_space_operations [xfs] ino:30000b7 dentry name:"libtest.so"
> [ 6411.448312] flags: 0x57fffe0000008027(locked|referenced|uptodate|active|writeback)
> [ 6411.448317] raw: 57fffe0000008027 ffff80001b48bc28 ffff80001b48bc28 ffff0003f88c86a8
> [ 6411.448321] raw: 0000000000000000 0000000000000000 00000000ffffffff ffff0000c3e9a000
> [ 6411.448324] page dumped because: VM_BUG_ON_PAGE(((unsigned int) page_ref_count(page) + 127u <= 127u))
> [ 6411.448327] page->mem_cgroup:ffff0000c3e9a000
> [ 6411.448340] ------------[ cut here ]------------
> [ 6411.448343] kernel BUG at include/linux/mm.h:1212!
> [ 6411.449288] Internal error: Oops - BUG: 0 [#1] SMP
> [ 6411.449786] Modules linked in:
> [ 6411.449790] BUG: Bad page state in process khugepaged pfn:84ef32
> [ 6411.450143] xfs(E)
> [ 6411.450459] page:fffffe00201bcc80 refcount:0 mapcount:0 mapping:0 index:0x0 pfn:0x84ef32
> [ 6411.451361] libcrc32c(E) rfkill(E) aes_ce_blk(E) crypto_simd(E) ...
> [ 6411.451387] CPU: 25 PID: 0 Comm: swapper/25 Kdump: loaded Tainted: ...
> [ 6411.451389] pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
> [ 6411.451393] pc : end_page_writeback+0x1c0/0x214
> [ 6411.451394] lr : end_page_writeback+0x1c0/0x214
> [ 6411.451395] sp : ffff800011ce3cc0
> [ 6411.451396] x29: ffff800011ce3cc0 x28: 0000000000000000
> [ 6411.451398] x27: ffff000c04608040 x26: 0000000000000000
> [ 6411.451399] x25: ffff000c04608040 x24: 0000000000001000
> [ 6411.451401] x23: ffff0003f88c8530 x22: 0000000000001000
> [ 6411.451403] x21: ffff0003f88c8530 x20: 0000000000000000
> [ 6411.451404] x19: fffffe00201bcc80 x18: 0000000000000030
> [ 6411.451406] x17: 0000000000000000 x16: 0000000000000000
> [ 6411.451407] x15: ffff000c018f9760 x14: ffffffffffffffff
> [ 6411.451409] x13: ffff8000119d72b0 x12: ffff8000119d6ee3
> [ 6411.451410] x11: ffff8000117b69b8 x10: 00000000ffff8000
> [ 6411.451412] x9 : ffff800010617534 x8 : 0000000000000000
> [ 6411.451413] x7 : ffff8000114f69b8 x6 : 000000000000000f
> [ 6411.451415] x5 : 0000000000000000 x4 : 0000000000000000
> [ 6411.451416] x3 : 0000000000000400 x2 : 0000000000000000
> [ 6411.451418] x1 : 0000000000000000 x0 : 0000000000000000
> [ 6411.451420] Call trace:
> [ 6411.451421] end_page_writeback+0x1c0/0x214
> [ 6411.451424] iomap_finish_page_writeback+0x13c/0x204
> [ 6411.451425] iomap_finish_ioend+0xe8/0x19c
> [ 6411.451426] iomap_writepage_end_bio+0x38/0x50
> [ 6411.451427] bio_endio+0x168/0x1ec
> [ 6411.451430] blk_update_request+0x278/0x3f0
> [ 6411.451432] blk_mq_end_request+0x34/0x15c
> [ 6411.451435] virtblk_request_done+0x38/0x74 [virtio_blk]
> [ 6411.451437] blk_done_softirq+0xc4/0x110
> [ 6411.451439] __do_softirq+0x128/0x38c
> [ 6411.451441] __irq_exit_rcu+0x118/0x150
> [ 6411.451442] irq_exit+0x1c/0x30
> [ 6411.451445] __handle_domain_irq+0x8c/0xf0
> [ 6411.451446] gic_handle_irq+0x84/0x108
> [ 6411.451447] el1_irq+0xcc/0x180
> [ 6411.451448] arch_cpu_idle+0x18/0x40
> [ 6411.451450] default_idle_call+0x4c/0x1a0
> [ 6411.451453] cpuidle_idle_call+0x168/0x1e0
> [ 6411.451454] do_idle+0xb4/0x104
> [ 6411.451455] cpu_startup_entry+0x30/0x9c
> [ 6411.451458] secondary_start_kernel+0x104/0x180
> [ 6411.451460] Code: d4210000 b0006161 910c8021 94013f4d (d4210000)
> [ 6411.451462] ---[ end trace 4a88c6a074082f8c ]---
> [ 6411.451464] Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt
>
> Fixes: eb6ecbed0aa2 ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
> Suggested-by: Yang Shi <[email protected]>
> Signed-off-by: Xu Yu <[email protected]>
> Signed-off-by: Rongwei Wang <[email protected]>

Reviewed-by: Yang Shi <[email protected]>

> ---
> mm/khugepaged.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 045cc579f724..48de4e1b0783 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1763,6 +1763,10 @@ static void collapse_file(struct mm_struct *mm,
> filemap_flush(mapping);
> result = SCAN_FAIL;
> goto xa_unlocked;
> + } else if (PageWriteback(page)) {
> + xas_unlock_irq(&xas);
> + result = SCAN_FAIL;
> + goto xa_unlocked;
> } else if (trylock_page(page)) {
> get_page(page);
> xas_unlock_irq(&xas);
> @@ -1798,7 +1802,8 @@ static void collapse_file(struct mm_struct *mm,
> goto out_unlock;
> }
>
> - if (!is_shmem && PageDirty(page)) {
> + if (!is_shmem && (PageDirty(page) ||
> + PageWriteback(page))) {
> /*
> * khugepaged only works on read-only fd, so this
> * page is dirty because it hasn't been flushed
> --
> 2.27.0
>
>

2021-10-11 03:10:28

[permalink] [raw]

Subject: [PATCH v4 0/2] mm, thp: fix file-backed THP race in collapse_file and truncate pagecache

Hi
We found two bugs related to file-backed THP in our cases, recently.
The two bugs rough description as following:

1. in truncate_inode_pages_range, subpage(s) of file-backed THP can be
revealed by find_get_entry.

2. 'collapse_file' miss the pages which in writeback but no private.
This situation will be triggered in XFS when block size is set to
PAGESIZE.

These two patches mainly fix the above mentioned bugs, and have been
tested in latest branch.

Changelog:

v3 -> v4:
- Patch "mm, thp: lock filemap when truncating page cache"
- Patch "mm, thp: bail out early in collapse_file for writeback page"
remove the timestamps from the commit log.

v2 -> v3:
- Patch "mm, thp: lock filemap when truncating page cache"
add filemap_invalidate_{un}lock before and after calling truncate_pagecache (Suggested by Song Liu and Matthew).

v1 -> v2:
- Patch "mm, thp: check page mapping when truncating page cache"
move the check of page mapping to behind lock_page.
- Patch "mm, thp: bail out early in collapse_file for writeback page"
check the writeback flag before taking page lock (Suggested by Yang Shi).

v1 link:
https://patchwork.kernel.org/project/linux-mm/cover/[email protected]/
v2 link:
https://patchwork.kernel.org/project/linux-mm/patch/[email protected]/
v3 link:
https://patchwork.kernel.org/project/linux-mm/cover/[email protected]/

Rongwei Wang (2):
mm, thp: lock filemap when truncating page cache
mm, thp: bail out early in collapse_file for writeback page

fs/open.c | 5 ++++-
mm/khugepaged.c | 7 ++++++-
2 files changed, 10 insertions(+), 2 deletions(-)

--
2.27.0

2021-10-11 03:10:38

[permalink] [raw]

Subject: [PATCH v4 2/2] mm, thp: bail out early in collapse_file for writeback page

Currently collapse_file does not explicitly check PG_writeback, instead,
page_has_private and try_to_release_page are used to filter writeback
pages. This does not work for xfs with blocksize equal to or larger
than pagesize, because in such case xfs has no page->private.

This makes collapse_file bail out early for writeback page. Otherwise,
xfs end_page_writeback will panic as follows.

page:fffffe00201bcc80 refcount:0 mapcount:0 mapping:ffff0003f88c86a8 index:0x0 pfn:0x84ef32
aops:xfs_address_space_operations [xfs] ino:30000b7 dentry name:"libtest.so"
flags: 0x57fffe0000008027(locked|referenced|uptodate|active|writeback)
raw: 57fffe0000008027 ffff80001b48bc28 ffff80001b48bc28 ffff0003f88c86a8
raw: 0000000000000000 0000000000000000 00000000ffffffff ffff0000c3e9a000
page dumped because: VM_BUG_ON_PAGE(((unsigned int) page_ref_count(page) + 127u <= 127u))
page->mem_cgroup:ffff0000c3e9a000
------------[ cut here ]------------
kernel BUG at include/linux/mm.h:1212!
Internal error: Oops - BUG: 0 [#1] SMP
Modules linked in:
BUG: Bad page state in process khugepaged pfn:84ef32
xfs(E)
page:fffffe00201bcc80 refcount:0 mapcount:0 mapping:0 index:0x0 pfn:0x84ef32
libcrc32c(E) rfkill(E) aes_ce_blk(E) crypto_simd(E) ...
CPU: 25 PID: 0 Comm: swapper/25 Kdump: loaded Tainted: ...
pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
pc : end_page_writeback+0x1c0/0x214
lr : end_page_writeback+0x1c0/0x214
sp : ffff800011ce3cc0
x29: ffff800011ce3cc0 x28: 0000000000000000
x27: ffff000c04608040 x26: 0000000000000000
x25: ffff000c04608040 x24: 0000000000001000
x23: ffff0003f88c8530 x22: 0000000000001000
x21: ffff0003f88c8530 x20: 0000000000000000
x19: fffffe00201bcc80 x18: 0000000000000030
x17: 0000000000000000 x16: 0000000000000000
x15: ffff000c018f9760 x14: ffffffffffffffff
x13: ffff8000119d72b0 x12: ffff8000119d6ee3
x11: ffff8000117b69b8 x10: 00000000ffff8000
x9 : ffff800010617534 x8 : 0000000000000000
x7 : ffff8000114f69b8 x6 : 000000000000000f
x5 : 0000000000000000 x4 : 0000000000000000
x3 : 0000000000000400 x2 : 0000000000000000
x1 : 0000000000000000 x0 : 0000000000000000
Call trace:
end_page_writeback+0x1c0/0x214
iomap_finish_page_writeback+0x13c/0x204
iomap_finish_ioend+0xe8/0x19c
iomap_writepage_end_bio+0x38/0x50
bio_endio+0x168/0x1ec
blk_update_request+0x278/0x3f0
blk_mq_end_request+0x34/0x15c
virtblk_request_done+0x38/0x74 [virtio_blk]
blk_done_softirq+0xc4/0x110
__do_softirq+0x128/0x38c
__irq_exit_rcu+0x118/0x150
irq_exit+0x1c/0x30
__handle_domain_irq+0x8c/0xf0
gic_handle_irq+0x84/0x108
el1_irq+0xcc/0x180
arch_cpu_idle+0x18/0x40
default_idle_call+0x4c/0x1a0
cpuidle_idle_call+0x168/0x1e0
do_idle+0xb4/0x104
cpu_startup_entry+0x30/0x9c
secondary_start_kernel+0x104/0x180
Code: d4210000 b0006161 910c8021 94013f4d (d4210000)
---[ end trace 4a88c6a074082f8c ]---
Kernel panic - not syncing: Oops - BUG: Fatal exception in interrupt

Fixes: eb6ecbed0aa2 ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
Suggested-by: Yang Shi <[email protected]>
Signed-off-by: Xu Yu <[email protected]>
Signed-off-by: Rongwei Wang <[email protected]>
Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
Reviewed-by: Yang Shi <[email protected]>
---
mm/khugepaged.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/mm/khugepaged.c b/mm/khugepaged.c
index 045cc579f724..48de4e1b0783 100644
--- a/mm/khugepaged.c
+++ b/mm/khugepaged.c
@@ -1763,6 +1763,10 @@ static void collapse_file(struct mm_struct *mm,
filemap_flush(mapping);
result = SCAN_FAIL;
goto xa_unlocked;
+ } else if (PageWriteback(page)) {
+ xas_unlock_irq(&xas);
+ result = SCAN_FAIL;
+ goto xa_unlocked;
} else if (trylock_page(page)) {
get_page(page);
xas_unlock_irq(&xas);
@@ -1798,7 +1802,8 @@ static void collapse_file(struct mm_struct *mm,
goto out_unlock;
}

- if (!is_shmem && PageDirty(page)) {
+ if (!is_shmem && (PageDirty(page) ||
+ PageWriteback(page))) {
/*
* khugepaged only works on read-only fd, so this
* page is dirty because it hasn't been flushed
--
2.27.0

2021-10-11 05:34:47

[permalink] [raw]

Subject: Re: [PATCH v4 2/2] mm, thp: bail out early in collapse_file for writeback page

On Mon, Oct 11, 2021 at 10:22:41AM +0800, Rongwei Wang wrote:
> Currently collapse_file does not explicitly check PG_writeback, instead,
> page_has_private and try_to_release_page are used to filter writeback
> pages. This does not work for xfs with blocksize equal to or larger
> than pagesize, because in such case xfs has no page->private.
>
> This makes collapse_file bail out early for writeback page. Otherwise,
> xfs end_page_writeback will panic as follows.
>
> Fixes: eb6ecbed0aa2 ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")

This is the wrong Fixes line. This same bug exists earlier than this.
Your testing may not show it before then, but if you mmap something
that isn't an executable, you can provoke it. It should be:

Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")

(unless there's something I'm missing?)

Also, this should surely have a Cc: [email protected] in the
tags section? It's a user-visible bug, we want it backported.

> Suggested-by: Yang Shi <[email protected]>
> Signed-off-by: Xu Yu <[email protected]>
> Signed-off-by: Rongwei Wang <[email protected]>
> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
> Reviewed-by: Yang Shi <[email protected]>
> ---
> mm/khugepaged.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
> index 045cc579f724..48de4e1b0783 100644
> --- a/mm/khugepaged.c
> +++ b/mm/khugepaged.c
> @@ -1763,6 +1763,10 @@ static void collapse_file(struct mm_struct *mm,
> filemap_flush(mapping);
> result = SCAN_FAIL;
> goto xa_unlocked;
> + } else if (PageWriteback(page)) {
> + xas_unlock_irq(&xas);
> + result = SCAN_FAIL;
> + goto xa_unlocked;
> } else if (trylock_page(page)) {
> get_page(page);
> xas_unlock_irq(&xas);
> @@ -1798,7 +1802,8 @@ static void collapse_file(struct mm_struct *mm,
> goto out_unlock;
> }
>
> - if (!is_shmem && PageDirty(page)) {
> + if (!is_shmem && (PageDirty(page) ||
> + PageWriteback(page))) {
> /*
> * khugepaged only works on read-only fd, so this
> * page is dirty because it hasn't been flushed
> --
> 2.27.0
>

2021-10-11 05:38:07

[permalink] [raw]

Subject: Re: [PATCH v4 2/2] mm, thp: bail out early in collapse_file for writeback page

On 10/11/21 11:08 AM, Matthew Wilcox wrote:
> On Mon, Oct 11, 2021 at 10:22:41AM +0800, Rongwei Wang wrote:
>> Currently collapse_file does not explicitly check PG_writeback, instead,
>> page_has_private and try_to_release_page are used to filter writeback
>> pages. This does not work for xfs with blocksize equal to or larger
>> than pagesize, because in such case xfs has no page->private.
>>
>> This makes collapse_file bail out early for writeback page. Otherwise,
>> xfs end_page_writeback will panic as follows.
>>
>> Fixes: eb6ecbed0aa2 ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
>
> This is the wrong Fixes line. This same bug exists earlier than this.
> Your testing may not show it before then, but if you mmap something
> that isn't an executable, you can provoke it. It should be:
>
> Fixes: 99cb0dbd47a1 ("mm,thp: add read-only THP support for (non-shmem) FS")
>
> (unless there's something I'm missing?)
Hi, Matthew
I forget the Patch #2 fix the bug that is different with Patch #1. I
will update this.

Thanks for your remind!
>
> Also, this should surely have a Cc: [email protected] in the
> tags section? It's a user-visible bug, we want it backported.
OK, Thanks!
>
>> Suggested-by: Yang Shi <[email protected]>
>> Signed-off-by: Xu Yu <[email protected]>
>> Signed-off-by: Rongwei Wang <[email protected]>
>> Reviewed-by: Matthew Wilcox (Oracle) <[email protected]>
>> Reviewed-by: Yang Shi <[email protected]>
>> ---
>> mm/khugepaged.c | 7 ++++++-
>> 1 file changed, 6 insertions(+), 1 deletion(-)
>>
>> diff --git a/mm/khugepaged.c b/mm/khugepaged.c
>> index 045cc579f724..48de4e1b0783 100644
>> --- a/mm/khugepaged.c
>> +++ b/mm/khugepaged.c
>> @@ -1763,6 +1763,10 @@ static void collapse_file(struct mm_struct *mm,
>> filemap_flush(mapping);
>> result = SCAN_FAIL;
>> goto xa_unlocked;
>> + } else if (PageWriteback(page)) {
>> + xas_unlock_irq(&xas);
>> + result = SCAN_FAIL;
>> + goto xa_unlocked;
>> } else if (trylock_page(page)) {
>> get_page(page);
>> xas_unlock_irq(&xas);
>> @@ -1798,7 +1802,8 @@ static void collapse_file(struct mm_struct *mm,
>> goto out_unlock;
>> }
>>
>> - if (!is_shmem && PageDirty(page)) {
>> + if (!is_shmem && (PageDirty(page) ||
>> + PageWriteback(page))) {
>> /*
>> * khugepaged only works on read-only fd, so this
>> * page is dirty because it hasn't been flushed
>> --
>> 2.27.0
>>

2021-10-11 11:22:31