Transparent huge page has supported read-only non-shmem files. The file-
backed THP is collapsed by khugepaged and truncated when written (for
shared libraries).
However, there is race in two possible places.
1) multiple writers truncate the same page cache concurrently;
2) collapse_file rolls back when writer truncates the page cache;
In both cases, subpage(s) of file THP can be revealed by find_get_entry
in truncate_inode_pages_range, which will trigger PageTail BUG_ON in
truncate_inode_page, as follows.
[40326.247034] page:000000009e420ff2 refcount:1 mapcount:0 mapping:0000000000000000 index:0x7ff pfn:0x50c3ff
[40326.247041] head:0000000075ff816d order:9 compound_mapcount:0 compound_pincount:0
[40326.247046] flags: 0x37fffe0000010815(locked|uptodate|lru|arch_1|head)
[40326.247051] raw: 37fffe0000000000 fffffe0013108001 dead000000000122 dead000000000400
[40326.247053] raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
[40326.247055] head: 37fffe0000010815 fffffe001066bd48 ffff000404183c20 0000000000000000
[40326.247057] head: 0000000000000600 0000000000000000 00000001ffffffff ffff000c0345a000
[40326.247058] page dumped because: VM_BUG_ON_PAGE(PageTail(page))
[40326.247077] ------------[ cut here ]------------
[40326.247080] kernel BUG at mm/truncate.c:213!
[40326.280581] Internal error: Oops - BUG: 0 [#1] SMP
[40326.281077] Modules linked in: xfs(E) libcrc32c(E) rfkill(E) aes_ce_blk(E) crypto_simd(E) cryptd(E) aes_ce_cipher(E) crct10dif_ce(E) ghash_ce(E) sha1_ce(E) uio_pdrv_genirq(E) uio(E) nfsd(E) vfat(E) fat(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) sch_fq_codel(E) ip_tables(E) ext4(E) mbcache(E) jbd2(E) virtio_net(E) net_failover(E) virtio_blk(E) failover(E) sha2_ce(E) sha256_arm64(E) virtio_mmio(E) virtio_pci(E) virtio_ring(E) virtio(E)
[40326.285130] CPU: 14 PID: 11394 Comm: check_madvise_d Kdump: loaded Tainted: G W E 5.10.46-hugetext+ #55
[40326.286202] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015
[40326.286968] pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
[40326.287584] pc : truncate_inode_page+0x64/0x70
[40326.288040] lr : truncate_inode_page+0x64/0x70
[40326.288498] sp : ffff80001b60b900
[40326.288837] x29: ffff80001b60b900 x28: 00000000000007ff
[40326.289377] x27: ffff80001b60b9a0 x26: 0000000000000000
[40326.289943] x25: 000000000000000f x24: ffff80001b60b9a0
[40326.290485] x23: ffff80001b60ba18 x22: ffff0001e0999ea8
[40326.291027] x21: ffff0000c21db300 x20: ffffffffffffffff
[40326.291566] x19: fffffe001310ffc0 x18: 0000000000000020
[40326.292106] x17: 0000000000000000 x16: 0000000000000000
[40326.292655] x15: ffff0000c21db960 x14: 3030306666666620
[40326.293197] x13: 6666666666666666 x12: 3130303030303030
[40326.293746] x11: ffff8000117b69b8 x10: 00000000ffff8000
[40326.294313] x9 : ffff80001012690c x8 : 0000000000000000
[40326.294851] x7 : ffff8000114f69b8 x6 : 0000000000017ffd
[40326.295392] x5 : ffff0007fffbcbc8 x4 : ffff80001b60b5c0
[40326.295942] x3 : 0000000000000001 x2 : 0000000000000000
[40326.296497] x1 : 0000000000000000 x0 : 0000000000000000
[40326.297047] Call trace:
[40326.297304] truncate_inode_page+0x64/0x70
[40326.297724] truncate_inode_pages_range+0x550/0x7e4
[40326.298251] truncate_pagecache+0x58/0x80
[40326.298662] do_dentry_open+0x1e4/0x3c0
[40326.299052] vfs_open+0x38/0x44
[40326.299377] do_open+0x1f0/0x310
[40326.299709] path_openat+0x114/0x1dc
[40326.300077] do_filp_open+0x84/0x134
[40326.300444] do_sys_openat2+0xbc/0x164
[40326.300825] __arm64_sys_openat+0x74/0xc0
[40326.301236] el0_svc_common.constprop.0+0x88/0x220
[40326.301723] do_el0_svc+0x30/0xa0
[40326.302089] el0_svc+0x20/0x30
[40326.302404] el0_sync_handler+0x1a4/0x1b0
[40326.302814] el0_sync+0x180/0x1c0
[40326.303157] Code: aa0103e0 900061e1 910ec021 9400d300 (d4210000)
[40326.303775] ---[ end trace f70cdb42cb7c2d42 ]---
[40326.304244] Kernel panic - not syncing: Oops - BUG: Fatal exception
This checks the page mapping and retries when subpage of file THP is
found, in truncate_inode_pages_range.
Fixes: eb6ecbed0aa2 ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
Signed-off-by: Xu Yu <[email protected]>
Signed-off-by: Rongwei Wang <[email protected]>
---
mm/filemap.c | 7 ++++++-
mm/truncate.c | 17 ++++++++++++++++-
2 files changed, 22 insertions(+), 2 deletions(-)
diff --git a/mm/filemap.c b/mm/filemap.c
index dae481293..a3af2ec 100644
--- a/mm/filemap.c
+++ b/mm/filemap.c
@@ -2093,7 +2093,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
if (!xa_is_value(page)) {
if (page->index < start)
goto put;
- VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
if (page->index + thp_nr_pages(page) - 1 > end)
goto put;
if (!trylock_page(page))
@@ -2102,6 +2101,12 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
goto unlock;
VM_BUG_ON_PAGE(!thp_contains(page, xas.xa_index),
page);
+ /*
+ * We can find and get head page of file THP with
+ * non-head index. The head page should have already
+ * be truncated with page->mapping reset to NULL.
+ */
+ VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
}
indices[pvec->nr] = xas.xa_index;
if (!pagevec_add(pvec, page))
diff --git a/mm/truncate.c b/mm/truncate.c
index 714eaf1..8c59c00 100644
--- a/mm/truncate.c
+++ b/mm/truncate.c
@@ -319,7 +319,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
index = start;
while (index < end && find_lock_entries(mapping, index, end - 1,
&pvec, indices)) {
- index = indices[pagevec_count(&pvec) - 1] + 1;
+ index = indices[pagevec_count(&pvec) - 1] +
+ thp_nr_pages(pvec.pages[pagevec_count(&pvec) - 1]);
truncate_exceptional_pvec_entries(mapping, &pvec, indices);
for (i = 0; i < pagevec_count(&pvec); i++)
truncate_cleanup_page(pvec.pages[i]);
@@ -391,6 +392,20 @@ void truncate_inode_pages_range(struct address_space *mapping,
if (xa_is_value(page))
continue;
+ /*
+ * Already truncated? We can find and get subpage
+ * of file THP, of which the head page is truncated.
+ *
+ * In addition, another race will be avoided, where
+ * collapse_file rolls back when writer truncates the
+ * page cache.
+ */
+ if (page_mapping(page) != mapping) {
+ /* Restart to make sure all gone */
+ index = start - 1;
+ continue;
+ }
+
lock_page(page);
WARN_ON(page_to_index(page) != index);
wait_on_page_writeback(page);
--
1.8.3.1
On 9/6/21 8:11 PM, Rongwei Wang wrote:
> Transparent huge page has supported read-only non-shmem files. The file-
> backed THP is collapsed by khugepaged and truncated when written (for
> shared libraries).
>
> However, there is race in two possible places.
>
> 1) multiple writers truncate the same page cache concurrently;
> 2) collapse_file rolls back when writer truncates the page cache;
>
> In both cases, subpage(s) of file THP can be revealed by find_get_entry
> in truncate_inode_pages_range, which will trigger PageTail BUG_ON in
> truncate_inode_page, as follows.
>
> [40326.247034] page:000000009e420ff2 refcount:1 mapcount:0 mapping:0000000000000000 index:0x7ff pfn:0x50c3ff
> [40326.247041] head:0000000075ff816d order:9 compound_mapcount:0 compound_pincount:0
> [40326.247046] flags: 0x37fffe0000010815(locked|uptodate|lru|arch_1|head)
> [40326.247051] raw: 37fffe0000000000 fffffe0013108001 dead000000000122 dead000000000400
> [40326.247053] raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
> [40326.247055] head: 37fffe0000010815 fffffe001066bd48 ffff000404183c20 0000000000000000
> [40326.247057] head: 0000000000000600 0000000000000000 00000001ffffffff ffff000c0345a000
> [40326.247058] page dumped because: VM_BUG_ON_PAGE(PageTail(page))
> [40326.247077] ------------[ cut here ]------------
> [40326.247080] kernel BUG at mm/truncate.c:213!
> [40326.280581] Internal error: Oops - BUG: 0 [#1] SMP
> [40326.281077] Modules linked in: xfs(E) libcrc32c(E) rfkill(E) aes_ce_blk(E) crypto_simd(E) cryptd(E) aes_ce_cipher(E) crct10dif_ce(E) ghash_ce(E) sha1_ce(E) uio_pdrv_genirq(E) uio(E) nfsd(E) vfat(E) fat(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) sch_fq_codel(E) ip_tables(E) ext4(E) mbcache(E) jbd2(E) virtio_net(E) net_failover(E) virtio_blk(E) failover(E) sha2_ce(E) sha256_arm64(E) virtio_mmio(E) virtio_pci(E) virtio_ring(E) virtio(E)
> [40326.285130] CPU: 14 PID: 11394 Comm: check_madvise_d Kdump: loaded Tainted: G W E 5.10.46-hugetext+ #55
> [40326.286202] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015
> [40326.286968] pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
> [40326.287584] pc : truncate_inode_page+0x64/0x70
> [40326.288040] lr : truncate_inode_page+0x64/0x70
> [40326.288498] sp : ffff80001b60b900
> [40326.288837] x29: ffff80001b60b900 x28: 00000000000007ff
> [40326.289377] x27: ffff80001b60b9a0 x26: 0000000000000000
> [40326.289943] x25: 000000000000000f x24: ffff80001b60b9a0
> [40326.290485] x23: ffff80001b60ba18 x22: ffff0001e0999ea8
> [40326.291027] x21: ffff0000c21db300 x20: ffffffffffffffff
> [40326.291566] x19: fffffe001310ffc0 x18: 0000000000000020
> [40326.292106] x17: 0000000000000000 x16: 0000000000000000
> [40326.292655] x15: ffff0000c21db960 x14: 3030306666666620
> [40326.293197] x13: 6666666666666666 x12: 3130303030303030
> [40326.293746] x11: ffff8000117b69b8 x10: 00000000ffff8000
> [40326.294313] x9 : ffff80001012690c x8 : 0000000000000000
> [40326.294851] x7 : ffff8000114f69b8 x6 : 0000000000017ffd
> [40326.295392] x5 : ffff0007fffbcbc8 x4 : ffff80001b60b5c0
> [40326.295942] x3 : 0000000000000001 x2 : 0000000000000000
> [40326.296497] x1 : 0000000000000000 x0 : 0000000000000000
> [40326.297047] Call trace:
> [40326.297304] truncate_inode_page+0x64/0x70
> [40326.297724] truncate_inode_pages_range+0x550/0x7e4
> [40326.298251] truncate_pagecache+0x58/0x80
> [40326.298662] do_dentry_open+0x1e4/0x3c0
> [40326.299052] vfs_open+0x38/0x44
> [40326.299377] do_open+0x1f0/0x310
> [40326.299709] path_openat+0x114/0x1dc
> [40326.300077] do_filp_open+0x84/0x134
> [40326.300444] do_sys_openat2+0xbc/0x164
> [40326.300825] __arm64_sys_openat+0x74/0xc0
> [40326.301236] el0_svc_common.constprop.0+0x88/0x220
> [40326.301723] do_el0_svc+0x30/0xa0
> [40326.302089] el0_svc+0x20/0x30
> [40326.302404] el0_sync_handler+0x1a4/0x1b0
> [40326.302814] el0_sync+0x180/0x1c0
> [40326.303157] Code: aa0103e0 900061e1 910ec021 9400d300 (d4210000)
> [40326.303775] ---[ end trace f70cdb42cb7c2d42 ]---
> [40326.304244] Kernel panic - not syncing: Oops - BUG: Fatal exception
>
> This checks the page mapping and retries when subpage of file THP is
> found, in truncate_inode_pages_range.
>
> Fixes: eb6ecbed0aa2 ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
> Signed-off-by: Xu Yu <[email protected]>
> Signed-off-by: Rongwei Wang <[email protected]>
> ---
> mm/filemap.c | 7 ++++++-
> mm/truncate.c | 17 ++++++++++++++++-
> 2 files changed, 22 insertions(+), 2 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index dae481293..a3af2ec 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2093,7 +2093,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
> if (!xa_is_value(page)) {
> if (page->index < start)
> goto put;
> - VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
> if (page->index + thp_nr_pages(page) - 1 > end)
> goto put;
> if (!trylock_page(page))
> @@ -2102,6 +2101,12 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
> goto unlock;
> VM_BUG_ON_PAGE(!thp_contains(page, xas.xa_index),
> page);
> + /*
> + * We can find and get head page of file THP with
> + * non-head index. The head page should have already
> + * be truncated with page->mapping reset to NULL.
> + */
> + VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
> }
> indices[pvec->nr] = xas.xa_index;
> if (!pagevec_add(pvec, page))
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 714eaf1..8c59c00 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -319,7 +319,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
> index = start;
> while (index < end && find_lock_entries(mapping, index, end - 1,
> &pvec, indices)) {
> - index = indices[pagevec_count(&pvec) - 1] + 1;
> + index = indices[pagevec_count(&pvec) - 1] +
> + thp_nr_pages(pvec.pages[pagevec_count(&pvec) - 1]);
> truncate_exceptional_pvec_entries(mapping, &pvec, indices);
> for (i = 0; i < pagevec_count(&pvec); i++)
> truncate_cleanup_page(pvec.pages[i]);
> @@ -391,6 +392,20 @@ void truncate_inode_pages_range(struct address_space *mapping,
> if (xa_is_value(page))
> continue;
>
> + /*
> + * Already truncated? We can find and get subpage
> + * of file THP, of which the head page is truncated.
> + *
> + * In addition, another race will be avoided, where
> + * collapse_file rolls back when writer truncates the
> + * page cache.
> + */
> + if (page_mapping(page) != mapping) {
> + /* Restart to make sure all gone */
> + index = start - 1;
> + continue;
> + }
> +
> lock_page(page);
Better to check page mapping with page lock, i.e., move above chunk
after lock_page(page), and do not forget to unlock.
lock_page(page);
+ /*
+ * Already truncated? We can find and get subpage
+ * of file THP, of which the head page is truncated.
+ *
+ * In addition, another race will be avoided, where
+ * collapse_file rolls back when writer
truncates the
+ * page cache.
+ */
+ if (page_mapping(page) != mapping) {
+ unlock_page(page);
+ /* Restart to make sure all gone */
+ index = start - 1;
+ break;
+ }
+
We will wait for other reviews till we send out v2.
> WARN_ON(page_to_index(page) != index);
> wait_on_page_writeback(page);
>
--
Thanks,
Yu
On Mon, Sep 6, 2021 at 5:12 AM Rongwei Wang
<[email protected]> wrote:
>
> Transparent huge page has supported read-only non-shmem files. The file-
> backed THP is collapsed by khugepaged and truncated when written (for
> shared libraries).
>
> However, there is race in two possible places.
>
> 1) multiple writers truncate the same page cache concurrently;
> 2) collapse_file rolls back when writer truncates the page cache;
>
> In both cases, subpage(s) of file THP can be revealed by find_get_entry
> in truncate_inode_pages_range, which will trigger PageTail BUG_ON in
> truncate_inode_page, as follows.
>
> [40326.247034] page:000000009e420ff2 refcount:1 mapcount:0 mapping:0000000000000000 index:0x7ff pfn:0x50c3ff
> [40326.247041] head:0000000075ff816d order:9 compound_mapcount:0 compound_pincount:0
> [40326.247046] flags: 0x37fffe0000010815(locked|uptodate|lru|arch_1|head)
> [40326.247051] raw: 37fffe0000000000 fffffe0013108001 dead000000000122 dead000000000400
> [40326.247053] raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
> [40326.247055] head: 37fffe0000010815 fffffe001066bd48 ffff000404183c20 0000000000000000
> [40326.247057] head: 0000000000000600 0000000000000000 00000001ffffffff ffff000c0345a000
> [40326.247058] page dumped because: VM_BUG_ON_PAGE(PageTail(page))
> [40326.247077] ------------[ cut here ]------------
> [40326.247080] kernel BUG at mm/truncate.c:213!
> [40326.280581] Internal error: Oops - BUG: 0 [#1] SMP
> [40326.281077] Modules linked in: xfs(E) libcrc32c(E) rfkill(E) aes_ce_blk(E) crypto_simd(E) cryptd(E) aes_ce_cipher(E) crct10dif_ce(E) ghash_ce(E) sha1_ce(E) uio_pdrv_genirq(E) uio(E) nfsd(E) vfat(E) fat(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) sch_fq_codel(E) ip_tables(E) ext4(E) mbcache(E) jbd2(E) virtio_net(E) net_failover(E) virtio_blk(E) failover(E) sha2_ce(E) sha256_arm64(E) virtio_mmio(E) virtio_pci(E) virtio_ring(E) virtio(E)
> [40326.285130] CPU: 14 PID: 11394 Comm: check_madvise_d Kdump: loaded Tainted: G W E 5.10.46-hugetext+ #55
> [40326.286202] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015
> [40326.286968] pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
> [40326.287584] pc : truncate_inode_page+0x64/0x70
> [40326.288040] lr : truncate_inode_page+0x64/0x70
> [40326.288498] sp : ffff80001b60b900
> [40326.288837] x29: ffff80001b60b900 x28: 00000000000007ff
> [40326.289377] x27: ffff80001b60b9a0 x26: 0000000000000000
> [40326.289943] x25: 000000000000000f x24: ffff80001b60b9a0
> [40326.290485] x23: ffff80001b60ba18 x22: ffff0001e0999ea8
> [40326.291027] x21: ffff0000c21db300 x20: ffffffffffffffff
> [40326.291566] x19: fffffe001310ffc0 x18: 0000000000000020
> [40326.292106] x17: 0000000000000000 x16: 0000000000000000
> [40326.292655] x15: ffff0000c21db960 x14: 3030306666666620
> [40326.293197] x13: 6666666666666666 x12: 3130303030303030
> [40326.293746] x11: ffff8000117b69b8 x10: 00000000ffff8000
> [40326.294313] x9 : ffff80001012690c x8 : 0000000000000000
> [40326.294851] x7 : ffff8000114f69b8 x6 : 0000000000017ffd
> [40326.295392] x5 : ffff0007fffbcbc8 x4 : ffff80001b60b5c0
> [40326.295942] x3 : 0000000000000001 x2 : 0000000000000000
> [40326.296497] x1 : 0000000000000000 x0 : 0000000000000000
> [40326.297047] Call trace:
> [40326.297304] truncate_inode_page+0x64/0x70
> [40326.297724] truncate_inode_pages_range+0x550/0x7e4
> [40326.298251] truncate_pagecache+0x58/0x80
> [40326.298662] do_dentry_open+0x1e4/0x3c0
> [40326.299052] vfs_open+0x38/0x44
> [40326.299377] do_open+0x1f0/0x310
> [40326.299709] path_openat+0x114/0x1dc
> [40326.300077] do_filp_open+0x84/0x134
> [40326.300444] do_sys_openat2+0xbc/0x164
> [40326.300825] __arm64_sys_openat+0x74/0xc0
> [40326.301236] el0_svc_common.constprop.0+0x88/0x220
> [40326.301723] do_el0_svc+0x30/0xa0
> [40326.302089] el0_svc+0x20/0x30
> [40326.302404] el0_sync_handler+0x1a4/0x1b0
> [40326.302814] el0_sync+0x180/0x1c0
> [40326.303157] Code: aa0103e0 900061e1 910ec021 9400d300 (d4210000)
> [40326.303775] ---[ end trace f70cdb42cb7c2d42 ]---
> [40326.304244] Kernel panic - not syncing: Oops - BUG: Fatal exception
>
> This checks the page mapping and retries when subpage of file THP is
> found, in truncate_inode_pages_range.
>
> Fixes: eb6ecbed0aa2 ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
> Signed-off-by: Xu Yu <[email protected]>
> Signed-off-by: Rongwei Wang <[email protected]>
> ---
> mm/filemap.c | 7 ++++++-
> mm/truncate.c | 17 ++++++++++++++++-
> 2 files changed, 22 insertions(+), 2 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index dae481293..a3af2ec 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2093,7 +2093,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
> if (!xa_is_value(page)) {
> if (page->index < start)
> goto put;
> - VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
> if (page->index + thp_nr_pages(page) - 1 > end)
> goto put;
> if (!trylock_page(page))
> @@ -2102,6 +2101,12 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
> goto unlock;
> VM_BUG_ON_PAGE(!thp_contains(page, xas.xa_index),
> page);
> + /*
> + * We can find and get head page of file THP with
> + * non-head index. The head page should have already
> + * be truncated with page->mapping reset to NULL.
> + */
> + VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
> }
> indices[pvec->nr] = xas.xa_index;
> if (!pagevec_add(pvec, page))
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 714eaf1..8c59c00 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -319,7 +319,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
> index = start;
> while (index < end && find_lock_entries(mapping, index, end - 1,
> &pvec, indices)) {
> - index = indices[pagevec_count(&pvec) - 1] + 1;
> + index = indices[pagevec_count(&pvec) - 1] +
> + thp_nr_pages(pvec.pages[pagevec_count(&pvec) - 1]);
I don't quite get what this change is doing for. IIUC
find_lock_entries() already handles index advance correctly. If
truncate range is partial THP, it will be handled in the second pass.
AFAICT, the problem is why the THP is not split if it will get
partially truncated. Did I miss something?
> truncate_exceptional_pvec_entries(mapping, &pvec, indices);
> for (i = 0; i < pagevec_count(&pvec); i++)
> truncate_cleanup_page(pvec.pages[i]);
> @@ -391,6 +392,20 @@ void truncate_inode_pages_range(struct address_space *mapping,
> if (xa_is_value(page))
> continue;
>
> + /*
> + * Already truncated? We can find and get subpage
> + * of file THP, of which the head page is truncated.
> + *
> + * In addition, another race will be avoided, where
> + * collapse_file rolls back when writer truncates the
> + * page cache.
> + */
> + if (page_mapping(page) != mapping) {
> + /* Restart to make sure all gone */
> + index = start - 1;
> + continue;
> + }
> +
> lock_page(page);
> WARN_ON(page_to_index(page) != index);
> wait_on_page_writeback(page);
> --
> 1.8.3.1
>
>
On Tue, Sep 7, 2021 at 7:41 PM Rongwei Wang
<[email protected]> wrote:
>
>
>
> On Sep 8, 2021, at 2:08 AM, Yang Shi <[email protected]> wrote:
>
> On Mon, Sep 6, 2021 at 5:12 AM Rongwei Wang
> <[email protected]> wrote:
>
>
> Transparent huge page has supported read-only non-shmem files. The file-
> backed THP is collapsed by khugepaged and truncated when written (for
> shared libraries).
>
> However, there is race in two possible places.
>
> 1) multiple writers truncate the same page cache concurrently;
> 2) collapse_file rolls back when writer truncates the page cache;
>
> In both cases, subpage(s) of file THP can be revealed by find_get_entry
> in truncate_inode_pages_range, which will trigger PageTail BUG_ON in
> truncate_inode_page, as follows.
>
> [40326.247034] page:000000009e420ff2 refcount:1 mapcount:0 mapping:0000000000000000 index:0x7ff pfn:0x50c3ff
> [40326.247041] head:0000000075ff816d order:9 compound_mapcount:0 compound_pincount:0
> [40326.247046] flags: 0x37fffe0000010815(locked|uptodate|lru|arch_1|head)
> [40326.247051] raw: 37fffe0000000000 fffffe0013108001 dead000000000122 dead000000000400
> [40326.247053] raw: 0000000000000001 0000000000000000 00000000ffffffff 0000000000000000
> [40326.247055] head: 37fffe0000010815 fffffe001066bd48 ffff000404183c20 0000000000000000
> [40326.247057] head: 0000000000000600 0000000000000000 00000001ffffffff ffff000c0345a000
> [40326.247058] page dumped because: VM_BUG_ON_PAGE(PageTail(page))
> [40326.247077] ------------[ cut here ]------------
> [40326.247080] kernel BUG at mm/truncate.c:213!
> [40326.280581] Internal error: Oops - BUG: 0 [#1] SMP
> [40326.281077] Modules linked in: xfs(E) libcrc32c(E) rfkill(E) aes_ce_blk(E) crypto_simd(E) cryptd(E) aes_ce_cipher(E) crct10dif_ce(E) ghash_ce(E) sha1_ce(E) uio_pdrv_genirq(E) uio(E) nfsd(E) vfat(E) fat(E) auth_rpcgss(E) nfs_acl(E) lockd(E) grace(E) sunrpc(E) sch_fq_codel(E) ip_tables(E) ext4(E) mbcache(E) jbd2(E) virtio_net(E) net_failover(E) virtio_blk(E) failover(E) sha2_ce(E) sha256_arm64(E) virtio_mmio(E) virtio_pci(E) virtio_ring(E) virtio(E)
> [40326.285130] CPU: 14 PID: 11394 Comm: check_madvise_d Kdump: loaded Tainted: G W E 5.10.46-hugetext+ #55
> [40326.286202] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 0.0.0 02/06/2015
> [40326.286968] pstate: 60400005 (nZCv daif +PAN -UAO -TCO BTYPE=--)
> [40326.287584] pc : truncate_inode_page+0x64/0x70
> [40326.288040] lr : truncate_inode_page+0x64/0x70
> [40326.288498] sp : ffff80001b60b900
> [40326.288837] x29: ffff80001b60b900 x28: 00000000000007ff
> [40326.289377] x27: ffff80001b60b9a0 x26: 0000000000000000
> [40326.289943] x25: 000000000000000f x24: ffff80001b60b9a0
> [40326.290485] x23: ffff80001b60ba18 x22: ffff0001e0999ea8
> [40326.291027] x21: ffff0000c21db300 x20: ffffffffffffffff
> [40326.291566] x19: fffffe001310ffc0 x18: 0000000000000020
> [40326.292106] x17: 0000000000000000 x16: 0000000000000000
> [40326.292655] x15: ffff0000c21db960 x14: 3030306666666620
> [40326.293197] x13: 6666666666666666 x12: 3130303030303030
> [40326.293746] x11: ffff8000117b69b8 x10: 00000000ffff8000
> [40326.294313] x9 : ffff80001012690c x8 : 0000000000000000
> [40326.294851] x7 : ffff8000114f69b8 x6 : 0000000000017ffd
> [40326.295392] x5 : ffff0007fffbcbc8 x4 : ffff80001b60b5c0
> [40326.295942] x3 : 0000000000000001 x2 : 0000000000000000
> [40326.296497] x1 : 0000000000000000 x0 : 0000000000000000
> [40326.297047] Call trace:
> [40326.297304] truncate_inode_page+0x64/0x70
> [40326.297724] truncate_inode_pages_range+0x550/0x7e4
> [40326.298251] truncate_pagecache+0x58/0x80
> [40326.298662] do_dentry_open+0x1e4/0x3c0
> [40326.299052] vfs_open+0x38/0x44
> [40326.299377] do_open+0x1f0/0x310
> [40326.299709] path_openat+0x114/0x1dc
> [40326.300077] do_filp_open+0x84/0x134
> [40326.300444] do_sys_openat2+0xbc/0x164
> [40326.300825] __arm64_sys_openat+0x74/0xc0
> [40326.301236] el0_svc_common.constprop.0+0x88/0x220
> [40326.301723] do_el0_svc+0x30/0xa0
> [40326.302089] el0_svc+0x20/0x30
> [40326.302404] el0_sync_handler+0x1a4/0x1b0
> [40326.302814] el0_sync+0x180/0x1c0
> [40326.303157] Code: aa0103e0 900061e1 910ec021 9400d300 (d4210000)
> [40326.303775] ---[ end trace f70cdb42cb7c2d42 ]---
> [40326.304244] Kernel panic - not syncing: Oops - BUG: Fatal exception
>
> This checks the page mapping and retries when subpage of file THP is
> found, in truncate_inode_pages_range.
>
> Fixes: eb6ecbed0aa2 ("mm, thp: relax the VM_DENYWRITE constraint on file-backed THPs")
> Signed-off-by: Xu Yu <[email protected]>
> Signed-off-by: Rongwei Wang <[email protected]>
> ---
> mm/filemap.c | 7 ++++++-
> mm/truncate.c | 17 ++++++++++++++++-
> 2 files changed, 22 insertions(+), 2 deletions(-)
>
> diff --git a/mm/filemap.c b/mm/filemap.c
> index dae481293..a3af2ec 100644
> --- a/mm/filemap.c
> +++ b/mm/filemap.c
> @@ -2093,7 +2093,6 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
> if (!xa_is_value(page)) {
> if (page->index < start)
> goto put;
> - VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
> if (page->index + thp_nr_pages(page) - 1 > end)
> goto put;
> if (!trylock_page(page))
> @@ -2102,6 +2101,12 @@ unsigned find_lock_entries(struct address_space *mapping, pgoff_t start,
> goto unlock;
> VM_BUG_ON_PAGE(!thp_contains(page, xas.xa_index),
> page);
> + /*
> + * We can find and get head page of file THP with
> + * non-head index. The head page should have already
> + * be truncated with page->mapping reset to NULL.
> + */
> + VM_BUG_ON_PAGE(page->index != xas.xa_index, page);
> }
> indices[pvec->nr] = xas.xa_index;
> if (!pagevec_add(pvec, page))
> diff --git a/mm/truncate.c b/mm/truncate.c
> index 714eaf1..8c59c00 100644
> --- a/mm/truncate.c
> +++ b/mm/truncate.c
> @@ -319,7 +319,8 @@ void truncate_inode_pages_range(struct address_space *mapping,
> index = start;
> while (index < end && find_lock_entries(mapping, index, end - 1,
> &pvec, indices)) {
> - index = indices[pagevec_count(&pvec) - 1] + 1;
> + index = indices[pagevec_count(&pvec) - 1] +
> + thp_nr_pages(pvec.pages[pagevec_count(&pvec) - 1]);
>
>
> I don't quite get what this change is doing for. IIUC
> find_lock_entries() already handles index advance correctly. If
> truncate range is partial THP, it will be handled in the second pass.
>
> Yes, agree.
>
> AFAICT, the problem is why the THP is not split if it will get
>
> Yes, agree.
>
> This change is not necessary, but nice to have. Because find_lock_entries()
> return only head page if any, and it is better to advance the index by thp_nr_pages(), instead of 1.
>
> If you think it introduces unnecessary complexity, I don't mind discarding this change.
IIUC this change may reduce some runtime overhead (-1 call to
find_lock_entries()), but I'd suggest you wait for the comments from
Matthew.
>
> partially truncated. Did I miss something?
>
>
>
> truncate_exceptional_pvec_entries(mapping, &pvec, indices);
> for (i = 0; i < pagevec_count(&pvec); i++)
> truncate_cleanup_page(pvec.pages[i]);
> @@ -391,6 +392,20 @@ void truncate_inode_pages_range(struct address_space *mapping,
> if (xa_is_value(page))
> continue;
>
> + /*
> + * Already truncated? We can find and get subpage
> + * of file THP, of which the head page is truncated.
> + *
> + * In addition, another race will be avoided, where
> + * collapse_file rolls back when writer truncates the
> + * page cache.
> + */
> + if (page_mapping(page) != mapping) {
> + /* Restart to make sure all gone */
> + index = start - 1;
> + continue;
> + }
> +
> lock_page(page);
> WARN_ON(page_to_index(page) != index);
> wait_on_page_writeback(page);
> —
> 1.8.3.1
>
> Thanks,
> Rongwei Wang
>
Greeting,
FYI, we noticed the following commit (built with gcc-9):
commit: 20753096b67c9e841862c4f6f984aaac7dbe7183 ("[PATCH 1/2] mm, thp: check page mapping when truncating page cache")
url: https://github.com/0day-ci/linux/commits/Rongwei-Wang/mm-thp-fix-file-backed-THP-race-in-collapse_file/20210906-201318
base: https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git 27151f177827d478508e756c7657273261aaf8a9
in testcase: nvml
version: nvml-x86_64-ff6f0f125-1_20210908
with following parameters:
test: pmem
group: pmem
nr_pmem: 1
fs: ext4
mount_option: dax
bp_memmap: 32G!4G
ucode: 0x7000019
on test machine: 16 threads 1 sockets Intel(R) Xeon(R) CPU D-1541 @ 2.10GHz with 48G memory
caused below changes (please refer to attached dmesg/kmsg for entire log/backtrace):
If you fix the issue, kindly add following tag
Reported-by: kernel test robot <[email protected]>
[ 491.999010][T20052] BUG: unable to handle page fault for address: 00000000131ac00d
[ 492.019156][T20052] #PF: supervisor read access in kernel mode
[ 492.024977][T20052] #PF: error_code(0x0000) - not-present page
[ 492.030802][T20052] PGD 0 P4D 0
[ 492.034026][T20052] Oops: 0000 [#1] SMP PTI
[ 492.038204][T20052] CPU: 11 PID: 20052 Comm: rm Not tainted 5.14.0-09688-g20753096b67c #1
[ 492.046370][T20052] Hardware name: Supermicro SYS-5018D-FN4T/X10SDV-8C-TLN4F, BIOS 1.1 03/02/2016
[ 492.055225][T20052] RIP: 0010:truncate_inode_pages_range+0xd3/0x7c0
[ 492.061490][T20052] Code: 89 fa 48 89 ee 4c 89 e7 e8 0a 05 ff ff 85 c0 0f 84 de 00 00 00 0f b6 84 24 90 00 00 00 8d 48 ff 89 c2 48 8b 84 cc 98 00 00 00 <48> 8b 00 48 c1 e8 10 83 e0 01 3c 01 48 19 ed 48 81 e5 01 fe ff ff
[ 492.080932][T20052] RSP: 0018:ffffc900014e7d10 EFLAGS: 00010202
[ 492.086850][T20052] RAX: 00000000131ac00d RBX: ffffc900014e7da0 RCX: 0000000000000000
[ 492.094668][T20052] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffc900014e7ca0
[ 492.102485][T20052] RBP: 0000000000000000 R08: 0000000000000000 R09: fffffffffffffffe
[ 492.110301][T20052] R10: 0000000000001000 R11: ffff888977af07f0 R12: ffff88891aa77010
[ 492.118119][T20052] R13: ffffc900014e7d28 R14: ffff88891aa76e98 R15: fffffffffffffffe
[ 492.125937][T20052] FS: 00007f546799d580(0000) GS:ffff888c4aac0000(0000) knlGS:0000000000000000
[ 492.134715][T20052] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 492.141147][T20052] CR2: 00000000131ac00d CR3: 000000098d508002 CR4: 00000000003706e0
[ 492.148962][T20052] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 492.156781][T20052] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 492.164597][T20052] Call Trace:
[ 492.167737][T20052] ? __wake_up_common_lock+0x8a/0xc0
[ 492.172880][T20052] ? jbd2_journal_stop+0x14e/0x300
[ 492.177840][T20052] ext4_evict_inode+0x113/0x6c0
[ 492.182537][T20052] evict+0xd8/0x180
[ 492.186194][T20052] do_unlinkat+0x1d8/0x300
[ 492.190459][T20052] __x64_sys_unlinkat+0x34/0x80
[ 492.195155][T20052] do_syscall_64+0x3b/0xc0
[ 492.199421][T20052] entry_SYSCALL_64_after_hwframe+0x44/0xae
[ 492.205166][T20052] RIP: 0033:0x7f54678beff7
[ 492.209430][T20052] Code: 73 01 c3 48 8b 0d 99 ee 0c 00 f7 d8 64 89 01 48 83 c8 ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 b8 07 01 00 00 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 69 ee 0c 00 f7 d8 64 89 01 48
[ 492.228885][T20052] RSP: 002b:00007ffc1b088f18 EFLAGS: 00000206 ORIG_RAX: 0000000000000107
[ 492.237145][T20052] RAX: ffffffffffffffda RBX: 00005638f43dd6d0 RCX: 00007f54678beff7
[ 492.244964][T20052] RDX: 0000000000000000 RSI: 00005638f43dd7d8 RDI: 0000000000000005
[ 492.252779][T20052] RBP: 00005638f43dc2b0 R08: 0000000000000003 R09: 0000000000000000
[ 492.260597][T20052] R10: fffffffffffffbd8 R11: 0000000000000206 R12: 00007ffc1b089100
[ 492.268413][T20052] R13: 0000000000000000 R14: 00005638f43dd6d0 R15: 0000000000000000
[ 492.276233][T20052] Modules linked in: dm_mod xfs intel_rapl_msr intel_rapl_common btrfs blake2b_generic xor zstd_compress raid6_pq libcrc32c sb_edac x86_pkg_temp_thermal intel_powerclamp sd_mod t10_pi sg coretemp ast drm_vram_helper kvm_intel drm_ttm_helper ttm kvm drm_kms_helper irqbypass ipmi_ssif crct10dif_pclmul dax_pmem_compat crc32_pclmul crc32c_intel syscopyarea ghash_clmulni_intel sysfillrect device_dax sysimgblt rapl nd_pmem ahci dax_pmem_core nd_btt fb_sys_fops libahci intel_cstate acpi_ipmi mxm_wmi nd_e820 drm mei_me libata ipmi_si gpio_ich intel_pch_thermal intel_uncore libnvdimm ioatdma mei joydev ipmi_devintf dca ipmi_msghandler wmi acpi_pad ip_tables
[ 492.335150][T20052] CR2: 00000000131ac00d
[ 492.339162][T20052] ---[ end trace 0052004592872eb3 ]---
[ 492.360394][T20052] RIP: 0010:truncate_inode_pages_range+0xd3/0x7c0
[ 492.366659][T20052] Code: 89 fa 48 89 ee 4c 89 e7 e8 0a 05 ff ff 85 c0 0f 84 de 00 00 00 0f b6 84 24 90 00 00 00 8d 48 ff 89 c2 48 8b 84 cc 98 00 00 00 <48> 8b 00 48 c1 e8 10 83 e0 01 3c 01 48 19 ed 48 81 e5 01 fe ff ff
[ 492.386097][T20052] RSP: 0018:ffffc900014e7d10 EFLAGS: 00010202
[ 492.392008][T20052] RAX: 00000000131ac00d RBX: ffffc900014e7da0 RCX: 0000000000000000
[ 492.399827][T20052] RDX: 0000000000000001 RSI: 0000000000000000 RDI: ffffc900014e7ca0
[ 492.407651][T20052] RBP: 0000000000000000 R08: 0000000000000000 R09: fffffffffffffffe
[ 492.415468][T20052] R10: 0000000000001000 R11: ffff888977af07f0 R12: ffff88891aa77010
[ 492.423286][T20052] R13: ffffc900014e7d28 R14: ffff88891aa76e98 R15: fffffffffffffffe
[ 492.431102][T20052] FS: 00007f546799d580(0000) GS:ffff888c4aac0000(0000) knlGS:0000000000000000
[ 492.439879][T20052] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 492.446313][T20052] CR2: 00000000131ac00d CR3: 000000098d508002 CR4: 00000000003706e0
[ 492.454129][T20052] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 492.461947][T20052] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 492.469766][T20052] Kernel panic - not syncing: Fatal exception
[ 492.601630][T20052] Kernel Offset: disabled
To reproduce:
git clone https://github.com/intel/lkp-tests.git
cd lkp-tests
bin/lkp install job.yaml # job file is attached in this email
bin/lkp split-job --compatible job.yaml # generate the yaml file for lkp run
bin/lkp run generated-yaml-file
---
0DAY/LKP+ Test Infrastructure Open Source Technology Center
https://lists.01.org/hyperkitty/list/[email protected] Intel Corporation
Thanks,
Oliver Sang