2024-02-26 16:02:24

by Yisheng Xie

[permalink] [raw]
Subject: [PATCH] vfio/type1: unpin PageReserved page

We meet a warning as following:
WARNING: CPU: 99 PID: 1766859 at mm/gup.c:209 try_grab_page.part.0+0xe8/0x1b0
CPU: 99 PID: 1766859 Comm: qemu-kvm Kdump: loaded Tainted: GOE 5.10.134-008.2.x86_64 #1
Hardware name: Foxconn AliServer-Thor-04-12U-v2/Thunder2, BIOS 1.0.PL.FC.P.031.00 05/18/2022
RIP: 0010:try_grab_page.part.0+0xe8/0x1b0
Code: b9 00 04 00 00 83 e6 01 74 ca 48 8b 32 b9 00 04 00 00 f7 c6 00 00 01 00 74 ba eb 91 8b 57 34 48 89 f8 85 d2 0f 8f 48 ff ff ff <0f> 0b 31 c0 c3 48 89 fa 48 8b 0a f7 c1 00 00 01 00 0f 85 5c ff ff
RSP: 0018:ffffc900b1a63b98 EFLAGS: 00010282
RAX: ffffea00000e4580 RBX: 0000000000052202 RCX: ffffea00000e4580
RDX: 0000000080000001 RSI: 0000000000052202 RDI: ffffea00000e4580
RBP: ffff88efa5d3d860 R08: 0000000000000000 R09: 0000000000000002
R10: 0000000000000008 R11: ffff89403fff7000 R12: ffff88f589165818
R13: 00007f1320600000 R14: ffffea0181296ca8 R15: ffffea00000e4580
FS: 00007f1324f93e00(0000) GS:ffff893ebfb80000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f1321694070 CR3: 0000006046014004 CR4: 00000000007726e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
follow_page_pte+0x64b/0x800
__get_user_pages+0x228/0x560
__gup_longterm_locked+0xa0/0x2f0
vaddr_get_pfns+0x67/0x100 [vfio_iommu_type1]
vfio_pin_pages_remote+0x30b/0x460 [vfio_iommu_type1]
vfio_pin_map_dma+0xd4/0x2e0 [vfio_iommu_type1]
vfio_dma_do_map+0x21e/0x340 [vfio_iommu_type1]
vfio_iommu_type1_ioctl+0xdd/0x170 [vfio_iommu_type1]
? __fget_files+0x79/0xb0
ksys_ioctl+0x7b/0xb0
? ksys_write+0xc4/0xe0
__x64_sys_ioctl+0x16/0x20
do_syscall_64+0x2d/0x40
entry_SYSCALL_64_after_hwframe+0x44/0xa9

After add dumppage, it shows that it is a PageReserved page(e.g. zero page),
whoes refcount is just overflow:
page:00000000b0504535 refcount:-2147483647 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x3916
flags: 0xffffc000001002(referenced|reserved)
raw: 00ffffc000001002 ffffea00000e4588 ffffea00000e4588 0000000000000000
raw: 0000000000000000 0000000000000000 80000001ffffffff 0000000000000000

gup will _pin_ a page which is PageReserved, however, put_pfn in vfio will
skip unpin page which is PageReserved. So use pfn_valid in put_pfn
instead of !is_invalid_reserved_pfn to unpin PageReserved page.

Signed-off-by: Yisheng Xie <[email protected]>
---
drivers/vfio/vfio_iommu_type1.c | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
index b2854d7939ce..12775bab27ee 100644
--- a/drivers/vfio/vfio_iommu_type1.c
+++ b/drivers/vfio/vfio_iommu_type1.c
@@ -461,7 +461,7 @@ static bool is_invalid_reserved_pfn(unsigned long pfn)

static int put_pfn(unsigned long pfn, int prot)
{
- if (!is_invalid_reserved_pfn(pfn)) {
+ if (pfn_valid(pfn)) {
struct page *page = pfn_to_page(pfn);

unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);
--
2.39.1



2024-02-26 16:15:51

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH] vfio/type1: unpin PageReserved page

On Tue, 27 Feb 2024 00:01:06 +0800
Yisheng Xie <[email protected]> wrote:

> We meet a warning as following:
> WARNING: CPU: 99 PID: 1766859 at mm/gup.c:209 try_grab_page.part.0+0xe8/0x1b0
> CPU: 99 PID: 1766859 Comm: qemu-kvm Kdump: loaded Tainted: GOE 5.10.134-008.2.x86_64 #1
^^^^^^^^

Does this issue reproduce on mainline? Thanks,

Alex

> Hardware name: Foxconn AliServer-Thor-04-12U-v2/Thunder2, BIOS 1.0.PL.FC.P.031.00 05/18/2022
> RIP: 0010:try_grab_page.part.0+0xe8/0x1b0
> Code: b9 00 04 00 00 83 e6 01 74 ca 48 8b 32 b9 00 04 00 00 f7 c6 00 00 01 00 74 ba eb 91 8b 57 34 48 89 f8 85 d2 0f 8f 48 ff ff ff <0f> 0b 31 c0 c3 48 89 fa 48 8b 0a f7 c1 00 00 01 00 0f 85 5c ff ff
> RSP: 0018:ffffc900b1a63b98 EFLAGS: 00010282
> RAX: ffffea00000e4580 RBX: 0000000000052202 RCX: ffffea00000e4580
> RDX: 0000000080000001 RSI: 0000000000052202 RDI: ffffea00000e4580
> RBP: ffff88efa5d3d860 R08: 0000000000000000 R09: 0000000000000002
> R10: 0000000000000008 R11: ffff89403fff7000 R12: ffff88f589165818
> R13: 00007f1320600000 R14: ffffea0181296ca8 R15: ffffea00000e4580
> FS: 00007f1324f93e00(0000) GS:ffff893ebfb80000(0000) knlGS:0000000000000000
> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> CR2: 00007f1321694070 CR3: 0000006046014004 CR4: 00000000007726e0
> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> PKRU: 55555554
> Call Trace:
> follow_page_pte+0x64b/0x800
> __get_user_pages+0x228/0x560
> __gup_longterm_locked+0xa0/0x2f0
> vaddr_get_pfns+0x67/0x100 [vfio_iommu_type1]
> vfio_pin_pages_remote+0x30b/0x460 [vfio_iommu_type1]
> vfio_pin_map_dma+0xd4/0x2e0 [vfio_iommu_type1]
> vfio_dma_do_map+0x21e/0x340 [vfio_iommu_type1]
> vfio_iommu_type1_ioctl+0xdd/0x170 [vfio_iommu_type1]
> ? __fget_files+0x79/0xb0
> ksys_ioctl+0x7b/0xb0
> ? ksys_write+0xc4/0xe0
> __x64_sys_ioctl+0x16/0x20
> do_syscall_64+0x2d/0x40
> entry_SYSCALL_64_after_hwframe+0x44/0xa9
>
> After add dumppage, it shows that it is a PageReserved page(e.g. zero page),
> whoes refcount is just overflow:
> page:00000000b0504535 refcount:-2147483647 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x3916
> flags: 0xffffc000001002(referenced|reserved)
> raw: 00ffffc000001002 ffffea00000e4588 ffffea00000e4588 0000000000000000
> raw: 0000000000000000 0000000000000000 80000001ffffffff 0000000000000000
>
> gup will _pin_ a page which is PageReserved, however, put_pfn in vfio will
> skip unpin page which is PageReserved. So use pfn_valid in put_pfn
> instead of !is_invalid_reserved_pfn to unpin PageReserved page.
>
> Signed-off-by: Yisheng Xie <[email protected]>
> ---
> drivers/vfio/vfio_iommu_type1.c | 2 +-
> 1 file changed, 1 insertion(+), 1 deletion(-)
>
> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> index b2854d7939ce..12775bab27ee 100644
> --- a/drivers/vfio/vfio_iommu_type1.c
> +++ b/drivers/vfio/vfio_iommu_type1.c
> @@ -461,7 +461,7 @@ static bool is_invalid_reserved_pfn(unsigned long pfn)
>
> static int put_pfn(unsigned long pfn, int prot)
> {
> - if (!is_invalid_reserved_pfn(pfn)) {
> + if (pfn_valid(pfn)) {
> struct page *page = pfn_to_page(pfn);
>
> unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);


2024-02-26 17:33:15

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH] vfio/type1: unpin PageReserved page

On Tue, 27 Feb 2024 01:14:54 +0800
Yisheng Xie <[email protected]> wrote:

> 在 2024/2/27 00:14, Alex Williamson 写道:
> > On Tue, 27 Feb 2024 00:01:06 +0800
> > Yisheng Xie<[email protected]> wrote:
> >
> >> We meet a warning as following:
> >> WARNING: CPU: 99 PID: 1766859 at mm/gup.c:209 try_grab_page.part.0+0xe8/0x1b0
> >> CPU: 99 PID: 1766859 Comm: qemu-kvm Kdump: loaded Tainted: GOE 5.10134-008.2.x86_64 #1
> > ^^^^^^^^
> >
> > Does this issue reproduce on mainline? Thanks,
>
> I have check the code of mainline, the logical seems the same as my
> version.
>
> so I think it can reproduce if i understand correctly.

I obviously can't speak to what's in your 5.10.134-008.2 kernel, but I
do know there's a very similar issue resolved in v6.0 mainline and
included in v5.10.146 of the stable tree. Please test. Thanks,

Alex

> >> Hardware name: Foxconn AliServer-Thor-04-12U-v2/Thunder2, BIOS 1.0.PL.FC.P.031.00 05/18/2022
> >> RIP: 0010:try_grab_page.part.0+0xe8/0x1b0
> >> Code: b9 00 04 00 00 83 e6 01 74 ca 48 8b 32 b9 00 04 00 00 f7 c6 00 00 01 00 74 ba eb 91 8b 57 34 48 89 f8 85 d2 0f 8f 48 ff ff ff <0f> 0b 31 c0 c3 48 89 fa 48 8b 0a f7 c1 00 00 01 00 0f 85 5c ff ff
> >> RSP: 0018:ffffc900b1a63b98 EFLAGS: 00010282
> >> RAX: ffffea00000e4580 RBX: 0000000000052202 RCX: ffffea00000e4580
> >> RDX: 0000000080000001 RSI: 0000000000052202 RDI: ffffea00000e4580
> >> RBP: ffff88efa5d3d860 R08: 0000000000000000 R09: 0000000000000002
> >> R10: 0000000000000008 R11: ffff89403fff7000 R12: ffff88f589165818
> >> R13: 00007f1320600000 R14: ffffea0181296ca8 R15: ffffea00000e4580
> >> FS: 00007f1324f93e00(0000) GS:ffff893ebfb80000(0000) knlGS:0000000000000000
> >> CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> >> CR2: 00007f1321694070 CR3: 0000006046014004 CR4: 00000000007726e0
> >> DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
> >> DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
> >> PKRU: 55555554
> >> Call Trace:
> >> follow_page_pte+0x64b/0x800
> >> __get_user_pages+0x228/0x560
> >> __gup_longterm_locked+0xa0/0x2f0
> >> vaddr_get_pfns+0x67/0x100 [vfio_iommu_type1]
> >> vfio_pin_pages_remote+0x30b/0x460 [vfio_iommu_type1]
> >> vfio_pin_map_dma+0xd4/0x2e0 [vfio_iommu_type1]
> >> vfio_dma_do_map+0x21e/0x340 [vfio_iommu_type1]
> >> vfio_iommu_type1_ioctl+0xdd/0x170 [vfio_iommu_type1]
> >> ? __fget_files+0x79/0xb0
> >> ksys_ioctl+0x7b/0xb0
> >> ? ksys_write+0xc4/0xe0
> >> __x64_sys_ioctl+0x16/0x20
> >> do_syscall_64+0x2d/0x40
> >> entry_SYSCALL_64_after_hwframe+0x44/0xa9
> >>
> >> After add dumppage, it shows that it is a PageReserved page(e.g. zero page),
> >> whoes refcount is just overflow:
> >> page:00000000b0504535 refcount:-2147483647 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x3916
> >> flags: 0xffffc000001002(referenced|reserved)
> >> raw: 00ffffc000001002 ffffea00000e4588 ffffea00000e4588 0000000000000000
> >> raw: 0000000000000000 0000000000000000 80000001ffffffff 0000000000000000
> >>
> >> gup will _pin_ a page which is PageReserved, however, put_pfn in vfio will
> >> skip unpin page which is PageReserved. So use pfn_valid in put_pfn
> >> instead of !is_invalid_reserved_pfn to unpin PageReserved page.
> >>
> >> Signed-off-by: Yisheng Xie<[email protected]>
> >> ---
> >> drivers/vfio/vfio_iommu_type1.c | 2 +-
> >> 1 file changed, 1 insertion(+), 1 deletion(-)
> >>
> >> diff --git a/drivers/vfio/vfio_iommu_type1.c b/drivers/vfio/vfio_iommu_type1.c
> >> index b2854d7939ce..12775bab27ee 100644
> >> --- a/drivers/vfio/vfio_iommu_type1.c
> >> +++ b/drivers/vfio/vfio_iommu_type1.c
> >> @@ -461,7 +461,7 @@ static bool is_invalid_reserved_pfn(unsigned long pfn)
> >>
> >> static int put_pfn(unsigned long pfn, int prot)
> >> {
> >> - if (!is_invalid_reserved_pfn(pfn)) {
> >> + if (pfn_valid(pfn)) {
> >> struct page *page = pfn_to_page(pfn);
> >>
> >> unpin_user_pages_dirty_lock(&page, 1, prot & IOMMU_WRITE);


2024-02-27 10:29:24

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] vfio/type1: unpin PageReserved page

On 26.02.24 18:32, Alex Williamson wrote:
> On Tue, 27 Feb 2024 01:14:54 +0800
> Yisheng Xie <[email protected]> wrote:
>
>> 在 2024/2/27 00:14, Alex Williamson 写道:
>>> On Tue, 27 Feb 2024 00:01:06 +0800
>>> Yisheng Xie<[email protected]> wrote:
>>>
>>>> We meet a warning as following:
>>>> WARNING: CPU: 99 PID: 1766859 at mm/gup.c:209 try_grab_page.part.0+0xe8/0x1b0
>>>> CPU: 99 PID: 1766859 Comm: qemu-kvm Kdump: loaded Tainted: GOE 5.10.134-008.2.x86_64 #1
>>> ^^^^^^^^
>>>
>>> Does this issue reproduce on mainline? Thanks,
>>
>> I have check the code of mainline, the logical seems the same as my
>> version.
>>
>> so I think it can reproduce if i understand correctly.
>
> I obviously can't speak to what's in your 5.10.134-008.2 kernel, but I
> do know there's a very similar issue resolved in v6.0 mainline and
> included in v5.10.146 of the stable tree. Please test. Thanks,

This commit, to be precise:

commit 873aefb376bbc0ed1dd2381ea1d6ec88106fdbd4
Author: Alex Williamson <[email protected]>
Date: Mon Aug 29 21:05:40 2022 -0600

vfio/type1: Unpin zero pages

There's currently a reference count leak on the zero page. We increment
the reference via pin_user_pages_remote(), but the page is later handled
as an invalid/reserved page, therefore it's not accounted against the
user and not unpinned by our put_pfn().

Introducing special zero page handling in put_pfn() would resolve the
leak, but without accounting of the zero page, a single user could
still create enough mappings to generate a reference count overflow.

The zero page is always resident, so for our purposes there's no reason
to keep it pinned. Therefore, add a loop to walk pages returned from
pin_user_pages_remote() and unpin any zero pages.


BUT

in the meantime, we also have

commit c8070b78751955e59b42457b974bea4a4fe00187
Author: David Howells <[email protected]>
Date: Fri May 26 22:41:40 2023 +0100

mm: Don't pin ZERO_PAGE in pin_user_pages()

Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a pointer
to it from the page tables and make unpin_user_page*() correspondingly
ignore a ZERO_PAGE when unpinning. We don't want to risk overrunning a
zero page's refcount as we're only allowed ~2 million pins on it -
something that userspace can conceivably trigger.

Add a pair of functions to test whether a page or a folio is a ZERO_PAGE.


So the unpin_user_page_* won't do anything with the shared zeropage.

(likely, we could revert 873aefb376bbc0ed1dd2381ea1d6ec88106fdbd4)

--
Cheers,

David / dhildenb


2024-02-27 15:29:16

by Christoph Hellwig

[permalink] [raw]
Subject: Re: [PATCH] vfio/type1: unpin PageReserved page

Why does it pin a reserved page? It should stay the hell away
from it.


2024-02-27 21:13:01

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH] vfio/type1: unpin PageReserved page

On Tue, 27 Feb 2024 11:27:08 +0100
David Hildenbrand <[email protected]> wrote:

> On 26.02.24 18:32, Alex Williamson wrote:
> > On Tue, 27 Feb 2024 01:14:54 +0800
> > Yisheng Xie <[email protected]> wrote:
> >
> >> 在 2024/2/27 00:14, Alex Williamson 写道:
> >>> On Tue, 27 Feb 2024 00:01:06 +0800
> >>> Yisheng Xie<[email protected]> wrote:
> >>>
> >>>> We meet a warning as following:
> >>>> WARNING: CPU: 99 PID: 1766859 at mm/gup.c:209 try_grab_page.part.0+0xe8/0x1b0
> >>>> CPU: 99 PID: 1766859 Comm: qemu-kvm Kdump: loaded Tainted: GOE 510.134-008.2.x86_64 #1
> >>> ^^^^^^^^
> >>>
> >>> Does this issue reproduce on mainline? Thanks,
> >>
> >> I have check the code of mainline, the logical seems the same as my
> >> version.
> >>
> >> so I think it can reproduce if i understand correctly.
> >
> > I obviously can't speak to what's in your 5.10.134-008.2 kernel, but I
> > do know there's a very similar issue resolved in v6.0 mainline and
> > included in v5.10.146 of the stable tree. Please test. Thanks,
>
> This commit, to be precise:
>
> commit 873aefb376bbc0ed1dd2381ea1d6ec88106fdbd4
> Author: Alex Williamson <[email protected]>
> Date: Mon Aug 29 21:05:40 2022 -0600
>
> vfio/type1: Unpin zero pages
>
> There's currently a reference count leak on the zero page. We increment
> the reference via pin_user_pages_remote(), but the page is later handled
> as an invalid/reserved page, therefore it's not accounted against the
> user and not unpinned by our put_pfn().
>
> Introducing special zero page handling in put_pfn() would resolve the
> leak, but without accounting of the zero page, a single user could
> still create enough mappings to generate a reference count overflow.
>
> The zero page is always resident, so for our purposes there's no reason
> to keep it pinned. Therefore, add a loop to walk pages returned from
> pin_user_pages_remote() and unpin any zero pages.
>
>
> BUT
>
> in the meantime, we also have
>
> commit c8070b78751955e59b42457b974bea4a4fe00187
> Author: David Howells <[email protected]>
> Date: Fri May 26 22:41:40 2023 +0100
>
> mm: Don't pin ZERO_PAGE in pin_user_pages()
>
> Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a pointer
> to it from the page tables and make unpin_user_page*() correspondingly
> ignore a ZERO_PAGE when unpinning. We don't want to risk overrunning a
> zero page's refcount as we're only allowed ~2 million pins on it -
> something that userspace can conceivably trigger.
>
> Add a pair of functions to test whether a page or a folio is a ZERO_PAGE.
>
>
> So the unpin_user_page_* won't do anything with the shared zeropage.
>
> (likely, we could revert 873aefb376bbc0ed1dd2381ea1d6ec88106fdbd4)


Yes, according to the commit log it seems like the unpin is now just
wasted work since v6.5. Thanks!

Alex


2024-02-28 11:45:41

by Yisheng Xie

[permalink] [raw]
Subject: Re: [PATCH] vfio/type1: unpin PageReserved page


在 2024/2/27 18:27, David Hildenbrand 写道:
> On 26.02.24 18:32, Alex Williamson wrote:
>> On Tue, 27 Feb 2024 01:14:54 +0800
>> Yisheng Xie <[email protected]> wrote:
>>
>>> 在 2024/2/27 00:14, Alex Williamson 写道:
>>>> On Tue, 27 Feb 2024 00:01:06 +0800
>>>> Yisheng Xie<[email protected] wrote:
>>>>> We meet a warning as following:
>>>>>    WARNING: CPU: 99 PID: 1766859 at mm/gup.c:209
>>>>> try_grab_page.part.0+0xe8/0x1b0
>>>>>    CPU: 99 PID: 1766859 Comm: qemu-kvm Kdump: loaded Tainted: GOE 
>>>>> 5.10.134-008.2.x86_64 #1
>>>> ^^^^^^^^
>>>>
>>>> Does this issue reproduce on mainline?  Thanks,
>>>
>>> I have check the code of mainline, the logical seems the same as my
>>> version.
>>>
>>> so I think it can reproduce if i understand correctly.
>>
>> I obviously can't speak to what's in your 5.10.134-008.2 kernel, but I
>> do know there's a very similar issue resolved in v6.0 mainline and
>> included in v5.10.146 of the stable tree.  Please test.  Thanks,
>
> This commit, to be precise:
>
> commit 873aefb376bbc0ed1dd2381ea1d6ec88106fdbd4
> Author: Alex Williamson <[email protected]>
> Date:   Mon Aug 29 21:05:40 2022 -0600
>
>     vfio/type1: Unpin zero pages
>         There's currently a reference count leak on the zero page.  We
> increment
>     the reference via pin_user_pages_remote(), but the page is later
> handled
>     as an invalid/reserved page, therefore it's not accounted against the
>     user and not unpinned by our put_pfn().
>         Introducing special zero page handling in put_pfn() would
> resolve the
>     leak, but without accounting of the zero page, a single user could
>     still create enough mappings to generate a reference count overflow.
>         The zero page is always resident, so for our purposes there's
> no reason
>     to keep it pinned.  Therefore, add a loop to walk pages returned from
>     pin_user_pages_remote() and unpin any zero pages.
>
>
> BUT
>
> in the meantime, we also have
>
> commit c8070b78751955e59b42457b974bea4a4fe00187
> Author: David Howells <[email protected]>
> Date:   Fri May 26 22:41:40 2023 +0100
>
>     mm: Don't pin ZERO_PAGE in pin_user_pages()
>         Make pin_user_pages*() leave a ZERO_PAGE unpinned if it
> extracts a pointer
>     to it from the page tables and make unpin_user_page*()
> correspondingly
>     ignore a ZERO_PAGE when unpinning.  We don't want to risk
> overrunning a
>     zero page's refcount as we're only allowed ~2 million pins on it -
>     something that userspace can conceivably trigger.
>         Add a pair of functions to test whether a page or a folio is a
> ZERO_PAGE.
>
>
> So the unpin_user_page_* won't do anything with the shared zeropage.
>
> (likely, we could revert 873aefb376bbc0ed1dd2381ea1d6ec88106fdbd4)

Thanks for your detail info. BTW, do we need handle all of the
pagereserved page?


2024-02-29 22:04:26

by Alex Williamson

[permalink] [raw]
Subject: Re: [PATCH] vfio/type1: unpin PageReserved page

On Tue, 27 Feb 2024 13:25:56 -0700
Alex Williamson <[email protected]> wrote:

> On Tue, 27 Feb 2024 11:27:08 +0100
> David Hildenbrand <[email protected]> wrote:
>
> > On 26.02.24 18:32, Alex Williamson wrote:
> > > On Tue, 27 Feb 2024 01:14:54 +0800
> > > Yisheng Xie <[email protected]> wrote:
> > >
> > >> 在 2024/2/27 00:14, Alex Williamson 写道:
> > >>> On Tue, 27 Feb 2024 00:01:06 +0800
> > >>> Yisheng Xie<[email protected]> wrote:
> > >>>
> > >>>> We meet a warning as following:
> > >>>> WARNING: CPU: 99 PID: 1766859 at mm/gup.c:209 try_grab_page.part.0+0xe8/0x1b0
> > >>>> CPU: 99 PID: 1766859 Comm: qemu-kvm Kdump: loaded Tainted: GOE 5.10.134-008.2.x86_64 #1
> > >>> ^^^^^^^^
> > >>>
> > >>> Does this issue reproduce on mainline? Thanks,
> > >>
> > >> I have check the code of mainline, the logical seems the same as my
> > >> version.
> > >>
> > >> so I think it can reproduce if i understand correctly.
> > >
> > > I obviously can't speak to what's in your 5.10.134-008.2 kernel, but I
> > > do know there's a very similar issue resolved in v6.0 mainline and
> > > included in v5.10.146 of the stable tree. Please test. Thanks,
> >
> > This commit, to be precise:
> >
> > commit 873aefb376bbc0ed1dd2381ea1d6ec88106fdbd4
> > Author: Alex Williamson <[email protected]>
> > Date: Mon Aug 29 21:05:40 2022 -0600
> >
> > vfio/type1: Unpin zero pages
> >
> > There's currently a reference count leak on the zero page. We increment
> > the reference via pin_user_pages_remote(), but the page is later handled
> > as an invalid/reserved page, therefore it's not accounted against the
> > user and not unpinned by our put_pfn().
> >
> > Introducing special zero page handling in put_pfn() would resolve the
> > leak, but without accounting of the zero page, a single user could
> > still create enough mappings to generate a reference count overflow.
> >
> > The zero page is always resident, so for our purposes there's no reason
> > to keep it pinned. Therefore, add a loop to walk pages returned from
> > pin_user_pages_remote() and unpin any zero pages.
> >
> >
> > BUT
> >
> > in the meantime, we also have
> >
> > commit c8070b78751955e59b42457b974bea4a4fe00187
> > Author: David Howells <[email protected]>
> > Date: Fri May 26 22:41:40 2023 +0100
> >
> > mm: Don't pin ZERO_PAGE in pin_user_pages()
> >
> > Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a pointer
> > to it from the page tables and make unpin_user_page*() correspondingly
> > ignore a ZERO_PAGE when unpinning. We don't want to risk overrunning a
> > zero page's refcount as we're only allowed ~2 million pins on it -
> > something that userspace can conceivably trigger.
> >
> > Add a pair of functions to test whether a page or a folio is a ZERO_PAGE.
> >
> >
> > So the unpin_user_page_* won't do anything with the shared zeropage.
> >
> > (likely, we could revert 873aefb376bbc0ed1dd2381ea1d6ec88106fdbd4)
>
>
> Yes, according to the commit log it seems like the unpin is now just
> wasted work since v6.5. Thanks!

I dusted off an old unit test for mapping the zeropage through vfio and
started working on posting a revert for 873aefb376bb but I actually
found that this appears to be resolved even before c8070b787519. I
bisected it to:

commit 84209e87c6963f928194a890399e24e8ad299db1
Author: David Hildenbrand <[email protected]>
Date: Wed Nov 16 11:26:48 2022 +0100

mm/gup: reliable R/O long-term pinning in COW mappings

We already support reliable R/O pinning of anonymous memory.
However, assume we end up pinning (R/O long-term) a pagecache page
or the shared zeropage inside a writable private ("COW") mapping.
The next write access will trigger a write-fault and replace the
pinned page by an exclusive anonymous page in the process page
tables to break COW: the pinned page no longer corresponds to the
page mapped into the process' page table.
Now that FAULT_FLAG_UNSHARE can break COW on anything mapped into a
COW mapping, let's properly break COW first before R/O long-term
pinning something that's not an exclusive anon page inside a COW
mapping. FAULT_FLAG_UNSHARE will break COW and map an exclusive
anon page instead that can get pinned safely.

With this change, we can stop using FOLL_FORCE|FOLL_WRITE for
reliable R/O long-term pinning in COW mappings.

[...]

Note 3: For users that use FOLL_LONGTERM right now without
FOLL_WRITE, such as VFIO, we'd now no longer pin the shared
zeropage. Instead, we'd populate exclusive anon pages that we can
pin. There was a concern that this could affect the memlock limit
of existing setups.

For example, a VM running with VFIO could run into the memlock
limit and fail to run. However, we essentially had the same
behavior already in commit 17839856fd58 ("gup: document and work
around "COW can break either way" issue") which got merged into
some enterprise distros, and there were not any such complaints. So
most probably, we're fine.

IIUC, c8070b787519 fixes that we can no longer run up the refcount on
the zeropage but with the above we're no longer using the zeropage
anyway. I also confirm that after your commit above, my unit test that
previously triggered this with a read-only, anonymous mmap generates no
unpin hits in the vfio loop.

Anyway, all that to say that I think we're doubly covered that removing
the heinous workaround in vfio is overdue. Thanks,

Alex


2024-03-01 10:57:39

by David Hildenbrand

[permalink] [raw]
Subject: Re: [PATCH] vfio/type1: unpin PageReserved page

On 29.02.24 23:04, Alex Williamson wrote:
> On Tue, 27 Feb 2024 13:25:56 -0700
> Alex Williamson <[email protected]> wrote:
>
>> On Tue, 27 Feb 2024 11:27:08 +0100
>> David Hildenbrand <[email protected]> wrote:
>>
>>> On 26.02.24 18:32, Alex Williamson wrote:
>>>> On Tue, 27 Feb 2024 01:14:54 +0800
>>>> Yisheng Xie <[email protected]> wrote:
>>>>
>>>>> 在 2024/2/27 00:14, Alex Williamson 写道:
>>>>>> On Tue, 27 Feb 2024 00:01:06 +0800
>>>>>> Yisheng Xie<[email protected]> wrote:
>>>>>>
>>>>>>> We meet a warning as following:
>>>>>>> WARNING: CPU: 99 PID: 1766859 at mm/gup.c:209 try_grab_page.part.0+0xe8/0x1b0
>>>>>>> CPU: 99 PID: 1766859 Comm: qemu-kvm Kdump: loaded Tainted: GOE 5.10.134-008.2.x86_64 #1
>>>>>> ^^^^^^^^
>>>>>>
>>>>>> Does this issue reproduce on mainline? Thanks,
>>>>>
>>>>> I have check the code of mainline, the logical seems the same as my
>>>>> version.
>>>>>
>>>>> so I think it can reproduce if i understand correctly.
>>>>
>>>> I obviously can't speak to what's in your 5.10.134-008.2 kernel, but I
>>>> do know there's a very similar issue resolved in v6.0 mainline and
>>>> included in v5.10.146 of the stable tree. Please test. Thanks,
>>>
>>> This commit, to be precise:
>>>
>>> commit 873aefb376bbc0ed1dd2381ea1d6ec88106fdbd4
>>> Author: Alex Williamson <[email protected]>
>>> Date: Mon Aug 29 21:05:40 2022 -0600
>>>
>>> vfio/type1: Unpin zero pages
>>>
>>> There's currently a reference count leak on the zero page. We increment
>>> the reference via pin_user_pages_remote(), but the page is later handled
>>> as an invalid/reserved page, therefore it's not accounted against the
>>> user and not unpinned by our put_pfn().
>>>
>>> Introducing special zero page handling in put_pfn() would resolve the
>>> leak, but without accounting of the zero page, a single user could
>>> still create enough mappings to generate a reference count overflow.
>>>
>>> The zero page is always resident, so for our purposes there's no reason
>>> to keep it pinned. Therefore, add a loop to walk pages returned from
>>> pin_user_pages_remote() and unpin any zero pages.
>>>
>>>
>>> BUT
>>>
>>> in the meantime, we also have
>>>
>>> commit c8070b78751955e59b42457b974bea4a4fe00187
>>> Author: David Howells <[email protected]>
>>> Date: Fri May 26 22:41:40 2023 +0100
>>>
>>> mm: Don't pin ZERO_PAGE in pin_user_pages()
>>>
>>> Make pin_user_pages*() leave a ZERO_PAGE unpinned if it extracts a pointer
>>> to it from the page tables and make unpin_user_page*() correspondingly
>>> ignore a ZERO_PAGE when unpinning. We don't want to risk overrunning a
>>> zero page's refcount as we're only allowed ~2 million pins on it -
>>> something that userspace can conceivably trigger.
>>>
>>> Add a pair of functions to test whether a page or a folio is a ZERO_PAGE.
>>>
>>>
>>> So the unpin_user_page_* won't do anything with the shared zeropage.
>>>
>>> (likely, we could revert 873aefb376bbc0ed1dd2381ea1d6ec88106fdbd4)
>>
>>
>> Yes, according to the commit log it seems like the unpin is now just
>> wasted work since v6.5. Thanks!
>
> I dusted off an old unit test for mapping the zeropage through vfio and
> started working on posting a revert for 873aefb376bb but I actually
> found that this appears to be resolved even before c8070b787519. I
> bisected it to:
>
> commit 84209e87c6963f928194a890399e24e8ad299db1
> Author: David Hildenbrand <[email protected]>
> Date: Wed Nov 16 11:26:48 2022 +0100
>
> mm/gup: reliable R/O long-term pinning in COW mappings
>
> We already support reliable R/O pinning of anonymous memory.
> However, assume we end up pinning (R/O long-term) a pagecache page
> or the shared zeropage inside a writable private ("COW") mapping.
> The next write access will trigger a write-fault and replace the
> pinned page by an exclusive anonymous page in the process page
> tables to break COW: the pinned page no longer corresponds to the
> page mapped into the process' page table.
> Now that FAULT_FLAG_UNSHARE can break COW on anything mapped into a
> COW mapping, let's properly break COW first before R/O long-term
> pinning something that's not an exclusive anon page inside a COW
> mapping. FAULT_FLAG_UNSHARE will break COW and map an exclusive
> anon page instead that can get pinned safely.
>
> With this change, we can stop using FOLL_FORCE|FOLL_WRITE for
> reliable R/O long-term pinning in COW mappings.
>
> [...]
>
> Note 3: For users that use FOLL_LONGTERM right now without
> FOLL_WRITE, such as VFIO, we'd now no longer pin the shared
> zeropage. Instead, we'd populate exclusive anon pages that we can
> pin. There was a concern that this could affect the memlock limit
> of existing setups.
>
> For example, a VM running with VFIO could run into the memlock
> limit and fail to run. However, we essentially had the same
> behavior already in commit 17839856fd58 ("gup: document and work
> around "COW can break either way" issue") which got merged into
> some enterprise distros, and there were not any such complaints. So
> most probably, we're fine.
>

Oh, I almost forgot about that one :)

Indeed, 84209e87c696 was v6.2 and c8070b787519 was v6.5.

.. and c8070b787519 was primarily concerned about !FOLL_LONGTERM usage,
so that makes sense that they would still run into zeropages.

For vfio, 84209e87c696 did the trick.

--
Cheers,

David / dhildenb