2024-02-27 12:29:35

by mawupeng

[permalink] [raw]
Subject: [Question] CoW on VM_PFNMAP vma during write fault

We find that a warn will be produced during our test, the detail log is
shown in the end.

The core problem of this warn is that the first pfn of this pfnmap vma is
cleared during memory-failure. Digging into the source we find that this
problem can be triggered as following:

// mmap with MAP_PRIVATE and specific fd which hook mmap
mmap(MAP_PRIVATE, fd)
__mmap_region
remap_pfn_range
// set vma with pfnmap and the prot of pte is read only

// memset this memory with trigger fault
handle_mm_fault
__handle_mm_fault
handle_pte_fault
// write fault and !pte_write(entry)
do_wp_page
wp_page_copy // this will alloc a new page with valid page struct
// for this pfnmap vma

// inject a hwpoison to the first page of this vma
madvise_inject_error
memory_failure
hwpoison_user_mappings
try_to_unmap_one
// mark this pte as invalid (hwpoison)
mmu_notifier_range_init(&range, MMU_NOTIFY_CLEAR, 0, vma, vma->vm_mm,
address, range.end);

// during unmap vma, the first pfn of this pfnmap vma is invalid
vm_mmap_pgoff
do_mmap
__do_mmap_mm
__mmap_region
__do_munmap
unmap_region
unmap_vmas
unmap_single_vma
untrack_pfn
follow_phys // pte is already invalidate, WARN_ON here

CoW with a valid page for pfnmap vma is weird to us. Can we use
remap_pfn_range for private vma(read only)? Once CoW happens on a pfnmap
vma during write fault, this page is normal(page flag is valid) for most mm
subsystems, such as memory failure in thais case and extra should be done to
handle this special page.

During unmap, if this vma is pfnmap, unmap shouldn't be done since page
should not be touched for pfnmap vma.

But the root problem is that can we insert a valid page for pfnmap vma?

Any thoughts to solve this warn?

------------[ cut here ]------------
WARNING: CPU: 0 PID: 503 at arch/x86/mm/pat/memtype.c:1060 untrack_pfn+0xed/0x100
Modules linked in: remap_pfn(OE)
CPU: 0 PID: 503 Comm: remap_pfn Tainted: G OE 6.8.0-rc6-dirty #436
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
RIP: 0010:untrack_pfn+0xed/0x100
Code: cc cc cc cc 48 8b 43 10 8b a8 e8 00 00 00 3b 6b 28 74 ca 48 8b 7b 30 e8 81 de cf 00 89 6b 28 48 8b 7b 30 e8 05 cc b7 e8 ba c3 ce 00 66 2e 0f 1f 84 00 00 00 00 00 90 90 90
RSP: 0018:ffffb5f683eafc78 EFLAGS: 00010282
RAX: 00000000ffffffea RBX: ffff960b18537658 RCX: 0000000000000043
RDX: bfffffffdcb13e00 RSI: 0000000000000043 RDI: ffff960e45b7a140
RBP: 0000000000000000 R08: 00007f7df7a00000 R09: ffff960a00000fb8
R10: ffff960a00000000 R11: 000fffffffffffff R12: 00007f7df7a08000
R13: 0000000000000000 R14: ffffb5f683eafdc8 R15: 0000000000000000
FS: 0000000000000000(0000) GS:ffff960e2fc00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f7df7aeb110 CR3: 0000000118b66003 CR4: 0000000000770ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
? __warn+0x84/0x140
? untrack_pfn+0xed/0x100
? report_bug+0x1bd/0x1d0
? handle_bug+0x3c/0x70
? exc_invalid_op+0x18/0x70
? asm_exc_invalid_op+0x1a/0x20
? untrack_pfn+0xed/0x100
? untrack_pfn+0x5c/0x100
unmap_single_vma+0xa6/0xe0
unmap_vmas+0xb2/0x190
exit_mmap+0xee/0x3c0
mmput+0x68/0x120
do_exit+0x2ec/0xb80
do_group_exit+0x31/0x80
__x64_sys_exit_group+0x18/0x20
do_syscall_64+0x66/0x180
entry_SYSCALL_64_after_hwframe+0x6e/0x76
RIP: 0033:0x7f7df7aeb146
Code: Unable to access opcode bytes at 0x7f7df7aeb11c.
RSP: 002b:00007ffe571100a8 EFLAGS: 00000246 ORIG_RAX: 00000000000000e7
RAX: ffffffffffffffda RBX: 00007f7df7bf08a0 RCX: 00007f7df7aeb146
RDX: 0000000000000000 RSI: 000000000000003c RDI: 0000000000000000
RBP: 0000000000000000 R08: 00000000000000e7 R09: ffffffffffffff80
R10: 0000000000000002 R11: 0000000000000246 R12: 00007f7df7bf08a0
R13: 0000000000000001 R14: 00007f7df7bf92e8 R15: 0000000000000000
</TASK>
---[ end trace 0000000000000000 ]---



2024-03-04 09:04:56

by David Hildenbrand

[permalink] [raw]
Subject: Re: [Question] CoW on VM_PFNMAP vma during write fault

On 04.03.24 09:47, mawupeng wrote:
> Hi Maintainers, kindly ping...
>
> On 2024/2/28 9:55, mawupeng wrote:
>>
>>
>> On 2024/2/27 21:15, David Hildenbrand wrote:
>>> On 27.02.24 14:00, David Hildenbrand wrote:
>>>> On 27.02.24 13:28, Wupeng Ma wrote:
>>>>> We find that a warn will be produced during our test, the detail log is
>>>>> shown in the end.
>>>>>
>>>>> The core problem of this warn is that the first pfn of this pfnmap vma is
>>>>> cleared during memory-failure. Digging into the source we find that this
>>>>> problem can be triggered as following:
>>>>>
>>>>> // mmap with MAP_PRIVATE and specific fd which hook mmap
>>>>> mmap(MAP_PRIVATE, fd)
>>>>>     __mmap_region
>>>>>       remap_pfn_range
>>>>>       // set vma with pfnmap and the prot of pte is read only
>>>>>
>>>>
>>>> Okay, so we get a MAP_PRIVATE VM_PFNMAP I assume.
>>>>
>>>> What fd is that exactly? Often, we disallow private mappings in the
>>>> mmap() callback (for a good reason).
>
> We found this problem in 5.10, Commit 9f78bf330a66 ("xsk: support use vaddr as ring") Fix this
> problem during supporting vaddr by remap VM_PFNMAP by VM_MIXEDMAP. But other modules which
> use remap_pfn_range may still have this problem.

I wrote a simple reproducer using MAP_PRIVATE of iouring queues on Friday.

>
> It do seems wired for private mappings, What is the good reason?

I'm sure there are some use cases that require MAP_PRIVATE of such
areas, and usually there is nothing wrong with that.

It's just that the PAT implementation incompatible.

I can submit a cleaned-up version of my patches.

--
Cheers,

David / dhildenb


2024-03-04 09:19:00

by David Hildenbrand

[permalink] [raw]
Subject: Re: [Question] CoW on VM_PFNMAP vma during write fault

On 04.03.24 10:04, mawupeng wrote:
>
>
> On 2024/3/4 16:57, David Hildenbrand wrote:
>> On 04.03.24 09:47, mawupeng wrote:
>>> Hi Maintainers, kindly ping...
>>>
>>> On 2024/2/28 9:55, mawupeng wrote:
>>>>
>>>>
>>>> On 2024/2/27 21:15, David Hildenbrand wrote:
>>>>> On 27.02.24 14:00, David Hildenbrand wrote:
>>>>>> On 27.02.24 13:28, Wupeng Ma wrote:
>>>>>>> We find that a warn will be produced during our test, the detail log is
>>>>>>> shown in the end.
>>>>>>>
>>>>>>> The core problem of this warn is that the first pfn of this pfnmap vma is
>>>>>>> cleared during memory-failure. Digging into the source we find that this
>>>>>>> problem can be triggered as following:
>>>>>>>
>>>>>>> // mmap with MAP_PRIVATE and specific fd which hook mmap
>>>>>>> mmap(MAP_PRIVATE, fd)
>>>>>>>      __mmap_region
>>>>>>>        remap_pfn_range
>>>>>>>        // set vma with pfnmap and the prot of pte is read only
>>>>>>>
>>>>>>
>>>>>> Okay, so we get a MAP_PRIVATE VM_PFNMAP I assume.
>>>>>>
>>>>>> What fd is that exactly? Often, we disallow private mappings in the
>>>>>> mmap() callback (for a good reason).
>>>
>>> We found this problem in 5.10, Commit 9f78bf330a66 ("xsk: support use vaddr as ring") Fix this
>>> problem during supporting vaddr by remap VM_PFNMAP by VM_MIXEDMAP. But other modules which
>>> use remap_pfn_range may still have this problem.
>>
>> I wrote a simple reproducer using MAP_PRIVATE of iouring queues on Friday.
>>
>>>
>>> It do seems wired for private mappings, What is the good reason?
>>
>> I'm sure there are some use cases that require MAP_PRIVATE of such areas, and usually there is nothing wrong with that.
>
> So MAP_PRIVATE for VM_PFNMAP area with write access is ok? What is the user case for this situation?
I recall that MAP_PRIVATE /dev/mem mappings were required for some use
cases. No details/ideas about other users, though.

Likely sufficient use case that people really thought about ways to get
it working -- see vm_normal_page() :)

--
Cheers,

David / dhildenb


2024-03-04 09:04:53

by mawupeng

[permalink] [raw]
Subject: Re: [Question] CoW on VM_PFNMAP vma during write fault



On 2024/3/4 16:57, David Hildenbrand wrote:
> On 04.03.24 09:47, mawupeng wrote:
>> Hi Maintainers, kindly ping...
>>
>> On 2024/2/28 9:55, mawupeng wrote:
>>>
>>>
>>> On 2024/2/27 21:15, David Hildenbrand wrote:
>>>> On 27.02.24 14:00, David Hildenbrand wrote:
>>>>> On 27.02.24 13:28, Wupeng Ma wrote:
>>>>>> We find that a warn will be produced during our test, the detail log is
>>>>>> shown in the end.
>>>>>>
>>>>>> The core problem of this warn is that the first pfn of this pfnmap vma is
>>>>>> cleared during memory-failure. Digging into the source we find that this
>>>>>> problem can be triggered as following:
>>>>>>
>>>>>> // mmap with MAP_PRIVATE and specific fd which hook mmap
>>>>>> mmap(MAP_PRIVATE, fd)
>>>>>>      __mmap_region
>>>>>>        remap_pfn_range
>>>>>>        // set vma with pfnmap and the prot of pte is read only
>>>>>>     
>>>>>
>>>>> Okay, so we get a MAP_PRIVATE VM_PFNMAP I assume.
>>>>>
>>>>> What fd is that exactly? Often, we disallow private mappings in the
>>>>> mmap() callback (for a good reason).
>>
>> We found this problem in 5.10, Commit 9f78bf330a66 ("xsk: support use vaddr as ring") Fix this
>> problem during supporting vaddr by remap VM_PFNMAP by VM_MIXEDMAP. But other modules which
>> use remap_pfn_range may still have this problem.
>
> I wrote a simple reproducer using MAP_PRIVATE of iouring queues on Friday.
>
>>
>> It do seems wired for private mappings, What is the good reason?
>
> I'm sure there are some use cases that require MAP_PRIVATE of such areas, and usually there is nothing wrong with that.

So MAP_PRIVATE for VM_PFNMAP area with write access is ok? What is the user case for this situation?

>
> It's just that the PAT implementation incompatible.

PAT do have its problem.

>
> I can submit a cleaned-up version of my patches.

Thanks

>