Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20;
Message-ID: <1b5ce269-bb54-2f65-6919-8b2bb453c09a@redhat.com>
Date:   Tue, 27 Jun 2023 12:18:58 +0200
MIME-Version: 1.0
User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101
 Thunderbird/102.12.0
Content-Language: en-US
To:     Lorenzo Stoakes <lstoakes@gmail.com>
Cc:     Vlastimil Babka <vbabka@suse.cz>, linux-mm@kvack.org,
        linux-kernel@vger.kernel.org,
        Andrew Morton <akpm@linux-foundation.org>,
        Mike Rapoport <rppt@kernel.org>,
        "Liam R . Howlett" <Liam.Howlett@oracle.com>
References: <20230626204612.106165-1-lstoakes@gmail.com>
 <074fc253-beb4-f7be-14a1-ee5f4745c15b@suse.cz>
 <1832a772-93b4-70ad-3a2c-d8da104ce8dd@redhat.com>
 <40cd965f-ba4f-44d8-8e7c-d6267b91a9b3@lucifer.local>
 <57c677d1-9809-966e-5254-f01f59eff7bc@redhat.com>
 <a6ae5c22-80f7-47a5-9ba6-2217f20e673e@lucifer.local>
From:   David Hildenbrand <david@redhat.com>
Organization: Red Hat
Subject: Re: [PATCH] mm/mprotect: allow unfaulted VMAs to be unaccounted on
 mprotect()
In-Reply-To: <a6ae5c22-80f7-47a5-9ba6-2217f20e673e@lucifer.local>
Content-Type: text/plain; charset=UTF-8; format=flowed
Content-Transfer-Encoding: 7bit
Precedence: bulk

[...]

>>
>> Yeah, and that needs time and you have to motivate me :)
>>
> 
> Beer? ;)

Oh, that always works :)

> 
>>> Well the motivator for the initial investigation was rppt playing with
>>> R[WO]X (this came from an #mm irc conversation), however in his case he
>>> will be mapping pages between the two.
>>
>> And that's the scenario I think we care about in practice (actually
>> accessing memory).

[...]

>>> In real-use scenarios, yes fuzzers are a thing, but what comes to mind more
>>> immediately is a process that maps a big chunk of virtual memory PROT_NONE
>>> and uses that as part of an internal allocator.
>>>
>>> If the process then allocates memory from this chunk (mprotect() ->
>>> PROT_READ | PROT_WRITE), which then gets freed without being used
>>> (mprotect() -> PROT_NONE) we hit the issue. For OVERCOMMIT_NEVER this could
>>> become quite an issue more so than the VMA fragmentation.
>>
>> Using mprotect() when allocating/freeing memory in an allocator is already
>> horribly harmful for performance (well, and the #VMAs), so I don't think
>> that scenario is relevant in practice.
> 
> Chrome for instance maintains vast memory ranges as PROT_NONE. I've not dug
> into what they're doing, but surely to make use of them they'd need to
> mprotect() or mmap()/mremap() (which maybe is what the intent is)

I suspect they are doing something similar than glibc (and some other 
allocators like jemalloc IIRC), because they want to minimze the #VMAs.

> 
> But fair point. However I can't imagine m[re]map'ing like this would be
> cheap either, as you're doing the same kind of expensive operations, so the
> general _approach_ seems like it's used in some way in practice.

Usually people access memory and not play mprotect() games for fun :)

> 
>>
>> What some allocators (iirc even glibc) do is reserve a bigger area with
>> PROT_NONE and grow the accessible part slowly on demand, discarding freed
>> memory using MADV_DONTNEED. So you essentially end up with two VMAs -- one
>> completely accessible, one completely inaccessible.
>>
>> They don't use mprotect() because:
>> (a) It's bad for performance
>> (b) It might increase the #VMAs
>>
>> There is efence, but I remember it simply does mmap()+munmap() and runs into
>> VMA limits easily just by relying on a lot of mappings.
>>
>>
>>>
>>> In addition, I think a user simply doing the artificial test above would
>>> find the split remaining quite confusing, and somebody debugging some code
>>> like this would equally wonder why it happened, so there is benefit in
>>> clarity too (they of course observing the VMA fragmentation from the
>>> perspective of /proc/$pid/[s]maps).
>>
>> My answer would have been "memory gets commited the first time we allow
>> write access, and that wasn't the case for all memory in that range".
>>
>>
>> Now, take your example above and touch the memory.
>>
>>
>> ptr = mmap(NULL, page_size * 3, PROT_READ, MAP_ANON | MAP_PRIVATE, -1, 0);
>> mprotect(ptr + page_size, page_size, PROT_READ | PROT_WRITE);
>> *(ptr + page_size) = 1;
>> mprotect(ptr + page_size, page_size, PROT_READ);
>>
>>
>> And we'll not merge the VMAs.
>>
>> Which, at least to me, makes existing handling more consistent.
> 
> Indeed, but I don't think it's currently consistent at all.
> 
> The 'correct' solution would be to:-
> 
> 1. account for the block when it becomes writable
> 2. unaccount for any pages not used when it becomes unwritable
> 

I've been messing with something related (but slightly different) for a 
while now in my mind, and I'm not at the point where I can talk about my 
work/idea yet.

But because I've been messing with it, I can comment on some existing 
oddities. Just imagine:

* userfaultfd() can place anon pages even in PROT_NONE areas
* ptrace can place anon pages in PROT_READ areas
* "fun" like the forbidden shared zeropage on s390x in some VMAs can
   place anon pages into PROT_READ areas.

It's all far from "correct" when talking about memory accounting. But it 
seems to get the job done for the most case for now.

> However since we can't go from vma -> folios for anon pages without some
> extreme effort this is not feasible.
> 
> Therefore the existing code hacks it and just keep things accountable.
> 
> The patch reduces the hacking so we get halfway to the correct approach.
> 
> So before: "if you ever make this read/write, we account it forever"
> After: "if you ever make this read/write and USE IT, we account it forever"
> 

"USE" is probably the wrong word. Maybe "MODIFIED", but there are other 
cases (MADV_POPULATE_WRITE)

> To me it is more consistent. Of course this is subjective...
> 
You made the conditional more complicated to make it consistent, won't 
argue with that :)

>>
>> And users could rightfully wonder "why isn't it getting merged". And the
>> answer would be the same: "memory gets commited the first time we allow
>> write access, and that wasn't the case for all memory in that range".
>>
> 
> Yes indeed, a bigger answer is that we don't have fine-grained accounting
> for pages for anon_vma.

Yes, VM_ACCOUNT is all-or nothing, which makes a lot of sense in many 
cases (not in all, though).

[...]

>>
>>>>> So in practice programs will likely do the PROT_WRITE in order to actually
>>>>> populate the area, so this won't trigger as I commented above. But it can
>>>>> still help in some cases and is cheap to do, so:
>>>>
>>>> IMHO we should much rather look into getting hugetlb ranges merged. Mt
>>>> recollection is that we'll never end up merging hugetlb VMAs once split.
>>>
>>> I'm not sure how that's relevant to fragmented non-hugetlb VMAs though?
>>
>> It's a VMA merging issue that can be hit in practice, so I raised it.
>>
>>
>> No strong opinion from my side, just my 2 cents reading the patch
>> description and wondering "why do we even invest time thinking about this
>> case" -- and eventually make handling less consistent IMHO (see above).
> 
> Hmm it seems ilke you have quite a strong opinion :P but this is why I cc-d
> you, as you are a great scrutiniser.

I might make it sound like a strong opinion (because I am challenging 
the motivation), but there is no nak :)

> 
> Yeah, the time investment was just by accident, the patch was originally a
> throwaway thing to prove the point :]
> 
> I very much appreciate your time though! And I owe you at least one beer now.
> 
> I would ask that while you might question the value, whether you think it
> so harmful as not to go in, so Andrew can know whether this debate = don't
> take?
> 
> An Ack-with-meh would be fine. But also if you want to nak, it's also
> fine. I will buy you the beer either way ;)

It's more a "no nak" --  I don't see the real benefit but I also don't 
see the harm (as long as VMA locking is not an issue). If others see the 
benefit, great, so I'll let these decide.

-- 
Cheers,

David / dhildenb