by Nadav Amit

[permalink] [raw]

Subject: Re: [PATCH v2 2/5] mm: avoid unnecessary flush on change_huge_pmd()

> On Oct 26, 2021, at 12:40 PM, Dave Hansen <[email protected]> wrote:
>
> On 10/26/21 12:06 PM, Nadav Amit wrote:
>>
>> To make it very clear - consider the following scenario, in which
>> a volatile pointer p is mapped using a certain PTE, which is RW
>> (i.e., *p is writable):
>>
>> CPU0 CPU1
>> ---- ----
>> x = *p
>> [ PTE cached in TLB;
>> PTE is not dirty ]
>> clear_pte(PTE)
>> *p = x
>> [ needs to set dirty ]
>>
>> Note that there is no TLB flush in this scenario. The question
>> is whether the write access to *p would succeed, setting the
>> dirty bit on the clear, non-present entry.
>>
>> I was under the impression that the hardware AD-assist would
>> recheck the PTE atomically as it sets the dirty bit. But, as I
>> said, I am not sure anymore whether this is defined architecturally
>> (or at least would work in practice on all CPUs modulo the
>> Knights Landing thingy).
>
> Practically, at "x=*p", he thing that gets cached in the TLB will
> Dirty=0. At the "*p=x", the CPU will decide it needs to do a write,
> find the Dirty=0 entry and will entirely discard it. In other words, it
> *acts* roughly like this:
>
> x = *p
> INVLPG(p)
> *p = x;
>
> Where the INVLPG() and the "*p=x" are atomic. So, there's no
> _practical_ problem with your scenario. This specific behavior isn't
> architectural as far as I know, though.
>
> Although it's pretty much just academic, as for the architecture, are
> you getting hung up on the difference between the description of "Accessed":
>
> Whenever the processor uses a paging-structure entry as part of
> linear-address translation, it sets the accessed flag in that
> entry
>
> and "Dirty:"
>
> Whenever there is a write to a linear address, the processor
> sets the dirty flag (if it is not already set) in the paging-
> structure entry...
>
> Accessed says "as part of linear-address translation", which means that
> the address must have a translation. But, the "Dirty" section doesn't
> say that. It talks about "a write to a linear address" but not whether
> there is a linear address *translation* involved.
>
> If that's it, we could probably add a bit like:
>
> In addition to setting the accessed flag, whenever there is a
> write...
>
> before the dirty rules in the SDM.
>
> Or am I being dense and continuing to miss your point? :)

I think this time you got my question right.

I was thrown off by the SDM comment on RW permissions vs dirty that I
mentioned before:

"If software on one logical processor writes to a page while software on
another logical processor concurrently clears the R/W flag in the
paging-structure entry that maps the page, execution on some processors may
result in the entry’s dirty flag being set (due to the write on the first
logical processor) and the entry’s R/W flag being clear (due to the update
to the entry on the second logical processor).”

I did not pay enough attention to these small differences that you mentioned
between access and dirty this time (although I did notice them before).

I do not think that the change that you offered to the SDM really clarifies
the situation. Setting the access flag is done as part of caching the PTE in
the TLB. The SDM change you propose does not clarify the atomicity of the
permission/PTE-validity check and dirty-bit setting or the fact the PTE is
invalidated if the dirty-bit needs to be set and is cached as clear [I do not
presume you would want the latter in the SDM, since it is an implementation
detail.]

I just wonder how come the R/W-clearing and the P-clearing cause concurrent
dirty bit setting to behave differently. I am not a hardware guy, but I would
imagine they would be the same...

2021-10-27 10:24:41

by Dave Hansen

[permalink] [raw]

Subject: Re: [PATCH v2 2/5] mm: avoid unnecessary flush on change_huge_pmd()

On 10/26/21 1:07 PM, Nadav Amit wrote:
> I just wonder how come the R/W-clearing and the P-clearing cause concurrent
> dirty bit setting to behave differently. I am not a hardware guy, but I would
> imagine they would be the same...

First of all, I think the non-atomic properties where a PTE can go:

W=1,D=0 // original
W=0,D=0 // software clears W
W=0,D=1 // hardware sets D

were a total implementation accident. It wasn't someone being clever
and since the behavior was architecturally allowed and well-tolerated by
software it was around for a while. I think I was the one that asked
that it get fixed for shadow stacks, and nobody pushed back on it too
hard as far as I remember. I don't think it was super hard to fix.

Why do the Present/Accessed and Write/Dirty pairs act differently? I
think it's a total implementation accident and wasn't by design.

The KNL erratum was an erratum and wasn't codified in the architecture
because it actually broke things. The pre-CET Write/Dirty behavior
didn't break software to a level it was considered an erratum. It gets
to live on as allowed in the architecture.