ptep_get_lockless() does the following under CONFIG_GUP_GET_PTE_LOW_HIGH:
pte_t pte;
do {
pte.pte_low = ptep->pte_low;
smp_rmb();
pte.pte_high = ptep->pte_high;
smp_rmb();
} while (unlikely(pte.pte_low != ptep->pte_low));
It has a comment above it that argues that this is correct because:
1. A present PTE can't become non-present and then become a present
PTE pointing to another page without a TLB flush in between.
2. TLB flushes involve IPIs.
As far as I can tell, in particular on x86, _both_ of those
assumptions are false; perhaps on mips and sh only one of them is?
Number 2 is straightforward: X86 can run under hypervisors, and when
it runs under hypervisors, the MMU paravirtualization code (including
the KVM version) can implement remote TLB flushes without IPIs.
Number 1 is gnarlier, because breaking that assumption implies that
there can be a situation where different threads see different memory
at the same virtual address because their TLBs are incoherent. But as
far as I know, it can happen when MADV_DONTNEED races with an
anonymous page fault, because zap_pte_range() does not always flush
stale TLB entries before dropping the page table lock. I think that's
probably fine, since it's a "garbage in, garbage out" kind of
situation - but if a concurrent GUP-fast can then theoretically end up
returning a completely unrelated page, that's bad.
Sadly, mips and sh don't define arch_cmpxchg_double(), so we can't
just change ptep_get_lockless() to use arch_cmpxchg_double() and be
done with it...
On Thu, Oct 06, 2022 at 05:23:59PM +0200, Jann Horn wrote:
> ptep_get_lockless() does the following under CONFIG_GUP_GET_PTE_LOW_HIGH:
>
> pte_t pte;
> do {
> pte.pte_low = ptep->pte_low;
> smp_rmb();
> pte.pte_high = ptep->pte_high;
> smp_rmb();
> } while (unlikely(pte.pte_low != ptep->pte_low));
>
> It has a comment above it that argues that this is correct because:
> 1. A present PTE can't become non-present and then become a present
> PTE pointing to another page without a TLB flush in between.
> 2. TLB flushes involve IPIs.
>
> As far as I can tell, in particular on x86, _both_ of those
> assumptions are false; perhaps on mips and sh only one of them is?
>
> Number 2 is straightforward: X86 can run under hypervisors, and when
> it runs under hypervisors, the MMU paravirtualization code (including
> the KVM version) can implement remote TLB flushes without IPIs.
>
> Number 1 is gnarlier, because breaking that assumption implies that
> there can be a situation where different threads see different memory
> at the same virtual address because their TLBs are incoherent. But as
> far as I know, it can happen when MADV_DONTNEED races with an
> anonymous page fault, because zap_pte_range() does not always flush
> stale TLB entries before dropping the page table lock. I think that's
> probably fine, since it's a "garbage in, garbage out" kind of
> situation - but if a concurrent GUP-fast can then theoretically end up
> returning a completely unrelated page, that's bad.
>
>
> Sadly, mips and sh don't define arch_cmpxchg_double(), so we can't
> just change ptep_get_lockless() to use arch_cmpxchg_double() and be
> done with it...
I think the argument here has nothing to do with IPIs, but is more a
statement on memory ordering. What we want to get is a non-torn load
of low/high, under some restricted rules.
PTE writes should be ordered so that the present/not present bit is
properly:
Zapping a PTE:
write_low (not present)
wmb()
write_high (a)
wmb()
Reestablish a PTE:
write_high (b)
wmb()
write_low (present)
wmb()
This ordering is necessary to make the TLB's atomic 64 bit load work
properly, otherwise the TLB could read a present entry with a bogus
other half!
For ptep_get_lockless() we define non-torn as meaning the same as for the TLB:
pre-zap low / high (present)
restablish low / high (b) (present)
any low / any high (not present)
Other combinations are forbidden.
The read side has a corresponding list of reads:
read_low
rmb()
read_high
rmb()
read_low
So, it seems plausible this could be OK based only on atomics (I did
not check that the present bit is properly placed in the right
low/high). Do you see a way the atomics don't work out?
Jason
On Thu, Oct 6, 2022 at 5:44 PM Jason Gunthorpe <[email protected]> wrote:
> On Thu, Oct 06, 2022 at 05:23:59PM +0200, Jann Horn wrote:
> > ptep_get_lockless() does the following under CONFIG_GUP_GET_PTE_LOW_HIGH:
> >
> > pte_t pte;
> > do {
> > pte.pte_low = ptep->pte_low;
> > smp_rmb();
> > pte.pte_high = ptep->pte_high;
> > smp_rmb();
> > } while (unlikely(pte.pte_low != ptep->pte_low));
> >
> > It has a comment above it that argues that this is correct because:
> > 1. A present PTE can't become non-present and then become a present
> > PTE pointing to another page without a TLB flush in between.
> > 2. TLB flushes involve IPIs.
> >
> > As far as I can tell, in particular on x86, _both_ of those
> > assumptions are false; perhaps on mips and sh only one of them is?
> >
> > Number 2 is straightforward: X86 can run under hypervisors, and when
> > it runs under hypervisors, the MMU paravirtualization code (including
> > the KVM version) can implement remote TLB flushes without IPIs.
> >
> > Number 1 is gnarlier, because breaking that assumption implies that
> > there can be a situation where different threads see different memory
> > at the same virtual address because their TLBs are incoherent. But as
> > far as I know, it can happen when MADV_DONTNEED races with an
> > anonymous page fault, because zap_pte_range() does not always flush
> > stale TLB entries before dropping the page table lock. I think that's
> > probably fine, since it's a "garbage in, garbage out" kind of
> > situation - but if a concurrent GUP-fast can then theoretically end up
> > returning a completely unrelated page, that's bad.
> >
> >
> > Sadly, mips and sh don't define arch_cmpxchg_double(), so we can't
> > just change ptep_get_lockless() to use arch_cmpxchg_double() and be
> > done with it...
>
> I think the argument here has nothing to do with IPIs, but is more a
> statement on memory ordering.
The comment above the definition of ptep_get_lockless() claims: "it
will not switch to a completely different present page without a TLB
flush in between; something that we are blocking by holding interrupts
off."
> What we want to get is a non-torn load
> of low/high, under some restricted rules.
>
> PTE writes should be ordered so that the present/not present bit is
> properly:
>
> Zapping a PTE:
>
> write_low (not present)
> wmb()
> write_high (a)
> wmb()
>
> Reestablish a PTE:
>
> write_high (b)
> wmb()
> write_low (present)
> wmb()
>
> This ordering is necessary to make the TLB's atomic 64 bit load work
> properly, otherwise the TLB could read a present entry with a bogus
> other half!
>
> For ptep_get_lockless() we define non-torn as meaning the same as for the TLB:
>
> pre-zap low / high (present)
> restablish low / high (b) (present)
> any low / any high (not present)
>
> Other combinations are forbidden.
>
> The read side has a corresponding list of reads:
>
> read_low
> rmb()
> read_high
> rmb()
> read_low
>
> So, it seems plausible this could be OK based only on atomics (I did
> not check that the present bit is properly placed in the right
> low/high). Do you see a way the atomics don't work out?
The race would be something like this, where A is one thread doing
ptep_get_lockless() and B, C and D are other threads:
<PTE initially points to address 0x0001000100010000>
A: read ptep->pte_low, sees low address half 0x00010000
B: begins MADV_DONTNEED, removes the PTE but doesn't flush TLB yet
C: page fault installs a new PTE pointing to address 0x0001000200020000
A: read ptep->pte_high, sees high address half 0x00010002
C: begins MADV_DONTNEED, removes the PTE but doesn't flush TLB yet
D: page fault installs a new PTE pointing to address 0x0001000300010000
A: re-read ptep->pte_low, sees low address half 0x00010000 matching
the first one
A: returns physical address 0x000100020x00010000, which was never
actually in the PTE
So it's not a problem with the memory ordering, it's just that it's
not possible to atomically read a 64-bit PTE with 32-bit reads when
the PTE can completely change under you - and ptep_get_lockless() was
written under the assumption that this can't happen because of TLB
flush IPIs.
On Thu, Oct 06, 2022 at 05:55:54PM +0200, Jann Horn wrote:
> > I think the argument here has nothing to do with IPIs, but is more a
> > statement on memory ordering.
>
> The comment above the definition of ptep_get_lockless() claims: "it
> will not switch to a completely different present page without a TLB
> flush in between; something that we are blocking by holding interrupts
> off."
I was always skeptical of that argument..
> > So, it seems plausible this could be OK based only on atomics (I did
> > not check that the present bit is properly placed in the right
> > low/high). Do you see a way the atomics don't work out?
>
> The race would be something like this, where A is one thread doing
> ptep_get_lockless() and B, C and D are other threads:
>
> <PTE initially points to address 0x0001000100010000>
> A: read ptep->pte_low, sees low address half 0x00010000
> B: begins MADV_DONTNEED, removes the PTE but doesn't flush TLB yet
> C: page fault installs a new PTE pointing to address 0x0001000200020000
> A: read ptep->pte_high, sees high address half 0x00010002
> C: begins MADV_DONTNEED, removes the PTE but doesn't flush TLB yet
> D: page fault installs a new PTE pointing to address 0x0001000300010000
> A: re-read ptep->pte_low, sees low address half 0x00010000 matching
> the first one
> A: returns physical address 0x000100020x00010000, which was never
> actually in the PTE
Okay, that does seem plausible :(
> So it's not a problem with the memory ordering, it's just that it's
> not possible to atomically read a 64-bit PTE with 32-bit reads when
> the PTE can completely change under you - and ptep_get_lockless() was
> written under the assumption that this can't happen because of TLB
> flush IPIs.
It does seem broken then.
If the arch can't provide an atomci load, I suppose the "easy" way to
fix it would be to use a seqlock to protect it..
Though in practice the chances the PTE changes multiple times between
two loads might be sufficiently rare on already rare 32 bit high mem
systems, perhaps it isn't worth fixing..
Thanks,
Jason
On Thu, Oct 6, 2022 at 5:55 PM Jann Horn <[email protected]> wrote:
> A: returns physical address 0x000100020x00010000, which was never
> actually in the PTE
Sorry, typo, that was supposed to be 0x0001000200010000.