2007-10-05 01:43:51

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown

David's change 10a8d6ae4b3182d6588a5809a8366343bc295c20, "i386: add
ptep_test_and_clear_{dirty,young}" has introduced an SMP race which
affects the Xen pv-ops backend.

In Xen, pagetables are normally kept RO so that the hypervisor can
mediate all updates to them. If Xen sees a write to an active
(currently pointed to by cr3) or pinned (a currently inactive but
registered pagetable), it will trap the write fault and emulate the
instruction making the update; this means that most pagetable-modifying
code doesn't need to know or care that pagetables are RO.

When a pagetable is first created (either in execve or fork), the the
Xen paravirt backend pins the pagetable, and conversely, on exit it is
unpinned; this is done via the arch_dup_mmap() and activate_mm() hooks.
Pinning is done in two phases: first the pagetable pages are marked RO,
and then the pagetable is registered with Xen; unpinning is the
opposite. This works assuming that the pagetable is not in use, and not
yet visible to other cpus.

The race on pagetable creation is this: in kernel/fork.c:dup_mmap(), it
copies the old pagetable into the new one, and registers each vma with
the rmap prio tree. Once everything is copied, it calls
arch_dup_mmap(), which ends up doing the Xen pagetable pin. However,
because the pagetable is visible to other cpus via the prio tree,
pagetable modifications (specifically, clearing the access bit) can race
with pinning. If it hits between making the pagetable pages RO but
before they're registered with Xen, modifications to the flags will
fault, and Xen won't know to do the fixup.

The converse is also true in exit_mmap(): arch_exit_mmap is called
before removing the vmas from the prio tree, so it can race with unpinning.

The specific oops I'm seeing is this:

BUG: unable to handle kernel paging request at virtual address c5b023e8
printing eip:
c016d3f2
*pdpt = 000000004bc1a001
Oops: 0003 [#1]
PREEMPT SMP
Modules linked in:
CPU: 1
EIP: 0061:[<c016d3f2>] Not tainted VLI
EFLAGS: 00010202 (2.6.23-rc9-paravirt #1656)
EIP is at page_referenced_one+0xb8/0x12a
eax: c0401b17 ebx: c5b023e8 ecx: c2398000 edx: c044ceca
esi: 0087d000 edi: c5660688 ebp: c2399af4 esp: c2399acc
ds: 007b es: 007b fs: 00d8 gs: 0033 ss: 0069
Process cc1 (pid: 31474, ti=c2398000 task=c2dc9000 task.ti=c2398000)
Stack: c04014a7 c040f47a 0000011e c03697fe c2399b1c c5eb4500 c113e87c c116b1c8
c5660688 c13aa890 c2399b2c c016d4d8 00000000 c7917340 00000008 00000000
00000000 c13aa8c4 00000000 00000000 00000005 c116b1c8 00000001 c0473940
Call Trace:
[<c010927e>] show_trace_log_lvl+0x1a/0x2f
[<c0109330>] show_stack_log_lvl+0x9d/0xa5
[<c010952f>] show_registers+0x1f7/0x336
[<c0109789>] die+0x11b/0x23b
[<c035e811>] do_page_fault+0x758/0x838
[<c035ccca>] error_code+0x72/0x78
[<c016d4d8>] page_referenced_file+0x74/0xa0
[<c016d732>] page_referenced+0xbd/0xd0
[<c01611b4>] shrink_active_list+0x170/0x3a3
[<c0161e56>] shrink_zone+0xb9/0xf8
[<c0162808>] try_to_free_pages+0x13c/0x208
[<c015e269>] __alloc_pages+0x197/0x290
[<c015fa95>] __do_page_cache_readahead+0xd4/0x1d7
[<c015fe86>] do_page_cache_readahead+0x4b/0x56
[<c015be8d>] filemap_fault+0x1b7/0x3de
[<c0164649>] __do_fault+0x79/0x407
[<c0167760>] handle_mm_fault+0x27e/0xca0
[<c035e44a>] do_page_fault+0x391/0x838
[<c035ccca>] error_code+0x72/0x78
=======================
Code: 0c fe 97 36 c0 c7 44 24 08 1e 01 00 00 c7 44 24 04 7a f4 40 c0 c7 04 24 a7 14 40 c0 e8 d4 e5 fb ff e8 29 c9 f9 ff f6 03 20 74 27 <f0> 0f ba 33 05 19 c0 85 c0 74 1c 8b 07 89 f2 89 d9 8d b6 00 00
EIP: [<c016d3f2>] page_referenced_one+0xb8/0x12a SS:ESP 0069:c2399acc


It all worked OK before David's change, because asm-generic/pgtable.h
uses set_pte_at(), which ends up making a hypercall to update the
pagetable, which always works regardless of the state of the pagetable
pages.


It seems to me that there are a few ways to fix this:

1. Use asm-generic/pgtable.h when CONFIG_PARAVIRT is enabled. This
will clearly work, but is pretty blunt.
2. Make test_and_clear_pte_flags a new paravirt-op, which can be
implemented in Xen as a hypercall, and as a raw test_and_clear_bit
for everyone else. The downside is adding yet another pv-op.
3. Restructure the pagetable setup code so that the mm is not added
to the prio tree until after arch_dup_mmap has been called (and
the converse for exit_mmap). This is arguably cleaner, but I
haven't looked to see how much trouble this would be.

Thoughts anyone? Does making the pagetables visible "early" cause
problems for anyone else?

Thanks,
J


2007-10-05 01:53:19

by Rik van Riel

[permalink] [raw]
Subject: Re: race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown

On Thu, 04 Oct 2007 18:43:32 -0700
Jeremy Fitzhardinge <[email protected]> wrote:

> It seems to me that there are a few ways to fix this:
>
> 1. Use asm-generic/pgtable.h when CONFIG_PARAVIRT is enabled. This
> will clearly work, but is pretty blunt.
> 2. Make test_and_clear_pte_flags a new paravirt-op, which can be
> implemented in Xen as a hypercall, and as a raw
> test_and_clear_bit for everyone else. The downside is adding yet
> another pv-op.

Either of these two would work. Another alternative could be to
let test_and_clear_pte_flags have an exception table entry, where
we jump right to the next instruction if the instruction clearing
the flag fails.

That is the essentially variant you need for Xen, except the fast
path is still exactly the same it is as when running on native
hardware.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2007-10-05 02:45:17

by Andrew Morton

[permalink] [raw]
Subject: Re: race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown

On Thu, 04 Oct 2007 18:43:32 -0700 Jeremy Fitzhardinge <[email protected]> wrote:

> David's change 10a8d6ae4b3182d6588a5809a8366343bc295c20, "i386: add
> ptep_test_and_clear_{dirty,young}" has introduced an SMP race which
> affects the Xen pv-ops backend.

y'know, I think I think it's been several years since I saw a report of an
honest to goodness, genuine SMP race in core kernel. We used to be
infested by them, but the term has fallen into disuse. Interesting, but
OT.

> It seems to me that there are a few ways to fix this:
>
> 1. Use asm-generic/pgtable.h when CONFIG_PARAVIRT is enabled. This
> will clearly work, but is pretty blunt.
> 2. Make test_and_clear_pte_flags a new paravirt-op, which can be
> implemented in Xen as a hypercall, and as a raw test_and_clear_bit
> for everyone else. The downside is adding yet another pv-op.
> 3. Restructure the pagetable setup code so that the mm is not added
> to the prio tree until after arch_dup_mmap has been called (and
> the converse for exit_mmap). This is arguably cleaner, but I
> haven't looked to see how much trouble this would be.
>
> Thoughts anyone? Does making the pagetables visible "early" cause
> problems for anyone else?

I expect that 2) has the maximum niceness*suitable-for-2.6.23 product.

That's if you actually care much about kernel.org major releases - do many
people run kernel.org kernels on Xen? If "not many" then we could perhaps
do something more elaborate for 2.6.23.1. But adding ever more pvops as
core kernel evolves was always expected.

2007-10-05 04:08:54

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown

Andrew Morton wrote:
> y'know, I think I think it's been several years since I saw a report of an
> honest to goodness, genuine SMP race in core kernel. We used to be
> infested by them, but the term has fallen into disuse. Interesting, but
> OT.
>

I was a bit surprised to find myself typing it too. I guess it could
also be a preempt race, which has been a bit more common. Anyway, its a
deliberately unlocked access to the pagetable structure, so not terribly
surprising.

>> It seems to me that there are a few ways to fix this:
>>
>> 1. Use asm-generic/pgtable.h when CONFIG_PARAVIRT is enabled. This
>> will clearly work, but is pretty blunt.
>> 2. Make test_and_clear_pte_flags a new paravirt-op, which can be
>> implemented in Xen as a hypercall, and as a raw test_and_clear_bit
>> for everyone else. The downside is adding yet another pv-op.
>> 3. Restructure the pagetable setup code so that the mm is not added
>> to the prio tree until after arch_dup_mmap has been called (and
>> the converse for exit_mmap). This is arguably cleaner, but I
>> haven't looked to see how much trouble this would be.
>>
>> Thoughts anyone? Does making the pagetables visible "early" cause
>> problems for anyone else?
>>
>
> I expect that 2) has the maximum niceness*suitable-for-2.6.23 product.
>

OK, I'll whip a patch together.

> That's if you actually care much about kernel.org major releases - do many
> people run kernel.org kernels on Xen?

Well, given that there hasn't been a Xen-capable kernel.org release yet,
no... But we'll see what happens when .23 goes out the door.

> If "not many" then we could perhaps
> do something more elaborate for 2.6.23.1. But adding ever more pvops as
> core kernel evolves was always expected.
>

I think keep it simple for now; anything significant can wait for the
brave new world of unified x86.

J

2007-10-05 04:15:30

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown

Rik van Riel wrote:
> Either of these two would work. Another alternative could be to
> let test_and_clear_pte_flags have an exception table entry, where
> we jump right to the next instruction if the instruction clearing
> the flag fails.
>
> That is the essentially variant you need for Xen, except the fast
> path is still exactly the same it is as when running on native
> hardware.
>

Hm, that wouldn't end up clearing the bit. You'd need a Xen-specific
exception handler to do that, which would turn the whole thing into
Xen-specific code, and you're back at adding a pv-op.

J

2007-10-05 08:06:06

by Andi Kleen

[permalink] [raw]
Subject: Re: race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown


> Fortuitously, Jan Beulich has a patch to fix this. It's not going to be
> directly applicable to 2.6.23-rc series, but should be easily ported:
> <http://lists.xensource.com/archives/html/xen-devel/2007-03/msg01200.html>.

Do I misread that patch or does it really walk the complete address
space and try to take all possible locks? Isn't that very slow?

-Andi

2007-10-05 09:05:44

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown

Keir Fraser wrote:
>
> Hang on! How is the access unlocked? By my reading
> page_referenced_one()->page_check_address()->spin_lock(pte_lockptr()).
>

Ah, OK. I'd overlooked that.

> The problem here is most likely insufficient locking in the pin/unpin
> table-walking code, in light of the fact that you are probably running
> with
> per-page spinlocks (SPLIT_PTLOCK_CPUS). Because we nobble that option
> in our
> own kernel ports it suffices to take the page_table_lock when doing the
> walk-[un]pin-remap routine. This is *not* true with SPLIT_PTLOCK_CPUS.
>

Hm, I see.

> Fortuitously, Jan Beulich has a patch to fix this. It's not going to be
> directly applicable to 2.6.23-rc series, but should be easily ported:
> <http://lists.xensource.com/archives/html/xen-devel/2007-03/msg01200.html>.
>

OK, I can use that.


Andi says:
> Do I misread that patch or does it really walk the complete address
> space and try to take all possible locks? Isn't that very slow?
>

That's pretty much what it has to do. Pinning/unpinning walks the whole
pagetable anyway, so it shouldn't be much more expensive. And they're
relatively rare operations (fork, exec, exit).

J

2007-10-05 09:15:23

by Keir Fraser

[permalink] [raw]
Subject: Re: race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown

On 5/10/07 10:05, "Jeremy Fitzhardinge" <[email protected]> wrote:

> Andi says:
>> Do I misread that patch or does it really walk the complete address
>> space and try to take all possible locks? Isn't that very slow?
>>
>
> That's pretty much what it has to do. Pinning/unpinning walks the whole
> pagetable anyway, so it shouldn't be much more expensive. And they're
> relatively rare operations (fork, exec, exit).

It is a shame to do 3x walks per pin or unpin, rather than 1x, though.

One way to improve this, possibly, is to pin the pte tables individually as
you go, rather than doing one big pin/unpin just at the root pgd. Then you
can lock/unlock the pte's as you go. I'd suggest that as a possible post
2.6.23 improvement, however. Jan's patch has actually had some testing.

-- Keir

2007-10-05 11:38:21

by Hugh Dickins

[permalink] [raw]
Subject: Re: race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown

I see the discussion has somehow moved on to pagetable locking -
mysterious because the word "lock" never once appears in your
otherwise very helpful elucidation below, for which many thanks.

Maybe what I have to add is now of historical interest only, or none,
but I was prevented from answering your original mail earlier...

On Thu, 4 Oct 2007, Jeremy Fitzhardinge wrote:

> David's change 10a8d6ae4b3182d6588a5809a8366343bc295c20, "i386: add
> ptep_test_and_clear_{dirty,young}" has introduced an SMP race which
> affects the Xen pv-ops backend.

I think that race with Xen has been there all along, not introduced
by David's commit. Take a look at ptep_clear_flush_young() in the
2.6.21 include/asm-i386/pgtable.h, it looks equally a problem to me.

>
> In Xen, pagetables are normally kept RO so that the hypervisor can
> mediate all updates to them. If Xen sees a write to an active
> (currently pointed to by cr3) or pinned (a currently inactive but
> registered pagetable), it will trap the write fault and emulate the
> instruction making the update; this means that most pagetable-modifying
> code doesn't need to know or care that pagetables are RO.
>
> When a pagetable is first created (either in execve or fork), the the
> Xen paravirt backend pins the pagetable, and conversely, on exit it is
> unpinned; this is done via the arch_dup_mmap() and activate_mm() hooks.
> Pinning is done in two phases: first the pagetable pages are marked RO,
> and then the pagetable is registered with Xen; unpinning is the

To my naive mind, your problem actually lies in those two stages:
whatever marks the pages RO should not be keeping Xen in ignorance.

> opposite. This works assuming that the pagetable is not in use, and not
> yet visible to other cpus.
>
> The race on pagetable creation is this: in kernel/fork.c:dup_mmap(), it
> copies the old pagetable into the new one, and registers each vma with
> the rmap prio tree. Once everything is copied, it calls
> arch_dup_mmap(), which ends up doing the Xen pagetable pin. However,
> because the pagetable is visible to other cpus via the prio tree,
> pagetable modifications (specifically, clearing the access bit) can race
> with pinning. If it hits between making the pagetable pages RO but
> before they're registered with Xen, modifications to the flags will
> fault, and Xen won't know to do the fixup.
>
> The converse is also true in exit_mmap(): arch_exit_mmap is called
> before removing the vmas from the prio tree, so it can race with unpinning.
>
[oops snipped]
>
> It all worked OK before David's change, because asm-generic/pgtable.h
> uses set_pte_at(), which ends up making a hypercall to update the
> pagetable, which always works regardless of the state of the pagetable
> pages.

Except ptep_clear_flush_young() didn't use set_pte_at().

>
> It seems to me that there are a few ways to fix this:
>
> 1. Use asm-generic/pgtable.h when CONFIG_PARAVIRT is enabled. This
> will clearly work, but is pretty blunt.

No. The asm-generic/pgtable.h version of ptep_test_and_clear_young()
does set_pte_at((__vma)->vm_mm, (__address), (__ptep), pte_mkold(__pte)),
which may be nice for Xen, but is dangerously racy on i386 SMP (I hope
it's not racy on the other architectures using it...): it's in danger
of losing the dirty bit if that gets set in userspace in between us
reading __pte and setting the pte_mkold version of __pte. There's
no issue if we were to lose the young bit while setting the dirty
bit, but losing the dirty in setting the young is a no no.

> 2. Make test_and_clear_pte_flags a new paravirt-op, which can be
> implemented in Xen as a hypercall, and as a raw test_and_clear_bit
> for everyone else. The downside is adding yet another pv-op.

Maybe (if this turns out to be more than just a muddle over pagetable
locking). But note that you have a problem, not just with the
ptep_clear_flush_young from page_referenced_one, but also with the
ptep_clear_flush (which ends up using native_ptep_get_and_clear on
i386 I think) from try_to_unmap_one: you just won't hit that one
as often as the page_referenced_one case.

Anything in include/asm-i386/pgtable.h which uses pte_update or
pte_update_defer ought to be considered; but I expect that you'll
find all the other instances are only used from safe contexts,
which cannot race with yours. The pte_update comes after the op
that concerns you: maybe there should be a pte_about_to_update
just before, to handle your case?

> 3. Restructure the pagetable setup code so that the mm is not added
> to the prio tree until after arch_dup_mmap has been called (and
> the converse for exit_mmap). This is arguably cleaner, but I
> haven't looked to see how much trouble this would be.

No. It is intentional that we make those ptes visible as early as
possible, so that concurrent pageout (and less importantly swapoff)
has the best chance of finding all references to a page (or swap ent).
If they only became visible at the final arch_dup_mmap stage, then
it might become impossible to fork a large well-populated mm, if it
contains those very pages which need to be freed to make space to
allocate pagetables for the child.

Hugh

2007-10-05 13:17:55

by Rik van Riel

[permalink] [raw]
Subject: Re: race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown

On Thu, 04 Oct 2007 21:15:18 -0700
Jeremy Fitzhardinge <[email protected]> wrote:

> Rik van Riel wrote:
> > Either of these two would work. Another alternative could be to
> > let test_and_clear_pte_flags have an exception table entry, where
> > we jump right to the next instruction if the instruction clearing
> > the flag fails.
> >
> > That is the essentially variant you need for Xen, except the fast
> > path is still exactly the same it is as when running on native
> > hardware.
> >
>
> Hm, that wouldn't end up clearing the bit.

Big deal. We don't care *that* much.

As long as the bit gets cleared 99.9% of the time, we're fine.
All that bit is for is determining when to swap out a page.

--
"Debugging is twice as hard as writing the code in the first place.
Therefore, if you write the code as cleverly as possible, you are,
by definition, not smart enough to debug it." - Brian W. Kernighan

2007-10-05 15:34:19

by Hugh Dickins

[permalink] [raw]
Subject: Re: race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown

On Fri, 5 Oct 2007, Keir Fraser wrote:
> On 5/10/07 10:05, "Jeremy Fitzhardinge" <[email protected]> wrote:
> > Andi says:
> >> Do I misread that patch or does it really walk the complete address
> >> space and try to take all possible locks? Isn't that very slow?
> >
> > That's pretty much what it has to do. Pinning/unpinning walks the whole
> > pagetable anyway, so it shouldn't be much more expensive. And they're
> > relatively rare operations (fork, exec, exit).
>
> It is a shame to do 3x walks per pin or unpin, rather than 1x, though.
>
> One way to improve this, possibly, is to pin the pte tables individually as
> you go, rather than doing one big pin/unpin just at the root pgd. Then you
> can lock/unlock the pte's as you go. I'd suggest that as a possible post
> 2.6.23 improvement, however. Jan's patch has actually had some testing.

A few points come to mind looking at Jan's patch:

The comment about nested pagetable locks is wrong: mm/mremap.c does
nest pagetable locks; but you wouldn't (as I understand it) be doing
this pinning/unpinning anywhere which could race with an mremap() on
the same mm, so that's a niggle about a comment, not a real issue.

Yes, it would be greatly preferable to take the locks one by one
as needed. As it stands, I think you're in danger of overflowing
the PREEMPT_BITS 8 area of preempt_count(), venturing into the
SOFTIRQ area: I don't know the real-life consequence of that.

I don't see any protection against hugetlb areas, where the pmd
entry may indicate a hugetlb page rather than a pagetable page.
I guess you'll be needing to test pte_huge(). I don't know if
you want to lock those or skip them: locking is usually just
with page_table_lock, but beware there's also sharing of huge
page pmds between mms - Ken Chen should be able to help on that.

If a 2.6.23 fix is needed, I suggest simply excluding split ptlocks
in the Xen case, as shown by the mm/Kconfig - line in Jan's patch.

Hugh

2007-10-05 15:46:29

by Keir Fraser

[permalink] [raw]
Subject: Re: race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown

On 5/10/07 16:33, "Hugh Dickins" <[email protected]> wrote:

> If a 2.6.23 fix is needed, I suggest simply excluding split ptlocks
> in the Xen case, as shown by the mm/Kconfig - line in Jan's patch.

I didn't think that nobbling config options for particular pv_ops
implementations was acceptable? I'm rather out of the loop though, and could
be wrong.

The PREEMPT_BITS limitation is a good argument for at least taking the pte
locks in small batches though (small batches is preferable to one-by-one
since we will want to batch the make-readonly-and-pin hypercall requests to
amortise the cost of the hypervisor trap).

-- Keir

2007-10-05 16:48:36

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown

Keir Fraser wrote:
> I didn't think that nobbling config options for particular pv_ops
> implementations was acceptable? I'm rather out of the loop though, and could
> be wrong.
>

As a workaround it would be OK. As a dependency, perhaps.

> The PREEMPT_BITS limitation is a good argument for at least taking the pte
> locks in small batches though (small batches is preferable to one-by-one
> since we will want to batch the make-readonly-and-pin hypercall requests to
> amortise the cost of the hypervisor trap).
>

Hm, that's a good point. The pagetable permissions changes are batched
more or less asynchronously from the actual loop structure; that will
complicate adding the locking.

J

2007-10-05 18:59:28

by Rik van Riel

[permalink] [raw]
Subject: Re: race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown

On Fri, 5 Oct 2007 12:36:33 +0100 (BST)
Hugh Dickins <[email protected]> wrote:

> To my naive mind, your problem actually lies in those two stages:
> whatever marks the pages RO should not be keeping Xen in ignorance.

It does. Telling Xen to pin the page as a page table page is
basically the first thing a Xen kernel does after marking the
page read-only.

This makes for a narrow race window, during which ptep_test_and_clear_young
cannot clear the referenced bit and may end up causing a crash. We do not
care about it not clearing the referenced bit during that window, since it
will be cleared during the next go-around and the race is very rare.

Hence, the only thing we need to fix is the crash.

We can do that by adding an entry for ptep_test_and_clear_young to the
exception table. This way we do not need to turn this into a new paravirt
ops hook (since the fast path is exactly the same as x86 native) and there
is no need for added complexity.

Also, Xen would not conflict with SPLIT_PTLOCK_CPUS.

--
All Rights Reversed

2007-10-05 19:40:03

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown

Hugh Dickins wrote:
> I see the discussion has somehow moved on to pagetable locking -
> mysterious because the word "lock" never once appears in your
> otherwise very helpful elucidation below, for which many thanks.
>

It has to be taken with a grain of salt. The reason "lock" isn't
mentioned is because I mis-analyzed the situation, and overlooked that
page_referenced_one does actually take the pte's lock.

> Maybe what I have to add is now of historical interest only, or none,
> but I was prevented from answering your original mail earlier...
>
> On Thu, 4 Oct 2007, Jeremy Fitzhardinge wrote:
>
>
>> David's change 10a8d6ae4b3182d6588a5809a8366343bc295c20, "i386: add
>> ptep_test_and_clear_{dirty,young}" has introduced an SMP race which
>> affects the Xen pv-ops backend.
>>
>
> I think that race with Xen has been there all along, not introduced
> by David's commit. Take a look at ptep_clear_flush_young() in the
> 2.6.21 include/asm-i386/pgtable.h, it looks equally a problem to me.
>

Yes, but I don't think its a problem with correct locking. set_pte_at
is a red herring.

>> When a pagetable is first created (either in execve or fork), the the
>> Xen paravirt backend pins the pagetable, and conversely, on exit it is
>> unpinned; this is done via the arch_dup_mmap() and activate_mm() hooks.
>> Pinning is done in two phases: first the pagetable pages are marked RO,
>> and then the pagetable is registered with Xen; unpinning is the
>>
>
> To my naive mind, your problem actually lies in those two stages:
> whatever marks the pages RO should not be keeping Xen in ignorance.
>

No, its the Xen-specific kernel code which does the RO marking: it marks
the pagetables RO (using a hypercall to make the actual modifications),
and then tells the hypervisor to pin the whole pagetable. It can't be
done atomically, so there's always a window between the two phases.

>> It all worked OK before David's change, because asm-generic/pgtable.h
>> uses set_pte_at(), which ends up making a hypercall to update the
>> pagetable, which always works regardless of the state of the pagetable
>> pages.
>>
>
> Except ptep_clear_flush_young() didn't use set_pte_at().
>

Yes, my mistake.

>> 3. Restructure the pagetable setup code so that the mm is not added
>> to the prio tree until after arch_dup_mmap has been called (and
>> the converse for exit_mmap). This is arguably cleaner, but I
>> haven't looked to see how much trouble this would be.
>>
>
> No. It is intentional that we make those ptes visible as early as
> possible, so that concurrent pageout (and less importantly swapoff)
> has the best chance of finding all references to a page (or swap ent).
> If they only became visible at the final arch_dup_mmap stage, then
> it might become impossible to fork a large well-populated mm, if it
> contains those very pages which need to be freed to make space to
> allocate pagetables for the child.

Hm, OK.

J

2007-10-05 19:40:34

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown

Rik van Riel wrote:
> This makes for a narrow race window, during which ptep_test_and_clear_young
> cannot clear the referenced bit and may end up causing a crash. We do not
> care about it not clearing the referenced bit during that window, since it
> will be cleared during the next go-around and the race is very rare.
>

Is this the only possible case? It's just the one I found while testing
under high memory pressure.

> Hence, the only thing we need to fix is the crash.
>
> We can do that by adding an entry for ptep_test_and_clear_young to the
> exception table. This way we do not need to turn this into a new paravirt
> ops hook (since the fast path is exactly the same as x86 native) and there
> is no need for added complexity.
>
> Also, Xen would not conflict with SPLIT_PTLOCK_CPUS.
>

Well, isn't the correct fix to make Xen take all the pagetable locks
while pinning/unpinning? Adding exception handling to
test_and_clear_bit would solve this particular race, but are there
others (either now or potentially)? Seems fragile.

J

2007-10-05 19:57:08

by Rik van Riel

[permalink] [raw]
Subject: Re: race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown

On Fri, 05 Oct 2007 12:40:05 -0700
Jeremy Fitzhardinge <[email protected]> wrote:

> Well, isn't the correct fix to make Xen take all the pagetable locks
> while pinning/unpinning? Adding exception handling to
> test_and_clear_bit would solve this particular race, but are there
> others (either now or potentially)? Seems fragile.

You're right, holding the pagetable lock(s) across pinning/unpinning
operations would avoid the problem too.

While we do not care about the accessed bit, there are similar operations
going on with the dirty bit - which is a lot more important :)

--
All Rights Reversed

2007-10-05 20:35:50

by Jeremy Fitzhardinge

[permalink] [raw]
Subject: Re: race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown

Keir Fraser wrote:
> The PREEMPT_BITS limitation is a good argument for at least taking the pte
> locks in small batches though (small batches is preferable to one-by-one
> since we will want to batch the make-readonly-and-pin hypercall requests to
> amortise the cost of the hypervisor trap).

Hm, I don't see how we can avoid holding locks for the whole pagetable
at once. We need to hold the locks from between marking a particular
pte page RO until the whole pagetable is pinned; if we don't, pte
updates can come in from other cpus, which will fault on the RO but
won't be fixed up by Xen.

Could we avoid it by doing a series of partial pins, locking and
unlocking each one in the process, and then pin the whole pagetable?

J

2007-10-08 02:24:39

by Nick Piggin

[permalink] [raw]
Subject: Re: race with page_referenced_one->ptep_test_and_clear_young and pagetable setup/pulldown

On Friday 05 October 2007 12:44, Andrew Morton wrote:
> On Thu, 04 Oct 2007 18:43:32 -0700 Jeremy Fitzhardinge <[email protected]>
wrote:
> > David's change 10a8d6ae4b3182d6588a5809a8366343bc295c20, "i386: add
> > ptep_test_and_clear_{dirty,young}" has introduced an SMP race which
> > affects the Xen pv-ops backend.
>
> y'know, I think I think it's been several years since I saw a report of an
> honest to goodness, genuine SMP race in core kernel. We used to be
> infested by them, but the term has fallen into disuse. Interesting, but
> OT.

Does that include data races on weakly ordered systems, or UP races
(ie. with sleeping locks rather than spinning ones)? ;) Because we had
and have a few of those...