while running fork() testcases with NPTL we found a number of futex
related failures that i tracked down to the following conceptual bug in
the futex code:
if a futex is placed into private anonymous memory (eg. on the stack, or
in malloc() space, which is the common case with NPTL), then a fork()
followed by a COW in the original threaded context breaks all pending
FUTEX_WAIT operations on that page. The COW fault in the original
(threaded) context allocates a new page and installs it into the pagetable
- while a pending futex operation is hashed on that physical page. Any
subsequent FUTEX_WAKE will get lost, and the pending FUTEX_WAITs will get
stuck indefinitely. Ie. futexes were not really designed with COW in mind
- they were designed to be used in non-COW shared memory. This is a very
bad limitation and renders NPTL essencially useless, which has ABI
compatibility and on-stack automatic use/unuse of POSIX locking primitives
as an important goal.
there are a number of approaches possible to solve this limitation, i
picked the patch below. It extends the concept of page pinning with the
'stickyness' bit - a sticky page will remain in the pinning VM's context,
no matter what. The futex code marks the pinned page as sticky, in a
pagefault-atomic way, and the fork() code detects this condition and does
the COW copy upfront.
another implementation would be an idea from Linus: the page's lru list
pointer can in theory be used for pinned pages (pinned pages do not have
much LRU meaning anyway), and this pointer could specify the 'owner' MM of
the physical page. The COW fault handler then checks the sticky page:
- if the faulting context is a non-owner (ie. the fork()-ed child), then
the normal COW path is taken - new page allocated and installed.
- if the faulting context is the owner, then the pte chain is walked, and
the new page is installed into every 'other' pte. This needs a
cross-CPU single-page TLB flush though. The TLB flush could be
optimized if we had a way to get to the mapping MM's of the individual
pte chain entries - is this possible?
anyway, the attached patch is definitely the simpler approach, and any
more complex solution can be implemented ontop of this.
the downside of the sync-copy variant are the extra copies that could
perhaps be avoided via lazy copying. Code that did fork()+exec() can avoid
all of this overhead by using posix_spawn() which (will hopefully soon)
use clone()+exec() in NPTL. But even for normal forks() the number of
pending futex operations should be realtively low most of the time - and
it can be at most one pinned page per thread, in the NPTL design. And
those pinned pages will be copied in the future anyway, they are active
futex pages.
with this patch applied (against BK-curr), all previously failing NPTL
fork testcases pass, and futexes are now full-fledged OS primitives that
can be used on any type of memory mapping concept - anonymous, shared,
cow-shared, named, etc.
Ingo
--- linux/include/linux/page-flags.h.orig Thu Sep 26 12:48:42 2002
+++ linux/include/linux/page-flags.h Thu Sep 26 13:03:21 2002
@@ -69,6 +69,8 @@
#define PG_direct 16 /* ->pte_chain points directly at pte */
+#define PG_sticky 17 /* sticks to the pinning VM */
+
/*
* Global page accounting. One instance per CPU.
*/
@@ -203,6 +205,10 @@
#define TestSetPageDirect(page) test_and_set_bit(PG_direct, &(page)->flags)
#define ClearPageDirect(page) clear_bit(PG_direct, &(page)->flags)
#define TestClearPageDirect(page) test_and_clear_bit(PG_direct, &(page)->flags)
+
+#define PageSticky(page) test_bit(PG_sticky, &(page)->flags)
+#define SetPageSticky(page) set_bit(PG_sticky, &(page)->flags)
+#define ClearPageSticky(page) clear_bit(PG_sticky, &(page)->flags)
/*
* The PageSwapCache predicate doesn't use a PG_flag at this time,
--- linux/kernel/futex.c.orig Fri Sep 20 17:20:21 2002
+++ linux/kernel/futex.c Thu Sep 26 13:03:21 2002
@@ -90,6 +90,7 @@
/* Avoid releasing the page which is on the LRU list. I don't
know if this is correct, but it stops the BUG() in
__free_pages_ok(). */
+ ClearPageSticky(page);
page_cache_release(page);
}
@@ -151,17 +152,23 @@
static struct page *pin_page(unsigned long page_start)
{
struct mm_struct *mm = current->mm;
- struct page *page;
+ struct page *page = NULL;
int err;
- down_read(&mm->mmap_sem);
+ down_write(&mm->mmap_sem);
err = get_user_pages(current, mm, page_start,
1 /* one page */,
- 0 /* writable not important */,
+ 1 /* writable */,
0 /* don't force */,
&page,
NULL /* don't return vmas */);
- up_read(&mm->mmap_sem);
+ /*
+ * We mark the page sticky, to make sure a COW does not
+ * change the page from under hashed waiters.
+ */
+ if (page)
+ SetPageSticky(page);
+ up_write(&mm->mmap_sem);
if (err < 0)
return ERR_PTR(err);
--- linux/mm/memory.c.orig Fri Sep 20 17:20:25 2002
+++ linux/mm/memory.c Thu Sep 26 13:03:21 2002
@@ -188,6 +188,75 @@
out:
return pte_offset_kernel(pmd, address);
}
+
+/*
+ * Establish a new mapping:
+ * - flush the old one
+ * - update the page tables
+ * - inform the TLB about the new one
+ *
+ * We hold the mm semaphore for reading and vma->vm_mm->page_table_lock
+ */
+static inline void establish_pte(struct vm_area_struct * vma, unsigned long address, pte_t *page_table, pte_t entry)
+{
+ set_pte(page_table, entry);
+ flush_tlb_page(vma, address);
+ update_mmu_cache(vma, address, entry);
+}
+
+/*
+ * Sticky page handling.
+ *
+ * This code ensures that a sticky pinned page is not COW-ed away
+ * from the 'owner' VM. Eg. futex hashing relies on this.
+ */
+static inline int handle_sticky_page(struct mm_struct *src,
+ struct mm_struct *dst, pmd_t *src_pmd, pmd_t *dst_pmd, pte_t **src_pte,
+ pte_t **dst_pte, struct page *old_page, pte_t *pte,
+ unsigned long address, struct vm_area_struct *vma)
+{
+ struct page *new_page;
+
+ page_cache_get(old_page);
+
+ pte_unmap_nested(*src_pte);
+ pte_unmap(*dst_pte);
+ spin_unlock(&src->page_table_lock);
+
+ new_page = alloc_page(GFP_HIGHUSER);
+ if (!new_page) {
+ page_cache_release(old_page);
+ /*
+ * Return with no locks held:
+ */
+ return -ENOMEM;
+ }
+ if (old_page == ZERO_PAGE(address))
+ clear_user_highpage(new_page, address);
+ else
+ copy_user_highpage(new_page, old_page, address);
+
+ page_cache_release(old_page);
+
+ spin_lock(&src->page_table_lock);
+ *dst_pte = pte_offset_map(dst_pmd, address);
+ *src_pte = pte_offset_map_nested(src_pmd, address);
+ /*
+ * If another place raced with us, just free the new page
+ * and return - where we'll re-check what to do with the pte.
+ */
+ if (!pte_same(**src_pte, *pte)) {
+ __free_page(new_page);
+ return 1;
+ }
+ establish_pte(vma, address, *dst_pte, pte_mkwrite(pte_mkdirty(mk_pte(new_page, vma->vm_page_prot))));
+ page_add_rmap(new_page, *dst_pte);
+ lru_cache_add(new_page);
+ dst->rss++;
+
+ return 0;
+}
+
#define PTE_TABLE_MASK ((PTRS_PER_PTE-1) * sizeof(pte_t))
#define PMD_TABLE_MASK ((PTRS_PER_PMD-1) * sizeof(pmd_t))
@@ -240,7 +309,7 @@
goto nomem;
do {
- pte_t * src_pte, * dst_pte;
+ pte_t *src_pte, *dst_pte;
/* copy_pte_range */
@@ -267,6 +336,7 @@
/* copy_one_pte */
+repeat_check:
if (pte_none(pte))
goto cont_copy_pte_range_noset;
/* pte contains position in swap, so copy. */
@@ -276,15 +346,26 @@
goto cont_copy_pte_range_noset;
}
ptepage = pte_page(pte);
+ if (PageReserved(ptepage))
+ goto cont_copy_pte_range;
pfn = pte_pfn(pte);
if (!pfn_valid(pfn))
goto cont_copy_pte_range;
- ptepage = pfn_to_page(pfn);
- if (PageReserved(ptepage))
- goto cont_copy_pte_range;
/* If it's a COW mapping, write protect it both in the parent and the child */
if (cow) {
+ if (unlikely(PageSticky(ptepage))) {
+ int ret = handle_sticky_page(src, dst, src_pmd, dst_pmd, &src_pte, &dst_pte, ptepage, &pte, address, vma);
+ if (ret < 0)
+ goto nomem;
+
+ /* Was there a race? */
+ if (ret) {
+ pte = *src_pte;
+ goto repeat_check;
+ }
+ goto cont_copy_pte_range_noset;
+ }
ptep_set_wrprotect(src_pte);
pte = *src_pte;
}
@@ -950,21 +1031,6 @@
flush_tlb_range(vma, beg, end);
spin_unlock(&mm->page_table_lock);
return error;
-}
-
-/*
- * Establish a new mapping:
- * - flush the old one
- * - update the page tables
- * - inform the TLB about the new one
- *
- * We hold the mm semaphore for reading and vma->vm_mm->page_table_lock
- */
-static inline void establish_pte(struct vm_area_struct * vma, unsigned long address, pte_t *page_table, pte_t entry)
-{
- set_pte(page_table, entry);
- flush_tlb_page(vma, address);
- update_mmu_cache(vma, address, entry);
}
/*
In article <[email protected]>,
Ingo Molnar <[email protected]> wrote:
>
>while running fork() testcases with NPTL we found a number of futex
>related failures that i tracked down to the following conceptual bug in
>the futex code:
This patch seems trivially broken by having two futexes on the same
page. When the first futex removes itself, it will clear the sticky
bit, even though the other futex is still pinning the same page.
Trust me, you'll have to use the page list approach.
Linus
On Thu, 26 Sep 2002, Linus Torvalds wrote:
> This patch seems trivially broken by having two futexes on the same
> page. When the first futex removes itself, it will clear the sticky
> bit, even though the other futex is still pinning the same page.
sigh. And we cannot even properly detect which unpin_page() was the last
unpinning of the page - there can be so many other reasons a page's count
is elevated. And keeping a page sticky forever is no solution either, the
number of sticky pages would increase significantly, causing real fork()
problems.
> Trust me, you'll have to use the page list approach.
yeah, will try that now. I'm a bit worried about the mandatory cross-CPU
TLB flushes though.
Ingo
On Thu, 2002-09-26 at 16:27, Ingo Molnar wrote:
> > Trust me, you'll have to use the page list approach.
>
> yeah, will try that now. I'm a bit worried about the mandatory cross-CPU
> TLB flushes though.
Wouldn't splitting the vm area also be cleaner - that would let you get
things like correct address space accounting too.
Ingo Molnar wrote:
>
> ...
> another implementation would be an idea from Linus: the page's lru list
> pointer can in theory be used for pinned pages (pinned pages do not have
> much LRU meaning anyway), and this pointer could specify the 'owner' MM of
> the physical page. The COW fault handler then checks the sticky page:
Overloading page->lru in this way is tricky. If we can guarantee
that the page is anonymous (anonymise it if it's file-backed, pull
it out of swapcache) then fine, ->mapping, ->list.next, ->list.prev,
->index and ->private are available.
Can we do that?
> - if the faulting context is a non-owner (ie. the fork()-ed child), then
> the normal COW path is taken - new page allocated and installed.
>
> - if the faulting context is the owner, then the pte chain is walked, and
> the new page is installed into every 'other' pte. This needs a
> cross-CPU single-page TLB flush though. The TLB flush could be
> optimized if we had a way to get to the mapping MM's of the individual
> pte chain entries - is this possible?
Going from a pte back up to the owning MM is possible, yes. See
mm/rmap.c:try_to_unmap_one(). The caller would have to hold the
page's rmap_lock.
On Thu, 26 Sep 2002, Andrew Morton wrote:
> Ingo Molnar wrote:
> >
> > ...
> > another implementation would be an idea from Linus: the page's lru list
> > pointer can in theory be used for pinned pages (pinned pages do not have
> > much LRU meaning anyway), and this pointer could specify the 'owner' MM of
> > the physical page. The COW fault handler then checks the sticky page:
>
> Overloading page->lru in this way is tricky. If we can guarantee
> that the page is anonymous (anonymise it if it's file-backed, pull
> it out of swapcache) then fine, ->mapping, ->list.next, ->list.prev,
> ->index and ->private are available.
>
> Can we do that?
Well, we have two situations:
- the page is shared. In which case we don't need to put it on any
pinning list, since the very sharedness of the page means that we won't
be COW'ing it in this address space.
- the page is private. In which case we can (and should) pre-COW it and
make it anonymous at futex time.
So yeah, it should be doable.
Not pretty.
Linus
Actually, thinking more on this..
Ingo Molnar wrote:
>
> - if the faulting context is a non-owner (ie. the fork()-ed child), then
> the normal COW path is taken - new page allocated and installed.
>
> - if the faulting context is the owner, then the pte chain is walked, and
> the new page is installed into every 'other' pte. This needs a
> cross-CPU single-page TLB flush though. The TLB flush could be
> optimized if we had a way to get to the mapping MM's of the individual
> pte chain entries - is this possible?
Actually, we don't have to do it this way. My preferred solution would be
to make the pinning data structure be a special one with a callback (which
also means that you do _not_ have to re-use the LRU list), and what we do
is that when we're getting called back the futex code just updates to the
new physical page instead.
So the data structures would look something like this:
struct page_change_struct {
unsigned long address;
struct mm_struct *vm;
struct list_head list;
void (*callback)(struct page_change_struct *data, struct page *new);
}
struct list_head page_change_struct_hash[HASHSIZE];
and then when we pin a page, we do
/* This is part of the
struct page_change_struct pinned_data;
pinned_data.address = virtual_address;
pinned_data.vm = current_mm;
pinned_data.callback = futex_cow_callback;
insert_pin_page(page, &pinned_data);
.. this does a hash on address, inserts it into the
page_change_struct_hash table, and is done..
unpinning does:
remove_pin_page(page, &pinned_data);
.. this just does a "list_del(&pinned_data); ...
and COW does:
.. hash the COW address, look up the page_change_struct_hash,
search if the page/vm tuple exists in the index, and if it
does, call the callback()..
and then the "callback" function just updates the page information in the
futex block directly - as if it was looked up anew.
This has the advantage that it works without any cross-CPU tlb stuff, and
that other users (not just futexes) can also register themselves for
getting callbacks if somebody COW's a page they had.
We could extend it to work for unmapping etc too if we wanted (ie anybody
who caches a virtual->physical translation for a specific page can always
ask for a "invalidate this particular page mapping" event.
I really like this approach.
[ Of course I do, since I thought it up. All my ideas are absolutely
brilliant, until somebody points out why they can't work. The locking
might be interesting, but the most obvious locking seems to be to have
some per-hash thing. ]
Linus
On Thu, 26 Sep 2002, Linus Torvalds wrote:
>
> and then when we pin a page, we do
>
> /* This is part of the
> struct page_change_struct pinned_data;
That "This is part of the " comment should continue with "struct futex_q",
but I went off and was supposed to check what the futex data structure was
called, and forgot about updating it.
Anyway, the point being that this needs no new allocations in _any_ path,
and only extends a structure that we already need for the slow case for
futexes. It does imply a new lock, though, and the COW path would have to
check that hash (which should scale pretty well, since we only have
entries here when somebody is blocked on a futex).
Oh, and we need a new hash table, since the native futex hash can't just
be re-used due to different indexing - the futex hash is based on physical
page and offset, while the page_change_struct hash is based on virtual
address and the mm.
(I initially thought we could make the page_change_struct hash be based on
physical page, but there can be multiple instances of the same physical
page being mapped into the same VM, so that wouldn't be a good thing -
we'd get callbacks for the wrong virtual address being COW'ed).
Linus
On Thu, 26 Sep 2002, Linus Torvalds wrote:
> Well, we have two situations:
> - the page is shared. In which case we don't need to put it on any
> pinning list, since the very sharedness of the page means that we won't
> be COW'ing it in this address space.
> - the page is private. In which case we can (and should) pre-COW it and
> make it anonymous at futex time.
>
> So yeah, it should be doable.
well, 4 fields: ->mapping, ->list.next, ->list.prev, ->index and ->private
is *alot* of space to do something smart in struct page itself.
(hm, dont we have named anonymous mappings? (ie. mmap()-ing of a file via
MAP_PRIVATE?) And if a futex is put there it can be COW-ed, right, while
it's also in the pagecache?)
assuming those 4 fields are available, it should be easy for the futex
code to detect such mappings - the ->mapping field should be a good test,
if it's NULL then it's a COW-able page, if it's non-NULL then not.
then eg. the ->private field could be used as a simple PG_sticky counter.
or, to implement real lazy COW, the ->private field could be used as an
'owner MM' pointer, and if the COW handler sees current->mm == owner_mm,
then the ->list has a queue of pending page_change_struct's, which would
trigger a rehashing of the futexes. Then PG_sticky is cleared.
this would work fine as long as the pin_page code guarantees to never put
a hash on an already COW-ed page. (which can be guaranteed by using the
'writable' flag to get_user_pages())
no additional hashes. Is there any danger in going into this direction?
Ingo
Ingo Molnar <[email protected]> wrote:
> sigh. And we cannot even properly detect which unpin_page() was the last
> unpinning of the page - there can be so many other reasons a page's count
> is elevated. And keeping a page sticky forever is no solution either, the
> number of sticky pages would increase significantly, causing real fork()
> problems.
Maybe you can resurrect your approach by using a sticky counter instead of a flag.
If there are really that many unused fields in struct page for the case considered
here this should be possible.
But another point: what happens if get_user_pages (and the sticky-setting)
is called after the fork completed? If there was no write access to the
page between the fork and the futex call you may get the same race.
More general this seems not be a futex problem, but a general inconsistancy
between COW and page pinning by get_user_pages. You may get similar races if
you pin pages to do zero copy DMA on COWed pages. The bus-master device then
maybe transfers data to a child's page instead of a parent page (if someone write
touched the page between the call to get_user_pages and DMA completion).
One solution would be to force get_user_pages and copy_page_range
to replace a page marked for COW before handling it (of couse only on demand).
Martin
On Fri, 27 Sep 2002, Martin Wirth wrote:
> Maybe you can resurrect your approach by using a sticky counter instead
> of a flag. If there are really that many unused fields in struct page
> for the case considered here this should be possible.
well obviously it can be solved by putting more stuff into struct page,
but this whole exercise centers around trying to avoid that.
> But another point: what happens if get_user_pages (and the
> sticky-setting) is called after the fork completed? If there was no
> write access to the page between the fork and the futex call you may get
> the same race.
this is not a problem, the patch solves this, by using the 'writable' flag
to get_user_page(), which un-COWs any potential COW page.
Ingo
On Thu, 26 Sep 2002, Linus Torvalds wrote:
> and then the "callback" function just updates the page information in
> the futex block directly - as if it was looked up anew.
yes. And it would also have to rehash the futex queue (which is hashed
along (page,offset), because a FUTEX_WAKE has to find the proper queue -
but it's still very cheap.
this also means that FUTEX_WAIT does not have to make the futex page
writable - just making it present and hashing it along the physical page.
Ie. more robust and less intrusive futexes.
Ingo
Ingo Molnar wrote:
> futexes were not really designed with COW in mind - they were designed
> to be used in non-COW shared memory. This is a very bad limitation
I thought that futex-based locks were only reliable with PROT_SEM
memory, for architectures that define PROT_SEM (e.g. PPC) -- because of
the need for locking primitives to work in a cache coherent manner.
Is this not so?
-- Jamie
On Fri, 4 Oct 2002, Jamie Lokier wrote:
>
> I thought that futex-based locks were only reliable with PROT_SEM
> memory, for architectures that define PROT_SEM (e.g. PPC) -- because of
> the need for locking primitives to work in a cache coherent manner.
That was my initial thought, to make it portable.
It turns out that all the real uses of it want to just be able to use it
in a less portable manner, and put futexes anywhere, and in particular in
perfectly regular anonymous memory.
And since that works fine on all sane hardware, the point of PROT_SEM and
special casing just wasn't really there.
Does that mean that such common futex usage won't be portable to broken
hardware? Yup. The alternative was to make things cumbersome for
_everybody_, and since 99.9% of all existing hardware is sane (and the
remaining insane hw base tends to be going away anyway), it looks like it
was the right decision.
So now glibc can (and does) put semaphores on the stack etc random places,
without having to worry about it.
Linus