LinuxLists.cc - [PATCH 0/2][RFC] New version of shared page tables

2006-05-03 15:43:31

Subject: [PATCH 0/2][RFC] New version of shared page tables

I've done some cleanup and some bugfixing. Hugh, please review
this version instead of the old one. I like my locking mechanism
for unsharing on this one a lot better. It works on an address
range instead of depending on a vma, which more closely reflects
the way it's used.

The first patch just standardizes the pxd_page/pxd_page_kernel macros
for all architectures.

The second patch is the heart of shared page tables.

This version of the patches is against 2.6.17-rc3.

Dave McCracken

2006-05-03 15:56:17

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH 0/2][RFC] New version of shared page tables

On Wed, 3 May 2006, Dave McCracken wrote:
>
> I've done some cleanup and some bugfixing. Hugh, please review
> this version instead of the old one.

Grrr, just as I'm writing up my notes on the last revision!
I need a new go-faster brain. Okay, I'll switch over now.

Sisyphughs

2006-05-03 16:06:18

by Dave McCracken

[permalink] [raw]

Subject: Re: [PATCH 0/2][RFC] New version of shared page tables

--On Wednesday, May 03, 2006 16:56:12 +0100 Hugh Dickins <[email protected]>
wrote:

>> I've done some cleanup and some bugfixing. Hugh, please review
>> this version instead of the old one.
>
> Grrr, just as I'm writing up my notes on the last revision!
> I need a new go-faster brain. Okay, I'll switch over now.

Sorry.

The changes should be relatively minor. Just a tweak to the unshare
locking and some extra code to handle hugepage copy_page_range, mostly.

Dave

2006-05-05 19:25:47

by Brian Twichell

[permalink] [raw]

Subject: Re: [PATCH 0/2][RFC] New version of shared page tables

Hi,

We reevaluated shared pagetables with recent patches from Dave. As with
our previous evaluation, a database transaction-processing workload was
used. This time our evaluation focused on a 4-way x86-64 configuration
with 8 GB of memory.

In the case that the bufferpools were in small pages, shared pagetables
provided a 27% improvement in transaction throughput. The performance
increase is attributable to multiple factors. First, pagetable memory
consumption was reduced from 1.65 GB to 51 MB, freeing up 20% of the
system's memory. This memory was devoted to enlarging the database
bufferpools, which allowed more database data to be cached in memory.
The effect of this was to reduce the number of disk I/O's per
transaction by 23%, which contributed to a similar reduction in the
context switch rate. A second major component of the performance
improvement is reduced TLB and cache miss rates, due to the smaller
pagetable footprint. To try to isolate this benefit, we performed an
experiment where pagetables were shared, but the database bufferpools
were not enlarged. In this configuration, shared pagetables provided a
9% increase in database transaction throughput. Analysis of processor
performance counters revealed the following benefits from pagetable sharing:

- ITLB and DTLB page walks were reduced by 27% and 26%, respectively.
- L1 and L2 cache misses were reduced by 5%. This is due to fewer
pagetable entries crowding the caches.
- Front-side bus traffic was reduced approximately 10%.

When the bufferpools were in hugepages, shared pagetables provided a 3%
increase in database transaction throughput. Some of the underlying
benefits of pagetable sharing were as follows:

- Pagetable memory consumption was reduced from 53 MB to 37 MB.
- ITLB and DTLB page walks were reduced by 28% and 10%, respectively.
- L1 and L2 cache misses were reduced by 2% and 6.5%, respectively.
- Front-side bus traffic was reduced by approximately 4%.

The database transaction throughput achieved using small pages with
shared pagetables (with bufferpools enlarged) was within 3% of the
transaction throughput achieved using hugepages without shared
pagetables. Thus shared pagetables provided nearly all the benefit of
hugepages, without the requirement of having to deal with limitations of
hugepages. We believe this would be a significant benefit to customers
running these types of workloads.

We also measured the benefit of shared pagetables on our larger setups.
On our 4-way x86-64 setup with 64 GB memory, using small pages for the
bufferpools, shared pagetables provided a 33% increase in transaction
throughput. Using hugepages for the bufferpools, shared pagetables
provided a 3% increase. Performance with small pages and shared
pagetables was within 4% of the performance using hugepages without
shared pagetables.

On our ppc64 setups we used both Oracle and DB2 to evaluate the benefit
of shared pagetables. When database bufferpools were in small pages,
shared pagetables provided an increase in database transaction
throughput in the range of 60-65%, while in the hugepage case the
improvement was up to 2.4%.

We thank Kshitij Doshi and Ken Chen from Intel for their assistance in
analyzing the x86-64 data.

Cheers,
Brian

2006-05-06 03:37:24

by Chen, Kenneth W

[permalink] [raw]

Subject: RE: [PATCH 0/2][RFC] New version of shared page tables

Brian Twichell wrote on Friday, May 05, 2006 12:26 PM
> We also measured the benefit of shared pagetables on our larger setups.
> On our 4-way x86-64 setup with 64 GB memory, using small pages for the
> bufferpools, shared pagetables provided a 33% increase in transaction
> throughput. Using hugepages for the bufferpools, shared pagetables
> provided a 3% increase. Performance with small pages and shared
> pagetables was within 4% of the performance using hugepages without
> shared pagetables.
>
> On our ppc64 setups we used both Oracle and DB2 to evaluate the benefit
> of shared pagetables. When database bufferpools were in small pages,
> shared pagetables provided an increase in database transaction
> throughput in the range of 60-65%, while in the hugepage case the
> improvement was up to 2.4%.

I would also like to add that I have run this set of patches on ia64 and
observed similar performance upside. We have multiple data points showing
that this feature benefits several architectures. I'm advocating for the
upstream inclusion.

- Ken

2006-05-06 15:25:20

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH 0/2][RFC] New version of shared page tables

On Wed, 3 May 2006, Dave McCracken wrote:
>
> The changes should be relatively minor. Just a tweak to the unshare
> locking and some extra code to handle hugepage copy_page_range, mostly.

I didn't pay any attention to 1/2, the pxd_page/pxd_page_kernel patch.
It was well worth separating that one out, really helps to reduce the
scale and worry of the main patch.

Notice recent mails suggesting it's the answer to a pud_page anomaly:
I hadn't realized the type of pud_page varies from arch to arch, that's
horrid: if you're sorting it out, then please make that clear in the
comment and push it forward.

Though I notice only a couple of instances of pxd_page_kernel outside of
include/asm-* in 2.6.17-rc: I now think you'd do much better not to
propagate that obscure _kernel suffix further, but go for pxd_page_vaddr
(or suchlike) throughout instead: more change, but much clearer.

I do agree with Christoph that you'd do well to separate out the hugetlb
part of the main patch. Not just for the locking, more because that's
become such a specialist and fast-moving area recently. I didn't pay
attention to that part of 2/2 either, but got the impression that your
patch has not kept up with the changes there.

Let me say (while perhaps others are still reading) that I'm seriously
wondering whether you should actually restrict your shared pagetable work
to the hugetlb case. I realize that would be a disappointing limitation
to you, and would remove the 25%/50% improvement cases, leaving only the
3%/4% last-ounce-of-performance cases.

But it's worrying me a lot that these complications to core mm code will
_almost_ never apply to the majority of users, will get little testing
outside of specialist setups. I'd feel safer to remove that "almost",
and consign shared pagetables to the hugetlb ghetto, if that would
indeed remove their handling from the common code paths. (Whereas,
if we didn't have hugetlb, I would be arguing strongly for shared pts.)

Patch 2/2 does look cleaner than before, and dropping PTSHARE_PUD has
helped to make it all simpler.

But you've not yet added the rss accounting: that's going to make it
quite a lot nastier. An argument for sticking just to the hugetlb case:
although hugetlb accounts rss at present, I think we could justify not
doing so (though hugetlb rss is more relevant now it's not prefaulted).
However, I'll continue commenting on your non-hugetlb modifications.

A lot of page migration work has come into the tree (in mm/mempolicy.c
and mm/migrate.c) since 2.6.15: an optimistic guess would be that all
you need do there is skip shared pagetables unless MPOL_MF_MOVE_ALL
(as page_mapcount > 1 pages would be skipped). You don't have to
worry about the pagetable becoming shared after you've tested: it's
already accepted that what's initially tested may change later (and
your rmap.c changes should cover the wider TLB flushing needed).

Naming: pt_check_unshare_pxd, I think better call them pt_unshare_pxd;
ah, you've already got pt_unshare_pxd, which is similar but different.
That's confusing - forced upon you by pagetable folding peculiarities?
I'd rather have just one copy of that central locking code.

You've generally been helpful with your underscores: one exception is
pt_vmashared: pt_sharing_vma? Ah, it's supposed to be gone from the
latest version, but one trace remains - please delete that. ptshare.h
contains more declarations/definitions never used, pt_increment_pte,
pt_decrement_pte, pt_increment_pmd, pt_decrement_pmd at least: please
check and delete what you're not actually using.

Compiler warns "value computed is not used" on the "*shared++;" lines
in page_check_address: should be "(*shared)++;". So at present the
shared path in rmap.c has not really been tested.

page_referenced_one needs to skip decrementing *mapcount when shared:
the vma prio_tree search will bring it back again and again to the
same shared pagetable, though that pte is only counted once in mapcount:
so it's currently liable to break out too early, missing other entries.

try_to_unmap_one will now be flushing TLB on all cpus for each shared
pagetable entry it unmaps, where often (seeing inactive mm) it wouldn't
have needed to flush TLB at all; but that might work out on balance,
it won't be finding so many entries to unmap.

The prio_tree_iter_inits in pt_share_pte and pt_share_pmd should limit
their scope to the range of the pagetable involved, not the whole vma.
next_shareable_vma likewise? I thought so at first, but perhaps its
check for a similar vma often avoids immediate unsharing.
Optimizations only, you've probably had them in mind for later.

I mentioned the off-by-one in pt_shareable_pte and pt_shareable_pmd
before: ought to say "vma->vm_end - 1 >= end"; but must admit that's
nitpicking, since vm_end is PAGE_SIZE aligned anyway, so no real
issue can arise - fix it to help stop others worrying later?

Whereas pt_share_pte and pt_share_pmd have the complementary issue:
"end = base + PXD_SIZE" may wrap to 0, so you need to -1 somewhere
(but you won't need base and end if pt_trans_start/end go away).
pt_share_pte and pt_share_pmd: preferable to swap around their pxd
and address arguments, so they resemble what they're replacing.

pt_shareable now has the same off-by-one too: or would have, but
"end = base + (mask-1)" is quite wrong, isn't it? base + ~mask?
Move those calculations lower down, after the common tests? or
do compiler and processor nowadays optimize such orderings well?
And there's a leftover "vmas in transition" comment on vm_flags.

pt_shareable is still not rejecting if vma->anon_vma is set: it's
quite possible for a vm_file vma to be private and writable, gather
some COW pages, and then be mprotected to readonly, so passing the
vm_flags test - but its pagetables must not be shared.

VM_PTSHARE came and went, good, you never had the mmap_sem needed
to set it. VM_TRANSITION came and went, you've replaced it by the
mm->pt_trans_start, pt_trans_end. At first I thought that a big
improvement, now I'm not so sure. If they stay, those added fields
should be under #ifdef so as not to enlarge the basic mm_struct.

I'd prefer something other than "lock" in pt_unshare_lock_range
and pt_unlock_range, but I think I'm going to suggest you go back
to using pt_unshare_range alone: let's look at the three callsites.

sys_remap_file_pages: doesn't really need the locked range, you could
just call pt_unshare_range a little lower down, once i_mmap_lock taken.

mprotect_fixup: that does need some protection, yes, because the pte
protections are out-of-synch with the vm_flags for a while (in a way
that's okay for the owning mm, but not for "parasites" wanting to share).
Please move the pt_unshare_lock_range (or whatever) down above vma_merge,
so you can remove the pt_unlock_range from the -ENOMEM case above it.

mremap move_vma: not good enough, you're unsharing and locking the old
range, but you also need to lock the new range before copy_vma, to hold
it unshared too; which could be done, though not with the interfaces
you've provided. (The VM_TRANSITION version was insufficient too, and
cleared the flag at a point where "vma" might already have been freed.)

Are those the only places which need this range locking? I was worried
at first that there might be more, then I came around to thinking you'd
identified the right places, now suddenly I see pt_check_unshare_pxd in
zap_pxd_range as vulnerable: the vma remains in the prio_tree, so it
might immediately get shared again; what the zapping does is not wholly
wrong, but its TLB flushing would be inadequate if the table has become
shared in the meantime. Or am I mistaken?

Unless you have firm performance evidence to the contrary, on a workload
that you're seriously trying to address, I suggest you drop the range
locking transitional stuff, and down_read_trylock(&svma->vm_mm->mmap_sem)
in (or near calls to) next_shareable_vma instead. That will fail more
often than your transitional checks, but give much stronger assurance
that nothing funny is going on in the vma found.

But do you then need to add down_write(&mm->mmap_sem) in exit_mmap, if
pagetable sharing is enabled? currently I believe so. But then, I think,
you can remove the pt_check_unshare_pxd from free_pyd_range: odd how it
was doing those once in the unmap_vmas path, then again in the
free_pgtables path - what was your thinking there? Yes, the pagetable
may have gotten reshared in between, but the TLB flushing would already
be inadequate if so.

I admire the simplicity of the way you just unshare when you have to,
letting faults fill back in lazily; but does that have a problem in the
case of a VM_LOCKED vma, losing the guarantees mlock should be giving?

I read through a lot of old mails while reviewing, going back to Daniel's
first implementation in 2002. The most interesting remark I found (and
have lost again) was one from wli, questioning the locking required when
changing *pmd. Hmm, let's look at your pt_check_unshare_pte (similar
remarks apply to pt_check_unshare_pmd, pt_unshare_pte, pt_unshare_pmd),
there's a lot to question in that locking.

Well, the locking that you do have, it's unclear why you're using the
spinlock in the pagetable struct page there: doesn't it amount to?
page = pmd_page(*pmd);
if (atomic_add_unless(&page->_mapcount, -1, -1))
return 0;
pmd_clear_flush(mm, address, pmd);
return 1;
Ah, probably atomic_add_unless wasn't available when you wrote it.

But then what of the "Oops, already mapped" pt_decrement_share in
pt_share_pte? That's under different locking (rightly, the level
above, since the question there is whether a racing thread has set
*pmd): what happens if that decrement brings the share count down to,
umm, something awkward - hard to be specific, partly because of how
_mapcount starts from -1, partly because you've gone for a share count
rather than a reference count - I understand that you were avoiding the
overhead of maintaining another reference count on the common path, but
it leaves me deeply suspicious, I fear it's hiding bugs.

I'd agree it'll be rare (usually the racing thread will have found the
same pagetable to share as we have, and so raised its share count), but
I do believe that pt_decrement_share can go wrong: the process that had
that pagetable may be exiting, find it shared in its pt_check_unshare_pte
so skip zapping, then we're left with that pagetable to free - but we do
nothing other than decrement the count one too far. I think that will
get fixed by pt_share_pte holding i_mmap_lock and its down_read_trylock
of svma->vm_mm->mmap_sem across the lower block: then it only needs to
pt_increment_share when all's well at the end, the decrement case gone.
pte_lock nests within i_mmap_lock, should be fine for pmd_lock also.

Now, back to the question of the pmd_clear_flush: currently, we may add
a valid entry *pmd at any time, but we only clear it in free_pgtables,
after all occupying vmas have been removed from anon_vma and prio_tree.
You're relaxing that; most paths are protected by holding mmap_sem, but
file truncation and rmap lookup are not. The easiest protection against
races here is to hold i_mmap_lock, since both unmap_mapping_range and
page_check_address do (but slightly messy since the unmap_mapping_range
path must then avoid retaking it in the pt_check_unshares). If you're
taking i_mmap_lock in pt_check_unshare_pte etc, you could then skip the
atomic_add_unless I was suggesting above, revert to your existing
structure but using i_mmap_lock instead of pmd_page ptl.

Except, you must not drop the lock until after your pmd_clear_flush
(which only needs flush_tlb_mm, doesn't it, rather than flush_tlb_all?).
Because once you drop the lock, the process you were sharing with could
unmap and free all the pages, and not knowing it had been sharing, only
flush for its own mm - other threads of your process might be able to
access those pages after they were freed by the other.

I don't suppose extending the use of i_mmap_lock as I suggest will be
popular, it's liable to reduce your scalability: I'm more pointing to
an obvious way to fix some problems than necessarily the end solution.

Please don't interpret these detailed comments as meaning that I think
your patch is almost ready: I'm afraid that the longer I spend looking
at it, the more I find to worry about - not a good sign. (And let me
repeat, I've not looked at the hugetlb end of it at all.)

And though it's easy to find performance advocates in favour of your
patch, it's hard to find kernel hackers who care for maintainability
wanting it in. And I worry that it will tie our hands, repeatedly
posing a difficulty for other future developments (rather as
sys_remap_file_pages did, or I feared Christoph's pte xchging would).

How was Ray Bryant's shared,anonymous,fork,munmap,private bug of
25 Jan resolved? We didn't hear the end of that.

Hugh

2006-05-08 19:33:12

by Ray Bryant

[permalink] [raw]

Subject: Re: [PATCH 0/2][RFC] New version of shared page tables

On Saturday 06 May 2006 10:25, Hugh Dickins wrote:
<snip>

> How was Ray Bryant's shared,anonymous,fork,munmap,private bug of
> 25 Jan resolved? We didn't hear the end of that.
>

I never heard anything back from Dave, either.

> Hugh
> -

<snip>

--
Ray Bryant
AMD Performance Labs Austin, Tx
512-602-0038 (o) 512-507-7807 (c)

2006-05-08 19:49:37

by Brian Twichell

[permalink] [raw]

Subject: Re: [PATCH 0/2][RFC] New version of shared page tables

Hugh Dickins wrote:

>Let me say (while perhaps others are still reading) that I'm seriously
>wondering whether you should actually restrict your shared pagetable work
>to the hugetlb case. I realize that would be a disappointing limitation
>to you, and would remove the 25%/50% improvement cases, leaving only the
>3%/4% last-ounce-of-performance cases.
>
>But it's worrying me a lot that these complications to core mm code will
>_almost_ never apply to the majority of users, will get little testing
>outside of specialist setups. I'd feel safer to remove that "almost",
>and consign shared pagetables to the hugetlb ghetto, if that would
>indeed remove their handling from the common code paths. (Whereas,
>if we didn't have hugetlb, I would be arguing strongly for shared pts.)
>
Hi,

In the case of x86-64, if pagetable sharing for small pages was
eliminated, we'd lose more than the 27-33% throughput improvement
observed when the bufferpools are in small pages. We'd also lose a
significant chunk of the 3% improvement observed when the bufferpools
are in hugepages. This occurs because there is still small page
pagetable sharing being achieved, minimally for database text, when the
bufferpools are in hugepages. The performance counters indicated that
ITLB and DTLB page walks were reduced by 28% and 10%, respectively, in
the x86-64/hugepage case.

To be clear, all measurements discussed in my post were performed with
kernels config'ed to share pagetables for both small pages and hugepages.

If we had to choose between pagetable sharing for small pages and
hugepages, we would be in favor of retaining pagetable sharing for small
pages. That is where the discernable benefit is for customers that run
with "out-of-the-box" settings. Also, there is still some benefit there
on x86-64 for customers that use hugepages for the bufferpools.

Cheers,
Brian

2006-05-09 03:46:22

by Nick Piggin

[permalink] [raw]

Subject: Re: [PATCH 0/2][RFC] New version of shared page tables

Brian Twichell wrote:

> Hugh Dickins wrote:
>
>> Let me say (while perhaps others are still reading) that I'm seriously
>> wondering whether you should actually restrict your shared pagetable
>> work
>> to the hugetlb case. I realize that would be a disappointing limitation
>> to you, and would remove the 25%/50% improvement cases, leaving only the
>> 3%/4% last-ounce-of-performance cases.
>>
>> But it's worrying me a lot that these complications to core mm code will
>> _almost_ never apply to the majority of users, will get little testing
>> outside of specialist setups. I'd feel safer to remove that "almost",
>> and consign shared pagetables to the hugetlb ghetto, if that would
>> indeed remove their handling from the common code paths. (Whereas,
>> if we didn't have hugetlb, I would be arguing strongly for shared pts.)
>>
> Hi,
>
> In the case of x86-64, if pagetable sharing for small pages was
> eliminated, we'd lose more than the 27-33% throughput improvement
> observed when the bufferpools are in small pages. We'd also lose a
> significant chunk of the 3% improvement observed when the bufferpools
> are in hugepages. This occurs because there is still small page
> pagetable sharing being achieved, minimally for database text, when
> the bufferpools are in hugepages. The performance counters indicated
> that ITLB and DTLB page walks were reduced by 28% and 10%,
> respectively, in the x86-64/hugepage case.

Aside, can you just enlighten me as to how TLB misses are improved on
x86-64? As far as
I knew, it doesn't have ASIDs so I wouldn't have thought it could share
TLBs anyway...
But I'm not up to scratch with modern implementations.

>
> To be clear, all measurements discussed in my post were performed with
> kernels config'ed to share pagetables for both small pages and hugepages.
>
> If we had to choose between pagetable sharing for small pages and
> hugepages, we would be in favor of retaining pagetable sharing for
> small pages. That is where the discernable benefit is for customers
> that run with "out-of-the-box" settings. Also, there is still some
> benefit there on x86-64 for customers that use hugepages for the
> bufferpools.

Of course if it was free performance then we'd want it. The downsides
are that it
is a significant complexity for a pretty small (3%) performance gain for
your apparent
target workload, which is pretty uncommon among all Linux users.

Ignoring the complexity, it is still not free. Sharing data across
processes adds to
synchronisation overhead and hurts scalability. Some of these page fault
scalability
scenarios have shown to be important enough that we have introduced
complexity _there_.

And it seems customers running "out-of-the-box" settings really want to
start using
hugepages if they're interested in getting the most performance
possible, no?

---

Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-10 02:07:45

by Chen, Kenneth W

[permalink] [raw]

Subject: RE: [PATCH 0/2][RFC] New version of shared page tables

Nick Piggin wrote on Monday, May 08, 2006 8:42 PM
> Brian Twichell wrote:
> > In the case of x86-64, if pagetable sharing for small pages was
> > eliminated, we'd lose more than the 27-33% throughput improvement
> > observed when the bufferpools are in small pages. We'd also lose a
> > significant chunk of the 3% improvement observed when the bufferpools
> > are in hugepages. This occurs because there is still small page
> > pagetable sharing being achieved, minimally for database text, when
> > the bufferpools are in hugepages. The performance counters indicated
> > that ITLB and DTLB page walks were reduced by 28% and 10%,
> > respectively, in the x86-64/hugepage case.
>
>
> Aside, can you just enlighten me as to how TLB misses are improved on
> x86-64? As far as I knew, it doesn't have ASIDs so I wouldn't have thought
> it could share TLBs anyway...
> But I'm not up to scratch with modern implementations.

Allow me to jump in if I may: The number of TLB misses did not change that
much (both i-side and d-side and is expected). What changed is the penalty
of TLB misses are reduced: i.e., number of page table walk performed by the
hardware are reduced. This is due to specialized buffering of information
that reduces the need to perform page walks. With page table sharing, the
overall size of page tables are reduced, in turn, it has a better hit rate
on the buffered items and it helps to mitigate page walks upon a TLB miss.

- Ken

2006-05-10 02:15:50

by Hugh Dickins

[permalink] [raw]

Subject: Re: [PATCH 0/2][RFC] New version of shared page tables

On Mon, 8 May 2006, Brian Twichell wrote:
>
> If we had to choose between pagetable sharing for small pages and hugepages,
> we would be in favor of retaining pagetable sharing for small pages. That is
> where the discernable benefit is for customers that run with "out-of-the-box"
> settings. Also, there is still some benefit there on x86-64 for customers
> that use hugepages for the bufferpools.

Thanks for the further info, Brian. Okay, the hugepage end of it does
add a different kind of complexity, in an area already complex from the
different arch implementations. If you've found that a significant part
of the hugepage test improvment is actually due to the smallpage changes,
let's turn around what I said, and suggest Dave concentrate on getting the
smallpage changes right, putting the hugepage part of it on the backburner
at least for now (or if he's particularly keen still to present it, as 3/3).

Hugh

2006-05-10 19:45:31

by Brian Twichell

[permalink] [raw]

Subject: Re: [PATCH 0/2][RFC] New version of shared page tables

Nick Piggin wrote:

> Brian Twichell wrote:
>
>>
>> If we had to choose between pagetable sharing for small pages and
>> hugepages, we would be in favor of retaining pagetable sharing for
>> small pages. That is where the discernable benefit is for customers
>> that run with "out-of-the-box" settings. Also, there is still some
>> benefit there on x86-64 for customers that use hugepages for the
>> bufferpools.
>
>
> Of course if it was free performance then we'd want it. The downsides
> are that it
> is a significant complexity for a pretty small (3%) performance gain
> for your apparent
> target workload, which is pretty uncommon among all Linux users.

Our performance data demonstrated that the potential gain for the
non-hugepage case is much higher than 3%.

>
> Ignoring the complexity, it is still not free. Sharing data across
> processes adds to
> synchronisation overhead and hurts scalability. Some of these page
> fault scalability
> scenarios have shown to be important enough that we have introduced
> complexity _there_.

True, but this needs to be balanced against the fact that pagetable
sharing will reduce the number of page faults when it is achieved.
Let's say you have N processes which touch all the pages in an M page
shared memory region. Without shared pagetables this requires N*M page
faults; if pagetable sharing is achieved, only M pagefaults are required.

>
> And it seems customers running "out-of-the-box" settings really want
> to start using
> hugepages if they're interested in getting the most performance
> possible, no?

My perspective is that, once the customer is required to invoke "echo
XXX > /proc/sys/vm/nr_hugepages" they've left the "out-of-the-box"
domain, and entered the domain of hoping that the number of hugepages is
sufficient, because if it's not, they'll probably need to reboot, which
can be pretty inconvenient for a production transaction-processing
application.

Cheers,
Brian

2006-05-12 05:17:42

by Nick Piggin

[permalink] [raw]

Subject: Re: [PATCH 0/2][RFC] New version of shared page tables

Brian Twichell wrote:
> Nick Piggin wrote:

>> Of course if it was free performance then we'd want it. The downsides
>> are that it
>> is a significant complexity for a pretty small (3%) performance gain
>> for your apparent
>> target workload, which is pretty uncommon among all Linux users.
>
>
> Our performance data demonstrated that the potential gain for the
> non-hugepage case is much higher than 3%.

The point is, there are hugepages. They were a significant additional
complexity but the concession was made because they did provide a
large speedup for databases.

>
>>
>> Ignoring the complexity, it is still not free. Sharing data across
>> processes adds to
>> synchronisation overhead and hurts scalability. Some of these page
>> fault scalability
>> scenarios have shown to be important enough that we have introduced
>> complexity _there_.
>
>
> True, but this needs to be balanced against the fact that pagetable
> sharing will reduce the number of page faults when it is achieved.
> Let's say you have N processes which touch all the pages in an M page
> shared memory region. Without shared pagetables this requires N*M page
> faults; if pagetable sharing is achieved, only M pagefaults are required.
>
>>
>> And it seems customers running "out-of-the-box" settings really want
>> to start using
>> hugepages if they're interested in getting the most performance
>> possible, no?
>
>
> My perspective is that, once the customer is required to invoke "echo
> XXX > /proc/sys/vm/nr_hugepages" they've left the "out-of-the-box"
> domain, and entered the domain of hoping that the number of hugepages is
> sufficient, because if it's not, they'll probably need to reboot, which
> can be pretty inconvenient for a production transaction-processing
> application.

I think it is pretty easy to reserve hugepages at bootup. This is what
a production transaction processing system will be doing, won't it?
Especially if they're performance constrained and hugepages gives them
a 30% performance boost.

--
SUSE Labs, Novell Inc.
Send instant messages to your online friends http://au.messenger.yahoo.com

2006-05-16 21:09:22

by Dave McCracken

[permalink] [raw]

Subject: Re: [PATCH 0/2][RFC] New version of shared page tables

--On Monday, May 08, 2006 14:32:39 -0500 Ray Bryant
<[email protected]> wrote:
> On Saturday 06 May 2006 10:25, Hugh Dickins wrote:
> <snip>
>> How was Ray Bryant's shared,anonymous,fork,munmap,private bug of
>> 25 Jan resolved? We didn't hear the end of that.
>>
>
> I never heard anything back from Dave, either.

My apologies. As I recall your problem looked to be a race in an area
where I was redoing the concurrency control. I intended to ask you to
retest when my new version came out. Unfortunately the new version took
awhile, and by the time I sent it out I forgot to ask you about it.

I believe your problem should be fixed in recent versions. If not, I'll
make another pass at it.

Dave McCracken

2006-05-19 16:55:42

by Ray Bryant

[permalink] [raw]

Subject: Re: [PATCH 0/2][RFC] New version of shared page tables

2006-05-22 18:01:06

by Ray Bryant

[permalink] [raw]

Subject: Re: [PATCH 0/2][RFC] New version of shared page tables

On Tuesday 16 May 2006 16:09, Dave McCracken wrote:
> --On Monday, May 08, 2006 14:32:39 -0500 Ray Bryant
>
> <[email protected]> wrote:
> > On Saturday 06 May 2006 10:25, Hugh Dickins wrote:
> > <snip>
> >
> >> How was Ray Bryant's shared,anonymous,fork,munmap,private bug of
> >> 25 Jan resolved? We didn't hear the end of that.
> >
> > I never heard anything back from Dave, either.
>
> My apologies. As I recall your problem looked to be a race in an area
> where I was redoing the concurrency control. I intended to ask you to
> retest when my new version came out. Unfortunately the new version took
> awhile, and by the time I sent it out I forgot to ask you about it.
>
> I believe your problem should be fixed in recent versions. If not, I'll
> make another pass at it.
>
> Dave McCracken
>

Dave,

I'm sending you a test case and a small kernel patch (see attachments). The
patch applies to 2.6.17-rc1, on top of your patches from 4/10/2006 (I'm
assuming these the most recent ones.).

What the patch does is to add a system call that will return the pfn and ptep
for a given virtual address. What the test program does (I think :-) ) is
to create a mmap'd shared region, then fork off a child. The child then
re-mmaps() private a portion of the region. Call it without arguments for
now, that should map 512 pte's and share them between the parent and 1 child.
[Later on we can try more pages and more children. (e. g.
./shpt_test1 128 64).]

At this point, what I expect to have happened is that in at the shared region
address in the child, there will be a number of pages that are still shared
with the parent, hence have the same pfn and ptep as they used to, followed
by a set of pages in the re-mmapped() region where the pfn's and ptep's are
different, because that set of pages is no longer shared.

What I find is that the re-mapped() region, the pfn's are different, but the
ptep's have not changed. Hence, we've modified the parent address space
rather than getting our own copy of that part of the address space.

Now I'm not positive as to what the semantics SHOULD be here, so that may be
the error involved, but it seems to me that if I mmap() the region private in
the child, I should get a nice new private copy, and the pte's should no
longer be shared with the parent. Is that the way you guys understand the
semantics of this?

Anyway take a look at my test case and see if it makes any sense to you.
If it turns out my test case is in error, the mea culpa, and I'll fix the
problems and try again.

Best Regards,

Ray
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

--
Ray Bryant
AMD Performance Labs Austin, Tx
512-602-0038 (o) 512-507-7807 (c)

Attachments:

(No filename) (2.88 kB)
add-sys_get_vminfo.patch (2.35 kB)
shpt_test1.c (9.96 kB)
Download all attachments