On 09/29/2012 06:48 AM, Andrea Arcangeli wrote:
>
> There would be a small cache benefit here... but even then some first
> level caches are virtually indexed IIRC (always physically tagged to
> avoid the software to notice) and virtually indexed ones won't get any
> benefit.
>
Not quite. The virtual indexing is limited to a few bits (e.g. three
bits on K8); the right way to deal with that is to color the zeropage,
both the regular one and the virtual one (the virtual one would circle
through all the colors repeatedly.)
The cache difference, therefore, is *huge*.
> I guess it won't make a whole lot of difference but my preference is
> for the previous implementation that always guaranteed huge TLB
> entries whenever possible. Said that I'm fine either ways so if
> somebody has strong reasons for wanting this one, I'd like to hear
> about it.
It's a performance tradeoff, and it can, and should, be measured.
-hpa
On Mon, Oct 01, 2012 at 08:34:28AM -0700, H. Peter Anvin wrote:
> On 09/29/2012 06:48 AM, Andrea Arcangeli wrote:
> >
> > There would be a small cache benefit here... but even then some first
> > level caches are virtually indexed IIRC (always physically tagged to
> > avoid the software to notice) and virtually indexed ones won't get any
> > benefit.
> >
>
> Not quite. The virtual indexing is limited to a few bits (e.g. three
> bits on K8); the right way to deal with that is to color the zeropage,
> both the regular one and the virtual one (the virtual one would circle
> through all the colors repeatedly.)
>
> The cache difference, therefore, is *huge*.
Kirill measured the cache benefit and it provided a 6% gain, not very
huge but certainly significant.
> It's a performance tradeoff, and it can, and should, be measured.
I now measured the other side of the trade, by touching only one
character every 4k page in the range to simulate a very seeking load,
and doing so the physical huge zero page wins with a 600% margin, so
if the cache benefit is huge for the virtual zero page, the TLB
benefit is massive for the physical zero page.
Overall I think picking the solution that risks to regress the least
(also compared to current status of no zero page) is the safest.
Thanks!
Andrea
On 10/01/2012 09:31 AM, Andrea Arcangeli wrote:
> On Mon, Oct 01, 2012 at 08:34:28AM -0700, H. Peter Anvin wrote:
>> On 09/29/2012 06:48 AM, Andrea Arcangeli wrote:
>>>
>>> There would be a small cache benefit here... but even then some first
>>> level caches are virtually indexed IIRC (always physically tagged to
>>> avoid the software to notice) and virtually indexed ones won't get any
>>> benefit.
>>>
>>
>> Not quite. The virtual indexing is limited to a few bits (e.g. three
>> bits on K8); the right way to deal with that is to color the zeropage,
>> both the regular one and the virtual one (the virtual one would circle
>> through all the colors repeatedly.)
>>
>> The cache difference, therefore, is *huge*.
>
> Kirill measured the cache benefit and it provided a 6% gain, not very
> huge but certainly significant.
>
>> It's a performance tradeoff, and it can, and should, be measured.
>
> I now measured the other side of the trade, by touching only one
> character every 4k page in the range to simulate a very seeking load,
> and doing so the physical huge zero page wins with a 600% margin, so
> if the cache benefit is huge for the virtual zero page, the TLB
> benefit is massive for the physical zero page.
>
> Overall I think picking the solution that risks to regress the least
> (also compared to current status of no zero page) is the safest.
>
Something isn't quite right about that. If you look at your numbers:
1,049,134,961 LLC-loads
6,222 LLC-load-misses
This is another way of saying in your benchmark the huge zero page is
parked in your LLC - using up 2 MB of your LLC, typically a significant
portion of said cache. In a real-life application that will squeeze out
real data, but in your benchmark the system is artificially quiescent.
It is well known that microbenchmarks can be horribly misleading. What
led to Kirill investigating huge zero page in the first place was the
fact that some applications/macrobenchmarks benefit, and I think those
are the right thing to look at.
-hpa
On Mon, Oct 01, 2012 at 10:03:53AM -0700, H. Peter Anvin wrote:
> On 10/01/2012 09:31 AM, Andrea Arcangeli wrote:
> > On Mon, Oct 01, 2012 at 08:34:28AM -0700, H. Peter Anvin wrote:
> >> On 09/29/2012 06:48 AM, Andrea Arcangeli wrote:
> >>>
> >>> There would be a small cache benefit here... but even then some first
> >>> level caches are virtually indexed IIRC (always physically tagged to
> >>> avoid the software to notice) and virtually indexed ones won't get any
> >>> benefit.
> >>>
> >>
> >> Not quite. The virtual indexing is limited to a few bits (e.g. three
> >> bits on K8); the right way to deal with that is to color the zeropage,
> >> both the regular one and the virtual one (the virtual one would circle
> >> through all the colors repeatedly.)
> >>
> >> The cache difference, therefore, is *huge*.
> >
> > Kirill measured the cache benefit and it provided a 6% gain, not very
> > huge but certainly significant.
> >
> >> It's a performance tradeoff, and it can, and should, be measured.
> >
> > I now measured the other side of the trade, by touching only one
> > character every 4k page in the range to simulate a very seeking load,
> > and doing so the physical huge zero page wins with a 600% margin, so
> > if the cache benefit is huge for the virtual zero page, the TLB
> > benefit is massive for the physical zero page.
> >
> > Overall I think picking the solution that risks to regress the least
> > (also compared to current status of no zero page) is the safest.
> >
>
> Something isn't quite right about that. If you look at your numbers:
>
> 1,049,134,961 LLC-loads
> 6,222 LLC-load-misses
>
> This is another way of saying in your benchmark the huge zero page is
> parked in your LLC - using up 2 MB of your LLC, typically a significant
> portion of said cache. In a real-life application that will squeeze out
> real data, but in your benchmark the system is artificially quiescent.
>
> It is well known that microbenchmarks can be horribly misleading. What
> led to Kirill investigating huge zero page in the first place was the
> fact that some applications/macrobenchmarks benefit, and I think those
> are the right thing to look at.
I think performance is not the first thing we should look at. We need to
choose which implementation is easier to support.
Applications which benefit from zero page are quite rare. We need to
provide a huge zero page to avoid huge memory consumption with THP.
That's it. Performance optimization for that rare case is overkill.
--
Kirill A. Shutemov
On Mon, Oct 01, 2012 at 10:03:53AM -0700, H. Peter Anvin wrote:
> Something isn't quite right about that. If you look at your numbers:
>
> 1,049,134,961 LLC-loads
> 6,222 LLC-load-misses
>
> This is another way of saying in your benchmark the huge zero page is
> parked in your LLC - using up 2 MB of your LLC, typically a significant
> portion of said cache. In a real-life application that will squeeze out
> real data, but in your benchmark the system is artificially quiescent.
Agreed. And that argument applies to the cache benefits of the virtual
zero page too: squeeze the cache just more aggressively so those 4k
got out of the cache too, and that 6% improvement will disappear
(while the TLB benefit of the physical zero page is guaranteed and is
always present no matter the workload, even if the TLB miss at the
same frequency, it'll get filled with one less cacheline access every
time).
> It is well known that microbenchmarks can be horribly misleading. What
> led to Kirill investigating huge zero page in the first place was the
> fact that some applications/macrobenchmarks benefit, and I think those
> are the right thing to look at.
The whole point of the two microbenchmarks was to measure the worst
cases for both scenarios and I think that was useful. Real life using
zero pages are going to be somewhere in that range.
On 10/01/2012 10:26 AM, Andrea Arcangeli wrote:
>
>> It is well known that microbenchmarks can be horribly misleading. What
>> led to Kirill investigating huge zero page in the first place was the
>> fact that some applications/macrobenchmarks benefit, and I think those
>> are the right thing to look at.
>
> The whole point of the two microbenchmarks was to measure the worst
> cases for both scenarios and I think that was useful. Real life using
> zero pages are going to be somewhere in that range.
>
... and I think it would be worthwhile to know which effect dominates
(or neither, in which case it doesn't matter).
Overall, I'm okay with either as long as we don't lock down 2 MB when
there isn't a huge zero page in use.
-hpa
On Mon, Oct 01, 2012 at 10:33:12AM -0700, H. Peter Anvin wrote:
> Overall, I'm okay with either as long as we don't lock down 2 MB when
> there isn't a huge zero page in use.
Is shinker-reclaimable huge zero page okay for you?
--
Kirill A. Shutemov
On 10/01/2012 10:36 AM, Kirill A. Shutemov wrote:
> On Mon, Oct 01, 2012 at 10:33:12AM -0700, H. Peter Anvin wrote:
>> Overall, I'm okay with either as long as we don't lock down 2 MB when
>> there isn't a huge zero page in use.
>
> Is shinker-reclaimable huge zero page okay for you?
>
Yes, I'm fine with that. However, I'm curious about the relative
benefit versus virtual hzp from a performance perspective, on an
application where hzp actually matters.
One can otherwise argue that if hzp doesn't matter for except in a small
number of cases that we shouldn't use it at all.
-hpa
On Mon, Oct 01, 2012 at 10:37:23AM -0700, H. Peter Anvin wrote:
> One can otherwise argue that if hzp doesn't matter for except in a small
> number of cases that we shouldn't use it at all.
These small number of cases can easily trigger OOM if THP is enabled. :)
--
Kirill A. Shutemov
On 10/01/2012 10:44 AM, Kirill A. Shutemov wrote:
> On Mon, Oct 01, 2012 at 10:37:23AM -0700, H. Peter Anvin wrote:
>> One can otherwise argue that if hzp doesn't matter for except in a small
>> number of cases that we shouldn't use it at all.
>
> These small number of cases can easily trigger OOM if THP is enabled. :)
>
And that doesn't happen in any conditions that *aren't* helped by hzp?
-hpa
On Mon, Oct 01, 2012 at 10:33:12AM -0700, H. Peter Anvin wrote:
> ... and I think it would be worthwhile to know which effect dominates
> (or neither, in which case it doesn't matter).
>
> Overall, I'm okay with either as long as we don't lock down 2 MB when
> there isn't a huge zero page in use.
Same here.
I agree the cmpxchg idea to free the 2M zero page, was a very nice
addition to the physical zero page patchset.
On Mon, Oct 01, 2012 at 08:15:19PM +0300, Kirill A. Shutemov wrote:
> I think performance is not the first thing we should look at. We need to
> choose which implementation is easier to support.
Having to introduce a special pmd bitflag requiring architectural
support is actually making it less self contained. The zero page
support is made optional of course, but the physical zero page would
have worked without the arch noticing.
> Applications which benefit from zero page are quite rare. We need to
> provide a huge zero page to avoid huge memory consumption with THP.
> That's it. Performance optimization for that rare case is overkill.
I still don't like the idea of some rare app potentially running
significantly slower (and we may not be notified because it's not a
breakage, if they're simulations it's hard to tell it's slower because
of different input or because of zero page being introduced). If we
knew for sure that zero pages accesses were always rare I wouldn't
care of course. But rare app != rare access.
The physical zero page patchset is certainly bigger, but it was mostly
localized in huge_memory.c so I don't see it at very intrusive even if
bigger.
Anyway if others sees the virtual zero page as easier to maintain, I'm
fine either ways.
On Mon, Oct 01, 2012 at 10:52:06AM -0700, H. Peter Anvin wrote:
> On 10/01/2012 10:44 AM, Kirill A. Shutemov wrote:
> > On Mon, Oct 01, 2012 at 10:37:23AM -0700, H. Peter Anvin wrote:
> >> One can otherwise argue that if hzp doesn't matter for except in a small
> >> number of cases that we shouldn't use it at all.
> >
> > These small number of cases can easily trigger OOM if THP is enabled. :)
> >
>
> And that doesn't happen in any conditions that *aren't* helped by hzp?
Sure, OOM still can happen.
But if we can eliminate a class of problem why not to do so?
--
Kirill A. Shutemov