LinuxLists.cc - Re: [benchmark] 1% performance overhead of paravirt

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

On Fri, 5 Jun 2009, Rusty Russell wrote:
>
> Distributions don't ship UP kernels any more; this shows what that costs if
> you're actually on a UP box. If we really don't care, perhaps we should make
> CONFIG_SMP=n an option under EMBEDDED for x86. And we can rip out the complex
> patching SMP patching stuff too.

The complex SMP patching is what makes it _possible_ to not ship UP
kernels any more.

The SMP overhead exists, but it would be even higher if we didn't patch
things to remove the "lock" prefix.

Linus

2009-06-06 19:00:35

by Anders K. Pedersen

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

Dave McCracken wrote:
> What I see as the message of his benchmark is if you care about performance
> you should be customizing your kernel anyway. Distro kernels are slow. An
> option that makes the distro kernel a bit slower is no big deal since anyone
> who wants speed should already be rebuilding their kernel.

And Oracle of course supports customers doing that?

Not in my experience, and the same goes for most other commercial
enterprise software on Linux as well, so customers have to stick to
distro kernels, if they want support.

Regards,
Anders K. Pedersen

2009-06-07 00:53:20

by Rusty Russell

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

On Sat, 6 Jun 2009 12:24:43 am Linus Torvalds wrote:
> On Fri, 5 Jun 2009, Rusty Russell wrote:
> > Distributions don't ship UP kernels any more; this shows what that costs
> > if you're actually on a UP box. If we really don't care, perhaps we
> > should make CONFIG_SMP=n an option under EMBEDDED for x86. And we can
> > rip out the complex patching SMP patching stuff too.
>
> The complex SMP patching is what makes it _possible_ to not ship UP
> kernels any more.

"possible"? You mean "acceptable". Gray, not black and white.

1) Where's the line?
2) Where are we? Does patching claw back 5% of the loss? 50%? 90%?

No point benchmarking on my (SMP) laptop for this one. Gerd cc'd, maybe he
has benchmarks from when he did the work originally?

Thanks,
Rusty.

2009-06-08 14:55:10

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

On Sun, 7 Jun 2009, Rusty Russell wrote:
>
> "possible"? You mean "acceptable". Gray, not black and white.

I don't think we can possibly claim to support UP configurations if we
don't patch.

> 1) Where's the line?

"As good as we can make it". There is no line. There's "your code sucks so
badly that it needs to get fixed, or we'll rip it out or disable it".

> 2) Where are we? Does patching claw back 5% of the loss? 50%? 90%?

On some things, especially on P4, the lock overhead was tens of percent.
Just a single locked instruction takes closer to two hundred instructions.

Of course, on those P4's, just kernel entry/exit is pretty high too (even
with sysenter/exit), so I doubt you'll ever see something be 90% just
because of that, unless it causes extra IO or other non-CPU issues.

Linus

2009-06-09 09:39:32

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

On Thu, Jun 04, 2009 at 08:02:14AM -0700, Linus Torvalds wrote:
>
>
> On Thu, 4 Jun 2009, Rusty Russell wrote:
> > >
> > > Turn off HIGHMEM64G, please (and HIGHMEM4G too, for that matter - you
> > > can't compare it to a no-highmem case).
> >
> > Thanks, your point is demonstrated below. I don't think HIGHMEM4G is
> > unreasonable for a distro tho, so I turned that on instead.
>
> Well, I agree that HIGHMEM4G is a _reasonable_ thing to turn on.
>
> The thing I disagree with is that it's at all valid to then compare to
> some all-software feature thing. HIGHMEM doesn't expand any esoteric
> capability that some people might use - it's about regular RAM for regular
> users.
>
> And don't get me wrong - I don't like HIGHMEM. I detest the damn thing. I
> hated having to merge it, and I still hate it. It's a stupid, ugly, and
> very invasive config option. It's just that it's there to support a
> stupid, ugly and very annoying fundamental hardware problem.

I was looking forward to be able to get rid of it... unfortunately
other 32-bit architectures are starting to use it again :(

I guess it is not incredibly intrusive for generic mm code. A bit
of kmap sprinkled around which is actually quite a useful delimiter
of where pagecache is addressed via its kernel mapping.

Do you hate more the x86 code? Maybe that can be removed?

2009-06-09 11:18:18

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

* Nick Piggin <[email protected]> wrote:

> On Thu, Jun 04, 2009 at 08:02:14AM -0700, Linus Torvalds wrote:
> >
> >
> > On Thu, 4 Jun 2009, Rusty Russell wrote:
> > > >
> > > > Turn off HIGHMEM64G, please (and HIGHMEM4G too, for that matter - you
> > > > can't compare it to a no-highmem case).
> > >
> > > Thanks, your point is demonstrated below. I don't think HIGHMEM4G is
> > > unreasonable for a distro tho, so I turned that on instead.
> >
> > Well, I agree that HIGHMEM4G is a _reasonable_ thing to turn on.
> >
> > The thing I disagree with is that it's at all valid to then compare to
> > some all-software feature thing. HIGHMEM doesn't expand any esoteric
> > capability that some people might use - it's about regular RAM for regular
> > users.
> >
> > And don't get me wrong - I don't like HIGHMEM. I detest the damn thing. I
> > hated having to merge it, and I still hate it. It's a stupid, ugly, and
> > very invasive config option. It's just that it's there to support a
> > stupid, ugly and very annoying fundamental hardware problem.
>
> I was looking forward to be able to get rid of it... unfortunately
> other 32-bit architectures are starting to use it again :(
>
> I guess it is not incredibly intrusive for generic mm code. A bit
> of kmap sprinkled around which is actually quite a useful
> delimiter of where pagecache is addressed via its kernel mapping.
>
> Do you hate more the x86 code? Maybe that can be removed?

IMHO what hurts most about highmem isnt even its direct source code
overhead, but three factors:

- The buddy allocator allocates top down, with highmem pages first.
So a lot of critical apps (the first ones started) will have
highmem footprint, and that shows up every time they use it for
file IO or other ops. kmap() overhead and more.

- Highmem is not really a 'solvable' problem in terms of good VM
balancing. It gives conflicting constraints and there's no single
'good VM' that can really work - just a handful of bad solutions
that differ in their level and area of suckiness.

- The kmap() cache itself can be depleted, and using atomic kmaps
is fragile and error-prone. I think we still have a FIXME of a
possibly triggerable deadlock somewhere in the core MM code ...

OTOH, highmem is clearly a useful hardware enablement feature with a
slowly receding upside and a constant downside. The outcome is
clear: when a critical threshold is reached distros will stop
enabling it. (or more likely, there will be pure 64-bit x86 distros)

Highmem simply enables a sucky piece of hardware so the code itself
has an intrinsic level of suckage, so to speak. There's not much to
be done about it but it's not a _big_ problem either: this type of
hw is moving fast out of the distro attention span.

( What scares/worries me much more than sucky hardware is sucky
_software_ ABIs. Those have a half-life measured not in years but
in decades and they get put into new products stubbornly, again
and again. There's no Moore's Law getting rid of sucky software
really and unlike the present set of sucky highmem hardware
there's no influx of cosmic particles chipping away on their
installed base either. )

Ingo

2009-06-09 12:11:14

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

On Tue, Jun 09, 2009 at 01:17:19PM +0200, Ingo Molnar wrote:
>
> * Nick Piggin <[email protected]> wrote:
>
> > On Thu, Jun 04, 2009 at 08:02:14AM -0700, Linus Torvalds wrote:
> > >
> > >
> > > On Thu, 4 Jun 2009, Rusty Russell wrote:
> > > > >
> > > > > Turn off HIGHMEM64G, please (and HIGHMEM4G too, for that matter - you
> > > > > can't compare it to a no-highmem case).
> > > >
> > > > Thanks, your point is demonstrated below. I don't think HIGHMEM4G is
> > > > unreasonable for a distro tho, so I turned that on instead.
> > >
> > > Well, I agree that HIGHMEM4G is a _reasonable_ thing to turn on.
> > >
> > > The thing I disagree with is that it's at all valid to then compare to
> > > some all-software feature thing. HIGHMEM doesn't expand any esoteric
> > > capability that some people might use - it's about regular RAM for regular
> > > users.
> > >
> > > And don't get me wrong - I don't like HIGHMEM. I detest the damn thing. I
> > > hated having to merge it, and I still hate it. It's a stupid, ugly, and
> > > very invasive config option. It's just that it's there to support a
> > > stupid, ugly and very annoying fundamental hardware problem.
> >
> > I was looking forward to be able to get rid of it... unfortunately
> > other 32-bit architectures are starting to use it again :(
> >
> > I guess it is not incredibly intrusive for generic mm code. A bit
> > of kmap sprinkled around which is actually quite a useful
> > delimiter of where pagecache is addressed via its kernel mapping.
> >
> > Do you hate more the x86 code? Maybe that can be removed?
>
> IMHO what hurts most about highmem isnt even its direct source code
> overhead, but three factors:
>
> - The buddy allocator allocates top down, with highmem pages first.
> So a lot of critical apps (the first ones started) will have
> highmem footprint, and that shows up every time they use it for
> file IO or other ops. kmap() overhead and more.

Yeah this really sucks about it. OTOH, we have basically the same
thing today with NUMA allocations and task placement.

> - Highmem is not really a 'solvable' problem in terms of good VM
> balancing. It gives conflicting constraints and there's no single
> 'good VM' that can really work - just a handful of bad solutions
> that differ in their level and area of suckiness.

But we have other zones too. And you also run into similar (and
in some senses harder) choices with NUMA as well.

> - The kmap() cache itself can be depleted,

Yeah, the rule is not allowed to do 2 nested ones.

> and using atomic kmaps
> is fragile and error-prone. I think we still have a FIXME of a
> possibly triggerable deadlock somewhere in the core MM code ...

Not that I know of. I fixed the last long standing known one
with the write_begin/write_end changes a year or two ago. It
wasn't exactly related to kmap of the pagecache (but page fault
of the user address in copy_from_user).

> OTOH, highmem is clearly a useful hardware enablement feature with a
> slowly receding upside and a constant downside. The outcome is
> clear: when a critical threshold is reached distros will stop
> enabling it. (or more likely, there will be pure 64-bit x86 distros)

Well now lots of embedded type archs are enabling it... So the
upside is slowly increasing again I think.

> Highmem simply enables a sucky piece of hardware so the code itself
> has an intrinsic level of suckage, so to speak. There's not much to
> be done about it but it's not a _big_ problem either: this type of
> hw is moving fast out of the distro attention span.

Yes but Linus really hated the code. I wonder whether it is
generic code or x86 specific. OTOH with x86 you'd probably
still have to support different page table formats, at least,
so you couldn't rip it all out.

2009-06-09 12:26:20

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

* Nick Piggin <[email protected]> wrote:

> > and using atomic kmaps
> > is fragile and error-prone. I think we still have a FIXME of a
> > possibly triggerable deadlock somewhere in the core MM code ...
>
> Not that I know of. I fixed the last long standing known one with
> the write_begin/write_end changes a year or two ago. It wasn't
> exactly related to kmap of the pagecache (but page fault of the
> user address in copy_from_user).

> > OTOH, highmem is clearly a useful hardware enablement feature
> > with a slowly receding upside and a constant downside. The
> > outcome is clear: when a critical threshold is reached distros
> > will stop enabling it. (or more likely, there will be pure
> > 64-bit x86 distros)
>
> Well now lots of embedded type archs are enabling it... So the
> upside is slowly increasing again I think.

Sure - but the question is always how often does it show up on lkml?
Less and less. There might be a lot of embedded Linux products sold,
but their users are not reporting bugs to us and are not sending
patches to us in the proportion of their apparent usage.

And on lkml there's a clear downtick in highmem relevance.

> > Highmem simply enables a sucky piece of hardware so the code
> > itself has an intrinsic level of suckage, so to speak. There's
> > not much to be done about it but it's not a _big_ problem
> > either: this type of hw is moving fast out of the distro
> > attention span.
>
> Yes but Linus really hated the code. I wonder whether it is
> generic code or x86 specific. OTOH with x86 you'd probably still
> have to support different page table formats, at least, so you
> couldn't rip it all out.

In practice the pte format hurts the VM more than just highmem. (the
two are inseparably connected of course)

I did this fork overhead measurement some time ago, using
perfcounters and 'perf':

Performance counter stats for './fork':

32-bit 32-bit-PAE 64-bit
--------- ---------- ---------
27.367537 30.660090 31.542003 task clock ticks (msecs)

5785 5810 5751 pagefaults (events)
389 388 388 context switches (events)
4 4 4 CPU migrations (events)
--------- ---------- ---------
+12.0% +15.2% overhead

So PAE is 12.0% slower (the overhead of double the pte size and
three page table levels), and 64-bit is 15.2% slower (the extra
overhead of having four page table levels added to the overhead of
double the pte size). [the pagefault count noise is well below the
systematic performance difference.]

Fork is pretty much the worst-case measurement for larger pte
overhead, as it has to copy around a lot of pagetables.

Larger ptes do not come for free and the 64-bit instructions do not
mitigate the cachemiss overhead and memory bandwidth cost.

Ingo

2009-06-09 12:42:14

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

On Tue, Jun 09, 2009 at 02:25:29PM +0200, Ingo Molnar wrote:
>
> * Nick Piggin <[email protected]> wrote:
>
> > > and using atomic kmaps
> > > is fragile and error-prone. I think we still have a FIXME of a
> > > possibly triggerable deadlock somewhere in the core MM code ...
> >
> > Not that I know of. I fixed the last long standing known one with
> > the write_begin/write_end changes a year or two ago. It wasn't
> > exactly related to kmap of the pagecache (but page fault of the
> > user address in copy_from_user).
>
> > > OTOH, highmem is clearly a useful hardware enablement feature
> > > with a slowly receding upside and a constant downside. The
> > > outcome is clear: when a critical threshold is reached distros
> > > will stop enabling it. (or more likely, there will be pure
> > > 64-bit x86 distros)
> >
> > Well now lots of embedded type archs are enabling it... So the
> > upside is slowly increasing again I think.
>
> Sure - but the question is always how often does it show up on lkml?
> Less and less. There might be a lot of embedded Linux products sold,
> but their users are not reporting bugs to us and are not sending
> patches to us in the proportion of their apparent usage.
>
> And on lkml there's a clear downtick in highmem relevance.

Definitely. Probably it works *reasonably* well enough in the
end that embedded systems with reasonable highmem:lowmem ratio
probably will work OK. Sadly for them in a year or two they
probably get the full burden of carrying the crap ;)

> > > Highmem simply enables a sucky piece of hardware so the code
> > > itself has an intrinsic level of suckage, so to speak. There's
> > > not much to be done about it but it's not a _big_ problem
> > > either: this type of hw is moving fast out of the distro
> > > attention span.
> >
> > Yes but Linus really hated the code. I wonder whether it is
> > generic code or x86 specific. OTOH with x86 you'd probably still
> > have to support different page table formats, at least, so you
> > couldn't rip it all out.
>
> In practice the pte format hurts the VM more than just highmem. (the
> two are inseparably connected of course)
>
> I did this fork overhead measurement some time ago, using
> perfcounters and 'perf':
>
> Performance counter stats for './fork':
>
> 32-bit 32-bit-PAE 64-bit
> --------- ---------- ---------
> 27.367537 30.660090 31.542003 task clock ticks (msecs)
>
> 5785 5810 5751 pagefaults (events)
> 389 388 388 context switches (events)
> 4 4 4 CPU migrations (events)
> --------- ---------- ---------
> +12.0% +15.2% overhead
>
> So PAE is 12.0% slower (the overhead of double the pte size and
> three page table levels), and 64-bit is 15.2% slower (the extra
> overhead of having four page table levels added to the overhead of
> double the pte size). [the pagefault count noise is well below the
> systematic performance difference.]
>
> Fork is pretty much the worst-case measurement for larger pte
> overhead, as it has to copy around a lot of pagetables.
>
> Larger ptes do not come for free and the 64-bit instructions do not
> mitigate the cachemiss overhead and memory bandwidth cost.

No question about that... but you probably can't get rid of that
because somebody will cry about NX bit, won't they?

2009-06-09 12:57:32

by Avi Kivity

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

Ingo Molnar wrote:
> Fork is pretty much the worst-case measurement for larger pte
> overhead, as it has to copy around a lot of pagetables.
>

We could eliminate that if we use the R/W bit on pgd entries. fork()
would be 256 clear_bit()s (1536 and 768 on i386 pae and nonpae).

copy_one_pte() disagrees though:

if (unlikely(!pte_present(pte))) {
if (!pte_file(pte)) {
swp_entry_t entry = pte_to_swp_entry(pte);

swap_duplicate(entry);
/* make sure dst_mm is on swapoff's mmlist. */
if (unlikely(list_empty(&dst_mm->mmlist))) {
spin_lock(&mmlist_lock);
if (list_empty(&dst_mm->mmlist))
list_add(&dst_mm->mmlist,
&src_mm->mmlist);
spin_unlock(&mmlist_lock);
}
if (is_write_migration_entry(entry) &&
is_cow_mapping(vm_flags)) {
/*
* COW mappings require pages in both parent
* and child to be set to read.
*/
make_migration_entry_read(&entry);
pte = swp_entry_to_pte(entry);
set_pte_at(src_mm, addr, src_pte, pte);
}
}
goto out_set_pte;
}

Not sure how we can enlaze this thing.

--
error compiling committee.c: too many arguments to function

2009-06-09 14:55:26

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

On Tue, 9 Jun 2009, Nick Piggin wrote:
> >
> > And don't get me wrong - I don't like HIGHMEM. I detest the damn thing. I
> > hated having to merge it, and I still hate it. It's a stupid, ugly, and
> > very invasive config option. It's just that it's there to support a
> > stupid, ugly and very annoying fundamental hardware problem.
>
> I was looking forward to be able to get rid of it... unfortunately
> other 32-bit architectures are starting to use it again :(

.. and 32-bit x86 is still not dead, and there are still people who use it
with more than 1G of RAM (ie it's not like it's just purely a "small
embedded cell-phones with Atom" kind of thing that Intel seems to be
pushing for eventually).

> I guess it is not incredibly intrusive for generic mm code. A bit
> of kmap sprinkled around which is actually quite a useful delimiter
> of where pagecache is addressed via its kernel mapping.
>
> Do you hate more the x86 code? Maybe that can be removed?

No, we can't remove the x86 code, and quite frankly, I don't even mind
that. The part I mind is actually the sprinkling of kmap all over. Do a
"git grep kmap fs", and you'll see that there are four times as many
kmap's in filesystem code than there are in mm/.

I was benchmarking btrfs on my little EeePC. There, kmap overhead was 25%
of file access time. Part of it is that people have been taught to use
"kmap_atomic()", which is usable under spinlocks and people have been told
that it's "fast". It's not fast. The whole TLB thing is slow as hell.

Oh well. It's sad. But we can't get rid of it.

Linus

2009-06-09 14:58:17

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

* Linus Torvalds <[email protected]> wrote:

> I was benchmarking btrfs on my little EeePC. There, kmap overhead
> was 25% of file access time. Part of it is that people have been
> taught to use "kmap_atomic()", which is usable under spinlocks and
> people have been told that it's "fast". It's not fast. The whole
> TLB thing is slow as hell.

yeah. I noticed it some time ago that INVLPG is unreasonably slow.

My theory is that in the CPU it's perhaps a loop (in microcode?)
over _all_ TLBs - so as TLB caches get larger, INVLPG gets slower
and slower ...

Ingo

2009-06-09 15:08:28

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

On Tue, 9 Jun 2009, Nick Piggin wrote:

> On Tue, Jun 09, 2009 at 01:17:19PM +0200, Ingo Molnar wrote:
> >
> > - The buddy allocator allocates top down, with highmem pages first.
> > So a lot of critical apps (the first ones started) will have
> > highmem footprint, and that shows up every time they use it for
> > file IO or other ops. kmap() overhead and more.
>
> Yeah this really sucks about it. OTOH, we have basically the same
> thing today with NUMA allocations and task placement.

It's not the buddy allocator. Each zone has it's own buddy list.

It's that we do the zones in order, and always start with the HIGHMEM
zone.

Which is quite reasonablefor most loads (if the page is only used as a
user mapping, we won't kmap it all that often), but it's bad for things
where we will actually want to touch it over and over again. Notably
filesystem caches that aren't just for user mappings.

> > Highmem simply enables a sucky piece of hardware so the code itself
> > has an intrinsic level of suckage, so to speak. There's not much to
> > be done about it but it's not a _big_ problem either: this type of
> > hw is moving fast out of the distro attention span.
>
> Yes but Linus really hated the code. I wonder whether it is
> generic code or x86 specific. OTOH with x86 you'd probably
> still have to support different page table formats, at least,
> so you couldn't rip it all out.

The arch-specific code really isn't that nasty. We have some silly
workarouds for doing 8-byte-at-a-time operations on x86-32 with cmpxchg8b
etc, but those are just odd small details.

If highmem was just a matter of arch details, I wouldn't mind it at all.

It's the generic code pollution I find annoying. It really does pollute a
lot of crap. Not just fs/ and mm/, but even drivers.

Linus

2009-06-09 15:20:16

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

On Tue, 9 Jun 2009, Ingo Molnar wrote:
>
> In practice the pte format hurts the VM more than just highmem. (the
> two are inseparably connected of course)

I think PAE is a separate issue (ie I think HIGHMEM4G and HIGHMEM64G are
about different issues).

I do think we could probably drop PAE some day - very few 32-bit x86's
have more than 4GB of memory, and the ones that did buy lots of memory
back when it was a big deal for them have hopefully upgraded long since.

Of course, PAE also adds the NX flag etc, so there are probably other
reasons to have it. And qutie frankly, PAE is just a small x86-specific
detail that doesn't hurt anybody else.

So I have no reason to really dislike PAE per se - the real dislike is for
HIGHMEM itself, and that gets enabled already for HIGHMEM4G without any
PAE.

Of course, I'd also not ever enable it on any machine I have. PAE does add
overhead, and the NX bit isn't _that_ important to me.

Linus

2009-06-09 15:38:55

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

On Tue, Jun 09, 2009 at 07:54:00AM -0700, Linus Torvalds wrote:
>
>
> On Tue, 9 Jun 2009, Nick Piggin wrote:
> > >
> > > And don't get me wrong - I don't like HIGHMEM. I detest the damn thing. I
> > > hated having to merge it, and I still hate it. It's a stupid, ugly, and
> > > very invasive config option. It's just that it's there to support a
> > > stupid, ugly and very annoying fundamental hardware problem.
> >
> > I was looking forward to be able to get rid of it... unfortunately
> > other 32-bit architectures are starting to use it again :(
>
> .. and 32-bit x86 is still not dead, and there are still people who use it
> with more than 1G of RAM (ie it's not like it's just purely a "small
> embedded cell-phones with Atom" kind of thing that Intel seems to be
> pushing for eventually).
>
> > I guess it is not incredibly intrusive for generic mm code. A bit
> > of kmap sprinkled around which is actually quite a useful delimiter
> > of where pagecache is addressed via its kernel mapping.
> >
> > Do you hate more the x86 code? Maybe that can be removed?
>
> No, we can't remove the x86 code, and quite frankly, I don't even mind
> that. The part I mind is actually the sprinkling of kmap all over. Do a
> "git grep kmap fs", and you'll see that there are four times as many
> kmap's in filesystem code than there are in mm/.

Yeah, I guess I just don't see it as such a bad thing. As I said,
it's nice to have something to grep for and not have pointers into
pagecache stored around the place (although filesystems do that
with buffercache).

If code has to jump through particular nasty hoops to use atomic
kmaps, that's not such a good thing...

> I was benchmarking btrfs on my little EeePC. There, kmap overhead was 25%
> of file access time. Part of it is that people have been taught to use
> "kmap_atomic()", which is usable under spinlocks and people have been told
> that it's "fast". It's not fast. The whole TLB thing is slow as hell.
>
> Oh well. It's sad. But we can't get rid of it.

If it's such a problem, it could be made a lot faster without too
much problem. You could just introduce a FIFO of ptes behind it
and flush them all in one go. 4K worth of ptes per CPU might
hopefully bring your overhead down to < 1%.

2009-06-09 15:56:49

by Avi Kivity

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

Ingo Molnar wrote:
> * Linus Torvalds <[email protected]> wrote:
>
>
>> I was benchmarking btrfs on my little EeePC. There, kmap overhead
>> was 25% of file access time. Part of it is that people have been
>> taught to use "kmap_atomic()", which is usable under spinlocks and
>> people have been told that it's "fast". It's not fast. The whole
>> TLB thing is slow as hell.
>>
>
> yeah. I noticed it some time ago that INVLPG is unreasonably slow.
>
> My theory is that in the CPU it's perhaps a loop (in microcode?)
> over _all_ TLBs - so as TLB caches get larger, INVLPG gets slower
> and slower ...
>

The tlb already content-addresses entries when looking up translations,
so it shouldn't be that bad.

invlpg does have to invalidate all the intermediate entries
("paging-structure caches"), and it does (obviously) force a tlb reload.

I seem to recall 50 cycles for invlpg, what do you characterize as
unreasonably slow?

--
error compiling committee.c: too many arguments to function

2009-06-09 16:02:21

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

On Tue, 9 Jun 2009, Nick Piggin wrote:
>
> If it's such a problem, it could be made a lot faster without too
> much problem. You could just introduce a FIFO of ptes behind it
> and flush them all in one go. 4K worth of ptes per CPU might
> hopefully bring your overhead down to < 1%.

We already have that. The regular kmap() does that. It's just not usable
in atomic context.

We'd need to fix the locking: right now kmap_high() uses non-irq-safe
locks, and it does that whole cross-cpu flushing thing (which is why
those locks _have_ to be non-irq-safe.

The way to fix that, though, would be to never do any cross-cpu calls, and
instead just have a cpumask saying "you need to flush before you do
anything with kmap". So you'd just set that cpumask inside the lock, and
if/when some other CPU does a kmap, they'd flush their local TLB at _that_
point instead of having to have an IPI call.

If we can get rid of kmap_atomic(), I'd already like HIGHMEM more. Right
now I absolutely _hate_ all the different "levels" of kmap_atomic() and
having to be careful about crazy nesting rules etc.

Linus

2009-06-09 16:21:32

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

On Tue, Jun 09, 2009 at 09:00:08AM -0700, Linus Torvalds wrote:
>
>
> On Tue, 9 Jun 2009, Nick Piggin wrote:
> >
> > If it's such a problem, it could be made a lot faster without too
> > much problem. You could just introduce a FIFO of ptes behind it
> > and flush them all in one go. 4K worth of ptes per CPU might
> > hopefully bring your overhead down to < 1%.
>
> We already have that. The regular kmap() does that. It's just not usable
> in atomic context.

Well this would be more like the kmap cache idea rather than the
kmap_atomic FIFO (which would remain per-cpu and look much like
the existing kmap_atomic).

> We'd need to fix the locking: right now kmap_high() uses non-irq-safe
> locks, and it does that whole cross-cpu flushing thing (which is why
> those locks _have_ to be non-irq-safe.
>
> The way to fix that, though, would be to never do any cross-cpu calls, and
> instead just have a cpumask saying "you need to flush before you do
> anything with kmap". So you'd just set that cpumask inside the lock, and
> if/when some other CPU does a kmap, they'd flush their local TLB at _that_
> point instead of having to have an IPI call.

The idea seems nice but isn't the problem that kmap gives back a
basically 1st class kernel virtual memory? (ie. it can then be used
by any other CPU at any point without it having to use kmap?).

2009-06-09 16:27:51

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

On Tue, 9 Jun 2009, Nick Piggin wrote:
>
> The idea seems nice but isn't the problem that kmap gives back a
> basically 1st class kernel virtual memory? (ie. it can then be used
> by any other CPU at any point without it having to use kmap?).

No, everybody has to use kmap()/kunmap().

The "problem" is that you could in theory run out of kmap frames, since if
everybody does a kmap() in an interruptible context and you have lots and
lots of threads doing different pages, you'd run out. But that has nothing
to do with kmap_atomic(), which is basically limited to just the number of
CPU's and a (very small) level of nesting.

Linus

2009-06-09 16:45:28

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

On Tue, Jun 09, 2009 at 09:26:47AM -0700, Linus Torvalds wrote:
>
>
> On Tue, 9 Jun 2009, Nick Piggin wrote:
> >
> > The idea seems nice but isn't the problem that kmap gives back a
> > basically 1st class kernel virtual memory? (ie. it can then be used
> > by any other CPU at any point without it having to use kmap?).
>
> No, everybody has to use kmap()/kunmap().

So it is strictly a bug to expose a pointer returned by kmap to
another CPU? That would make it easier, although it would need
to remove the global bit I think so when one task migrates CPUs
then the entry will be flushed and reloaded properly.

> The "problem" is that you could in theory run out of kmap frames, since if
> everybody does a kmap() in an interruptible context and you have lots and
> lots of threads doing different pages, you'd run out. But that has nothing
> to do with kmap_atomic(), which is basically limited to just the number of
> CPU's and a (very small) level of nesting.

This could be avoided with an anti-deadlock pool. If a task
attempts a nested kmap and already holds a kmap, then give it
exclusive access to this pool until it releases its last
nested kmap.

2009-06-09 17:10:19

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

On Tue, 9 Jun 2009, Nick Piggin wrote:

> On Tue, Jun 09, 2009 at 09:26:47AM -0700, Linus Torvalds wrote:
> >
> >
> > On Tue, 9 Jun 2009, Nick Piggin wrote:
> > >
> > > The idea seems nice but isn't the problem that kmap gives back a
> > > basically 1st class kernel virtual memory? (ie. it can then be used
> > > by any other CPU at any point without it having to use kmap?).
> >
> > No, everybody has to use kmap()/kunmap().
>
> So it is strictly a bug to expose a pointer returned by kmap to
> another CPU?

No, not at all. The pointers are all global. They have to be, since the
original kmap() user may well be scheduled away.

> > The "problem" is that you could in theory run out of kmap frames, since if
> > everybody does a kmap() in an interruptible context and you have lots and
> > lots of threads doing different pages, you'd run out. But that has nothing
> > to do with kmap_atomic(), which is basically limited to just the number of
> > CPU's and a (very small) level of nesting.
>
> This could be avoided with an anti-deadlock pool. If a task
> attempts a nested kmap and already holds a kmap, then give it
> exclusive access to this pool until it releases its last
> nested kmap.

We just sleep, waiting for somebody to release their. Again, that
obviously won't work in atomic context, but it's easy enough to just have
a "we need to have a few entries free" for the atomic case, and make it
busy-loop if it runs out (which is not going to happen in practice
anyway).

Linus

2009-06-09 17:59:55

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

Ingo Molnar wrote:
>
> OTOH, highmem is clearly a useful hardware enablement feature with a
> slowly receding upside and a constant downside. The outcome is
> clear: when a critical threshold is reached distros will stop
> enabling it. (or more likely, there will be pure 64-bit x86 distros)
>

A major problem is that distros don't seem to be willing to push 64-bit
kernels for 32-bit distros. There are a number of good (and
not-so-good) reasons why users may want to run a 32-bit userspace, but
not running a 64-bit kernel on capable hardware is just problematic.

-hpa

--
H. Peter Anvin, Intel Open Source Technology Center
I work for Intel. I don't speak on their behalf.

2009-06-09 18:07:22

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

On Tue, 9 Jun 2009, H. Peter Anvin wrote:
>
> A major problem is that distros don't seem to be willing to push 64-bit
> kernels for 32-bit distros. There are a number of good (and
> not-so-good) reasons why users may want to run a 32-bit userspace, but
> not running a 64-bit kernel on capable hardware is just problematic.

Yeah, that's just stupid. A 64-bit kernel should work well with 32-bit
tools, and while we've occasionally had compat issues (the intel gfx
people used to claim that they needed to work with a 32-bit kernel because
they cared about 32-bit tools), they aren't unfixable or even all _that_
common.

And they'd be even less common if the whole "64-bit kernel even if you do
a 32-bit distro" was more common.

The nice thing about a 64-bit kernel is that you should be able to build
one even if you don't in general have all the 64-bit libraries. So you
don't need a full 64-bit development environment, you just need a compiler
that can generate code for both (and that should be the default on x86
these days).

Linus

2009-06-09 18:08:55

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

On Tue, 9 Jun 2009, Linus Torvalds wrote:
>
> And they'd be even less common if the whole "64-bit kernel even if you do
> a 32-bit distro" was more common.

Side note: intel is to blame too. I think several Atom versions were
shipped with 64-bit mode disabled. So even "modern" CPU's are sometimes
artifically crippled to just 32-bit mode.

Linus

2009-06-09 22:50:06

by Matthew Garrett

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

On Tue, Jun 09, 2009 at 11:07:41AM -0700, Linus Torvalds wrote:

> Side note: intel is to blame too. I think several Atom versions were
> shipped with 64-bit mode disabled. So even "modern" CPU's are sometimes
> artifically crippled to just 32-bit mode.

And some people still want to run dosemu so they can drive their
godforsaken 80s era PIO driven data analyzer. It'd be nice to think that
nobody used vm86, but they always seem to pop out of the woodwork
whenever someone suggests 64-bit kernels by default.

--
Matthew Garrett | [email protected]

2009-06-09 23:02:18

by H. Peter Anvin

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

Matthew Garrett wrote:
> On Tue, Jun 09, 2009 at 11:07:41AM -0700, Linus Torvalds wrote:
>
>> Side note: intel is to blame too. I think several Atom versions were
>> shipped with 64-bit mode disabled. So even "modern" CPU's are sometimes
>> artifically crippled to just 32-bit mode.
>
> And some people still want to run dosemu so they can drive their
> godforsaken 80s era PIO driven data analyzer. It'd be nice to think that
> nobody used vm86, but they always seem to pop out of the woodwork
> whenever someone suggests 64-bit kernels by default.
>

There is both KVM and Qemu as alternatives, though. The godforsaken
80s-era PIO driven data analyzer will run fine in Qemu even on
non-HVM-capable hardware if it's 64-bit capable. Most of the time it'll
spend sitting in PIO no matter what you do.

-hpa

2009-06-10 00:04:14

by Paul Mackerras

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

Ingo Molnar writes:

> I did this fork overhead measurement some time ago, using
> perfcounters and 'perf':

Could you post the program? I'd like to try it on some systems here.

Paul.

2009-06-10 01:26:48

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

* Paul Mackerras <[email protected]> wrote:

> Ingo Molnar writes:
>
> > I did this fork overhead measurement some time ago, using
> > perfcounters and 'perf':
>
> Could you post the program? I'd like to try it on some systems
> here.

I still have it, it was something really, really simple and silly:

int main(void)
{
int i;

for (i = 0; i < 8; i++)
if (!fork())
wait(0);
}

Ingo

2009-06-10 05:53:27

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

On Tue, Jun 09, 2009 at 10:08:53AM -0700, Linus Torvalds wrote:
>
>
> On Tue, 9 Jun 2009, Nick Piggin wrote:
>
> > On Tue, Jun 09, 2009 at 09:26:47AM -0700, Linus Torvalds wrote:
> > >
> > >
> > > On Tue, 9 Jun 2009, Nick Piggin wrote:
> > > >
> > > > The idea seems nice but isn't the problem that kmap gives back a
> > > > basically 1st class kernel virtual memory? (ie. it can then be used
> > > > by any other CPU at any point without it having to use kmap?).
> > >
> > > No, everybody has to use kmap()/kunmap().
> >
> > So it is strictly a bug to expose a pointer returned by kmap to
> > another CPU?
>
> No, not at all. The pointers are all global. They have to be, since the
> original kmap() user may well be scheduled away.

Sorry, I meant another task.

> > > The "problem" is that you could in theory run out of kmap frames, since if
> > > everybody does a kmap() in an interruptible context and you have lots and
> > > lots of threads doing different pages, you'd run out. But that has nothing
> > > to do with kmap_atomic(), which is basically limited to just the number of
> > > CPU's and a (very small) level of nesting.
> >
> > This could be avoided with an anti-deadlock pool. If a task
> > attempts a nested kmap and already holds a kmap, then give it
> > exclusive access to this pool until it releases its last
> > nested kmap.
>
> We just sleep, waiting for somebody to release their. Again, that
> obviously won't work in atomic context, but it's easy enough to just have
> a "we need to have a few entries free" for the atomic case, and make it
> busy-loop if it runs out (which is not going to happen in practice
> anyway).

The really theoretical one (which Andrew likes complaining about) is
when *everybody* is holding a kmap and asking for another one ;)
But I think it isn't too hard to make a pool for that. And yes we'd
also need a pool for atomic kmaps as you point out.

2009-06-10 06:29:28

by Peter Zijlstra

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

On Tue, 2009-06-09 at 09:26 -0700, Linus Torvalds wrote:
>
> On Tue, 9 Jun 2009, Nick Piggin wrote:
> >
> > The idea seems nice but isn't the problem that kmap gives back a
> > basically 1st class kernel virtual memory? (ie. it can then be used
> > by any other CPU at any point without it having to use kmap?).
>
> No, everybody has to use kmap()/kunmap().
>
> The "problem" is that you could in theory run out of kmap frames, since if
> everybody does a kmap() in an interruptible context and you have lots and
> lots of threads doing different pages, you'd run out. But that has nothing
> to do with kmap_atomic(), which is basically limited to just the number of
> CPU's and a (very small) level of nesting.

One of the things I did for -rt back when I rewrote mm/highmem.c for it
was to reserve multiple slots per kmap() user so that if you did 1 you
could always do another.

With everything in task context like rt does 2 seemed enough, but you
cuold ways extend that scheme and reserve enough for the worst case
nesting depth and be done with it.

2009-06-17 09:45:56

by Pavel Machek

[permalink] [raw]

Subject: Re: [benchmark] 1% performance overhead of paravirt_ops on native kernels

Hi!

> > > > The "problem" is that you could in theory run out of kmap frames, since if
> > > > everybody does a kmap() in an interruptible context and you have lots and
> > > > lots of threads doing different pages, you'd run out. But that has nothing
> > > > to do with kmap_atomic(), which is basically limited to just the number of
> > > > CPU's and a (very small) level of nesting.
> > >
> > > This could be avoided with an anti-deadlock pool. If a task
> > > attempts a nested kmap and already holds a kmap, then give it
> > > exclusive access to this pool until it releases its last
> > > nested kmap.
> >
> > We just sleep, waiting for somebody to release their. Again, that
> > obviously won't work in atomic context, but it's easy enough to just have
> > a "we need to have a few entries free" for the atomic case, and make it
> > busy-loop if it runs out (which is not going to happen in practice
> > anyway).
>
> The really theoretical one (which Andrew likes complaining about) is
> when *everybody* is holding a kmap and asking for another one ;)
> But I think it isn't too hard to make a pool for that. And yes we'd

Does one pool help?

Now you can have '*everyone* is holding the kmaps and is asking for
another one'.

You could add as many pools as maximum nesting level... Is there
maximum nesting level?

--
(english) http://www.livejournal.com/~pavelmachek
(cesky, pictures) http://atrey.karlin.mff.cuni.cz/~pavel/picture/horses/blog.html

2009-06-17 09:56:39