2008-10-09 17:15:17

by Dave Jones

[permalink] [raw]
Subject: Update cacheline size on X86_GENERIC

I just noticed that configuring a kernel to use CONFIG_X86_GENERIC
(as is typical for a distro kernel) configures it to use a 128 byte cacheline size.
This made sense when that was commonplace (P4 era) but current
Intel, AMD and VIA cpus use 64 byte cachelines.

Signed-off-by: Dave Jones <[email protected]>

--- linux-2.6.26.noarch/arch/x86/Kconfig.cpu~ 2008-10-09 12:59:56.000000000 -0400
+++ linux-2.6.26.noarch/arch/x86/Kconfig.cpu 2008-10-09 13:11:32.000000000 -0400
@@ -301,8 +301,8 @@ config X86_CPU
# Define implied options from the CPU selection here
config X86_L1_CACHE_BYTES
int
- default "128" if GENERIC_CPU || MPSC
- default "64" if MK8 || MCORE2
+ default "128" if MPENTIUM4 || MPSC
+ default "64" if MK8 || MCORE2 || GENERIC_CPU
depends on X86_64

config X86_INTERNODE_CACHE_BYTES
@@ -316,10 +316,10 @@ config X86_CMPXCHG

config X86_L1_CACHE_SHIFT
int
- default "7" if MPENTIUM4 || X86_GENERIC || GENERIC_CPU || MPSC
+ default "7" if MPENTIUM4 || MPSC
default "4" if X86_ELAN || M486 || M386 || MGEODEGX1
default "5" if MWINCHIP3D || MWINCHIP2 || MWINCHIPC6 || MCRUSOE || MEFFICEON || MCYRIXIII || MK6 || MPENTIUMIII || MPENTIUMII || M686 || M586MMX || M586TSC || M586 || MVIAC3_2 || MGEODE_LX
- default "6" if MK7 || MK8 || MPENTIUMM || MCORE2 || MVIAC7
+ default "6" if MK7 || MK8 || MPENTIUMM || MCORE2 || MVIAC7 || X86_GENERIC || GENERIC_CPU

config X86_XADD
def_bool y

--
http://www.codemonkey.org.uk


2008-10-10 03:28:39

by Nick Piggin

[permalink] [raw]
Subject: Re: Update cacheline size on X86_GENERIC

On Friday 10 October 2008 04:14, Dave Jones wrote:
> I just noticed that configuring a kernel to use CONFIG_X86_GENERIC
> (as is typical for a distro kernel) configures it to use a 128 byte
> cacheline size. This made sense when that was commonplace (P4 era) but
> current
> Intel, AMD and VIA cpus use 64 byte cachelines.

I think P4 technically did have 64 byte cachelines, but had some adjacent
line prefetching. And AFAIK core2 CPUs can do similar prefetching (but
maybe it's smarter and doesn't cause so much bouncing?).

Anyway, GENERIC kernel should run well on all architectures, and while
going too big causes slightly increased structures sometimes, going too
small could result in horrible bouncing.

Lastly, I think x86 will go to 128 byte lines in the next year or two, so
maybe at this point we can just keep 128 byte alignment?

/random thoughts

>
> Signed-off-by: Dave Jones <[email protected]>
>
> --- linux-2.6.26.noarch/arch/x86/Kconfig.cpu~ 2008-10-09 12:59:56.000000000
> -0400 +++ linux-2.6.26.noarch/arch/x86/Kconfig.cpu 2008-10-09
> 13:11:32.000000000 -0400 @@ -301,8 +301,8 @@ config X86_CPU
> # Define implied options from the CPU selection here
> config X86_L1_CACHE_BYTES
> int
> - default "128" if GENERIC_CPU || MPSC
> - default "64" if MK8 || MCORE2
> + default "128" if MPENTIUM4 || MPSC
> + default "64" if MK8 || MCORE2 || GENERIC_CPU
> depends on X86_64
>
> config X86_INTERNODE_CACHE_BYTES
> @@ -316,10 +316,10 @@ config X86_CMPXCHG
>
> config X86_L1_CACHE_SHIFT
> int
> - default "7" if MPENTIUM4 || X86_GENERIC || GENERIC_CPU || MPSC
> + default "7" if MPENTIUM4 || MPSC
> default "4" if X86_ELAN || M486 || M386 || MGEODEGX1
> default "5" if MWINCHIP3D || MWINCHIP2 || MWINCHIPC6 || MCRUSOE ||
> MEFFICEON || MCYRIXIII || MK6 || MPENTIUMIII || MPENTIUMII || M686 ||
> M586MMX || M586TSC || M586 || MVIAC3_2 || MGEODE_LX - default "6" if MK7 ||
> MK8 || MPENTIUMM || MCORE2 || MVIAC7
> + default "6" if MK7 || MK8 || MPENTIUMM || MCORE2 || MVIAC7 || X86_GENERIC
> || GENERIC_CPU
>
> config X86_XADD
> def_bool y

2008-10-10 07:46:34

by Andi Kleen

[permalink] [raw]
Subject: Re: Update cacheline size on X86_GENERIC

Nick Piggin <[email protected]> writes:

> On Friday 10 October 2008 04:14, Dave Jones wrote:
>> I just noticed that configuring a kernel to use CONFIG_X86_GENERIC
>> (as is typical for a distro kernel) configures it to use a 128 byte
>> cacheline size. This made sense when that was commonplace (P4 era) but
>> current
>> Intel, AMD and VIA cpus use 64 byte cachelines.
>
> I think P4 technically did have 64 byte cachelines, but had some adjacent
> line prefetching.

The "coherency unit" on P4, which is what matters for SMP alignment
purposes to avoid false sharing, is 128 bytes.

> And AFAIK core2 CPUs can do similar prefetching (but
> maybe it's smarter and doesn't cause so much bouncing?).

On Core2 the coherency unit is 64bytes.

> Anyway, GENERIC kernel should run well on all architectures, and while
> going too big causes slightly increased structures sometimes, going too
> small could result in horrible bouncing.

Exactly.

That is it costs one percent or so on TPC, but I think the fix
for that is just to analyze where the problem is and size those
data structures based on the runtime cache size. Some subsystems
like slab do this already.

TPC is a bit of a extreme case because it is so extremly cache bound.

Overall the memory impact of the cache padding is getting less over
time because more and more data is moving into the per CPU data areas.

-Andi

--
[email protected]

2008-10-10 08:45:49

by Nick Piggin

[permalink] [raw]
Subject: Re: Update cacheline size on X86_GENERIC

On Friday 10 October 2008 18:46, Andi Kleen wrote:
> Nick Piggin <[email protected]> writes:
> > On Friday 10 October 2008 04:14, Dave Jones wrote:
> >> I just noticed that configuring a kernel to use CONFIG_X86_GENERIC
> >> (as is typical for a distro kernel) configures it to use a 128 byte
> >> cacheline size. This made sense when that was commonplace (P4 era) but
> >> current
> >> Intel, AMD and VIA cpus use 64 byte cachelines.
> >
> > I think P4 technically did have 64 byte cachelines, but had some adjacent
> > line prefetching.
>
> The "coherency unit" on P4, which is what matters for SMP alignment
> purposes to avoid false sharing, is 128 bytes.
>
> > And AFAIK core2 CPUs can do similar prefetching (but
> > maybe it's smarter and doesn't cause so much bouncing?).
>
> On Core2 the coherency unit is 64bytes.

OK.


> > Anyway, GENERIC kernel should run well on all architectures, and while
> > going too big causes slightly increased structures sometimes, going too
> > small could result in horrible bouncing.
>
> Exactly.
>
> That is it costs one percent or so on TPC, but I think the fix
> for that is just to analyze where the problem is and size those
> data structures based on the runtime cache size. Some subsystems
> like slab do this already.

Costs 1% on TPC? Is that 128 byte aligning data structures on
Core2, or 64 byte aligning them on P4 that costs the performance?


> TPC is a bit of a extreme case because it is so extremly cache bound.

Still, it is a good canary.


> Overall the memory impact of the cache padding is getting less over
> time because more and more data is moving into the per CPU data areas.

Right.

2008-10-10 10:22:43

by Andi Kleen

[permalink] [raw]
Subject: Re: Update cacheline size on X86_GENERIC

Nick Piggin wrote:

>
>>> Anyway, GENERIC kernel should run well on all architectures, and while
>>> going too big causes slightly increased structures sometimes, going too
>>> small could result in horrible bouncing.
>> Exactly.
>>
>> That is it costs one percent or so on TPC, but I think the fix
>> for that is just to analyze where the problem is and size those
>> data structures based on the runtime cache size. Some subsystems
>> like slab do this already.
>
> Costs 1% on TPC? Is that 128 byte aligning data structures on
> Core2, or 64 byte aligning them on P4 that costs the performance?

The first. BTW it was a rough number from memory, in that ballpark.
Also the experiment was on older kernels, might be different now.

The second would undoubtedly be much worse.

-Andi

2008-10-10 18:27:27

by H. Peter Anvin

[permalink] [raw]
Subject: Re: Update cacheline size on X86_GENERIC

Nick Piggin wrote:
> I think P4 technically did have 64 byte cachelines, but had some adjacent
> line prefetching. And AFAIK core2 CPUs can do similar prefetching (but
> maybe it's smarter and doesn't cause so much bouncing?).
>
> Anyway, GENERIC kernel should run well on all architectures, and while
> going too big causes slightly increased structures sometimes, going too
> small could result in horrible bouncing.

Well, GENERIC really is targetted toward the commercial mainstream at
the time, with the additional caveat that it shouldn't totally suck on
anything that isn't so obscure it's irrelevant. It is thus a moving
target. 1% on TPC doesn't count as "totally suck", especially since by
now anyone who is running workloads like TPC either will have phased out
their P4s or they don't care about performance at all.

> Lastly, I think x86 will go to 128 byte lines in the next year or two, so
> maybe at this point we can just keep 128 byte alignment?

"x86" doesn't have a cache line size; a specific implementation will.
Which particular implementation do you believe is going to 128-byte L1
cachelines?

-hpa

2008-10-11 03:56:30

by Nick Piggin

[permalink] [raw]
Subject: Re: Update cacheline size on X86_GENERIC

On Saturday 11 October 2008 05:26, H. Peter Anvin wrote:
> Nick Piggin wrote:
> > I think P4 technically did have 64 byte cachelines, but had some adjacent
> > line prefetching. And AFAIK core2 CPUs can do similar prefetching (but
> > maybe it's smarter and doesn't cause so much bouncing?).
> >
> > Anyway, GENERIC kernel should run well on all architectures, and while
> > going too big causes slightly increased structures sometimes, going too
> > small could result in horrible bouncing.
>
> Well, GENERIC really is targetted toward the commercial mainstream at
> the time, with the additional caveat that it shouldn't totally suck on
> anything that isn't so obscure it's irrelevant. It is thus a moving
> target. 1% on TPC doesn't count as "totally suck", especially since by
> now anyone who is running workloads like TPC either will have phased out
> their P4s or they don't care about performance at all.

tpc shouldn't have too false sharing these days, AFAICT (slab is rather
important there, but it finds cacheline sizes at runtime). Actually I
thought Andi was referring to the slowdown on 64-byte cacheline systems.
But other workloads could be hurt much worse than tpc-c from false
sharing I think.


> > Lastly, I think x86 will go to 128 byte lines in the next year or two, so
> > maybe at this point we can just keep 128 byte alignment?
>
> "x86" doesn't have a cache line size; a specific implementation will.
> Which particular implementation do you believe is going to 128-byte L1
> cachelines?

Right. I thought a future implementation would. But I'm probably wrong
about that, and anyway OK it wasn't such a good argument for kernel.org
kernels I suppose.

2008-10-11 04:00:20

by Nick Piggin

[permalink] [raw]
Subject: Re: Update cacheline size on X86_GENERIC

On Friday 10 October 2008 21:22, Andi Kleen wrote:
> Nick Piggin wrote:
> >>> Anyway, GENERIC kernel should run well on all architectures, and while
> >>> going too big causes slightly increased structures sometimes, going too
> >>> small could result in horrible bouncing.
> >>
> >> Exactly.
> >>
> >> That is it costs one percent or so on TPC, but I think the fix
> >> for that is just to analyze where the problem is and size those
> >> data structures based on the runtime cache size. Some subsystems
> >> like slab do this already.
> >
> > Costs 1% on TPC? Is that 128 byte aligning data structures on
> > Core2, or 64 byte aligning them on P4 that costs the performance?
>
> The first. BTW it was a rough number from memory, in that ballpark.
> Also the experiment was on older kernels, might be different now.
>
> The second would undoubtedly be much worse.

OK. Well I don't have a really strong opinion on what to do...

I guess there is a reasonable argument to not care about P4 so
much in today's GENERIC kernel. If it is worth around 1% on tpc
on a more modern architecture, that is a pretty big motivation
to change it too...

2008-10-11 08:01:57

by Andi Kleen

[permalink] [raw]
Subject: Re: Update cacheline size on X86_GENERIC

> I guess there is a reasonable argument to not care about P4 so

I don't think it is. Ignoring old systems would be a mistake
and the wrong signal. One of Linux's forte over the competition
was always to run reasonable on older systems too.

There are millions and millions of P4s around.
And they're not that old, they're still shipping in fact.
And the point of GENERIC was to be a reasonable default on all
systems.

If you want to optimize for a specific CPU you're always free
to compile the kernel for that. But GENERIC should be really
GENERIC.

> much in today's GENERIC kernel. If it is worth around 1% on tpc
> on a more modern architecture, that is a pretty big motivation
> to change it too...

TPC is a extreme case, it is extremly cache bound.

Besides I suspect the TPC issue could be fixed with a minimal
tweaks without breaking other systems.

-Andi
--
[email protected]

2008-10-11 08:29:42

by Nick Piggin

[permalink] [raw]
Subject: Re: Update cacheline size on X86_GENERIC

On Saturday 11 October 2008 19:08, Andi Kleen wrote:
> > I guess there is a reasonable argument to not care about P4 so
>
> I don't think it is. Ignoring old systems would be a mistake
> and the wrong signal. One of Linux's forte over the competition
> was always to run reasonable on older systems too.

I think there is a reasonable argument: and that is that most
multiprocessor P4 systems in production and using a GENERIC (ie.
probably not custom but probably vendor compiled) kernel is not
likely to be upgraded to a 2.6.28+ based GENERIC kernel.

I also think there are reasonable arguments the other way, and I
personally also think it might be better to leave it 128 (even
if it is unlikely, introducing a regression is not good).


> There are millions and millions of P4s around.
> And they're not that old, they're still shipping in fact.

Still shipping in anything aside from 1s systems?


> And the point of GENERIC was to be a reasonable default on all
> systems.
>
> If you want to optimize for a specific CPU you're always free
> to compile the kernel for that. But GENERIC should be really
> GENERIC.
>
> > much in today's GENERIC kernel. If it is worth around 1% on tpc
> > on a more modern architecture, that is a pretty big motivation
> > to change it too...
>
> TPC is a extreme case, it is extremly cache bound.

Still, 1% there is a large increase.


> Besides I suspect the TPC issue could be fixed with a minimal
> tweaks without breaking other systems.

That would be nice. It would be interesting to know what is causing
the slowdown.

2008-10-11 11:16:02

by Andi Kleen

[permalink] [raw]
Subject: Re: Update cacheline size on X86_GENERIC

On Sat, Oct 11, 2008 at 07:29:19PM +1100, Nick Piggin wrote:
> I also think there are reasonable arguments the other way, and I
> personally also think it might be better to leave it 128 (even
> if it is unlikely, introducing a regression is not good).

The issue is also that the regression will be likely large.
False sharing can really hurt when it hits as you know, because
the penalties are so large.

> > There are millions and millions of P4s around.
> > And they're not that old, they're still shipping in fact.
>
> Still shipping in anything aside from 1s systems?

Remember the first Core2 based 4S (Tigerton) Xeon was only introduced last year
and that market is quite conservative. For 2S it's a bit longer, but
it wouldn't surprise me there if new systems are still shipping.

Also to be honest I doubt the theory that older systems
are never upgraded to newer OS.

> That would be nice. It would be interesting to know what is causing
> the slowdown.

At least that test is extremly cache footprint sensitive. A lot of the
cache misses are surprisingly in hd_struct, because it runs
with hundred of disks and each needs hd_struct references in the fast path.
The recent introduction of fine grained per partition statistics
caused a large slowdown. But I don't think kernel workloads
are normally that extremly cache sensitive.

-Andi
--
[email protected]

2008-10-11 11:23:20

by Rafael J. Wysocki

[permalink] [raw]
Subject: Re: Update cacheline size on X86_GENERIC

On Saturday, 11 of October 2008, Andi Kleen wrote:
> On Sat, Oct 11, 2008 at 07:29:19PM +1100, Nick Piggin wrote:
> > I also think there are reasonable arguments the other way, and I
> > personally also think it might be better to leave it 128 (even
> > if it is unlikely, introducing a regression is not good).
>
> The issue is also that the regression will be likely large.
> False sharing can really hurt when it hits as you know, because
> the penalties are so large.
>
> > > There are millions and millions of P4s around.
> > > And they're not that old, they're still shipping in fact.
> >
> > Still shipping in anything aside from 1s systems?
>
> Remember the first Core2 based 4S (Tigerton) Xeon was only introduced last year
> and that market is quite conservative. For 2S it's a bit longer, but
> it wouldn't surprise me there if new systems are still shipping.
>
> Also to be honest I doubt the theory that older systems
> are never upgraded to newer OS.

Actaually, I have examples to the contrary. :-)

Thanks,
Rafael

2008-10-11 11:42:50

by Nick Piggin

[permalink] [raw]
Subject: Re: Update cacheline size on X86_GENERIC

On Saturday 11 October 2008 22:22, Andi Kleen wrote:
> On Sat, Oct 11, 2008 at 07:29:19PM +1100, Nick Piggin wrote:
> > I also think there are reasonable arguments the other way, and I
> > personally also think it might be better to leave it 128 (even
> > if it is unlikely, introducing a regression is not good).
>
> The issue is also that the regression will be likely large.

Yeah, that is what I'm worried about. If it was a simple case of
1% loss on P4 for 1% gain on Core2, it would be a good change.
But it might be huge losses on P4s.


> False sharing can really hurt when it hits as you know, because
> the penalties are so large.
>
> > > There are millions and millions of P4s around.
> > > And they're not that old, they're still shipping in fact.
> >
> > Still shipping in anything aside from 1s systems?
>
> Remember the first Core2 based 4S (Tigerton) Xeon was only introduced last
> year and that market is quite conservative. For 2S it's a bit longer, but
> it wouldn't surprise me there if new systems are still shipping.
>
> Also to be honest I doubt the theory that older systems
> are never upgraded to newer OS.

Yeah, fair enough.


> > That would be nice. It would be interesting to know what is causing
> > the slowdown.
>
> At least that test is extremly cache footprint sensitive. A lot of the
> cache misses are surprisingly in hd_struct, because it runs
> with hundred of disks and each needs hd_struct references in the fast path.
> The recent introduction of fine grained per partition statistics
> caused a large slowdown. But I don't think kernel workloads
> are normally that extremly cache sensitive.

That's interesting. struct device is pretty big. I wonder if fields
couldn't be rearranged to minimise the fastpath cacheline footprint?
I guess that's already been looked at?

2008-10-11 13:05:00

by Andi Kleen

[permalink] [raw]
Subject: Re: Update cacheline size on X86_GENERIC

> > > That would be nice. It would be interesting to know what is causing
> > > the slowdown.
> >
> > At least that test is extremly cache footprint sensitive. A lot of the
> > cache misses are surprisingly in hd_struct, because it runs
> > with hundred of disks and each needs hd_struct references in the fast path.
> > The recent introduction of fine grained per partition statistics
> > caused a large slowdown. But I don't think kernel workloads
> > are normally that extremly cache sensitive.
>
> That's interesting. struct device is pretty big. I wonder if fields

Yes it is (it actually can be easily shrunk -- see willy's recent
patch to remove the struct completion from knodes), but that won't help
because it will always
be larger than a cache line and it's in the middle, so the
accesses to first part of it and last part of it will be separate.

> couldn't be rearranged to minimise the fastpath cacheline footprint?
> I guess that's already been looked at?

Yes, but not very intensively. So far I was looking for more
detailed profiling data to see the exact accesses.

Of course if you have any immediate ideas that could be tried too.

-Andi

--
[email protected]

2008-10-11 13:48:52

by Nick Piggin

[permalink] [raw]
Subject: Re: Update cacheline size on X86_GENERIC

On Sunday 12 October 2008 00:11, Andi Kleen wrote:
> > > > That would be nice. It would be interesting to know what is causing
> > > > the slowdown.
> > >
> > > At least that test is extremly cache footprint sensitive. A lot of the
> > > cache misses are surprisingly in hd_struct, because it runs
> > > with hundred of disks and each needs hd_struct references in the fast
> > > path. The recent introduction of fine grained per partition statistics
> > > caused a large slowdown. But I don't think kernel workloads
> > > are normally that extremly cache sensitive.
> >
> > That's interesting. struct device is pretty big. I wonder if fields
>
> Yes it is (it actually can be easily shrunk -- see willy's recent
> patch to remove the struct completion from knodes), but that won't help
> because it will always
> be larger than a cache line and it's in the middle, so the
> accesses to first part of it and last part of it will be separate.
>
> > couldn't be rearranged to minimise the fastpath cacheline footprint?
> > I guess that's already been looked at?
>
> Yes, but not very intensively. So far I was looking for more
> detailed profiling data to see the exact accesses.
>
> Of course if you have any immediate ideas that could be tried too.

No immediate ideas. Jens probably is a good person to cc. With direct IO
workloads, hd_struct should mostly only be touched in partition remapping
and IO accounting.

start_sect, nr_sects would be read for partition remapping.

*dkstats will be read to do accounting (dkstats for UP is written, but
false sharing doesn't matter on UP), as does partno.

These could all go together at the top of the struct perhaps.

struct device->parent gets read as well. This might go at the top of
struct device, which could come next.

stamp and in_flight are tricky, as they get both read and written often
:(

Still, you might just be able to fit them into the same 64-byte cacheline
as well as all the above fields.

At this point, you would want to cacheline align hd_struct. So if you
want to do that dynamically, you would need to change the disk_part_tbl
scheme (but at least you could test with static annotations first).

The other thing I notice is the block layer has some functions which
have error paths that have BDEVNAME_SIZE size arrays for error cases,
which gcc may not do well at. Probably they should go out to noinline
functions.

2008-10-11 13:55:00

by Andi Kleen

[permalink] [raw]
Subject: Re: Update cacheline size on X86_GENERIC

> No immediate ideas. Jens probably is a good person to cc. With direct IO
> workloads, hd_struct should mostly only be touched in partition remapping
> and IO accounting.

I found it doubtful if grouping the rw and ro members together was a good
idea.

> At this point, you would want to cacheline align hd_struct. So if you

The problem is probably not false sharing, but simply cache misses because it's
so big. I think.

-Andi
--
[email protected]

2008-10-12 05:56:28

by Nick Piggin

[permalink] [raw]
Subject: Re: Update cacheline size on X86_GENERIC

On Sunday 12 October 2008 01:01, Andi Kleen wrote:
> > No immediate ideas. Jens probably is a good person to cc. With direct IO
> > workloads, hd_struct should mostly only be touched in partition remapping
> > and IO accounting.
>
> I found it doubtful if grouping the rw and ro members together was a good
> idea.

If all members touched in the fastpath fit into one cacheline, then it
definitely is. Because if you put the rw members in another cacheline,
that line is still going to bounce just the same, but then you are just
going to take up one more cacheline with the ro members.


> > At this point, you would want to cacheline align hd_struct. So if you
>
> The problem is probably not false sharing, but simply cache misses because
> it's so big. I think.

If you line up all the commonly touched members into cacheline boundaries,
then presumably you have to assume the start of the struct has eg. cacheline
alignment. There are situations where you can almost cut down cacheline
footprint of a given data structure by 2x by aligning the items (eg. if we
align struct page to 64 bytes, then random access to mem_map array will be
almost 1/2 the cacheline footprint). I've always been interested in whether
"oltp" benefits from that (eg. define WANT_PAGE_VIRTUAL for x86-64).