2004-09-23 15:44:27

by Marcelo Tosatti

[permalink] [raw]
Subject: [[email protected]: Re: [PATCH] shrink per_cpu_pages to fit 32byte cacheline]


Forgot to CC linux-kernel, just in case someone else
can have useful information on this matter.

Andi says any additional overhead will be in the noise
compared to cacheline saving benefit.

***********

Jun,

We need some assistance here - you can probably help us.

Within the Linux kernel we can benefit from changing some fields
of commonly accessed data structures to 16 bit instead of 32 bits,
given that the values for these fields never reach 2 ^ 16.

Arjan warned me, however, that the prefix (in this case "data16") will
cause an additional extra cycle in instruction decoding, per message above.

Can you confirm that please? We can't seem to be able to find
it in Intel's documentation.

By shrinking two fields of "per_cpu_pages" structure we can fit it
in one 32-byte cacheline (<= Pentium III and probably several other
embedded/whatnot architectures will benefit from such a change).

And we just shrank two fields of "struct pagevec" in a similar way
in Andrew's -mm tree.

I'm adding linux-kernel just in case someone else can have
useful comments.

Thanks.

----- Forwarded message from Arjan van de Ven <[email protected]> -----

From: Arjan van de Ven <[email protected]>
Date: Tue, 14 Sep 2004 13:13:29 +0200
To: Marcelo Tosatti <[email protected]>
Cc: [email protected], "Martin J. Bligh" <[email protected]>,
[email protected]
In-Reply-To: <[email protected]>
Subject: Re: [PATCH] shrink per_cpu_pages to fit 32byte cacheline
Original-Recipient: rfc822;[email protected]
X-Loop: [email protected]
X-MIMETrack: Itemize by SMTP Server on USMail/Cyclades(Release 6.5.1|January 21, 2004) at
09/14/2004 03:13:02

On Tue, Sep 14, 2004 at 06:34:07AM -0300, Marcelo Tosatti wrote:
> How come short access can cost 1 extra cycle? Because you need two "read bytes" ?

on an x86, a word (2byte) access will cause a prefix byte to the
instruction, that particular prefix byte will take an extra cycle during execution
of the instruction and potentially reduces the parallal decodability of
instructions....



----- End forwarded message -----

----- End forwarded message -----


2004-09-23 16:03:51

by Giuliano Pochini

[permalink] [raw]
Subject: Re: [[email protected]: Re: [PATCH] shrink per_cpu_pages to fit 32byte cacheline]



On Thu, 23 Sep 2004, Marcelo Tosatti wrote:

> Forgot to CC linux-kernel, just in case someone else
> can have useful information on this matter.
>
> Andi says any additional overhead will be in the noise
> compared to cacheline saving benefit.
>
> ***********
>
> Within the Linux kernel we can benefit from changing some fields
> of commonly accessed data structures to 16 bit instead of 32 bits,
> given that the values for these fields never reach 2 ^ 16.
>
> Arjan warned me, however, that the prefix (in this case "data16") will
> cause an additional extra cycle in instruction decoding, per message above.
>
> Can you confirm that please? We can't seem to be able to find
> it in Intel's documentation.
>
> By shrinking two fields of "per_cpu_pages" structure we can fit it
> in one 32-byte cacheline (<= Pentium III and probably several other
> embedded/whatnot architectures will benefit from such a change).

One cycle is a small overhead compared to the cost of a fetch from L2
cache or, even worse, a cache miss. Memory is terribly slow. I think that
nowadays we should design things trying to keep memory accesses as few as
possible.


--
Giuliano.

2004-09-24 00:08:54

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [[email protected]: Re: [PATCH] shrink per_cpu_pages to fit 32byte cacheline]

On Thu, Sep 23, 2004 at 01:24:49PM -0700, Nakajima, Jun wrote:
> >From: Marcelo Tosatti [mailto:[email protected]]
> >Sent: Thursday, September 23, 2004 7:12 AM
> >To: [email protected]
> >Cc: Nakajima, Jun; [email protected]; [email protected]; [email protected]
> >Subject: [[email protected]: Re: [PATCH] shrink per_cpu_pages to fit
> 32byte
> >cacheline]
> >
> >
> >Forgot to CC linux-kernel, just in case someone else
> >can have useful information on this matter.
> >
> >Andi says any additional overhead will be in the noise
> >compared to cacheline saving benefit.
> >
> >***********
> >
> >Jun,
> >
> >We need some assistance here - you can probably help us.
> >
> >Within the Linux kernel we can benefit from changing some fields
> >of commonly accessed data structures to 16 bit instead of 32 bits,
> >given that the values for these fields never reach 2 ^ 16.
> >
> >Arjan warned me, however, that the prefix (in this case "data16") will
> >cause an additional extra cycle in instruction decoding, per message
> above.
>
> On the Pentium4 core, this is not a big deal because it runs out of the
> trace cache (i.e. decoded in advance). However, on the Pentium III/M
> (aka P6) core (i.e. Penitum III, Banias, Dothan, Yonah, etc.),
> especially when an operand size prefix (0x66) changes the # of bytes in
> an instruction (usually by impacting the size of an immediate in the
> instruction), the P6 core pays unnegligible penalty, slowing down
> decoding.

Jun,

What you mean by "unnegligible penalty" ?

You mean its very small penalty (unconsiderable), or its considerable penalty?

We are use one less cacheline for a very commonly used structure.

Thanks and sorry for poor english :)

2004-09-24 00:59:49

by Nakajima, Jun

[permalink] [raw]
Subject: RE: [[email protected]: Re: [PATCH] shrink per_cpu_pages to fit 32byte cacheline]

>From: Marcelo Tosatti [mailto:[email protected]]
>Sent: Thursday, September 23, 2004 3:32 PM
>To: Nakajima, Jun
>Cc: [email protected]; [email protected]; [email protected];
>[email protected]; Saxena, Sunil; Mallick, Asit K
>Subject: Re: [[email protected]: Re: [PATCH] shrink per_cpu_pages to
fit
>32byte cacheline]
>

<snip>

>> >***********
>> >
>> >Jun,
>> >
>> >We need some assistance here - you can probably help us.
>> >
>> >Within the Linux kernel we can benefit from changing some fields
>> >of commonly accessed data structures to 16 bit instead of 32 bits,
>> >given that the values for these fields never reach 2 ^ 16.
>> >
>> >Arjan warned me, however, that the prefix (in this case "data16")
will
>> >cause an additional extra cycle in instruction decoding, per message
>> above.
>>
>> On the Pentium4 core, this is not a big deal because it runs out of
the
>> trace cache (i.e. decoded in advance). However, on the Pentium III/M
>> (aka P6) core (i.e. Penitum III, Banias, Dothan, Yonah, etc.),
>> especially when an operand size prefix (0x66) changes the # of bytes
in
>> an instruction (usually by impacting the size of an immediate in the
>> instruction), the P6 core pays unnegligible penalty, slowing down
>> decoding.
>
>Jun,
>
>What you mean by "unnegligible penalty" ?
>
>You mean its very small penalty (unconsiderable), or its considerable
>penalty?

I mean it's considerable. Did you look at what kinds of instructions are
used for accessing such data structures? Does the operand size prefix
change the # of bytes in those instructions (as described above) for
most cases? If it does, we don't recommend such codes.

Jun

>> an instruction

>
>We are use one less cacheline for a very commonly used structure.
>

2004-09-24 02:10:01

by Nakajima, Jun

[permalink] [raw]
Subject: RE: [[email protected]: Re: [PATCH] shrink per_cpu_pages to fit 32byte cacheline]

>From: Marcelo Tosatti [mailto:[email protected]]
>Sent: Thursday, September 23, 2004 7:12 AM
>To: [email protected]
>Cc: Nakajima, Jun; [email protected]; [email protected]; [email protected]
>Subject: [[email protected]: Re: [PATCH] shrink per_cpu_pages to fit
32byte
>cacheline]
>
>
>Forgot to CC linux-kernel, just in case someone else
>can have useful information on this matter.
>
>Andi says any additional overhead will be in the noise
>compared to cacheline saving benefit.
>
>***********
>
>Jun,
>
>We need some assistance here - you can probably help us.
>
>Within the Linux kernel we can benefit from changing some fields
>of commonly accessed data structures to 16 bit instead of 32 bits,
>given that the values for these fields never reach 2 ^ 16.
>
>Arjan warned me, however, that the prefix (in this case "data16") will
>cause an additional extra cycle in instruction decoding, per message
above.

On the Pentium4 core, this is not a big deal because it runs out of the
trace cache (i.e. decoded in advance). However, on the Pentium III/M
(aka P6) core (i.e. Penitum III, Banias, Dothan, Yonah, etc.),
especially when an operand size prefix (0x66) changes the # of bytes in
an instruction (usually by impacting the size of an immediate in the
instruction), the P6 core pays unnegligible penalty, slowing down
decoding.

Jun

>
>Can you confirm that please? We can't seem to be able to find
>it in Intel's documentation.
>
>By shrinking two fields of "per_cpu_pages" structure we can fit it
>in one 32-byte cacheline (<= Pentium III and probably several other
>embedded/whatnot architectures will benefit from such a change).
>
>And we just shrank two fields of "struct pagevec" in a similar way
>in Andrew's -mm tree.
>
>I'm adding linux-kernel just in case someone else can have
>useful comments.
>
>Thanks.
>
>----- Forwarded message from Arjan van de Ven <[email protected]> -----
>
>From: Arjan van de Ven <[email protected]>
>Date: Tue, 14 Sep 2004 13:13:29 +0200
>To: Marcelo Tosatti <[email protected]>
>Cc: [email protected], "Martin J. Bligh" <[email protected]>,
> [email protected]
>In-Reply-To: <[email protected]>
>Subject: Re: [PATCH] shrink per_cpu_pages to fit 32byte cacheline
>Original-Recipient: rfc822;[email protected]
>X-Loop: [email protected]
>X-MIMETrack: Itemize by SMTP Server on USMail/Cyclades(Release
>6.5.1|January 21, 2004) at
> 09/14/2004 03:13:02
>
>On Tue, Sep 14, 2004 at 06:34:07AM -0300, Marcelo Tosatti wrote:
>> How come short access can cost 1 extra cycle? Because you need two
"read
>bytes" ?
>
>on an x86, a word (2byte) access will cause a prefix byte to the
>instruction, that particular prefix byte will take an extra cycle
during
>execution
>of the instruction and potentially reduces the parallal decodability of
>instructions....
>
>
>
>----- End forwarded message -----
>
>----- End forwarded message -----

2004-09-27 15:12:00

by Marcelo Tosatti

[permalink] [raw]
Subject: Re: [[email protected]: Re: [PATCH] shrink per_cpu_pages to fit 32byte cacheline]

On Thu, Sep 23, 2004 at 05:48:18PM -0700, Nakajima, Jun wrote:
> >From: Marcelo Tosatti [mailto:[email protected]]
> >Sent: Thursday, September 23, 2004 3:32 PM
> >To: Nakajima, Jun
> >Cc: [email protected]; [email protected]; [email protected];
> >[email protected]; Saxena, Sunil; Mallick, Asit K
> >Subject: Re: [[email protected]: Re: [PATCH] shrink per_cpu_pages to
> fit
> >32byte cacheline]
> >
>
> <snip>
>
> >> >***********
> >> >
> >> >Jun,
> >> >
> >> >We need some assistance here - you can probably help us.
> >> >
> >> >Within the Linux kernel we can benefit from changing some fields
> >> >of commonly accessed data structures to 16 bit instead of 32 bits,
> >> >given that the values for these fields never reach 2 ^ 16.
> >> >
> >> >Arjan warned me, however, that the prefix (in this case "data16")
> will
> >> >cause an additional extra cycle in instruction decoding, per message
> >> above.
> >>
> >> On the Pentium4 core, this is not a big deal because it runs out of
> the
> >> trace cache (i.e. decoded in advance). However, on the Pentium III/M
> >> (aka P6) core (i.e. Penitum III, Banias, Dothan, Yonah, etc.),
> >> especially when an operand size prefix (0x66) changes the # of bytes
> in
> >> an instruction (usually by impacting the size of an immediate in the
> >> instruction), the P6 core pays unnegligible penalty, slowing down
> >> decoding.
> >
> >Jun,
> >
> >What you mean by "unnegligible penalty" ?
> >
> >You mean its very small penalty (unconsiderable), or its considerable
> >penalty?
>
> I mean it's considerable. Did you look at what kinds of instructions are
> used for accessing such data structures? Does the operand size prefix
> change the # of bytes in those instructions (as described above) for
> most cases? If it does, we don't recommend such codes.

Yep, it does change the size the operand size.

Its mostly moving from the memory position into register for
comparison, and moving back to the memory position.

The hottest path (free_hot_cold_page) changes from


105d: 8b 42 08 mov 0x8(%edx),%eax
1086: ff 83 dc 00 00 00 incl 0xdc(%ebx)

10a6: 8b 42 0c mov 0xc(%edx),%eax


to

1087: 0f b7 83 dc 00 00 00 movzwl 0xdc(%ebx),%eax
108e: 40 inc %eax
108f: 66 89 83 dc 00 00 00 mov %ax,0xdc(%ebx)

10c4: 0f b7 93 dc 00 00 00 movzwl 0xdc(%ebx),%edx