Cache aligning the secret[] buffer makes copying from it infinitesimally
more efficient.
---
drivers/char/random.c | 2 +-
1 files changed, 1 insertions(+), 1 deletions(-)
diff --git a/drivers/char/random.c b/drivers/char/random.c
index 72a4fcb..4bcc4f2 100644
--- a/drivers/char/random.c
+++ b/drivers/char/random.c
@@ -1417,8 +1417,8 @@ static __u32 twothirdsMD4Transform(__u32 const buf[4], __u32 const in[12])
#define HASH_MASK ((1 << HASH_BITS) - 1)
static struct keydata {
- __u32 count; /* already shifted to the final position */
__u32 secret[12];
+ __u32 count; /* already shifted to the final position */
} ____cacheline_aligned ip_keydata[2];
static unsigned int ip_cnt;
--
1.7.4.1
On Sun, 2011-03-13 at 20:20 -0400, George Spelvin wrote:
> Cache aligning the secret[] buffer makes copying from it infinitesimally
> more efficient.
Acked-by: Matt Mackall <[email protected]>
> ---
> drivers/char/random.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/char/random.c b/drivers/char/random.c
> index 72a4fcb..4bcc4f2 100644
> --- a/drivers/char/random.c
> +++ b/drivers/char/random.c
> @@ -1417,8 +1417,8 @@ static __u32 twothirdsMD4Transform(__u32 const buf[4], __u32 const in[12])
> #define HASH_MASK ((1 << HASH_BITS) - 1)
>
> static struct keydata {
> - __u32 count; /* already shifted to the final position */
> __u32 secret[12];
> + __u32 count; /* already shifted to the final position */
> } ____cacheline_aligned ip_keydata[2];
>
> static unsigned int ip_cnt;
--
Mathematics is the supreme nostalgia of our time.
On Wed, Mar 16, 2011 at 4:54 AM, Matt Mackall <[email protected]> wrote:
> On Sun, 2011-03-13 at 20:20 -0400, George Spelvin wrote:
>> Cache aligning the secret[] buffer makes copying from it infinitesimally
>> more efficient.
>
> Acked-by: Matt Mackall <[email protected]>
Acked-by: Pekka Enberg <[email protected]>
On Sun, 13 Mar 2011, George Spelvin wrote:
> Cache aligning the secret[] buffer makes copying from it infinitesimally
> more efficient.
> ---
> drivers/char/random.c | 2 +-
> 1 files changed, 1 insertions(+), 1 deletions(-)
>
> diff --git a/drivers/char/random.c b/drivers/char/random.c
> index 72a4fcb..4bcc4f2 100644
> --- a/drivers/char/random.c
> +++ b/drivers/char/random.c
> @@ -1417,8 +1417,8 @@ static __u32 twothirdsMD4Transform(__u32 const buf[4], __u32 const in[12])
> #define HASH_MASK ((1 << HASH_BITS) - 1)
>
> static struct keydata {
> - __u32 count; /* already shifted to the final position */
> __u32 secret[12];
> + __u32 count; /* already shifted to the final position */
> } ____cacheline_aligned ip_keydata[2];
>
> static unsigned int ip_cnt;
I'm intrigued: please educate me. On what architectures does cache-
aligning a 48-byte buffer (previously offset by 4 bytes) speed up
copying from it, and why? Does the copying involve 8-byte or 16-byte
instructions that benefit from that alignment, rather than cacheline
alignment?
Thanks,
Hugh
> I'm intrigued: please educate me. On what architectures does cache-
> aligning a 48-byte buffer (previously offset by 4 bytes) speed up
> copying from it, and why? Does the copying involve 8-byte or 16-byte
> instructions that benefit from that alignment, rather than cacheline
> alignment?
I had two thoughts in my head when I wrote that:
1) A smart compiler could note the alignment and issue wider copy
instructions. (Especially on alignment-required architectures.)
2) The cacheline fetch would get more data faster. The data would
be transferred in the first 6 beats of the load from RAM (assuming a
64-bit data bus) rather than waiting for 7, so you'd finish the copy
1 ns sooner or so. Similar 1-cycle win on a 128-bit Ln->L(n-1) cache
transfer.
As I said, "infinitesimal". The main reason that I bothered to
generate a patch was that it appealed to my sense of neatness to
keep the 3x16-byte buffer 16-byte aligned.
On Wed, 16 Mar 2011, George Spelvin wrote:
> > I'm intrigued: please educate me. On what architectures does cache-
> > aligning a 48-byte buffer (previously offset by 4 bytes) speed up
> > copying from it, and why? Does the copying involve 8-byte or 16-byte
> > instructions that benefit from that alignment, rather than cacheline
> > alignment?
>
> I had two thoughts in my head when I wrote that:
> 1) A smart compiler could note the alignment and issue wider copy
> instructions. (Especially on alignment-required architectures.)
Right, that part of it would benefit from stronger alignment,
but does not generally need cacheline alignment.
> 2) The cacheline fetch would get more data faster. The data would
> be transferred in the first 6 beats of the load from RAM (assuming a
> 64-bit data bus) rather than waiting for 7, so you'd finish the copy
> 1 ns sooner or so. Similar 1-cycle win on a 128-bit Ln->L(n-1) cache
> transfer.
That argument worries me. I don't know enough to say whether you are
correct or not. But if you are correct, then it worries me that your
patch will be the first of a trickle growing to a stream to an avalanche
of patches where people align and reorder structures so that the most
commonly accessed fields are at the beginnng of the cacheline, so that
those can then be accessed minutely faster.
Aargh, and now I am setting off the avalanche with that remark.
Please, someone, save us by discrediting George's argument.
>
> As I said, "infinitesimal". The main reason that I bothered to
> generate a patch was that it appealed to my sense of neatness to
> keep the 3x16-byte buffer 16-byte aligned.
Ah, now you come clean! Yes, it does feel neater to me too;
but I doubt that would be sufficient justification by itself.
Hugh
On Wed, 2011-03-16 at 10:17 -0700, Hugh Dickins wrote:
> On Sun, 13 Mar 2011, George Spelvin wrote:
>
> > Cache aligning the secret[] buffer makes copying from it infinitesimally
> > more efficient.
> > ---
> > drivers/char/random.c | 2 +-
> > 1 files changed, 1 insertions(+), 1 deletions(-)
> >
> > diff --git a/drivers/char/random.c b/drivers/char/random.c
> > index 72a4fcb..4bcc4f2 100644
> > --- a/drivers/char/random.c
> > +++ b/drivers/char/random.c
> > @@ -1417,8 +1417,8 @@ static __u32 twothirdsMD4Transform(__u32 const buf[4], __u32 const in[12])
> > #define HASH_MASK ((1 << HASH_BITS) - 1)
> >
> > static struct keydata {
> > - __u32 count; /* already shifted to the final position */
> > __u32 secret[12];
> > + __u32 count; /* already shifted to the final position */
> > } ____cacheline_aligned ip_keydata[2];
> >
> > static unsigned int ip_cnt;
>
> I'm intrigued: please educate me. On what architectures does cache-
> aligning a 48-byte buffer (previously offset by 4 bytes) speed up
> copying from it, and why? Does the copying involve 8-byte or 16-byte
> instructions that benefit from that alignment, rather than cacheline
> alignment?
I think this alignment exists to minimize the number of cacheline
bounces on SMP as this can be a pretty hot structure in the network
stack. It could probably benefit from a per-cpu treatment.
--
Mathematics is the supreme nostalgia of our time.
> I think this alignment exists to minimize the number of cacheline
> bounces on SMP as this can be a pretty hot structure in the network
> stack. It could probably benefit from a per-cpu treatment.
>
Well, this is a mostly read area of memory, dirtied every 5 minutes.
Compare this to 'jiffies' for example ;)
What could be done is to embed 'ip_cnt' inside ip_keydata[0] for
example, to avoid wasting a cache line for one bit ;)
c1606c40 b ip_cnt
<hole>
c1606c80 b ip_keydata
>> 1) A smart compiler could note the alignment and issue wider copy
>> instructions. (Especially on alignment-required architectures.)
> Right, that part of it would benefit from stronger alignment,
> but does not generally need cacheline alignment.
Agreed. The only reason the structure is cacheline aligned is to keep
it all in a single cache line, and swapping the order of the elements
made the buffer more aligned without hurting the counter.
>> 2) The cacheline fetch would get more data faster. The data would
>> be transferred in the first 6 beats of the load from RAM (assuming a
>> 64-bit data bus) rather than waiting for 7, so you'd finish the copy
>> 1 ns sooner or so. Similar 1-cycle win on a 128-bit Ln->L(n-1) cache
>> transfer.
> That argument worries me. I don't know enough to say whether you are
> correct or not. But if you are correct, then it worries me that your
> patch will be the first of a trickle growing to a stream to an avalanche
> of patches where people align and reorder structures so that the most
> commonly accessed fields are at the beginnng of the cacheline, so that
> those can then be accessed minutely faster.
>
> Aargh, and now I am setting off the avalanche with that remark.
> Please, someone, save us by discrediting George's argument.
It was mostly #1 and #3. The *important* thing is to minimize the number
of cache lines touched by common operations, which has already been the
subject of a lot of kernel patches.
Remember, most hardware does have critical-word-first loads. So alignment
to the width of the data bus is enough. "Keep it naturally aligned" is
all that's necessary, and most kernel data structures already obey that.
I was just extending it, because I wanted to make it *possible* to use
wider loads.
>> As I said, "infinitesimal". The main reason that I bothered to
>> generate a patch was that it appealed to my sense of neatness to
>> keep the 3x16-byte buffer 16-byte aligned.
> Ah, now you come clean! Yes, it does feel neater to me too;
> but I doubt that would be sufficient justification by itself.
It took both factors to make it worth it to me. The real reason was:
1) Neater
2) Definitely not slower
3) Maybe a tiny bit faster
Conclusion: do it.
Sorry to alarm you.
On Wed, 2011-03-16 at 20:26 +0100, Eric Dumazet wrote:
> > I think this alignment exists to minimize the number of cacheline
> > bounces on SMP as this can be a pretty hot structure in the network
> > stack. It could probably benefit from a per-cpu treatment.
> >
>
> Well, this is a mostly read area of memory, dirtied every 5 minutes.
>
> Compare this to 'jiffies' for example ;)
>
> What could be done is to embed 'ip_cnt' inside ip_keydata[0] for
> example, to avoid wasting a cache line for one bit ;)
>
>
> c1606c40 b ip_cnt
> <hole>
> c1606c80 b ip_keydata
Yeah. I actually think we're due for rethinking this entire process. It
dates back to when we introduced syncookies in the 90s and it shows. The
fact that we've started abusing it for other things doesn't help.
--
Mathematics is the supreme nostalgia of our time.
On Wed, Mar 16, 2011 at 03:45:42PM -0400, George Spelvin wrote:
>
> > Ah, now you come clean! Yes, it does feel neater to me too;
> > but I doubt that would be sufficient justification by itself.
>
> It took both factors to make it worth it to me. The real reason was:
> 1) Neater
> 2) Definitely not slower
> 3) Maybe a tiny bit faster
> Conclusion: do it.
I'm with Hugh on this, the justification seems pretty weak.
Cheers,
--
Email: Herbert Xu <[email protected]>
Home Page: http://gondor.apana.org.au/~herbert/
PGP Key: http://gondor.apana.org.au/~herbert/pubkey.txt