2004-03-25 22:50:29

by Matt Mackall

[permalink] [raw]
Subject: Re: Fw: potential /dev/urandom scalability improvement

On Thu, Mar 25, 2004 at 02:19:23PM -0800, Andrew Morton wrote:
>
> Matt, could you please review David's changes?
>
> Begin forwarded message:
>
> Date: Wed, 24 Mar 2004 22:06:57 -0800
> From: David Mosberger <[email protected]>
> To: [email protected]
> Cc: [email protected], [email protected]
> Subject: potential /dev/urandom scalability improvement
>
>
> Hi Andrew,
>
> I'm addressing this patch to you because you seem to have been the
> person who most recently made some performance improvements to the
> random driver.

That was probably me, actually.

> Oh, and somebody who actually understands this code may want to
> double-check the patch for correctness.

Seems perfectly sane.

However, I've got a few pending patches that touch the same areas and
do some more critical cleanup that I've been sitting on since the
2.6.0 freeze. So perhaps I should start pushing those again and we can
queue this behind them. David, if you get a chance, grab the latest
copy of my linux-tiny tree from

http://www.selenic.com/tiny/2.6.5-rc2-tiny1-broken-out.tar.bz2
http://www.selenic.com/tiny/2.6.5-rc2-tiny1.patch.bz2

and see how I've tweaked the pool structure and the locking and how
your bits fit with it.

> +#ifdef ARCH_HAS_PREFETCH
> + for (cp = (char *) r->pool; cp <= (char *) (r->pool + wordmask); cp += PREFETCH_STRIDE)
> + prefetch(cp);
> +#endif

Can we avoid adding this ifdef in some fashion? What does the compiler
generate here when prefetch is a no-op? This seems to call for a
prefetch_range(start, len) function/macro in any case.

--
Matt Mackall : http://www.selenic.com : Linux development and consulting


2004-03-26 01:45:57

by David Mosberger

[permalink] [raw]
Subject: Re: Fw: potential /dev/urandom scalability improvement

>>>>> On Thu, 25 Mar 2004 16:47:27 -0600, Matt Mackall <[email protected]> said:

>> I'm addressing this patch to you because you seem to have been
>> the person who most recently made some performance improvements
>> to the random driver.

Matt> That was probably me, actually.

Sorry, that's what I get for trusting the BK log.

Matt> However, I've got a few pending patches that touch the same
Matt> areas and do some more critical cleanup that I've been sitting
Matt> on since the 2.6.0 freeze. So perhaps I should start pushing
Matt> those again and we can queue this behind them. David, if you
Matt> get a chance, grab the latest copy of my linux-tiny tree from

Matt>
Matt> http://www.selenic.com/tiny/2.6.5-rc2-tiny1-broken-out.tar.bz2
Matt> http://www.selenic.com/tiny/2.6.5-rc2-tiny1.patch.bz2

Matt> and see how I've tweaked the pool structure and the locking
Matt> and how your bits fit with it.

Not much left of the original bits!

>> +#ifdef ARCH_HAS_PREFETCH

Matt> Can we avoid adding this ifdef in some fashion? What does the
Matt> compiler generate here when prefetch is a no-op? This seems to
Matt> call for a prefetch_range(start, len) function/macro in any
Matt> case.

Sounds reasonable, but I would prefer to do this in separate steps.

I tried your changes and performance was virtually unchanged. The
patch below is updated to go on top of your patch and gives about the
same performance as I reported yesterday. For now, I defined an
inline prefetch_range(). If and when all architectures get updated to
define this directly, we can simply remove prefetch_range() from the
driver.

Thanks

--david

--
--- drivers/char/random.c 2004-03-25 17:41:54.432358997 -0800
+++ drivers/char/random.c-davidm 2004-03-25 17:34:59.040063215 -0800
@@ -421,14 +421,20 @@
**********************************************************************/

struct entropy_store {
+ /* mostly-read data: */
const char *name;
+ struct poolinfo *poolinfo;
+ __u32 *pool;
+ /*
+ * read-write data (colocate with lock such that when we get
+ * the lock, we get the other data for "free"; may cause some
+ * extra bus-transactions, though):
+ */
+ spinlock_t lock ____cacheline_aligned;
unsigned add_ptr;
int entropy_count;
int input_rotate;
int reserved;
- struct poolinfo *poolinfo;
- spinlock_t lock;
- __u32 *pool;
};

static __u32 input_pool_data[INPUT_POOL_WORDS];
@@ -456,6 +462,16 @@
.pool = nonblocking_pool_data
};

+static inline void
+prefetch_range (void *addr, size_t len)
+{
+#ifdef ARCH_HAS_PREFETCH
+ char *cp, *end = (char *) addr + len;
+ for (cp = addr; cp < end; cp += PREFETCH_STRIDE)
+ prefetch(cp);
+#endif
+}
+
/*
* This function adds words into the entropy "pool". It does not
* update the entropy estimate. The caller should call
@@ -472,17 +488,30 @@
static __u32 const twist_table[8] = {
0, 0x3b6e20c8, 0x76dc4190, 0x4db26158,
0xedb88320, 0xd6d6a3e8, 0x9b64c2b0, 0xa00ae278 };
- unsigned i;
- int new_rotate;
- int wordmask = r->poolinfo->poolwords - 1;
- __u32 w;
- unsigned long flags;
+ unsigned long i, add_ptr, tap1, tap2, tap3, tap4, tap5;
+ int new_rotate, input_rotate;
+ unsigned long flags, wordmask = r->poolinfo->poolwords - 1;
+ __u32 w, next_w, *pool = r->pool;
+
+ /* Taps are constant, so we can load them without holding r->lock. */
+ tap1 = r->poolinfo->tap1;
+ tap2 = r->poolinfo->tap2;
+ tap3 = r->poolinfo->tap3;
+ tap4 = r->poolinfo->tap4;
+ tap5 = r->poolinfo->tap5;
+ next_w = *in++;

spin_lock_irqsave(&r->lock, flags);

+ prefetch_range(pool, 4 * (wordmask + 1));
+ input_rotate = r->input_rotate;
+ add_ptr = r->add_ptr;
+
while (nwords--) {
- w = rol32(*in++, r->input_rotate);
- i = r->add_ptr = (r->add_ptr - 1) & wordmask;
+ w = rol32(next_w, input_rotate);
+ if (nwords > 0)
+ next_w = *in++;
+ i = add_ptr = (add_ptr - 1) & wordmask;

/*
* Normally, we add 7 bits of rotation to the pool.
@@ -490,21 +519,24 @@
* rotation, so that successive passes spread the
* input bits across the pool evenly.
*/
- new_rotate = r->input_rotate + 14;
+ new_rotate = input_rotate + 14;
if (i)
- new_rotate = r->input_rotate + 7;
- r->input_rotate = new_rotate & 31;
+ new_rotate = input_rotate + 7;
+ input_rotate = new_rotate & 31;

/* XOR in the various taps */
- w ^= r->pool[(i + r->poolinfo->tap1) & wordmask];
- w ^= r->pool[(i + r->poolinfo->tap2) & wordmask];
- w ^= r->pool[(i + r->poolinfo->tap3) & wordmask];
- w ^= r->pool[(i + r->poolinfo->tap4) & wordmask];
- w ^= r->pool[(i + r->poolinfo->tap5) & wordmask];
- w ^= r->pool[i];
- r->pool[i] = (w >> 3) ^ twist_table[w & 7];
+ w ^= pool[(i + tap1) & wordmask];
+ w ^= pool[(i + tap2) & wordmask];
+ w ^= pool[(i + tap3) & wordmask];
+ w ^= pool[(i + tap4) & wordmask];
+ w ^= pool[(i + tap5) & wordmask];
+ w ^= pool[i];
+ pool[i] = (w >> 3) ^ twist_table[w & 7];
}

+ r->input_rotate = input_rotate;
+ r->add_ptr = add_ptr;
+
spin_unlock_irqrestore(&r->lock, flags);
}

2004-03-26 02:01:15

by Andrew Morton

[permalink] [raw]
Subject: Re: Fw: potential /dev/urandom scalability improvement

David Mosberger <[email protected]> wrote:
>
> The
> patch below is updated to go on top of your patch and gives about the
> same performance as I reported yesterday. For now, I defined an
> inline prefetch_range(). If and when all architectures get updated to
> define this directly, we can simply remove prefetch_range() from the
> driver.

We may as well stick prefetch_range() in prefetch.h.

And Matt's patch series is not a thing I want to take on board at present,
so let's stick with the straight scalability patch for now.

I moved the prefetch_range() call to outside the spinlock. Does that make
sense?

25-akpm/drivers/char/random.c | 51 ++++++++++++++++++++++++++-------------
25-akpm/include/linux/prefetch.h | 11 ++++++++
2 files changed, 46 insertions(+), 16 deletions(-)

diff -puN drivers/char/random.c~urandom-scalability-fix drivers/char/random.c
--- 25/drivers/char/random.c~urandom-scalability-fix 2004-03-25 17:53:57.498675480 -0800
+++ 25-akpm/drivers/char/random.c 2004-03-25 17:57:39.795881168 -0800
@@ -490,12 +490,15 @@ static inline __u32 int_ln_12bits(__u32
**********************************************************************/

struct entropy_store {
+ /* mostly-read data: */
+ struct poolinfo poolinfo;
+ __u32 *pool;
+
+ /* read-write data: */
+ spinlock_t lock ____cacheline_aligned;
unsigned add_ptr;
int entropy_count;
int input_rotate;
- struct poolinfo poolinfo;
- __u32 *pool;
- spinlock_t lock;
};

/*
@@ -571,38 +574,54 @@ static void add_entropy_words(struct ent
static __u32 const twist_table[8] = {
0, 0x3b6e20c8, 0x76dc4190, 0x4db26158,
0xedb88320, 0xd6d6a3e8, 0x9b64c2b0, 0xa00ae278 };
- unsigned i;
- int new_rotate;
+ unsigned long i, add_ptr, tap1, tap2, tap3, tap4, tap5;
+ int new_rotate, input_rotate;
int wordmask = r->poolinfo.poolwords - 1;
- __u32 w;
+ __u32 w, next_w;
unsigned long flags;

+ /* Taps are constant, so we can load them without holding r->lock. */
+ tap1 = r->poolinfo.tap1;
+ tap2 = r->poolinfo.tap2;
+ tap3 = r->poolinfo.tap3;
+ tap4 = r->poolinfo.tap4;
+ tap5 = r->poolinfo.tap5;
+ next_w = *in++;
+
+ prefetch_range(r->pool, wordmask);
spin_lock_irqsave(&r->lock, flags);
+ input_rotate = r->input_rotate;
+ add_ptr = r->add_ptr;

while (nwords--) {
- w = rotate_left(r->input_rotate, *in++);
- i = r->add_ptr = (r->add_ptr - 1) & wordmask;
+ w = rotate_left(input_rotate, next_w);
+ if (nwords > 0)
+ next_w = *in++;
+ i = add_ptr = (add_ptr - 1) & wordmask;
/*
* Normally, we add 7 bits of rotation to the pool.
* At the beginning of the pool, add an extra 7 bits
* rotation, so that successive passes spread the
* input bits across the pool evenly.
*/
- new_rotate = r->input_rotate + 14;
+ new_rotate = input_rotate + 14;
if (i)
- new_rotate = r->input_rotate + 7;
- r->input_rotate = new_rotate & 31;
+ new_rotate = input_rotate + 7;
+ input_rotate = new_rotate & 31;

/* XOR in the various taps */
- w ^= r->pool[(i + r->poolinfo.tap1) & wordmask];
- w ^= r->pool[(i + r->poolinfo.tap2) & wordmask];
- w ^= r->pool[(i + r->poolinfo.tap3) & wordmask];
- w ^= r->pool[(i + r->poolinfo.tap4) & wordmask];
- w ^= r->pool[(i + r->poolinfo.tap5) & wordmask];
+ w ^= r->pool[(i + tap1) & wordmask];
+ w ^= r->pool[(i + tap2) & wordmask];
+ w ^= r->pool[(i + tap3) & wordmask];
+ w ^= r->pool[(i + tap4) & wordmask];
+ w ^= r->pool[(i + tap5) & wordmask];
w ^= r->pool[i];
r->pool[i] = (w >> 3) ^ twist_table[w & 7];
}

+ r->input_rotate = input_rotate;
+ r->add_ptr = add_ptr;
+
spin_unlock_irqrestore(&r->lock, flags);
}

diff -puN include/linux/prefetch.h~urandom-scalability-fix include/linux/prefetch.h
--- 25/include/linux/prefetch.h~urandom-scalability-fix 2004-03-25 17:54:27.279148160 -0800
+++ 25-akpm/include/linux/prefetch.h 2004-03-25 17:55:34.456935584 -0800
@@ -54,4 +54,15 @@ static inline void prefetchw(const void
#define PREFETCH_STRIDE (4*L1_CACHE_BYTES)
#endif

+static inline void prefetch_range(void *addr, size_t len)
+{
+#ifdef ARCH_HAS_PREFETCH
+ char *cp;
+ char *end = addr + len;
+
+ for (cp = addr; cp < end; cp += PREFETCH_STRIDE)
+ prefetch(cp);
+#endif
+}
+
#endif

_

2004-03-26 02:10:50

by David Mosberger

[permalink] [raw]
Subject: Re: Fw: potential /dev/urandom scalability improvement

>>>>> On Thu, 25 Mar 2004 18:00:14 -0800, Andrew Morton <[email protected]> said:

Andrew> We may as well stick prefetch_range() in prefetch.h.

Fair enough.

Andrew> And Matt's patch series is not a thing I want to take on
Andrew> board at present, so let's stick with the straight
Andrew> scalability patch for now.

Andrew> I moved the prefetch_range() call to outside the spinlock.
Andrew> Does that make sense?

The other CPUs will dirty that data, so prefetching it before you hold
the lock is almost certainly bad for performance. (Well, to be
precise, it really depends on the number of lines dirtied.)

--david

2004-03-26 04:07:36

by Matt Mackall

[permalink] [raw]
Subject: Re: Fw: potential /dev/urandom scalability improvement

On Thu, Mar 25, 2004 at 06:00:14PM -0800, Andrew Morton wrote:
> David Mosberger <[email protected]> wrote:
> >
> > The
> > patch below is updated to go on top of your patch and gives about the
> > same performance as I reported yesterday. For now, I defined an
> > inline prefetch_range(). If and when all architectures get updated to
> > define this directly, we can simply remove prefetch_range() from the
> > driver.
>
> We may as well stick prefetch_range() in prefetch.h.
>
> And Matt's patch series is not a thing I want to take on board at present,
> so let's stick with the straight scalability patch for now.

Sigh, I'll trim it back to some bits I think are critical.

> I moved the prefetch_range() call to outside the spinlock. Does that make
> sense?

I don't think that's actually a win. If there's contention, threads
racing to the lock will grab the same cache lines and all but one
thread's cache will end up invalidated by the time the lock is
released.

--
Matt Mackall : http://www.selenic.com : Linux development and consulting

2004-03-26 04:19:39

by Matt Mackall

[permalink] [raw]
Subject: Re: Fw: potential /dev/urandom scalability improvement

On Thu, Mar 25, 2004 at 06:00:14PM -0800, Andrew Morton wrote:
>
> 25-akpm/drivers/char/random.c | 51 ++++++++++++++++++++++++++-------------
> 25-akpm/include/linux/prefetch.h | 11 ++++++++
> 2 files changed, 46 insertions(+), 16 deletions(-)
>
> diff -puN drivers/char/random.c~urandom-scalability-fix drivers/char/random.c
> +++ 25-akpm/drivers/char/random.c 2004-03-25 17:57:39.795881168 -0800
> @@ -490,12 +490,15 @@ static inline __u32 int_ln_12bits(__u32
> **********************************************************************/
>
> struct entropy_store {
> + /* mostly-read data: */
> + struct poolinfo poolinfo;
> + __u32 *pool;
> +
> + /* read-write data: */
> + spinlock_t lock ____cacheline_aligned;
> unsigned add_ptr;
> int entropy_count;
> int input_rotate;
> - struct poolinfo poolinfo;
> - __u32 *pool;
> - spinlock_t lock;
> };


Also, I think in general we'd prefer to stick the aligned bit at the
front of the structure rather than at the middle, as we'll avoid extra
padding. The size of cachelines is getting rather obscene on some
modern processors.

--
Matt Mackall : http://www.selenic.com : Linux development and consulting

2004-03-26 04:51:59

by David Mosberger

[permalink] [raw]
Subject: Re: Fw: potential /dev/urandom scalability improvement

>>>>> On Thu, 25 Mar 2004 22:19:26 -0600, Matt Mackall <[email protected]> said:

> struct entropy_store {
> + /* mostly-read data: */
> + struct poolinfo poolinfo;
> + __u32 *pool;
> +
> + /* read-write data: */
> + spinlock_t lock ____cacheline_aligned;
> unsigned add_ptr;
> int entropy_count;
> int input_rotate;
> - struct poolinfo poolinfo;
> - __u32 *pool;
> - spinlock_t lock;
> };

Matt> Also, I think in general we'd prefer to stick the aligned bit at the
Matt> front of the structure rather than at the middle, as we'll avoid extra
Matt> padding. The size of cachelines is getting rather obscene on some
Matt> modern processors.

Not sharing the cacheline between the mostly-read data and the
read-write data is the _point_ of this change. If you reverse the
order, the "poolinfo" and "pool" members will also get invalidated
whenever someone updates the write-intensive data.

--david

2004-03-26 05:15:58

by Matt Mackall

[permalink] [raw]
Subject: Re: Fw: potential /dev/urandom scalability improvement

On Thu, Mar 25, 2004 at 08:51:54PM -0800, David Mosberger wrote:
> >>>>> On Thu, 25 Mar 2004 22:19:26 -0600, Matt Mackall <[email protected]> said:
>
> > struct entropy_store {
> > + /* mostly-read data: */
> > + struct poolinfo poolinfo;
> > + __u32 *pool;
> > +
> > + /* read-write data: */
> > + spinlock_t lock ____cacheline_aligned;
> > unsigned add_ptr;
> > int entropy_count;
> > int input_rotate;
> > - struct poolinfo poolinfo;
> > - __u32 *pool;
> > - spinlock_t lock;
> > };
>
> Matt> Also, I think in general we'd prefer to stick the aligned bit at the
> Matt> front of the structure rather than at the middle, as we'll avoid extra
> Matt> padding. The size of cachelines is getting rather obscene on some
> Matt> modern processors.
>
> Not sharing the cacheline between the mostly-read data and the
> read-write data is the _point_ of this change. If you reverse the
> order, the "poolinfo" and "pool" members will also get invalidated
> whenever someone updates the write-intensive data.

Ok, previous observation made no sense; I should really be taking a
nap right now. Hopefully this next one will make more sense: it ought
to be ____cacheline_aligned_in_smp as the zero-byte spinlock struct
still forces alignment.

--
Matt Mackall : http://www.selenic.com : Linux development and consulting

2004-03-26 05:24:08

by David Mosberger

[permalink] [raw]
Subject: Re: Fw: potential /dev/urandom scalability improvement

>>>>> On Thu, 25 Mar 2004 23:15:32 -0600, Matt Mackall <[email protected]> said:

Matt> Ok, previous observation made no sense; I should really be taking a
Matt> nap right now. Hopefully this next one will make more sense: it ought
Matt> to be ____cacheline_aligned_in_smp as the zero-byte spinlock struct
Matt> still forces alignment.

Makes tons of sense, agreed.

--david

2004-03-26 11:13:32

by Dave Jones

[permalink] [raw]
Subject: Re: Fw: potential /dev/urandom scalability improvement

On Thu, Mar 25, 2004 at 06:00:14PM -0800, Andrew Morton wrote:

> +static inline void prefetch_range(void *addr, size_t len)
> +{
> +#ifdef ARCH_HAS_PREFETCH
> + char *cp;
> + char *end = addr + len;
> +
> + for (cp = addr; cp < end; cp += PREFETCH_STRIDE)
> + prefetch(cp);
> +#endif
> +}
> +
> #endif

I think this may be dangerous on some CPUs, if may prefetch past
the end of the buffer. Ie, if PREFETCH_STRIDE was 32, and len
was 65, we'd end up prefetching 65->97. As well as being
wasteful to cachelines, this can crash if theres for eg
nothing mapped after the next page boundary.

Dave

2004-03-26 18:08:46

by David Mosberger

[permalink] [raw]
Subject: Re: Fw: potential /dev/urandom scalability improvement

>>>>> On Fri, 26 Mar 2004 11:06:19 +0000, Dave Jones <[email protected]> said:

Dave> On Thu, Mar 25, 2004 at 06:00:14PM -0800, Andrew Morton wrote:
>> +static inline void prefetch_range(void *addr, size_t len) +{
>> +#ifdef ARCH_HAS_PREFETCH + char *cp; + char *end = addr + len; +
>> + for (cp = addr; cp < end; cp += PREFETCH_STRIDE) +
>> prefetch(cp); +#endif +} + #endif

Dave> I think this may be dangerous on some CPUs, if may prefetch
Dave> past the end of the buffer. Ie, if PREFETCH_STRIDE was 32, and
Dave> len was 65, we'd end up prefetching 65->97. As well as being
Dave> wasteful to cachelines, this can crash if theres for eg
Dave> nothing mapped after the next page boundary.

Huh? It only ever prefetches addresses that are _within_ the
specified buffer. Of course it will prefetch entire cachelines, but I
hope you're not worried about cachlines crossing page-boundaries! ;-))

--david

2004-03-26 18:37:59

by Dave Jones

[permalink] [raw]
Subject: Re: Fw: potential /dev/urandom scalability improvement

On Fri, Mar 26, 2004 at 10:08:39AM -0800, David Mosberger wrote:

> > I think this may be dangerous on some CPUs, if may prefetch
> > past the end of the buffer. Ie, if PREFETCH_STRIDE was 32, and
> > len was 65, we'd end up prefetching 65->97. As well as being
> > wasteful to cachelines, this can crash if theres for eg
> > nothing mapped after the next page boundary.
>
> Huh? It only ever prefetches addresses that are _within_ the
> specified buffer. Of course it will prefetch entire cachelines, but I
> hope you're not worried about cachlines crossing page-boundaries! ;-))

The proposed only user of this may take care of this requirement, but
I'm more concerned someone not aware of this requirement using this
helper routine. At the least it deserves a comment IMO.

Dave

2004-03-26 18:50:09

by Andrew Morton

[permalink] [raw]
Subject: Re: Fw: potential /dev/urandom scalability improvement

David Mosberger <[email protected]> wrote:
>
> >>>>> On Fri, 26 Mar 2004 11:06:19 +0000, Dave Jones <[email protected]> said:
>
> Dave> On Thu, Mar 25, 2004 at 06:00:14PM -0800, Andrew Morton wrote:
> >> +static inline void prefetch_range(void *addr, size_t len) +{
> >> +#ifdef ARCH_HAS_PREFETCH + char *cp; + char *end = addr + len; +
> >> + for (cp = addr; cp < end; cp += PREFETCH_STRIDE) +
> >> prefetch(cp); +#endif +} + #endif
>
> Dave> I think this may be dangerous on some CPUs, if may prefetch
> Dave> past the end of the buffer. Ie, if PREFETCH_STRIDE was 32, and
> Dave> len was 65, we'd end up prefetching 65->97. As well as being
> Dave> wasteful to cachelines, this can crash if theres for eg
> Dave> nothing mapped after the next page boundary.
>
> Huh? It only ever prefetches addresses that are _within_ the
> specified buffer. Of course it will prefetch entire cachelines, but I
> hope you're not worried about cachlines crossing page-boundaries! ;-))
>

But the start address which is fed into prefetch_range() may not be
cacheline-aligned. So if appropriately abused, a prefetch_range() could
wander off the end of the user's buffer and into a new page.

I think this gets it right, but I probably screwed something up.

static inline void prefetch_range(void *addr, size_t len)
{
#ifdef ARCH_HAS_PREFETCH
char *cp;
unsigned long end;

end = ((unsigned long)addr + len + PREFETCH_STRIDE - 1);
end &= ~(PREFETCH_STRIDE - 1);

for (cp = addr; cp < (char *)end; cp += PREFETCH_STRIDE)
prefetch(cp);
#endif
}

2004-03-26 20:25:12

by David Mosberger

[permalink] [raw]
Subject: Re: Fw: potential /dev/urandom scalability improvement

>>>>> On Fri, 26 Mar 2004 10:49:04 -0800, Andrew Morton <[email protected]> said:

Andrew> But the start address which is fed into prefetch_range() may
Andrew> not be cacheline-aligned. So if appropriately abused, a
Andrew> prefetch_range() could wander off the end of the user's
Andrew> buffer and into a new page.

Andrew> I think this gets it right, but I probably screwed something
Andrew> up.

Please, let's not make this more complicated than it is. The
cacheline alignment doesn't matter at all. Provided prefetch_range()
is given a range of guaranteed to be valid memory, then it will be
fine. It never touches anything outside the specified range.

--david

2004-03-26 20:34:04

by Andrew Morton

[permalink] [raw]
Subject: Re: Fw: potential /dev/urandom scalability improvement

David Mosberger <[email protected]> wrote:
>
> >>>>> On Fri, 26 Mar 2004 10:49:04 -0800, Andrew Morton <[email protected]> said:
>
> Andrew> But the start address which is fed into prefetch_range() may
> Andrew> not be cacheline-aligned. So if appropriately abused, a
> Andrew> prefetch_range() could wander off the end of the user's
> Andrew> buffer and into a new page.
>
> Andrew> I think this gets it right, but I probably screwed something
> Andrew> up.
>
> Please, let's not make this more complicated than it is. The
> cacheline alignment doesn't matter at all. Provided prefetch_range()
> is given a range of guaranteed to be valid memory, then it will be
> fine. It never touches anything outside the specified range.

If someone does, say,

prefetch_range(some_pointer, sizeof(*some_pointer));

then it is possible that prefetch_range() could

a) execute a prefetch at addresses which are not PREFETCH_STRIDE-aligned
and, as a consequence,

b) prefetch data from the next page, outside the range of the user's
(addr,len).

2004-03-26 20:49:56

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Fw: potential /dev/urandom scalability improvement


> a) execute a prefetch at addresses which are not PREFETCH_STRIDE-aligned
> and, as a consequence,
>
> b) prefetch data from the next page, outside the range of the user's
> (addr,len).

well if you assume that cachelines (and prefetch stride) are proper
divisors of PAGE_SIZE is that still true ?


Attachments:
signature.asc (189.00 B)
This is a digitally signed message part

2004-03-26 21:12:38

by David Mosberger

[permalink] [raw]
Subject: Re: Fw: potential /dev/urandom scalability improvement

>>>>> On Fri, 26 Mar 2004 12:33:03 -0800, Andrew Morton <[email protected]> said:

Andrew> If someone does, say,

Andrew> prefetch_range(some_pointer, sizeof(*some_pointer));

Andrew> then it is possible that prefetch_range() could

Andrew> a) execute a prefetch at addresses which are not
Andrew> PREFETCH_STRIDE-aligned and, as a consequence,

Andrew> b) prefetch data from the next page, outside the range of
Andrew> the user's (addr,len).

This is getting silly. Cache-lines _never_ cross page-boundaries.

--david

2004-03-26 21:19:36

by Andrew Morton

[permalink] [raw]
Subject: Re: Fw: potential /dev/urandom scalability improvement

Arjan van de Ven <[email protected]> wrote:
>
> > a) execute a prefetch at addresses which are not PREFETCH_STRIDE-aligned
> > and, as a consequence,
> >
> > b) prefetch data from the next page, outside the range of the user's
> > (addr,len).
>
> well if you assume that cachelines (and prefetch stride) are proper
> divisors of PAGE_SIZE is that still true ?

Probably not ;)

If someone does

prefetch_range(4096, 1);

on 4k pagesize, what should we do?

Issuing a single

prefetch 4090

sounds reasonable.

In that case I'm arranging for it to perform

prefetch (4096 - 32)

in that case, which seems neater.

2004-03-26 21:33:41

by David Mosberger

[permalink] [raw]
Subject: Re: Fw: potential /dev/urandom scalability improvement

>>>>> On Fri, 26 Mar 2004 18:23:26 +0000, Dave Jones <[email protected]> said:

Dave> The proposed only user of this may take care of this
Dave> requirement, but I'm more concerned someone not aware of this
Dave> requirement using this helper routine. At the least it
Dave> deserves a comment IMO.

A comment sounds like a fine idea. But while at it, the equivalent
comment for prefetch() and prefetchw() seems just as appropriate and
perhaps even more important.

--david

2004-03-27 07:49:17

by Arjan van de Ven

[permalink] [raw]
Subject: Re: Fw: potential /dev/urandom scalability improvement

On Fri, Mar 26, 2004 at 01:17:36PM -0800, Andrew Morton wrote:
> If someone does
>
> prefetch_range(4090, 20);
>
> on 4k pagesize, what should we do?

well if the datastructure you prefetch crosses pageboundary you know both
pages are safe to prefetch.

> Issuing a single
>
> prefetch 4090
>
> sounds reasonable.
>
> In that case I'm arranging for it to perform
>
> prefetch (4096 - 32)
>
> in that case, which seems neater.

well prefetch is defined as "get the cacheline the address is located in";
eg the cpu does the rounding to cacheline size for you.

(btw prefetch stride isn't the cacheline size! it's how far you're supposed
to prefetch ahead to get any useful gain, si for aligning and stuff it's
useless)


Attachments:
(No filename) (727.00 B)
(No filename) (189.00 B)
Download all attachments