2002-11-20 07:40:58

by Andrew Morton

[permalink] [raw]
Subject: the random driver


I'm seeing a context switch rate of 150/sec just writing a
big file to disk at 20 megabytes/sec.

It is coming out of add_disk_randomness()'s invokation of
batch_entropy_store().

That function is setting up deferred punt-to-process-context
for every disk request, and has the potential to cause 1000
context switches per second. This is clearly excessive.

There is a 256 slot buffer in the random driver for this,
and we are not using it at all effectively. I do intend
to submit the below patch which will cause one context switch
per 128 requests.

But this is a minimal fix. The batch_entropy_pool handling
in there needs work.

a) It's racy. The head and tail pointers have no SMP protection
and a race will cause it to dump 128 already-processed items
back into the entropy pool.

b) It's weird. What's up with this?

batch_entropy_pool[2*batch_head] = a;
batch_entropy_pool[(2*batch_head) + 1] = b;

It should be an array of 2-element structures.

c) The ring-buffer handling is awkward. It shouldn't be masking
the head and tail pointers to always remain within bounds.

A better technique is to allow these indices to wrap at
0xffffffff and only mask their values when you actually use
them as a subscript. This allows you to distinguish the
completely-full case from the completely-empty one. See
LOG_BUF* in kernel/printk.c.

d) It's punting work up to process context which could be performed
right there in interrupt context.

My suggestion, if anyone cares, is to convert the entropy pool
into smaller per-cpu buffers, protected by local_irq_save() only.
This way the global lock (which isn't there yet) only needs to
be taken when a CPU is actually dumping its buffer.



--- 25/drivers/char/random.c~reduce-random-context-switch-rate Tue Nov 19 23:17:12 2002
+++ 25-akpm/drivers/char/random.c Tue Nov 19 23:18:11 2002
@@ -663,7 +663,7 @@ void batch_entropy_store(u32 a, u32 b, i
batch_entropy_credit[batch_head] = num;

new = (batch_head+1) & (batch_max-1);
- if (new != batch_tail) {
+ if ((new - batch_tail) >= batch_max / 2) {
/*
* Schedule it for the next timer tick:
*/

_


2002-11-20 08:06:06

by Aaron Lehmann

[permalink] [raw]
Subject: Re: the random driver

Speaking of the random driver, did the entropy calculation fixes ever
get merged? Those seemed important.

2002-11-20 15:28:24

by Ingo Oeser

[permalink] [raw]
Subject: Re: the random driver

Hi Andrew,

On Tue, Nov 19, 2002 at 11:46:53PM -0800, Andrew Morton wrote:
> c) The ring-buffer handling is awkward. It shouldn't be masking
> the head and tail pointers to always remain within bounds.
>
> A better technique is to allow these indices to wrap at
> 0xffffffff and only mask their values when you actually use
> them as a subscript. This allows you to distinguish the
> completely-full case from the completely-empty one. See
> LOG_BUF* in kernel/printk.c.

Care to implement a generic ringbuffer in a header file, to avoid
this kind of errors? I'm sure there are more (and even overflows
on some implementations).

The gory implementation details of basic computer science
algorithms seem to cause problems for many driver writers.

So the trend of abstracting these away in the kernel, if it costs
no time and doesn't lead to ugly code, is right.

Regards

Ingo Oeser
--
Science is what we can tell a computer. Art is everything else. --- D.E.Knuth

2002-11-20 16:21:06

by Theodore Ts'o

[permalink] [raw]
Subject: Re: the random driver

On Tue, Nov 19, 2002 at 11:46:53PM -0800, Andrew Morton wrote:
> a) It's racy. The head and tail pointers have no SMP protection
> and a race will cause it to dump 128 already-processed items
> back into the entropy pool.

Yeah, that's a real problem. The random driver was never adequately
or locked for SMP case. We also have a problem on the output side;
two processes that read from /dev/random at the same time can get the
exact same value. This is **bad**, especially if it is being used for
UUID generation or for session key generation.

The output side SMP locking is on my todo queue, but this week I'm in
Atlanta dealing with IPSE Cat the the IETF meeting.... when I get
back in Boston next week, I'll look at fixing this, but if someone
wants to beat me to it, feel free....

> b) It's weird. What's up with this?
>
> batch_entropy_pool[2*batch_head] = a;
> batch_entropy_pool[(2*batch_head) + 1] = b;
>
> It should be an array of 2-element structures.

The entropy returned by the drivers is essentially just an arbitrary
64 bit value. It's treated as two 32 bit values so that we don't lose
horribly given GCC's pathetic 64-bit code generator for the ia32
platform.

> d) It's punting work up to process context which could be performed
> right there in interrupt context.

The idea was to trying to pacify the soft realtime nazi's that are
stressing out over every single microsecond of interrupt latency.
Realistically, it's about dozen memory memory cache misses, so it's
not *that* bad. Originally though the batched work was being done in
a bottom-half handler, so there wasn't a process context switch
overhead. So perhaps we should rethink the design decision of
deffering the work in the interests of reducing interrupt latency.

> My suggestion, if anyone cares, is to convert the entropy pool
> into smaller per-cpu buffers, protected by local_irq_save() only.
> This way the global lock (which isn't there yet) only needs to
> be taken when a CPU is actually dumping its buffer.

Yeah, that's probably what we should do.

- Ted

2002-11-20 19:29:08

by Andrew Morton

[permalink] [raw]
Subject: Re: the random driver

Theodore Ts'o wrote:
>
> On Tue, Nov 19, 2002 at 11:46:53PM -0800, Andrew Morton wrote:
> > a) It's racy. The head and tail pointers have no SMP protection
> > and a race will cause it to dump 128 already-processed items
> > back into the entropy pool.
>
> Yeah, that's a real problem. The random driver was never adequately
> or locked for SMP case. We also have a problem on the output side;
> two processes that read from /dev/random at the same time can get the
> exact same value. This is **bad**, especially if it is being used for
> UUID generation or for session key generation.

It was pointed out (alleged?) to me that the lack of input-side locking is
a feature - if the SMP race hits, it adds unpredicatability.

> ...
> > b) It's weird. What's up with this?
> >
> > batch_entropy_pool[2*batch_head] = a;
> > batch_entropy_pool[(2*batch_head) + 1] = b;
> >
> > It should be an array of 2-element structures.
>
> The entropy returned by the drivers is essentially just an arbitrary
> 64 bit value. It's treated as two 32 bit values so that we don't lose
> horribly given GCC's pathetic 64-bit code generator for the ia32
> platform.

heh, I see. Presumably u64 loads and stores would be OK though?

> > d) It's punting work up to process context which could be performed
> > right there in interrupt context.
>
> The idea was to trying to pacify the soft realtime nazi's that are
> stressing out over every single microsecond of interrupt latency.
> Realistically, it's about dozen memory memory cache misses, so it's
> not *that* bad. Originally though the batched work was being done in
> a bottom-half handler, so there wasn't a process context switch
> overhead. So perhaps we should rethink the design decision of
> deffering the work in the interests of reducing interrupt latency.

That would suit. If you go this way, the batching is probably
detrimental - it would increase peak latencies. Could do the
work direct in the interrupt handler or schedule a softirq.

I think what bit us in 2.5 was the HZ=1000 change - with HZ=100
the context switch rate would be lower. But yes, using a workqueue
here seems inappropriate.

The whole idea of scheduling the work on the calling CPU is a
little inappropriate in this case. I have one CPU working hard
and three idle. Yet the deferred work and all the context
switching is being performed on the busy CPU. hmm.

2002-11-20 20:35:34

by Oliver Xymoron

[permalink] [raw]
Subject: Re: the random driver

On Tue, Nov 19, 2002 at 11:46:53PM -0800, Andrew Morton wrote:
>
> I'm seeing a context switch rate of 150/sec just writing a
> big file to disk at 20 megabytes/sec.
>
> It is coming out of add_disk_randomness()'s invokation of
> batch_entropy_store().

First off, there's very little unobservable randomness in those disk
ops, especially in a file/webserver context, so the value of
add_disk_randomness is questionable.

> That function is setting up deferred punt-to-process-context
> for every disk request, and has the potential to cause 1000
> context switches per second. This is clearly excessive.
>
> There is a 256 slot buffer in the random driver for this,
> and we are not using it at all effectively. I do intend
> to submit the below patch which will cause one context switch
> per 128 requests.

Done that.

> But this is a minimal fix. The batch_entropy_pool handling
> in there needs work.
>
> a) It's racy. The head and tail pointers have no SMP protection
> and a race will cause it to dump 128 already-processed items
> back into the entropy pool.

I have a rewrite that fails safe and is lockless. But this is nothing
compared to the completely broken entropy accounting in
xfer_secondary_pool. Try this: cat /dev/random, wait for it to block,
and then tap your mouse.

> d) It's punting work up to process context which could be performed
> right there in interrupt context.

Disagree. Did you look at the mixing function? It'll dirty a large
chunk of cache.

--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."

2002-11-20 20:38:01

by Oliver Xymoron

[permalink] [raw]
Subject: Re: the random driver

On Wed, Nov 20, 2002 at 12:13:05AM -0800, Aaron Lehmann wrote:
> Speaking of the random driver, did the entropy calculation fixes ever
> get merged? Those seemed important.

Nope. I don't have a working 2.5 at the moment, will resubmit them
when the feature freeze firms up a bit.

--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."