2008-07-01 12:46:27

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Delayed interrupt work, thread pools

Here's something that's been running in the back of my mind for some
time that could be a good topic of discussion at KS.

In various areas (I'll come up with some examples later), kernel code
such as drivers want to defer some processing to "task level", for
various reasons such as locking (taking mutexes), memory allocation,
interrupt latency, or simply doing things that may take more time than
is reasonable to do at interrupt time or do things that may block.

Currently, the main mechanism we provide to do that is workqueues. They
somewhat solve the problem, but at the same time, somewhat can make it
worse.

The problem is that delaying a potentially long/sleeping task to a work
queue will have the effect of delaying everything else waiting on that
work queue.

The ability to have per-cpu work queues helps in areas where the problem
scope is mostly per-cpu, but doesn't necessarily cover the case where
the problem scope depends on the driver's activity, not necessarily tied
to one CPU.

Let's take some examples: The main one (which triggers my email) is
spufs, ie, the management of the SPU "co-processors" on the cell
processor, though the same thing mostly applies to any similar
co-processor architecture that would require the need to service page
faults to access user memory.

In this case, various contexts running on the device may want to service
long operations (ie. handle_mm_fault in this case), but using the main
work queue or even a dedicated per-cpu one will cause a context to
potentially hog other contexts or other drivers trying to do the same
while the first one is blocked in the page fault code waiting for IOs...

The basic interface that such drivers want it still about the same as
workqueues tho: "call that function at task level as soon as possible".

Thus the idea of turning workqueues into some kind of pool of threads.

At a given point in time, if none are available (idle) and work stacks
up, the kernel can allocate a new bunch and dispatch more work. Of
course, we would have to find tune what the actual algorithm is to
decide whether to allocate new threads or just wait / throttle for
current delayed work to complete. But I believe the basic premise still
stand.

So what about we allocate a "pool" of task structs, initially blocked,
ready to service jobs dispatched from interrupt time, with some
mechanism, possibly based on the existing base work queue, that can
allocate more if too much work stacks up or (via some scheduler
feedback) too many of the current ones are blocked (ie. waiting for IOs
for example).

For the specific SPU management issue we've been thinking about, we
could just implement an ad-hoc mechanism locally, but it occurs to me
that maybe this is a more generic problem and thus some kind of
extension to workqueues would be a good idea here.

Any comments ?

Cheers,
Ben.


2008-07-01 12:55:15

by Matthew Wilcox

[permalink] [raw]
Subject: Re: [Ksummit-2008-discuss] Delayed interrupt work, thread pools

On Tue, Jul 01, 2008 at 10:45:35PM +1000, Benjamin Herrenschmidt wrote:
> In various areas (I'll come up with some examples later), kernel code
> such as drivers want to defer some processing to "task level", for
> various reasons such as locking (taking mutexes), memory allocation,
> interrupt latency, or simply doing things that may take more time than
> is reasonable to do at interrupt time or do things that may block.
>
> Currently, the main mechanism we provide to do that is workqueues. They
> somewhat solve the problem, but at the same time, somewhat can make it
> worse.

Why not just use a dedicated thread? The API to start / stop threads is
now pretty easy to use.

--
Intel are signing my paycheques ... these opinions are still mine
"Bill, look, we understand that you're interested in selling us this
operating system, but compare it to ours. We can't possibly take such
a retrograde step."

2008-07-01 13:04:05

by Robin Holt

[permalink] [raw]
Subject: Re: Delayed interrupt work, thread pools

Adding Dean Nelson to this discussion. I don't think he actively
follows lkml. We do something similar to this in xpc by managing our
own pool of threads. I know he has talked about this type thing in the
past.

Thanks,
Robin


On Tue, Jul 01, 2008 at 10:45:35PM +1000, Benjamin Herrenschmidt wrote:
> Here's something that's been running in the back of my mind for some
> time that could be a good topic of discussion at KS.
>
> In various areas (I'll come up with some examples later), kernel code
> such as drivers want to defer some processing to "task level", for
> various reasons such as locking (taking mutexes), memory allocation,
> interrupt latency, or simply doing things that may take more time than
> is reasonable to do at interrupt time or do things that may block.
>
> Currently, the main mechanism we provide to do that is workqueues. They
> somewhat solve the problem, but at the same time, somewhat can make it
> worse.
>
> The problem is that delaying a potentially long/sleeping task to a work
> queue will have the effect of delaying everything else waiting on that
> work queue.
>
> The ability to have per-cpu work queues helps in areas where the problem
> scope is mostly per-cpu, but doesn't necessarily cover the case where
> the problem scope depends on the driver's activity, not necessarily tied
> to one CPU.
>
> Let's take some examples: The main one (which triggers my email) is
> spufs, ie, the management of the SPU "co-processors" on the cell
> processor, though the same thing mostly applies to any similar
> co-processor architecture that would require the need to service page
> faults to access user memory.
>
> In this case, various contexts running on the device may want to service
> long operations (ie. handle_mm_fault in this case), but using the main
> work queue or even a dedicated per-cpu one will cause a context to
> potentially hog other contexts or other drivers trying to do the same
> while the first one is blocked in the page fault code waiting for IOs...
>
> The basic interface that such drivers want it still about the same as
> workqueues tho: "call that function at task level as soon as possible".
>
> Thus the idea of turning workqueues into some kind of pool of threads.
>
> At a given point in time, if none are available (idle) and work stacks
> up, the kernel can allocate a new bunch and dispatch more work. Of
> course, we would have to find tune what the actual algorithm is to
> decide whether to allocate new threads or just wait / throttle for
> current delayed work to complete. But I believe the basic premise still
> stand.
>
> So what about we allocate a "pool" of task structs, initially blocked,
> ready to service jobs dispatched from interrupt time, with some
> mechanism, possibly based on the existing base work queue, that can
> allocate more if too much work stacks up or (via some scheduler
> feedback) too many of the current ones are blocked (ie. waiting for IOs
> for example).
>
> For the specific SPU management issue we've been thinking about, we
> could just implement an ad-hoc mechanism locally, but it occurs to me
> that maybe this is a more generic problem and thus some kind of
> extension to workqueues would be a good idea here.
>
> Any comments ?
>
> Cheers,
> Ben.
>
>
> --
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/

2008-07-01 13:38:49

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [Ksummit-2008-discuss] Delayed interrupt work, thread pools

On Tue, 2008-07-01 at 06:53 -0600, Matthew Wilcox wrote:
> On Tue, Jul 01, 2008 at 10:45:35PM +1000, Benjamin Herrenschmidt wrote:
> > In various areas (I'll come up with some examples later), kernel code
> > such as drivers want to defer some processing to "task level", for
> > various reasons such as locking (taking mutexes), memory allocation,
> > interrupt latency, or simply doing things that may take more time than
> > is reasonable to do at interrupt time or do things that may block.
> >
> > Currently, the main mechanism we provide to do that is workqueues. They
> > somewhat solve the problem, but at the same time, somewhat can make it
> > worse.
>
> Why not just use a dedicated thread? The API to start / stop threads is
> now pretty easy to use.

A dedicated thread isn't far from a dedicated workqueue. The thread can
be blocked servicing a page fault and that will delay any further work.

In the case of spufs, we could solve that by having a dedicated thread
per context. That's probably what we'll do for our proof-of-concept
implementation of our new ideas. But that sounds overkill, there
shouldn't be -that- much page faults. Similar comes with gfx cards with
MMUs, etc.. we'd end up with shitload of dedicated threads mostly
staying there sleeping and wasting kernel resources.

Another option I though about would be something akin to some of the
threadlet discussions (or whatever we call those nowadays). ie, have the
workqueue fork when it blocks basically. That would require some API
changes as current drivers may rely on the fact that all workqueues
tasks are serialized though.

Cheers,
Ben.

2008-07-02 01:39:37

by Dean Nelson

[permalink] [raw]
Subject: Re: Delayed interrupt work, thread pools

On Tue, Jul 01, 2008 at 08:02:40AM -0500, Robin Holt wrote:
> Adding Dean Nelson to this discussion. I don't think he actively
> follows lkml. We do something similar to this in xpc by managing our
> own pool of threads. I know he has talked about this type thing in the
> past.
>
> Thanks,
> Robin
>
>
> On Tue, Jul 01, 2008 at 10:45:35PM +1000, Benjamin Herrenschmidt wrote:
> >
> > For the specific SPU management issue we've been thinking about, we
> > could just implement an ad-hoc mechanism locally, but it occurs to me
> > that maybe this is a more generic problem and thus some kind of
> > extension to workqueues would be a good idea here.
> >
> > Any comments ?

As Robin, mentioned XPC manages a pool of kthreads that can (for performance
reasons) be quickly awakened by an interrupt handler and that are able to
block for indefinite periods of time.

In drivers/misc/sgi-xp/xpc_main.c you'll find a rather simplistic attempt
at maintaining this pool of kthreads.

The kthreads are activated by calling xpc_activate_kthreads(). Either idle
kthreads are awakened or new kthreads are created if a sufficent number of
idle kthreads are not available.

Once finished with current 'work' a kthread waits for new work by calling
wait_event_interruptible_exclusive(). (The call is found in
xpc_kthread_waitmsgs().)

The number of idle kthreads is limited as is the total number of kthreads
allowed to exist concurrently.

It's certainly not optimal in the way it maintains the number of kthreads
in the pool over time, but I've not had the time to spare to make it better.

I'd love it if a general mechanism were provided so that XPC could get out
of maintaining its own pool.

Thanks,
Dean

2008-07-02 02:39:14

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: Delayed interrupt work, thread pools

On Tue, 2008-07-01 at 20:39 -0500, Dean Nelson wrote:
> As Robin, mentioned XPC manages a pool of kthreads that can (for performance
> reasons) be quickly awakened by an interrupt handler and that are able to
> block for indefinite periods of time.
>
> In drivers/misc/sgi-xp/xpc_main.c you'll find a rather simplistic attempt
> at maintaining this pool of kthreads.
>
> The kthreads are activated by calling xpc_activate_kthreads(). Either idle
> kthreads are awakened or new kthreads are created if a sufficent number of
> idle kthreads are not available.
>
> Once finished with current 'work' a kthread waits for new work by calling
> wait_event_interruptible_exclusive(). (The call is found in
> xpc_kthread_waitmsgs().)
>
> The number of idle kthreads is limited as is the total number of kthreads
> allowed to exist concurrently.
>
> It's certainly not optimal in the way it maintains the number of kthreads
> in the pool over time, but I've not had the time to spare to make it better.
>
> I'd love it if a general mechanism were provided so that XPC could get out
> of maintaining its own pool.

Thanks. That makes one existing in-tree user and a one likely WIP user,
probably enough to move forward :-)

I'll look at your implementation and discuss internally see what our
specific needs in term of number of threads etc... look like.

I might come up with something simple first (ie, generalizing your
current implementation for example) and then look at some smarter
management of the thread pools.

Cheers,
Ben.

2008-07-02 02:48:18

by Dave Chinner

[permalink] [raw]
Subject: Re: Delayed interrupt work, thread pools

On Wed, Jul 02, 2008 at 12:38:52PM +1000, Benjamin Herrenschmidt wrote:
> On Tue, 2008-07-01 at 20:39 -0500, Dean Nelson wrote:
> > As Robin, mentioned XPC manages a pool of kthreads that can (for performance
> > reasons) be quickly awakened by an interrupt handler and that are able to
> > block for indefinite periods of time.
> >
> > In drivers/misc/sgi-xp/xpc_main.c you'll find a rather simplistic attempt
> > at maintaining this pool of kthreads.
.....
> > I'd love it if a general mechanism were provided so that XPC could get out
> > of maintaining its own pool.
>
> Thanks. That makes one existing in-tree user and a one likely WIP user,
> probably enough to move forward :-)

FWIW, the NFS server has a fairly sophisicated thread pool
implementation that allows interesting control of pool
affinity. Look up struct svc_pool in your local tree ;)

Cheers,

Dave.
--
Dave Chinner
[email protected]

2008-07-02 04:23:39

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [Ksummit-2008-discuss] Delayed interrupt work, thread pools

Benjamin Herrenschmidt wrote:
> Here's something that's been running in the back of my mind for some
> time that could be a good topic of discussion at KS.
>
> In various areas (I'll come up with some examples later), kernel code
> such as drivers want to defer some processing to "task level", for
> various reasons such as locking (taking mutexes), memory allocation,
> interrupt latency, or simply doing things that may take more time than
> is reasonable to do at interrupt time or do things that may block.
>
> Currently, the main mechanism we provide to do that is workqueues. They
> somewhat solve the problem, but at the same time, somewhat can make it
> worse.
>
> The problem is that delaying a potentially long/sleeping task to a work
> queue will have the effect of delaying everything else waiting on that
> work queue.
>

how much of this would be obsoleted if we had irqthreads ?

2008-07-02 05:45:05

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [Ksummit-2008-discuss] Delayed interrupt work, thread pools


> how much of this would be obsoleted if we had irqthreads ?

I'm not sure irqthreads is what I want...

First, can they call handle_mm_fault ? (ie, I'm not sure precisely what
kind of context those operate into).

But even if that's ok, it doesn't quite satisfy my primary needs unless
we can fire off an irqthread per interrupt -occurence- rather than
having an irqthread per source.

There is two aspects to the problem. The less important is that I need
to be able to service other interrupts from that source
after firing off the "job".

For example, the GFX chip or the SPU in my case takes a page fault when
accessing the user mm context it's attached to, I fire off a thread to
handle it (which I attach/detach from the mm, catch signals, etc...),
but that doesn't stop execution. Transfers to/from main memory on the
SPU (and to some extend on graphic chips) are asynchronous and thus the
SPU can still run and emit other interrupts representing different
conditions (though not other page faults).

The second aspect which is more important in the SPU case is that they
context switch. While an SPU context causes a page fault, and I fire off
that thread to service it, I want to be able to context switch some
other context on the SPU which will itself emit interrupts etc... on
that same source.

I could get away by simply allocating a kernel thread per SPU context,
and that's what we're going to do in our proof-of-concept
implementation, but I was hoping to avoid it with the thread pools in
the long run, thus saving a few resources left and right and loading the
main scheduler lists less with huge amount of mostly idle threads.

Now regarding the other usage scenario mentioned here (XPC and the NFS
server) that already have thread pools, how much of these would be also
replaced by irqthreads ? I don't think much off hand but I can't say for
sure until I have a look ... Again, that may be me just not
understanding what irqthreads are but it looks to me that they are one
thread per IRQ source or so, not the ability for a single IRQ source to
fire off multiple threads. Maybe if irqthreads could fork() that would
be an option...

In any case, Dave messages imply we have at least two existing in tree
thread pool implementations for two users and possibly spufs being a 3rd
one (I'm keeping graphics at bay for now as I see that being a more long
term scenario). Probably worth looking at some consolidation.

Anyway, time for me to go look at the XPC and NFS code and see if there
is anything worth putting in common in there. Might take me a little
while, there is nothing urgent (which is why I was thinking about a KS
chat but the list is fine too), we are doing a proof-of-concept
implementation using per-context threads in the meantime anyway.

Cheers,
Ben.

2008-07-02 11:03:27

by Andi Kleen

[permalink] [raw]
Subject: Re: [Ksummit-2008-discuss] Delayed interrupt work, thread pools

Benjamin Herrenschmidt <[email protected]> writes:

>> how much of this would be obsoleted if we had irqthreads ?
>
> I'm not sure irqthreads is what I want...
>
> First, can they call handle_mm_fault ? (ie, I'm not sure precisely what
> kind of context those operate into).

Interrupt threads would be kernel threads and kernel threads
run with lazy (= random) mm and calling handle_mm_fault on that
wouldn't be very useful because you would affect a random mm.

Ok you could force them to run with a specific MM, but that would
cause first live time issues with the original MM (how could you
ever free it?) and also increase the interrupt handling latency
because the interrupt would be a nearly full blown VM context
switch then.

I also think interrupts threads are a bad idea in many cases because
their whole "advantage" over classical interrupts is that they can
block. Now blocking can be usually take a unbounded potentially long
time.

What do you do when there are more interrupts in that unbounded time?

Create more interrupt threads? At some point you'll have hundreds
of threads doing nothing when you're unlucky.

Keep the interrupt event blocked? Then you'll not be handling
any events for a long time.

The usual Linux design that you design that interrupt to be fast
and run in a bounded time seems far more sane to me.

RT Linux has interrupt threads and they work around this problem
by assuming that the block events are only short (only
what would be normally spinlocks). If someone took a lock there
long enough then the bad things described above would happen.

Given if a spinlock is taken too long then a non RT kernel
is usually also not too happy so it's usually ensured that
that this is not the case.

But for your case these guarantees (only short lock regions
block) would not be the case (handle_mm_fault can block for
a long time in some cases) and then all hell could break lose.

So in short I don't think interrupts are a solution to your
problem or that they even solve any problem in a non RT kernel.

-Andi

2008-07-02 11:20:20

by Leon Woestenberg

[permalink] [raw]
Subject: Re: [Ksummit-2008-discuss] Delayed interrupt work, thread pools

Hello,

(including linux-rt-users in the CC:, irqthreads are on-topic there)

On Wed, Jul 2, 2008 at 1:02 PM, Andi Kleen <[email protected]> wrote:
> Benjamin Herrenschmidt <[email protected]> writes:
>
>>> how much of this would be obsoleted if we had irqthreads ?
>>
>> I'm not sure irqthreads is what I want...
>>
> I also think interrupts threads are a bad idea in many cases because
> their whole "advantage" over classical interrupts is that they can
> block. Now blocking can be usually take a unbounded potentially long
> time.
>
> What do you do when there are more interrupts in that unbounded time?
>
If by irqthreads the -rt implementation is meant, isn't this what happens:

irq kernel handler masks the source interrupt
irq handler awakes the matching irqthread (they always are present)
irqthread is scheduled, does work and returns
irq kernel unmasks the source interrupt

> Create more interrupt threads? At some point you'll have hundreds
> of threads doing nothing when you're unlucky.
>
Each irqthread handles one irq.
So now new irq thread would spawn for any interrupt.

Regards,
--
Leon

2008-07-02 11:25:17

by Andi Kleen

[permalink] [raw]
Subject: Re: [Ksummit-2008-discuss] Delayed interrupt work, thread pools

Leon Woestenberg wrote:
> Hello,
>
> (including linux-rt-users in the CC:, irqthreads are on-topic there)

Actually it's probably not interesting for this case.

>
> On Wed, Jul 2, 2008 at 1:02 PM, Andi Kleen <[email protected]> wrote:
>> Benjamin Herrenschmidt <[email protected]> writes:
>>
>>>> how much of this would be obsoleted if we had irqthreads ?
>>> I'm not sure irqthreads is what I want...
>>>
>> I also think interrupts threads are a bad idea in many cases because
>> their whole "advantage" over classical interrupts is that they can
>> block. Now blocking can be usually take a unbounded potentially long
>> time.
>>
>> What do you do when there are more interrupts in that unbounded time?
>>
> If by irqthreads the -rt implementation is meant, isn't this what happens:
>
> irq kernel handler masks the source interrupt
> irq handler awakes the matching irqthread (they always are present)
> irqthread is scheduled, does work and returns
> irq kernel unmasks the source interrupt

I described this case. If the interrupt handler blocks for a long
time (as Ben asked for) then the interrupts will not be handled
for a long time. Probably not what you want.

BTW this was not a criticsm of rt linux (in whose context
irqthreads make sense as I explained) , just an explanation why they
imho don't make sense (IMHO) in a non hard rt interruptible kernel
and especially not to solve Ben's issue.

>
>> Create more interrupt threads? At some point you'll have hundreds
>> of threads doing nothing when you're unlucky.
>>
> Each irqthread handles one irq.
> So now new irq thread would spawn for any interrupt.

It was a general description of all possible irqthreads.

-Andi

2008-07-02 14:11:51

by James Bottomley

[permalink] [raw]
Subject: Re: [Ksummit-2008-discuss] Delayed interrupt work, thread pools

On Wed, 2008-07-02 at 15:44 +1000, Benjamin Herrenschmidt wrote:
> > how much of this would be obsoleted if we had irqthreads ?
>
> I'm not sure irqthreads is what I want...
>
> First, can they call handle_mm_fault ? (ie, I'm not sure precisely what
> kind of context those operate into).
>
> But even if that's ok, it doesn't quite satisfy my primary needs unless
> we can fire off an irqthread per interrupt -occurence- rather than
> having an irqthread per source.
>
> There is two aspects to the problem. The less important is that I need
> to be able to service other interrupts from that source
> after firing off the "job".
>
> For example, the GFX chip or the SPU in my case takes a page fault when
> accessing the user mm context it's attached to, I fire off a thread to
> handle it (which I attach/detach from the mm, catch signals, etc...),
> but that doesn't stop execution. Transfers to/from main memory on the
> SPU (and to some extend on graphic chips) are asynchronous and thus the
> SPU can still run and emit other interrupts representing different
> conditions (though not other page faults).
>
> The second aspect which is more important in the SPU case is that they
> context switch. While an SPU context causes a page fault, and I fire off
> that thread to service it, I want to be able to context switch some
> other context on the SPU which will itself emit interrupts etc... on
> that same source.
>
> I could get away by simply allocating a kernel thread per SPU context,
> and that's what we're going to do in our proof-of-concept
> implementation, but I was hoping to avoid it with the thread pools in
> the long run, thus saving a few resources left and right and loading the
> main scheduler lists less with huge amount of mostly idle threads.
>
> Now regarding the other usage scenario mentioned here (XPC and the NFS
> server) that already have thread pools, how much of these would be also
> replaced by irqthreads ? I don't think much off hand but I can't say for
> sure until I have a look ... Again, that may be me just not
> understanding what irqthreads are but it looks to me that they are one
> thread per IRQ source or so, not the ability for a single IRQ source to
> fire off multiple threads. Maybe if irqthreads could fork() that would
> be an option...
>
> In any case, Dave messages imply we have at least two existing in tree
> thread pool implementations for two users and possibly spufs being a 3rd
> one (I'm keeping graphics at bay for now as I see that being a more long
> term scenario). Probably worth looking at some consolidation.
>
> Anyway, time for me to go look at the XPC and NFS code and see if there
> is anything worth putting in common in there. Might take me a little
> while, there is nothing urgent (which is why I was thinking about a KS
> chat but the list is fine too), we are doing a proof-of-concept
> implementation using per-context threads in the meantime anyway.

If you really need the full scheduling capabilities of threads, then it
sounds like a threadpool is all you need (and we should just provide a
unified interface).

Initially you were implying you'd prefer some type of non blockable
workqueue (i.e. a workqueue that shifts to the next work item when and
earlier item blocks). I can see this construct being useful because it
would have easier to use semantics and be more lightweight than a full
thread spawn. It strikes me we could use some of the syslets work to do
this ... all the queue needs is an "next activation head", which will be
the next job in the queue in the absence of blocking. When a job
blocks, syslets informs the workqueue and it moves on to the work on the
"next activation head". If a prior job unblocks, syslets informs the
queue and it moves the "next activation head" to the unblocked job.
What this is doing is implementing a really simple scheduler within a
single workqueue, which I'm unsure is actually a good idea since
schedulers are complex and tricky things, but it is probably worthy of
discussion.

James

2008-07-02 14:27:41

by Hugh Dickins

[permalink] [raw]
Subject: Re: [Ksummit-2008-discuss] Delayed interrupt work, thread pools

On Wed, 2 Jul 2008, Benjamin Herrenschmidt wrote:
> On Tue, 2008-07-01 at 20:39 -0500, Dean Nelson wrote:
> > As Robin, mentioned XPC manages a pool of kthreads that can (for performance
> > reasons) be quickly awakened by an interrupt handler and that are able to
> > block for indefinite periods of time.
> >
> > In drivers/misc/sgi-xp/xpc_main.c you'll find a rather simplistic attempt
> > at maintaining this pool of kthreads.
> >
> > The kthreads are activated by calling xpc_activate_kthreads(). Either idle
> > kthreads are awakened or new kthreads are created if a sufficent number of
> > idle kthreads are not available.
> >
> > Once finished with current 'work' a kthread waits for new work by calling
> > wait_event_interruptible_exclusive(). (The call is found in
> > xpc_kthread_waitmsgs().)
> >
> > The number of idle kthreads is limited as is the total number of kthreads
> > allowed to exist concurrently.
> >
> > It's certainly not optimal in the way it maintains the number of kthreads
> > in the pool over time, but I've not had the time to spare to make it better.
> >
> > I'd love it if a general mechanism were provided so that XPC could get out
> > of maintaining its own pool.
>
> Thanks. That makes one existing in-tree user and a one likely WIP user,
> probably enough to move forward :-)
>
> I'll look at your implementation and discuss internally see what our
> specific needs in term of number of threads etc... look like.
>
> I might come up with something simple first (ie, generalizing your
> current implementation for example) and then look at some smarter
> management of the thread pools.

Do the pdflush daemons (from mm/pdflush.c) provide another example?

Hugh

2008-07-02 20:00:57

by Steven Rostedt

[permalink] [raw]
Subject: Re: [Ksummit-2008-discuss] Delayed interrupt work, thread pools

On Wed, Jul 02, 2008 at 09:11:36AM -0500, James Bottomley wrote:
>
> If you really need the full scheduling capabilities of threads, then it
> sounds like a threadpool is all you need (and we should just provide a
> unified interface).

Something like this may also be useful for the RT kernel as well. Being
able to push off tasks that we could prioritize would be greatly
beneficial.

Too bad we don't have a lighter task. Looking at the task_struct it
looks quite heavy, to be storing lots of threads. Perhaps we can clean
it up some time and remove out anything that would only be useful for
userspace threads. Not sure how much that would save us.

As for interrupt threads, those would help for some non-RT issues
(having a better desktop feel) but not for the issue that Ben has been
stating. I would be interested in knowing exactly what is needing to
handle a page fault inside the kernel. If we need to do something for a
user space task, as soon as that task is found the work should be passed
to that thread.

>
> Initially you were implying you'd prefer some type of non blockable
> workqueue (i.e. a workqueue that shifts to the next work item when and
> earlier item blocks). I can see this construct being useful because it
> would have easier to use semantics and be more lightweight than a full
> thread spawn. It strikes me we could use some of the syslets work to do
> this ... all the queue needs is an "next activation head", which will be
> the next job in the queue in the absence of blocking. When a job
> blocks, syslets informs the workqueue and it moves on to the work on the
> "next activation head". If a prior job unblocks, syslets informs the
> queue and it moves the "next activation head" to the unblocked job.
> What this is doing is implementing a really simple scheduler within a
> single workqueue, which I'm unsure is actually a good idea since
> schedulers are complex and tricky things, but it is probably worthy of
> discussion.

I think doing a "mini scheduler" inside a workgroup thread would be a
major hack. We would have to have hooks into the normal scheduler to
let the mini-scheduler know something is blocking, and then have that
scheduler do some work. Not to mention that we need to handle
preemption.

Having a thread pool sounds much more reasonable and easier to
implement.

BTW, if something like this is implemented, I think that it should be a
replacement for softirqs and tasklets.

-- Steve

2008-07-02 20:22:20

by James Bottomley

[permalink] [raw]
Subject: Re: [Ksummit-2008-discuss] Delayed interrupt work, thread pools

On Wed, 2008-07-02 at 16:00 -0400, Steven Rostedt wrote:
> On Wed, Jul 02, 2008 at 09:11:36AM -0500, James Bottomley wrote:
> >
> > If you really need the full scheduling capabilities of threads, then it
> > sounds like a threadpool is all you need (and we should just provide a
> > unified interface).
>
> Something like this may also be useful for the RT kernel as well. Being
> able to push off tasks that we could prioritize would be greatly
> beneficial.
>
> Too bad we don't have a lighter task. Looking at the task_struct it
> looks quite heavy, to be storing lots of threads. Perhaps we can clean
> it up some time and remove out anything that would only be useful for
> userspace threads. Not sure how much that would save us.
>
> As for interrupt threads, those would help for some non-RT issues
> (having a better desktop feel) but not for the issue that Ben has been
> stating. I would be interested in knowing exactly what is needing to
> handle a page fault inside the kernel. If we need to do something for a
> user space task, as soon as that task is found the work should be passed
> to that thread.
>
> >
> > Initially you were implying you'd prefer some type of non blockable
> > workqueue (i.e. a workqueue that shifts to the next work item when and
> > earlier item blocks). I can see this construct being useful because it
> > would have easier to use semantics and be more lightweight than a full
> > thread spawn. It strikes me we could use some of the syslets work to do
> > this ... all the queue needs is an "next activation head", which will be
> > the next job in the queue in the absence of blocking. When a job
> > blocks, syslets informs the workqueue and it moves on to the work on the
> > "next activation head". If a prior job unblocks, syslets informs the
> > queue and it moves the "next activation head" to the unblocked job.
> > What this is doing is implementing a really simple scheduler within a
> > single workqueue, which I'm unsure is actually a good idea since
> > schedulers are complex and tricky things, but it is probably worthy of
> > discussion.
>
> I think doing a "mini scheduler" inside a workgroup thread would be a
> major hack. We would have to have hooks into the normal scheduler to
> let the mini-scheduler know something is blocking, and then have that
> scheduler do some work. Not to mention that we need to handle
> preemption.

Not necessarly ... a simplistic round robin is fine.

The work to detect the "am I being blocked" has already been done for
some of the aio patches, so I'm merely suggesting another use for it.

Isn't preemption an orthogonal problem ... it will surely exist even in
the threadpool approach?

> Having a thread pool sounds much more reasonable and easier to
> implement.

Easier to implement, yes. Easier to program, unlikely, and coming with
a large amount of overhead, definitely.

> BTW, if something like this is implemented, I think that it should be a
> replacement for softirqs and tasklets.

James

2008-07-02 20:29:00

by Arjan van de Ven

[permalink] [raw]
Subject: Re: [Ksummit-2008-discuss] Delayed interrupt work, thread pools

James Bottomley wrote:
> Easier to implement, yes. Easier to program, unlikely, and coming with
> a large amount of overhead, definitely.
>
>> BTW, if something like this is implemented, I think that it should be a
>> replacement for softirqs and tasklets.
>
under the "better steal right (from other open source) than invent wrong" mantra:
it's worth looking at what glib does here; I've used their threadpools before and
it worked really well for me.... we could learn a lot from that

2008-07-02 20:40:24

by Steven Rostedt

[permalink] [raw]
Subject: Re: [Ksummit-2008-discuss] Delayed interrupt work, thread pools


On Wed, 2 Jul 2008, James Bottomley wrote:
> >
> > I think doing a "mini scheduler" inside a workgroup thread would be a
> > major hack. We would have to have hooks into the normal scheduler to
> > let the mini-scheduler know something is blocking, and then have that
> > scheduler do some work. Not to mention that we need to handle
> > preemption.
>
> Not necessarly ... a simplistic round robin is fine.

Coming from the RT world I was hoping for something that we could have
better control of prioritizing the tasks ;-)

>
> The work to detect the "am I being blocked" has already been done for
> some of the aio patches, so I'm merely suggesting another use for it.

Hmm, I didn't realize this. I'll have to go look at that code.

>
> Isn't preemption an orthogonal problem ... it will surely exist even in
> the threadpool approach?

I was just thinking that the scheduler would need to differentiate between
being blocked and being preempted. Seems that anytime a task would sleep
(outside preemption) the mini-scheduler would need to schedule the next
task.

>
> > Having a thread pool sounds much more reasonable and easier to
> > implement.
>
> Easier to implement, yes. Easier to program, unlikely, and coming with
> a large amount of overhead, definitely.

Hmm, I'd argue about the "easier to program" part, but the overhead I,
unfortunately, have to argee with you.

>
> > BTW, if something like this is implemented, I think that it should be a
> > replacement for softirqs and tasklets.


-- Steve

2008-07-02 20:57:47

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [Ksummit-2008-discuss] Delayed interrupt work, thread pools

On Wed, 2008-07-02 at 13:02 +0200, Andi Kleen wrote:
> Benjamin Herrenschmidt <[email protected]> writes:
>
> >> how much of this would be obsoleted if we had irqthreads ?
> >
> > I'm not sure irqthreads is what I want...
> >
> > First, can they call handle_mm_fault ? (ie, I'm not sure precisely what
> > kind of context those operate into).
>
> Interrupt threads would be kernel threads and kernel threads
> run with lazy (= random) mm and calling handle_mm_fault on that
> wouldn't be very useful because you would affect a random mm.

That isn't a big issue. handle_mm_fault() takes the mm as an argument
(like when called from get_user_pages()) and if there's anything fishy I
can always attach/detach the mm to the thread. Been done before, works
fine.

> Ok you could force them to run with a specific MM, but that would
> cause first live time issues with the original MM (how could you
> ever free it?) and also increase the interrupt handling latency
> because the interrupt would be a nearly full blown VM context
> switch then.

handle_mm_fault() shouldn't need an mm context switch. I can just
refcount while I have a ref. to the mm in my queue. I can deal with
lifetime, that isn't a big issue.

> I also think interrupts threads are a bad idea in many cases because
> their whole "advantage" over classical interrupts is that they can
> block. Now blocking can be usually take a unbounded potentially long
> time.

Yes, that's what I explain in the rest of my mail. That plus the fact
that I need to context switch the SPU to other contexts while we block.

.../...

I agree with most of your points, which is why I believe interrupt
threads aren't a good option for me.

Interrupts for "normal" events will be handled in a short/bounded time.

Interrupts coming from SPU page faults will be deferred to a thread from
a pool (which can need more time if none is available, ie work queue ->
allocate more, or just wait on one to free up, the stategy here is to be
defined).

It's not a problem to have them delayed. I can context switch a faulting
SPU to some other task and switch it back later when the fault is
serviced. Anything time critical shouldn't operate on fault-able memory
in the first place :-)

So I need at most one kernel thread per SPU context for handling the
faults. The idea of the thread pools is that most of the time, I don't
take faults, and thus I don't nearly need as many threads in practice.
Thus having a pool that can dynamically grow or shrink based on pressure
would make sense.

Ben.

2008-07-02 21:00:32

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [Ksummit-2008-discuss] Delayed interrupt work, thread pools


> If you really need the full scheduling capabilities of threads, then it
> sounds like a threadpool is all you need (and we should just provide a
> unified interface).

That's my thinking nowadays.

> Initially you were implying you'd prefer some type of non blockable
> workqueue (i.e. a workqueue that shifts to the next work item when and
> earlier item blocks).

That's also something I had in mind, I was tossing ideas around and
collecting feedback :-)

> I can see this construct being useful because it
> would have easier to use semantics and be more lightweight than a full
> thread spawn. It strikes me we could use some of the syslets work to do
> this ...

Precisely what I had in mind.

> all the queue needs is an "next activation head", which will be
> the next job in the queue in the absence of blocking. When a job
> blocks, syslets informs the workqueue and it moves on to the work on the
> "next activation head". If a prior job unblocks, syslets informs the
> queue and it moves the "next activation head" to the unblocked job.
> What this is doing is implementing a really simple scheduler within a
> single workqueue, which I'm unsure is actually a good idea since
> schedulers are complex and tricky things, but it is probably worthy of
> discussion.

The question is: is that significantly less overhead than just spawning
a new full blown kernel thread ? enough to justify the complexity ? at
the end of the day, it means allocating a stack (which on ppc64 is still
16K, I know it sucks)...

Ben.

2008-07-02 21:03:18

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [Ksummit-2008-discuss] Delayed interrupt work, thread pools

On Wed, 2008-07-02 at 16:00 -0400, Steven Rostedt wrote:
>
> As for interrupt threads, those would help for some non-RT issues
> (having a better desktop feel) but not for the issue that Ben has been
> stating. I would be interested in knowing exactly what is needing to
> handle a page fault inside the kernel. If we need to do something for a
> user space task, as soon as that task is found the work should be passed
> to that thread.

Not much is needed, as the mm is passed as an argument to
handle_mm_fault(). Page faults can already be handled by 'other'
processes (get_user_pages() doesn't have to be called in the context of
the target mm). I need to check if we may not get into funky issues down
at the vfs level if using a kernel thread without files and I may need
to dbl check if anything in that path tries to signal but that's about
it afaik.

Ben.

2008-07-03 11:55:49

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [Ksummit-2008-discuss] Delayed interrupt work, thread pools

On Thu, 2008-07-03 at 03:12 -0700, Eric W. Biederman wrote:
> Benjamin Herrenschmidt <[email protected]> writes:
>
> > The question is: is that significantly less overhead than just spawning
> > a new full blown kernel thread ? enough to justify the complexity ? at
> > the end of the day, it means allocating a stack (which on ppc64 is still
> > 16K, I know it sucks)...
>
> I looked at this a while ago. And right now kernel_thread is fairly light.
> kthread_create has latency issues because we need to queue up a task on
> our kernel thread spawning daemon, and let it fork the child. Needing
> to go via the kthread spawning daemon didn't look fundamental, just something
> that was a challenge to sort out.

Yes. I was thinking that if it becomes an issue, we could special case
something in the scheduler to pop them.

Ben.

2008-07-03 13:12:05

by Eric W. Biederman

[permalink] [raw]
Subject: Re: [Ksummit-2008-discuss] Delayed interrupt work, thread pools

Benjamin Herrenschmidt <[email protected]> writes:

> The question is: is that significantly less overhead than just spawning
> a new full blown kernel thread ? enough to justify the complexity ? at
> the end of the day, it means allocating a stack (which on ppc64 is still
> 16K, I know it sucks)...

I looked at this a while ago. And right now kernel_thread is fairly light.
kthread_create has latency issues because we need to queue up a task on
our kernel thread spawning daemon, and let it fork the child. Needing
to go via the kthread spawning daemon didn't look fundamental, just something
that was a challenge to sort out.

Eric

2008-07-07 14:10:37

by Chris Mason

[permalink] [raw]
Subject: Re: [Ksummit-2008-discuss] Delayed interrupt work, thread pools

On Wed, 2008-07-02 at 09:11 -0500, James Bottomley wrote:

> If you really need the full scheduling capabilities of threads, then it
> sounds like a threadpool is all you need (and we should just provide a
> unified interface).
>

workqueues weren't quite right for btrfs either, where I need to be able
to verify checksums after IO completes (among other things). So I also
ended up with a simple thread pool system that can add kthreads on
demand.

So, it sounds like we'd have a number of users for a unified interface.

> Initially you were implying you'd prefer some type of non blockable
> workqueue (i.e. a workqueue that shifts to the next work item when and
> earlier item blocks). I can see this construct being useful because it
> would have easier to use semantics and be more lightweight than a full
> thread spawn. It strikes me we could use some of the syslets work to do
> this ... all the queue needs is an "next activation head", which will be
> the next job in the queue in the absence of blocking. When a job
> blocks, syslets informs the workqueue and it moves on to the work on the
> "next activation head". If a prior job unblocks, syslets informs the
> queue and it moves the "next activation head" to the unblocked job.
> What this is doing is implementing a really simple scheduler within a
> single workqueue, which I'm unsure is actually a good idea since
> schedulers are complex and tricky things, but it is probably worthy of
> discussion.

I have a few different users of the thread pools, and I ended up having
to create a number of pools to avoid deadlocks between different types
of operations on the same work list. Ideas like the next activation
head really sound cool, but the simplicity of just making dedicated
pools to dedicated tasks is much much easier to debug.

If the pools are able to resize themselves sanely, it should perform
about the same as the fancy stuff ;)

-chris

2008-07-07 23:04:10

by Benjamin Herrenschmidt

[permalink] [raw]
Subject: Re: [Ksummit-2008-discuss] Delayed interrupt work, thread pools

On Mon, 2008-07-07 at 10:09 -0400, Chris Mason wrote:
> I have a few different users of the thread pools, and I ended up
> having
> to create a number of pools to avoid deadlocks between different types
> of operations on the same work list. Ideas like the next activation
> head really sound cool, but the simplicity of just making dedicated
> pools to dedicated tasks is much much easier to debug.
>
> If the pools are able to resize themselves sanely, it should perform
> about the same as the fancy stuff ;)

Could be just like workqueues: a "default" common pool and the ability
to create specialized pools with possibly configurable constraints in
size etc...

I'm a bit too busy with preparing for the merge window right now along
with a few other things, so I haven't looked in details at the existing
implementations yet. I'm in no hurry tho, thus somebody feel free to
beat me to it, else I'll dig into it later this month or so.

Ben.