ftrace may trigger rb_wakeups while holding pi_lock which will also be
requested via trace_...->...->ring_buffer_unlock_commit->...->
irq_work_queue->raise_softirq->try_to_wake_up. This quickly causes
deadlocks when trying to use ftrace under -rt.
Resolve this by marking the ring buffer's irq_work as HARD_IRQ.
Signed-off-by: Jan Kiszka <[email protected]>
---
I'm not yet sure if this doesn't push work into hard-irq context that
is better not done there on -rt.
I'm also not sure if there aren't more such cases, given that -rt turns
the default irq_work wakeup policy around. But maybe we are lucky.
kernel/trace/ring_buffer.c | 1 +
1 file changed, 1 insertion(+)
diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
index f4fbbfc..6a1939e 100644
--- a/kernel/trace/ring_buffer.c
+++ b/kernel/trace/ring_buffer.c
@@ -1252,6 +1252,7 @@ rb_allocate_cpu_buffer(struct ring_buffer *buffer, int nr_pages, int cpu)
init_irq_work(&cpu_buffer->irq_work.work, rb_wake_up_waiters);
init_waitqueue_head(&cpu_buffer->irq_work.waiters);
init_waitqueue_head(&cpu_buffer->irq_work.full_waiters);
+ cpu_buffer->irq_work.work.flags = IRQ_WORK_HARD_IRQ;
bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
GFP_KERNEL, cpu_to_node(cpu));
On Thu, 16 Apr 2015 16:06:54 +0200
Jan Kiszka <[email protected]> wrote:
> ftrace may trigger rb_wakeups while holding pi_lock which will also be
> requested via trace_...->...->ring_buffer_unlock_commit->...->
> irq_work_queue->raise_softirq->try_to_wake_up. This quickly causes
> deadlocks when trying to use ftrace under -rt.
>
> Resolve this by marking the ring buffer's irq_work as HARD_IRQ.
Hmm, I could have sworn I wrote a patch like this not too long ago.
I'll have to check that out.
Thanks,
-- Steve
>
> Signed-off-by: Jan Kiszka <[email protected]>
> ---
>
> I'm not yet sure if this doesn't push work into hard-irq context that
> is better not done there on -rt.
>
> I'm also not sure if there aren't more such cases, given that -rt turns
> the default irq_work wakeup policy around. But maybe we are lucky.
>
> kernel/trace/ring_buffer.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index f4fbbfc..6a1939e 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -1252,6 +1252,7 @@ rb_allocate_cpu_buffer(struct ring_buffer *buffer, int nr_pages, int cpu)
> init_irq_work(&cpu_buffer->irq_work.work, rb_wake_up_waiters);
> init_waitqueue_head(&cpu_buffer->irq_work.waiters);
> init_waitqueue_head(&cpu_buffer->irq_work.full_waiters);
> + cpu_buffer->irq_work.work.flags = IRQ_WORK_HARD_IRQ;
>
> bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
> GFP_KERNEL, cpu_to_node(cpu));
On 04/16/2015 04:06 PM, Jan Kiszka wrote:
> ftrace may trigger rb_wakeups while holding pi_lock which will also be
> requested via trace_...->...->ring_buffer_unlock_commit->...->
> irq_work_queue->raise_softirq->try_to_wake_up. This quickly causes
> deadlocks when trying to use ftrace under -rt.
>
> Resolve this by marking the ring buffer's irq_work as HARD_IRQ.
>
> Signed-off-by: Jan Kiszka <[email protected]>
> ---
>
> I'm not yet sure if this doesn't push work into hard-irq context that
> is better not done there on -rt.
everything should be done in the soft-irq.
>
> I'm also not sure if there aren't more such cases, given that -rt turns
> the default irq_work wakeup policy around. But maybe we are lucky.
The only thing that is getting done in the hardirq is the FULL_NO_HZ
thingy. I would be _very_ glad if we could keep it that way.
> kernel/trace/ring_buffer.c | 1 +
> 1 file changed, 1 insertion(+)
>
> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
> index f4fbbfc..6a1939e 100644
> --- a/kernel/trace/ring_buffer.c
> +++ b/kernel/trace/ring_buffer.c
> @@ -1252,6 +1252,7 @@ rb_allocate_cpu_buffer(struct ring_buffer *buffer, int nr_pages, int cpu)
> init_irq_work(&cpu_buffer->irq_work.work, rb_wake_up_waiters);
> init_waitqueue_head(&cpu_buffer->irq_work.waiters);
> init_waitqueue_head(&cpu_buffer->irq_work.full_waiters);
> + cpu_buffer->irq_work.work.flags = IRQ_WORK_HARD_IRQ;
>
> bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
> GFP_KERNEL, cpu_to_node(cpu));
>
Sebastian
On 2015-04-16 16:26, Sebastian Andrzej Siewior wrote:
> On 04/16/2015 04:06 PM, Jan Kiszka wrote:
>> ftrace may trigger rb_wakeups while holding pi_lock which will also be
>> requested via trace_...->...->ring_buffer_unlock_commit->...->
>> irq_work_queue->raise_softirq->try_to_wake_up. This quickly causes
>> deadlocks when trying to use ftrace under -rt.
>>
>> Resolve this by marking the ring buffer's irq_work as HARD_IRQ.
>>
>> Signed-off-by: Jan Kiszka <[email protected]>
>> ---
>>
>> I'm not yet sure if this doesn't push work into hard-irq context that
>> is better not done there on -rt.
>
> everything should be done in the soft-irq.
>
>>
>> I'm also not sure if there aren't more such cases, given that -rt turns
>> the default irq_work wakeup policy around. But maybe we are lucky.
>
> The only thing that is getting done in the hardirq is the FULL_NO_HZ
> thingy. I would be _very_ glad if we could keep it that way.
Then - to my current understanding - we need an NMI-safe trigger for
soft-irq work. Is there anything like this existing already? Or can we
still use the IPI-based kick without actually doing the work in hard-irq
context?
Jan
>
>> kernel/trace/ring_buffer.c | 1 +
>> 1 file changed, 1 insertion(+)
>>
>> diff --git a/kernel/trace/ring_buffer.c b/kernel/trace/ring_buffer.c
>> index f4fbbfc..6a1939e 100644
>> --- a/kernel/trace/ring_buffer.c
>> +++ b/kernel/trace/ring_buffer.c
>> @@ -1252,6 +1252,7 @@ rb_allocate_cpu_buffer(struct ring_buffer *buffer, int nr_pages, int cpu)
>> init_irq_work(&cpu_buffer->irq_work.work, rb_wake_up_waiters);
>> init_waitqueue_head(&cpu_buffer->irq_work.waiters);
>> init_waitqueue_head(&cpu_buffer->irq_work.full_waiters);
>> + cpu_buffer->irq_work.work.flags = IRQ_WORK_HARD_IRQ;
>>
>> bpage = kzalloc_node(ALIGN(sizeof(*bpage), cache_line_size()),
>> GFP_KERNEL, cpu_to_node(cpu));
>>
>
> Sebastian
>
--
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
On 04/16/2015 04:28 PM, Jan Kiszka wrote:
> On 2015-04-16 16:26, Sebastian Andrzej Siewior wrote:
>> On 04/16/2015 04:06 PM, Jan Kiszka wrote:
>>> ftrace may trigger rb_wakeups while holding pi_lock which will also be
>>> requested via trace_...->...->ring_buffer_unlock_commit->...->
>>> irq_work_queue->raise_softirq->try_to_wake_up. This quickly causes
>>> deadlocks when trying to use ftrace under -rt.
>>>
>>> Resolve this by marking the ring buffer's irq_work as HARD_IRQ.
>>>
>>> Signed-off-by: Jan Kiszka <[email protected]>
>>> ---
>>>
>>> I'm not yet sure if this doesn't push work into hard-irq context that
>>> is better not done there on -rt.
>>
>> everything should be done in the soft-irq.
>>
>>>
>>> I'm also not sure if there aren't more such cases, given that -rt turns
>>> the default irq_work wakeup policy around. But maybe we are lucky.
>>
>> The only thing that is getting done in the hardirq is the FULL_NO_HZ
>> thingy. I would be _very_ glad if we could keep it that way.
>
> Then - to my current understanding - we need an NMI-safe trigger for
> soft-irq work. Is there anything like this existing already? Or can we
> still use the IPI-based kick without actually doing the work in hard-irq
> context?
But if you trigger it via IPI it will still run in hardirq context,
right? Can you describe how run into this and try to think about it in
a quiet moment. It it just enabling the function tracer and running it?
> Jan
Sebastian
On Thu, 16 Apr 2015 16:28:58 +0200
Jan Kiszka <[email protected]> wrote:
> On 2015-04-16 16:26, Sebastian Andrzej Siewior wrote:
> > On 04/16/2015 04:06 PM, Jan Kiszka wrote:
> >> ftrace may trigger rb_wakeups while holding pi_lock which will also be
> >> requested via trace_...->...->ring_buffer_unlock_commit->...->
> >> irq_work_queue->raise_softirq->try_to_wake_up. This quickly causes
> >> deadlocks when trying to use ftrace under -rt.
> >>
> >> Resolve this by marking the ring buffer's irq_work as HARD_IRQ.
> >>
> >> Signed-off-by: Jan Kiszka <[email protected]>
> >> ---
> >>
> >> I'm not yet sure if this doesn't push work into hard-irq context that
> >> is better not done there on -rt.
> >
> > everything should be done in the soft-irq.
> >
> >>
> >> I'm also not sure if there aren't more such cases, given that -rt turns
> >> the default irq_work wakeup policy around. But maybe we are lucky.
> >
> > The only thing that is getting done in the hardirq is the FULL_NO_HZ
> > thingy. I would be _very_ glad if we could keep it that way.
tracing is special, even more so than NO_HZ_FULL, as it also traces
that as well (and even RCU). Tracing the kernel is like a debugger.
Ideally, it would not be part of the kernel, but just an external
observer. Without special hardware that is not the case, so we try to
be outside the main system as much as possible.
>
> Then - to my current understanding - we need an NMI-safe trigger for
> soft-irq work. Is there anything like this existing already? Or can we
> still use the IPI-based kick without actually doing the work in hard-irq
> context?
>
The reason why it uses irq_work() is because a simple wakeup can
deadlock the system if called by the tracing infrastructure (as we see
raise_softirq() does too).
But yeah, there's no real need to have the ring buffer irq work
handler run from hardirq context. The only requirement is that you can
not do the raise from the irq_work_queue call. If you want to have the
hardirq work handle do the raise softirq, that's fine. Perhaps that's
the solution? Have all irq_work_queue() always trigger the hard irq, but
the hard irq may just raise a softirq or it will call the handler
directly if IRQ_WORK_HARD_IRQ is set.
-- Steve
On 2015-04-16 17:10, Steven Rostedt wrote:
> On Thu, 16 Apr 2015 16:28:58 +0200
> Jan Kiszka <[email protected]> wrote:
>
>> On 2015-04-16 16:26, Sebastian Andrzej Siewior wrote:
>>> On 04/16/2015 04:06 PM, Jan Kiszka wrote:
>>>> ftrace may trigger rb_wakeups while holding pi_lock which will also be
>>>> requested via trace_...->...->ring_buffer_unlock_commit->...->
>>>> irq_work_queue->raise_softirq->try_to_wake_up. This quickly causes
>>>> deadlocks when trying to use ftrace under -rt.
>>>>
>>>> Resolve this by marking the ring buffer's irq_work as HARD_IRQ.
>>>>
>>>> Signed-off-by: Jan Kiszka <[email protected]>
>>>> ---
>>>>
>>>> I'm not yet sure if this doesn't push work into hard-irq context that
>>>> is better not done there on -rt.
>>>
>>> everything should be done in the soft-irq.
>>>
>>>>
>>>> I'm also not sure if there aren't more such cases, given that -rt turns
>>>> the default irq_work wakeup policy around. But maybe we are lucky.
>>>
>>> The only thing that is getting done in the hardirq is the FULL_NO_HZ
>>> thingy. I would be _very_ glad if we could keep it that way.
>
> tracing is special, even more so than NO_HZ_FULL, as it also traces
> that as well (and even RCU). Tracing the kernel is like a debugger.
> Ideally, it would not be part of the kernel, but just an external
> observer. Without special hardware that is not the case, so we try to
> be outside the main system as much as possible.
>
>
>>
>> Then - to my current understanding - we need an NMI-safe trigger for
>> soft-irq work. Is there anything like this existing already? Or can we
>> still use the IPI-based kick without actually doing the work in hard-irq
>> context?
>>
>
> The reason why it uses irq_work() is because a simple wakeup can
> deadlock the system if called by the tracing infrastructure (as we see
> raise_softirq() does too).
>
> But yeah, there's no real need to have the ring buffer irq work
> handler run from hardirq context. The only requirement is that you can
> not do the raise from the irq_work_queue call. If you want to have the
> hardirq work handle do the raise softirq, that's fine. Perhaps that's
> the solution? Have all irq_work_queue() always trigger the hard irq, but
> the hard irq may just raise a softirq or it will call the handler
> directly if IRQ_WORK_HARD_IRQ is set.
I'll play with that.
My patch is definitely not OK. It causes
[ 380.372579] BUG: scheduling while atomic: trace-cmd/2149/0x00010004
...
[ 380.372604] Call Trace:
[ 380.372610] <IRQ> [<ffffffff81607694>] dump_stack+0x50/0x9f
[ 380.372613] [<ffffffff8160413c>] __schedule_bug+0x59/0x69
[ 380.372615] [<ffffffff8160a1d5>] __schedule+0x675/0x800
[ 380.372617] [<ffffffff8160a394>] schedule+0x34/0xa0
[ 380.372619] [<ffffffff8160bf7d>] rt_spin_lock_slowlock+0xcd/0x290
[ 380.372621] [<ffffffff8160d8b5>] rt_spin_lock+0x25/0x30
[ 380.372623] [<ffffffff8108fe39>] __wake_up+0x29/0x60
[ 380.372626] [<ffffffff81106960>] rb_wake_up_waiters+0x40/0x50
[ 380.372628] [<ffffffff8112cdbf>] irq_work_run_list+0x3f/0x60
[ 380.372630] [<ffffffff8112cdf9>] irq_work_run+0x19/0x20
[ 380.372632] [<ffffffff81008409>] smp_trace_irq_work_interrupt+0x39/0x120
[ 380.372633] [<ffffffff8160f8ef>] trace_irq_work_interrupt+0x6f/0x80
[ 380.372636] <EOI> [<ffffffff8103d66d>] ? native_apic_msr_write+0x2d/0x30
[ 380.372637] [<ffffffff8103d53d>] x2apic_send_IPI_self+0x1d/0x20
[ 380.372638] [<ffffffff8100851e>] arch_irq_work_raise+0x2e/0x40
[ 380.372639] [<ffffffff8112d025>] irq_work_queue+0xc5/0xf0
[ 380.372641] [<ffffffff81107d8a>] ring_buffer_unlock_commit+0x14a/0x2e0
[ 380.372643] [<ffffffff8110f894>] trace_buffer_unlock_commit+0x24/0x60
[ 380.372644] [<ffffffff8111f9da>] ftrace_event_buffer_commit+0x8a/0xc0
[ 380.372647] [<ffffffff811c58de>] ftrace_raw_event_writeback_dirty_inode_template+0x8e/0xc0
[ 380.372648] [<ffffffff811c8b21>] __mark_inode_dirty+0x1d1/0x310
[ 380.372650] [<ffffffff811d0ec8>] generic_write_end+0x78/0xb0
[ 380.372658] [<ffffffffa021c42b>] ext4_da_write_end+0x10b/0x2f0 [ext4]
[ 380.372661] [<ffffffff8116335e>] ? pagefault_enable+0x1e/0x20
[ 380.372662] [<ffffffff8113c337>] generic_perform_write+0x107/0x1b0
[ 380.372664] [<ffffffff8113e49f>] __generic_file_write_iter+0x15f/0x350
[ 380.372668] [<ffffffffa0210c91>] ext4_file_write_iter+0x101/0x3d0 [ext4]
[ 380.372670] [<ffffffff8118f59b>] ? __kmalloc+0x16b/0x250
[ 380.372672] [<ffffffff811ca96e>] ? iter_file_splice_write+0x8e/0x430
[ 380.372673] [<ffffffff811ca96e>] ? iter_file_splice_write+0x8e/0x430
[ 380.372674] [<ffffffff811cab35>] iter_file_splice_write+0x255/0x430
[ 380.372676] [<ffffffff811cc474>] SyS_splice+0x214/0x760
[ 380.372677] [<ffffffff81011fe7>] ? syscall_trace_enter_phase2+0xa7/0x1e0
[ 380.372679] [<ffffffff8160e266>] tracesys_phase2+0xd4/0xd9
Jan
--
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
On 2015-04-16 16:57, Sebastian Andrzej Siewior wrote:
> On 04/16/2015 04:28 PM, Jan Kiszka wrote:
>> On 2015-04-16 16:26, Sebastian Andrzej Siewior wrote:
>>> On 04/16/2015 04:06 PM, Jan Kiszka wrote:
>>>> ftrace may trigger rb_wakeups while holding pi_lock which will also be
>>>> requested via trace_...->...->ring_buffer_unlock_commit->...->
>>>> irq_work_queue->raise_softirq->try_to_wake_up. This quickly causes
>>>> deadlocks when trying to use ftrace under -rt.
>>>>
>>>> Resolve this by marking the ring buffer's irq_work as HARD_IRQ.
>>>>
>>>> Signed-off-by: Jan Kiszka <[email protected]>
>>>> ---
>>>>
>>>> I'm not yet sure if this doesn't push work into hard-irq context that
>>>> is better not done there on -rt.
>>>
>>> everything should be done in the soft-irq.
>>>
>>>>
>>>> I'm also not sure if there aren't more such cases, given that -rt turns
>>>> the default irq_work wakeup policy around. But maybe we are lucky.
>>>
>>> The only thing that is getting done in the hardirq is the FULL_NO_HZ
>>> thingy. I would be _very_ glad if we could keep it that way.
>>
>> Then - to my current understanding - we need an NMI-safe trigger for
>> soft-irq work. Is there anything like this existing already? Or can we
>> still use the IPI-based kick without actually doing the work in hard-irq
>> context?
>
> But if you trigger it via IPI it will still run in hardirq context,
> right? Can you describe how run into this and try to think about it in
> a quiet moment. It it just enabling the function tracer and running it?
# trace-cmd record -e sched
/sys/kernel/debug/tracing/events/sched/filter
/sys/kernel/debug/tracing/events/*/sched/filter
Hit Ctrl^C to stop recording
^C
and you are dead(locked).
Jan
--
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
On 04/16/2015 05:29 PM, Jan Kiszka wrote:
> My patch is definitely not OK. It causes
>
> [ 380.372579] BUG: scheduling while atomic: trace-cmd/2149/0x00010004
> ...
> [ 380.372604] Call Trace:
> [ 380.372610] <IRQ> [<ffffffff81607694>] dump_stack+0x50/0x9f
> [ 380.372613] [<ffffffff8160413c>] __schedule_bug+0x59/0x69
> [ 380.372615] [<ffffffff8160a1d5>] __schedule+0x675/0x800
> [ 380.372617] [<ffffffff8160a394>] schedule+0x34/0xa0
> [ 380.372619] [<ffffffff8160bf7d>] rt_spin_lock_slowlock+0xcd/0x290
> [ 380.372621] [<ffffffff8160d8b5>] rt_spin_lock+0x25/0x30
> [ 380.372623] [<ffffffff8108fe39>] __wake_up+0x29/0x60
right. you must not take any sleeping locks in the hardirq context :)
This works for the no-hz-work thingy. It grabs raw locks and may cancel
one hrtimer which is marked irqsafe.
Sebastian
Instead of turning all irq_work requests into lazy ones on -rt, just
move their execution from hard into soft-irq context.
This resolves deadlocks of ftrace which will queue work from arbitrary
contexts, including those that have locks held that are needed for
raising a soft-irq.
Signed-off-by: Jan Kiszka <[email protected]>
---
Second try, looks much better so far. And it also removes my concerns
regarding other potential cases besides ftrace.
kernel/irq_work.c | 26 ++++++++++----------------
1 file changed, 10 insertions(+), 16 deletions(-)
diff --git a/kernel/irq_work.c b/kernel/irq_work.c
index 9dda38a..3f6ffcd 100644
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -85,12 +85,9 @@ bool irq_work_queue_on(struct irq_work *work, int cpu)
raise_irqwork = llist_add(&work->llnode,
&per_cpu(hirq_work_list, cpu));
else
- raise_irqwork = llist_add(&work->llnode,
- &per_cpu(lazy_list, cpu));
-#else
+#endif
raise_irqwork = llist_add(&work->llnode,
&per_cpu(raised_list, cpu));
-#endif
if (raise_irqwork)
arch_send_call_function_single_ipi(cpu);
@@ -114,21 +111,20 @@ bool irq_work_queue(struct irq_work *work)
if (work->flags & IRQ_WORK_HARD_IRQ) {
if (llist_add(&work->llnode, this_cpu_ptr(&hirq_work_list)))
arch_irq_work_raise();
- } else {
+ } else
+#endif
+ if (work->flags & IRQ_WORK_LAZY) {
if (llist_add(&work->llnode, this_cpu_ptr(&lazy_list)) &&
tick_nohz_tick_stopped())
+#ifdef CONFIG_PREEMPT_RT_FULL
raise_softirq(TIMER_SOFTIRQ);
- }
#else
- if (work->flags & IRQ_WORK_LAZY) {
- if (llist_add(&work->llnode, this_cpu_ptr(&lazy_list)) &&
- tick_nohz_tick_stopped())
arch_irq_work_raise();
+#endif
} else {
if (llist_add(&work->llnode, this_cpu_ptr(&raised_list)))
arch_irq_work_raise();
}
-#endif
preempt_enable();
@@ -202,6 +198,8 @@ void irq_work_run(void)
{
#ifdef CONFIG_PREEMPT_RT_FULL
irq_work_run_list(this_cpu_ptr(&hirq_work_list));
+ if (!llist_empty(this_cpu_ptr(&raised_list)))
+ raise_softirq(TIMER_SOFTIRQ);
#else
irq_work_run_list(this_cpu_ptr(&raised_list));
irq_work_run_list(this_cpu_ptr(&lazy_list));
@@ -211,15 +209,11 @@ EXPORT_SYMBOL_GPL(irq_work_run);
void irq_work_tick(void)
{
-#ifdef CONFIG_PREEMPT_RT_FULL
- irq_work_run_list(this_cpu_ptr(&lazy_list));
-#else
- struct llist_head *raised = &__get_cpu_var(raised_list);
+ struct llist_head *raised = this_cpu_ptr(&raised_list);
if (!llist_empty(raised) && !arch_irq_work_has_interrupt())
irq_work_run_list(raised);
- irq_work_run_list(&__get_cpu_var(lazy_list));
-#endif
+ irq_work_run_list(this_cpu_ptr(&lazy_list));
}
/*
--
2.1.4
On Thu, 2015-04-16 at 18:28 +0200, Jan Kiszka wrote:
> Instead of turning all irq_work requests into lazy ones on -rt, just
> move their execution from hard into soft-irq context.
>
> This resolves deadlocks of ftrace which will queue work from
> arbitrary
> contexts, including those that have locks held that are needed for
> raising a soft-irq.
Yup, trace-cmd record -> dead-box fully repeatable, and now fixed.
> Signed-off-by: Jan Kiszka <[email protected]>
> ---
>
> Second try, looks much better so far. And it also removes my concerns
> regarding other potential cases besides ftrace.
>
> kernel/irq_work.c | 26 ++++++++++----------------
> 1 file changed, 10 insertions(+), 16 deletions(-)
>
> diff --git a/kernel/irq_work.c b/kernel/irq_work.c
> index 9dda38a..3f6ffcd 100644
> --- a/kernel/irq_work.c
> +++ b/kernel/irq_work.c
> @@ -85,12 +85,9 @@ bool irq_work_queue_on(struct irq_work *work, int
> cpu)
> raise_irqwork = llist_add(&work->llnode,
> &per_cpu(hirq_work_list, cpu));
> else
> - raise_irqwork = llist_add(&work->llnode,
> - &per_cpu(lazy_list, cpu));
> -#else
> +#endif
> raise_irqwork = llist_add(&work->llnode,
> &per_cpu(raised_list, cpu));
> -#endif
>
> if (raise_irqwork)
> arch_send_call_function_single_ipi(cpu);
> @@ -114,21 +111,20 @@ bool irq_work_queue(struct irq_work *work)
> if (work->flags & IRQ_WORK_HARD_IRQ) {
> if (llist_add(&work->llnode, this_cpu_ptr(&hirq_work_list)))
> arch_irq_work_raise();
> - } else {
> + } else
> +#endif
> + if (work->flags & IRQ_WORK_LAZY) {
> if (llist_add(&work->llnode, this_cpu_ptr(&lazy_list)) &&
> tick_nohz_tick_stopped())
> +#ifdef CONFIG_PREEMPT_RT_FULL
> raise_softirq(TIMER_SOFTIRQ);
> - }
> #else
> - if (work->flags & IRQ_WORK_LAZY) {
> - if (llist_add(&work->llnode, this_cpu_ptr(&lazy_list)) &&
> - tick_nohz_tick_stopped())
> arch_irq_work_raise();
> +#endif
> } else {
> if (llist_add(&work->llnode, this_cpu_ptr(&raised_list)))
> arch_irq_work_raise();
> }
> -#endif
>
> preempt_enable();
>
> @@ -202,6 +198,8 @@ void irq_work_run(void)
> {
> #ifdef CONFIG_PREEMPT_RT_FULL
> irq_work_run_list(this_cpu_ptr(&hirq_work_list));
> + if (!llist_empty(this_cpu_ptr(&raised_list)))
> + raise_softirq(TIMER_SOFTIRQ);
> #else
> irq_work_run_list(this_cpu_ptr(&raised_list));
> irq_work_run_list(this_cpu_ptr(&lazy_list));
> @@ -211,15 +209,11 @@ EXPORT_SYMBOL_GPL(irq_work_run);
>
> void irq_work_tick(void)
> {
> -#ifdef CONFIG_PREEMPT_RT_FULL
> - irq_work_run_list(this_cpu_ptr(&lazy_list));
> -#else
> - struct llist_head *raised = &__get_cpu_var(raised_list);
> + struct llist_head *raised = this_cpu_ptr(&raised_list);
>
> if (!llist_empty(raised) && !arch_irq_work_has_interrupt())
> irq_work_run_list(raised);
> - irq_work_run_list(&__get_cpu_var(lazy_list));
> -#endif
> + irq_work_run_list(this_cpu_ptr(&lazy_list));
> }
>
> /*
On Mon, 2015-04-20 at 10:03 +0200, Mike Galbraith wrote:
> On Thu, 2015-04-16 at 18:28 +0200, Jan Kiszka wrote:
> > Instead of turning all irq_work requests into lazy ones on -rt,
> > just
> > move their execution from hard into soft-irq context.
> >
> > This resolves deadlocks of ftrace which will queue work from
> > arbitrary
> > contexts, including those that have locks held that are needed for
> > raising a soft-irq.
>
> Yup, trace-cmd record -> dead-box fully repeatable, and now fixed.
Except you kinda forgot to run the raised list. The reformatted
(which saved two whole lines;) patch below adds that to
irq_work_tick(), which fixes the livelock both powertop and perf top
otherwise meet.
Subject: [PATCH RT 3.18] irq_work: Provide a soft-irq based queue
Date: Thu, 16 Apr 2015 18:28:16 +0200
From: Jan Kiszka <[email protected]>
Instead of turning all irq_work requests into lazy ones on -rt, just
move their execution from hard into soft-irq context.
This resolves deadlocks of ftrace which will queue work from arbitrary
contexts, including those that have locks held that are needed for
raising a soft-irq.
Signed-off-by: Jan Kiszka <[email protected]>
---
Second try, looks much better so far. And it also removes my concerns
regarding other potential cases besides ftrace.
kernel/irq_work.c | 84 ++++++++++++++++++++++++++----------------------------
1 file changed, 41 insertions(+), 43 deletions(-)
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -80,17 +80,12 @@ bool irq_work_queue_on(struct irq_work *
if (!irq_work_claim(work))
return false;
-#ifdef CONFIG_PREEMPT_RT_FULL
- if (work->flags & IRQ_WORK_HARD_IRQ)
+ if (IS_ENABLED(CONFIG_PREEMPT_RT_FULL) && (work->flags & IRQ_WORK_HARD_IRQ))
raise_irqwork = llist_add(&work->llnode,
&per_cpu(hirq_work_list, cpu));
else
raise_irqwork = llist_add(&work->llnode,
- &per_cpu(lazy_list, cpu));
-#else
- raise_irqwork = llist_add(&work->llnode,
&per_cpu(raised_list, cpu));
-#endif
if (raise_irqwork)
arch_send_call_function_single_ipi(cpu);
@@ -103,6 +98,9 @@ EXPORT_SYMBOL_GPL(irq_work_queue_on);
/* Enqueue the irq work @work on the current CPU */
bool irq_work_queue(struct irq_work *work)
{
+ bool realtime = IS_ENABLED(CONFIG_PREEMPT_RT_FULL);
+ bool raise = false;
+
/* Only queue if not already pending */
if (!irq_work_claim(work))
return false;
@@ -110,25 +108,22 @@ bool irq_work_queue(struct irq_work *wor
/* Queue the entry and raise the IPI if needed. */
preempt_disable();
-#ifdef CONFIG_PREEMPT_RT_FULL
- if (work->flags & IRQ_WORK_HARD_IRQ) {
+ if (realtime && (work->flags & IRQ_WORK_HARD_IRQ)) {
if (llist_add(&work->llnode, this_cpu_ptr(&hirq_work_list)))
- arch_irq_work_raise();
- } else {
- if (llist_add(&work->llnode, this_cpu_ptr(&lazy_list)) &&
- tick_nohz_tick_stopped())
- raise_softirq(TIMER_SOFTIRQ);
- }
-#else
- if (work->flags & IRQ_WORK_LAZY) {
+ raise = 1;
+ } else if (work->flags & IRQ_WORK_LAZY) {
if (llist_add(&work->llnode, this_cpu_ptr(&lazy_list)) &&
- tick_nohz_tick_stopped())
- arch_irq_work_raise();
- } else {
- if (llist_add(&work->llnode, this_cpu_ptr(&raised_list)))
- arch_irq_work_raise();
- }
-#endif
+ tick_nohz_tick_stopped()) {
+ if (realtime)
+ raise_softirq(TIMER_SOFTIRQ);
+ else
+ raise = true;
+ }
+ } else if (llist_add(&work->llnode, this_cpu_ptr(&raised_list)))
+ raise = true;
+
+ if (raise)
+ arch_irq_work_raise();
preempt_enable();
@@ -143,12 +138,13 @@ bool irq_work_needs_cpu(void)
raised = this_cpu_ptr(&raised_list);
lazy = this_cpu_ptr(&lazy_list);
- if (llist_empty(raised))
- if (llist_empty(lazy))
-#ifdef CONFIG_PREEMPT_RT_FULL
+ if (llist_empty(raised) && llist_empty(lazy)) {
+ if (IS_ENABLED(CONFIG_PREEMPT_RT_FULL)) {
if (llist_empty(this_cpu_ptr(&hirq_work_list)))
-#endif
return false;
+ } else
+ return false;
+ }
/* All work should have been flushed before going offline */
WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
@@ -162,9 +158,7 @@ static void irq_work_run_list(struct lli
struct irq_work *work;
struct llist_node *llnode;
-#ifndef CONFIG_PREEMPT_RT_FULL
- BUG_ON(!irqs_disabled());
-#endif
+ BUG_ON(!IS_ENABLED(CONFIG_PREEMPT_RT_FULL) && !irqs_disabled());
if (llist_empty(list))
return;
@@ -200,26 +194,30 @@ static void irq_work_run_list(struct lli
*/
void irq_work_run(void)
{
-#ifdef CONFIG_PREEMPT_RT_FULL
- irq_work_run_list(this_cpu_ptr(&hirq_work_list));
-#else
- irq_work_run_list(this_cpu_ptr(&raised_list));
- irq_work_run_list(this_cpu_ptr(&lazy_list));
-#endif
+ if (IS_ENABLED(CONFIG_PREEMPT_RT_FULL)) {
+ irq_work_run_list(this_cpu_ptr(&hirq_work_list));
+ /*
+ * NOTE: we raise softirq via IPI for safety,
+ * and execute in irq_work_tick() to move the
+ * overhead from hard to soft irq context.
+ */
+ if (!llist_empty(this_cpu_ptr(&raised_list)))
+ raise_softirq(TIMER_SOFTIRQ);
+ } else {
+ irq_work_run_list(this_cpu_ptr(&raised_list));
+ irq_work_run_list(this_cpu_ptr(&lazy_list));
+ }
}
EXPORT_SYMBOL_GPL(irq_work_run);
void irq_work_tick(void)
{
-#ifdef CONFIG_PREEMPT_RT_FULL
- irq_work_run_list(this_cpu_ptr(&lazy_list));
-#else
- struct llist_head *raised = &__get_cpu_var(raised_list);
+ struct llist_head *raised = this_cpu_ptr(&raised_list);
- if (!llist_empty(raised) && !arch_irq_work_has_interrupt())
+ if (!llist_empty(raised) && (!arch_irq_work_has_interrupt() ||
+ IS_ENABLED(CONFIG_PREEMPT_RT_FULL)))
irq_work_run_list(raised);
- irq_work_run_list(&__get_cpu_var(lazy_list));
-#endif
+ irq_work_run_list(this_cpu_ptr(&lazy_list));
}
/*
On 2015-04-23 08:11, Mike Galbraith wrote:
> On Mon, 2015-04-20 at 10:03 +0200, Mike Galbraith wrote:
>> On Thu, 2015-04-16 at 18:28 +0200, Jan Kiszka wrote:
>>> Instead of turning all irq_work requests into lazy ones on -rt,
>>> just
>>> move their execution from hard into soft-irq context.
>>>
>>> This resolves deadlocks of ftrace which will queue work from
>>> arbitrary
>>> contexts, including those that have locks held that are needed for
>>> raising a soft-irq.
>>
>> Yup, trace-cmd record -> dead-box fully repeatable, and now fixed.
>
> Except you kinda forgot to run the raised list. The reformatted
> (which saved two whole lines;) patch below adds that to
> irq_work_tick(), which fixes the livelock both powertop and perf top
> otherwise meet.
>
> Subject: [PATCH RT 3.18] irq_work: Provide a soft-irq based queue
> Date: Thu, 16 Apr 2015 18:28:16 +0200
> From: Jan Kiszka <[email protected]>
>
> Instead of turning all irq_work requests into lazy ones on -rt, just
> move their execution from hard into soft-irq context.
>
> This resolves deadlocks of ftrace which will queue work from arbitrary
> contexts, including those that have locks held that are needed for
> raising a soft-irq.
>
> Signed-off-by: Jan Kiszka <[email protected]>
> ---
>
> Second try, looks much better so far. And it also removes my concerns
> regarding other potential cases besides ftrace.
>
> kernel/irq_work.c | 84 ++++++++++++++++++++++++++----------------------------
> 1 file changed, 41 insertions(+), 43 deletions(-)
>
> --- a/kernel/irq_work.c
> +++ b/kernel/irq_work.c
> @@ -80,17 +80,12 @@ bool irq_work_queue_on(struct irq_work *
> if (!irq_work_claim(work))
> return false;
>
> -#ifdef CONFIG_PREEMPT_RT_FULL
> - if (work->flags & IRQ_WORK_HARD_IRQ)
> + if (IS_ENABLED(CONFIG_PREEMPT_RT_FULL) && (work->flags & IRQ_WORK_HARD_IRQ))
> raise_irqwork = llist_add(&work->llnode,
> &per_cpu(hirq_work_list, cpu));
> else
> raise_irqwork = llist_add(&work->llnode,
> - &per_cpu(lazy_list, cpu));
> -#else
> - raise_irqwork = llist_add(&work->llnode,
> &per_cpu(raised_list, cpu));
> -#endif
>
> if (raise_irqwork)
> arch_send_call_function_single_ipi(cpu);
> @@ -103,6 +98,9 @@ EXPORT_SYMBOL_GPL(irq_work_queue_on);
> /* Enqueue the irq work @work on the current CPU */
> bool irq_work_queue(struct irq_work *work)
> {
> + bool realtime = IS_ENABLED(CONFIG_PREEMPT_RT_FULL);
> + bool raise = false;
> +
> /* Only queue if not already pending */
> if (!irq_work_claim(work))
> return false;
> @@ -110,25 +108,22 @@ bool irq_work_queue(struct irq_work *wor
> /* Queue the entry and raise the IPI if needed. */
> preempt_disable();
>
> -#ifdef CONFIG_PREEMPT_RT_FULL
> - if (work->flags & IRQ_WORK_HARD_IRQ) {
> + if (realtime && (work->flags & IRQ_WORK_HARD_IRQ)) {
> if (llist_add(&work->llnode, this_cpu_ptr(&hirq_work_list)))
> - arch_irq_work_raise();
> - } else {
> - if (llist_add(&work->llnode, this_cpu_ptr(&lazy_list)) &&
> - tick_nohz_tick_stopped())
> - raise_softirq(TIMER_SOFTIRQ);
> - }
> -#else
> - if (work->flags & IRQ_WORK_LAZY) {
> + raise = 1;
> + } else if (work->flags & IRQ_WORK_LAZY) {
> if (llist_add(&work->llnode, this_cpu_ptr(&lazy_list)) &&
> - tick_nohz_tick_stopped())
> - arch_irq_work_raise();
> - } else {
> - if (llist_add(&work->llnode, this_cpu_ptr(&raised_list)))
> - arch_irq_work_raise();
> - }
> -#endif
> + tick_nohz_tick_stopped()) {
> + if (realtime)
> + raise_softirq(TIMER_SOFTIRQ);
> + else
> + raise = true;
> + }
> + } else if (llist_add(&work->llnode, this_cpu_ptr(&raised_list)))
> + raise = true;
> +
> + if (raise)
> + arch_irq_work_raise();
>
> preempt_enable();
>
> @@ -143,12 +138,13 @@ bool irq_work_needs_cpu(void)
> raised = this_cpu_ptr(&raised_list);
> lazy = this_cpu_ptr(&lazy_list);
>
> - if (llist_empty(raised))
> - if (llist_empty(lazy))
> -#ifdef CONFIG_PREEMPT_RT_FULL
> + if (llist_empty(raised) && llist_empty(lazy)) {
> + if (IS_ENABLED(CONFIG_PREEMPT_RT_FULL)) {
> if (llist_empty(this_cpu_ptr(&hirq_work_list)))
> -#endif
> return false;
> + } else
> + return false;
> + }
>
> /* All work should have been flushed before going offline */
> WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
> @@ -162,9 +158,7 @@ static void irq_work_run_list(struct lli
> struct irq_work *work;
> struct llist_node *llnode;
>
> -#ifndef CONFIG_PREEMPT_RT_FULL
> - BUG_ON(!irqs_disabled());
> -#endif
> + BUG_ON(!IS_ENABLED(CONFIG_PREEMPT_RT_FULL) && !irqs_disabled());
>
> if (llist_empty(list))
> return;
> @@ -200,26 +194,30 @@ static void irq_work_run_list(struct lli
> */
> void irq_work_run(void)
> {
> -#ifdef CONFIG_PREEMPT_RT_FULL
> - irq_work_run_list(this_cpu_ptr(&hirq_work_list));
> -#else
> - irq_work_run_list(this_cpu_ptr(&raised_list));
> - irq_work_run_list(this_cpu_ptr(&lazy_list));
> -#endif
> + if (IS_ENABLED(CONFIG_PREEMPT_RT_FULL)) {
> + irq_work_run_list(this_cpu_ptr(&hirq_work_list));
> + /*
> + * NOTE: we raise softirq via IPI for safety,
> + * and execute in irq_work_tick() to move the
> + * overhead from hard to soft irq context.
> + */
> + if (!llist_empty(this_cpu_ptr(&raised_list)))
> + raise_softirq(TIMER_SOFTIRQ);
> + } else {
> + irq_work_run_list(this_cpu_ptr(&raised_list));
> + irq_work_run_list(this_cpu_ptr(&lazy_list));
> + }
> }
> EXPORT_SYMBOL_GPL(irq_work_run);
>
> void irq_work_tick(void)
> {
> -#ifdef CONFIG_PREEMPT_RT_FULL
> - irq_work_run_list(this_cpu_ptr(&lazy_list));
> -#else
> - struct llist_head *raised = &__get_cpu_var(raised_list);
> + struct llist_head *raised = this_cpu_ptr(&raised_list);
>
> - if (!llist_empty(raised) && !arch_irq_work_has_interrupt())
> + if (!llist_empty(raised) && (!arch_irq_work_has_interrupt() ||
> + IS_ENABLED(CONFIG_PREEMPT_RT_FULL)))
OK, that additional condition is addressing archs that don't have
irq_work support and fall back to the timer, right?
> irq_work_run_list(raised);
> - irq_work_run_list(&__get_cpu_var(lazy_list));
> -#endif
> + irq_work_run_list(this_cpu_ptr(&lazy_list));
> }
>
> /*
>
The patch is whitespace-damaged - could you resent? I'm trying to
visualize the full diff for me.
Thanks,
Jan
--
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
On 2015-04-23 08:11, Mike Galbraith wrote:
> @@ -103,6 +98,9 @@ EXPORT_SYMBOL_GPL(irq_work_queue_on);
> /* Enqueue the irq work @work on the current CPU */
> bool irq_work_queue(struct irq_work *work)
> {
> + bool realtime = IS_ENABLED(CONFIG_PREEMPT_RT_FULL);
> + bool raise = false;
> +
> /* Only queue if not already pending */
> if (!irq_work_claim(work))
> return false;
> @@ -110,25 +108,22 @@ bool irq_work_queue(struct irq_work *wor
> /* Queue the entry and raise the IPI if needed. */
> preempt_disable();
>
> -#ifdef CONFIG_PREEMPT_RT_FULL
> - if (work->flags & IRQ_WORK_HARD_IRQ) {
> + if (realtime && (work->flags & IRQ_WORK_HARD_IRQ)) {
> if (llist_add(&work->llnode, this_cpu_ptr(&hirq_work_list)))
This boils down to
#ifdef CONFIG_X
some_type x;
#endif
...
if (IS_ENABLED(CONFIG_X) && ...)
use(x);
And here we even have an indirection for IS_ENABLED via that local bool
variable. Is that pattern OK for Linux? Does it compile in all supported
optimization levels of all supported compilers?
Jan
--
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
On Thu, 2015-04-23 at 08:29 +0200, Jan Kiszka wrote:
>
> > void irq_work_tick(void)
> > {
> > -#ifdef CONFIG_PREEMPT_RT_FULL
> > - irq_work_run_list(this_cpu_ptr(&lazy_list));
> > -#else
> > - struct llist_head *raised = &__get_cpu_var(raised_list);
> > + struct llist_head *raised = this_cpu_ptr(&raised_list);
> >
> > - if (!llist_empty(raised) && !arch_irq_work_has_interrupt())
> > + if (!llist_empty(raised) &&
> > (!arch_irq_work_has_interrupt() ||
> > + IS_ENABLED(CONFIG_PREEMPT_RT_FULL)))
>
> OK, that additional condition is addressing archs that don't have
> irq_work support and fall back to the timer, right?
How will ever run if it is not run in either irq_work_run() or
irq_work_tick()? There are two choices, we better pick one.
Attaching patch since either evolution fscked up again (it does that),
or someone has managed to turn it into a completely useless piece of
crap... if so, likely the same dipstick who made it save messages such
that you need fromdos to wipe away the shite it smears all over it.
-Mike
On Thu, 2015-04-23 at 08:50 +0200, Jan Kiszka wrote:
> On 2015-04-23 08:11, Mike Galbraith wrote:
> > @@ -103,6 +98,9 @@ EXPORT_SYMBOL_GPL(irq_work_queue_on);
> > /* Enqueue the irq work @work on the current CPU */
> > bool irq_work_queue(struct irq_work *work)
> > {
> > + bool realtime = IS_ENABLED(CONFIG_PREEMPT_RT_FULL);
> > + bool raise = false;
> > +
> > /* Only queue if not already pending */
> > if (!irq_work_claim(work))
> > return false;
> > @@ -110,25 +108,22 @@ bool irq_work_queue(struct irq_work *wor
> > /* Queue the entry and raise the IPI if needed. */
> > preempt_disable();
> >
> > -#ifdef CONFIG_PREEMPT_RT_FULL
> > - if (work->flags & IRQ_WORK_HARD_IRQ) {
> > + if (realtime && (work->flags & IRQ_WORK_HARD_IRQ)) {
> > if (llist_add(&work->llnode,
> > this_cpu_ptr(&hirq_work_list)))
>
> This boils down to
>
> #ifdef CONFIG_X
> some_type x;
> #endif
> ...
>
> if (IS_ENABLED(CONFIG_X) && ...)
> use(x);
>
> And here we even have an indirection for IS_ENABLED via that local
> bool
> variable. Is that pattern OK for Linux? Does it compile in all
> supported
> optimization levels of all supported compilers?
I hope it all goes away, that being what IS_ENABLED() is there for.
-Mike
On 2015-04-23 09:01, Mike Galbraith wrote:
> On Thu, 2015-04-23 at 08:50 +0200, Jan Kiszka wrote:
>> On 2015-04-23 08:11, Mike Galbraith wrote:
>>> @@ -103,6 +98,9 @@ EXPORT_SYMBOL_GPL(irq_work_queue_on);
>>> /* Enqueue the irq work @work on the current CPU */
>>> bool irq_work_queue(struct irq_work *work)
>>> {
>>> + bool realtime = IS_ENABLED(CONFIG_PREEMPT_RT_FULL);
>>> + bool raise = false;
>>> +
>>> /* Only queue if not already pending */
>>> if (!irq_work_claim(work))
>>> return false;
>>> @@ -110,25 +108,22 @@ bool irq_work_queue(struct irq_work *wor
>>> /* Queue the entry and raise the IPI if needed. */
>>> preempt_disable();
>>>
>>> -#ifdef CONFIG_PREEMPT_RT_FULL
>>> - if (work->flags & IRQ_WORK_HARD_IRQ) {
>>> + if (realtime && (work->flags & IRQ_WORK_HARD_IRQ)) {
>>> if (llist_add(&work->llnode,
>>> this_cpu_ptr(&hirq_work_list)))
>>
>> This boils down to
>>
>> #ifdef CONFIG_X
>> some_type x;
>> #endif
>> ...
>>
>> if (IS_ENABLED(CONFIG_X) && ...)
>> use(x);
>>
>> And here we even have an indirection for IS_ENABLED via that local
>> bool
>> variable. Is that pattern OK for Linux? Does it compile in all
>> supported
>> optimization levels of all supported compilers?
>
> I hope it all goes away, that being what IS_ENABLED() is there for.
Hope is good - but not enough here: it breaks the build under
!CONFIG_X, even the case without the bool var.
CC kernel/irq_work.o
In file included from ../include/asm-generic/percpu.h:6:0,
from ../arch/x86/include/asm/percpu.h:522,
from ../arch/x86/include/asm/current.h:5,
from ../arch/x86/include/asm/processor.h:15,
from ../arch/x86/include/asm/irq_work.h:4,
from ../include/linux/irq_work.h:47,
from ../kernel/irq_work.c:11:
../kernel/irq_work.c: In function ‘irq_work_queue_on’:
../kernel/irq_work.c:85:17: error: ‘hirq_work_list’ undeclared (first use in this function)
&per_cpu(hirq_work_list, cpu));
Jan
--
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
On 2015-04-23 08:58, Mike Galbraith wrote:
> On Thu, 2015-04-23 at 08:29 +0200, Jan Kiszka wrote:
>>
>>> void irq_work_tick(void)
>>> {
>>> -#ifdef CONFIG_PREEMPT_RT_FULL
>>> - irq_work_run_list(this_cpu_ptr(&lazy_list));
>>> -#else
>>> - struct llist_head *raised = &__get_cpu_var(raised_list);
>>> + struct llist_head *raised = this_cpu_ptr(&raised_list);
>>>
>>> - if (!llist_empty(raised) && !arch_irq_work_has_interrupt())
>>> + if (!llist_empty(raised) &&
>>> (!arch_irq_work_has_interrupt() ||
>>> + IS_ENABLED(CONFIG_PREEMPT_RT_FULL)))
>>
>> OK, that additional condition is addressing archs that don't have
>> irq_work support and fall back to the timer, right?
>
> How will ever run if it is not run in either irq_work_run() or
> irq_work_tick()? There are two choices, we better pick one.
Ah, now I see it. Indeed.
OK, will run through your fix and suggestions and come up with a new
version.
Thanks,
Jan
--
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
On Thu, 2015-04-23 at 09:12 +0200, Jan Kiszka wrote:
> On 2015-04-23 09:01, Mike Galbraith wrote:
> > On Thu, 2015-04-23 at 08:50 +0200, Jan Kiszka wrote:
> > > On 2015-04-23 08:11, Mike Galbraith wrote:
> > > > @@ -103,6 +98,9 @@ EXPORT_SYMBOL_GPL(irq_work_queue_on);
> > > > /* Enqueue the irq work @work on the current CPU */
> > > > bool irq_work_queue(struct irq_work *work)
> > > > {
> > > > + bool realtime = IS_ENABLED(CONFIG_PREEMPT_RT_FULL);
> > > > + bool raise = false;
> > > > +
> > > > /* Only queue if not already pending */
> > > > if (!irq_work_claim(work))
> > > > return false;
> > > > @@ -110,25 +108,22 @@ bool irq_work_queue(struct irq_work *wor
> > > > /* Queue the entry and raise the IPI if needed. */
> > > > preempt_disable();
> > > >
> > > > -#ifdef CONFIG_PREEMPT_RT_FULL
> > > > - if (work->flags & IRQ_WORK_HARD_IRQ) {
> > > > + if (realtime && (work->flags & IRQ_WORK_HARD_IRQ)) {
> > > > if (llist_add(&work->llnode,
> > > > this_cpu_ptr(&hirq_work_list)))
> > >
> > > This boils down to
> > >
> > > #ifdef CONFIG_X
> > > some_type x;
> > > #endif
> > > ...
> > >
> > > if (IS_ENABLED(CONFIG_X) && ...)
> > > use(x);
> > >
> > > And here we even have an indirection for IS_ENABLED via that
> > > local
> > > bool
> > > variable. Is that pattern OK for Linux? Does it compile in all
> > > supported
> > > optimization levels of all supported compilers?
> >
> > I hope it all goes away, that being what IS_ENABLED() is there for.
>
> Hope is good - but not enough here: it breaks the build under
> !CONFIG_X, even the case without the bool var.
>
> CC kernel/irq_work.o
> In file included from ../include/asm-generic/percpu.h:6:0,
> from ../arch/x86/include/asm/percpu.h:522,
> from ../arch/x86/include/asm/current.h:5,
> from ../arch/x86/include/asm/processor.h:15,
> from ../arch/x86/include/asm/irq_work.h:4,
> from ../include/linux/irq_work.h:47,
> from ../kernel/irq_work.c:11:
> ../kernel/irq_work.c: In function ‘irq_work_queue_on’:
> ../kernel/irq_work.c:85:17: error: ‘hirq_work_list’ undeclared
> (first use in this function)
> &per_cpu(hirq_work_list, cpu));
Aw poo, so that's just what I _thought_ it was for.
-Mike
On Thu, 23 Apr 2015 09:19:26 +0200
Mike Galbraith <[email protected]> wrote:
> > CC kernel/irq_work.o
> > In file included from ../include/asm-generic/percpu.h:6:0,
> > from ../arch/x86/include/asm/percpu.h:522,
> > from ../arch/x86/include/asm/current.h:5,
> > from ../arch/x86/include/asm/processor.h:15,
> > from ../arch/x86/include/asm/irq_work.h:4,
> > from ../include/linux/irq_work.h:47,
> > from ../kernel/irq_work.c:11:
> > ../kernel/irq_work.c: In function ‘irq_work_queue_on’:
> > ../kernel/irq_work.c:85:17: error: ‘hirq_work_list’ undeclared
> > (first use in this function)
> > &per_cpu(hirq_work_list, cpu));
>
> Aw poo, so that's just what I _thought_ it was for.
It helps optimization but does nothing for undefined symbols.
That said, why don't we clean up that irq_work code and at least
declare both lists, and get rid of all the #ifdefs. I wonder if gcc is
smart enough to not allocate a static variable if it happens to be
optimized out?
-- Steve
On Thu, 2015-04-23 at 17:00 -0400, Steven Rostedt wrote:
> On Thu, 23 Apr 2015 09:19:26 +0200
> Mike Galbraith <[email protected]> wrote:
>
> > > CC kernel/irq_work.o
> > > In file included from ../include/asm-generic/percpu.h:6:0,
> > > from ../arch/x86/include/asm/percpu.h:522,
> > > from ../arch/x86/include/asm/current.h:5,
> > > from ../arch/x86/include/asm/processor.h:15,
> > > from ../arch/x86/include/asm/irq_work.h:4,
> > > from ../include/linux/irq_work.h:47,
> > > from ../kernel/irq_work.c:11:
> > > ../kernel/irq_work.c: In function ‘irq_work_queue_on’:
> > > ../kernel/irq_work.c:85:17: error: ‘hirq_work_list’ undeclared
> > > (first use in this function)
> > > &per_cpu(hirq_work_list, cpu));
> >
> > Aw poo, so that's just what I _thought_ it was for.
>
> It helps optimization but does nothing for undefined symbols.
>
> That said, why don't we clean up that irq_work code and at least
> declare both lists, and get rid of all the #ifdefs. I wonder if gcc is
> smart enough to not allocate a static variable if it happens to be
> optimized out?
Nope, it didn't notice a thing.
This is a stab at that cleanup. Usable as is with Jan's ok, or as
fodder for your bitmaster-9000 patch shredder, or whatever. Box works
and it makes line count shrink...
I downgraded evolution v3.16->v3.12 to restore its ability to read it's
own fscking "Preformatted" switch, so whitespace should be fine.
Oh, btw, if anyone (else) makes a 4.1-rt, your rt push work will want
one of those nifty hirq tags lest box make boom due to trying to do that
not only way late, but with irqs enabled which pisses sched all off.
Subject: [PATCH RT 3.18] irq_work: Provide a soft-irq based queue
Date: Thu, 16 Apr 2015 18:28:16 +0200
From: Jan Kiszka <[email protected]>
Instead of turning all irq_work requests into lazy ones on -rt, just
move their execution from hard into soft-irq context.
This resolves deadlocks of ftrace which will queue work from arbitrary
contexts, including those that have locks held that are needed for
raising a soft-irq.
Mike: cleanup ifdef mess and kill hirq_work_list. We need two lists,
and already have them, merely need to select according to work type.
In -rt all work not tagged for hirq execution is queued to the lazy
list and runs via irq_work_tick(). Raising SOFTIRQ_TIMER is always
done via IPI for deadlock safety, if the work item is not a lazy work
or the tick is stopped, fire IPI immediately, otherwise let it wait.
IOW, lazy work is lazy in -rt only until someone queues immediate work.
Signed-off-by: Jan Kiszka <[email protected]>
Signed-off-by: Mike Galbraith <[email protected]>
---
Second try, looks much better so far. And it also removes my concerns
regarding other potential cases besides ftrace.
kernel/irq_work.c | 85 ++++++++++++++++++++----------------------------------
1 file changed, 33 insertions(+), 52 deletions(-)
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -23,9 +23,7 @@
static DEFINE_PER_CPU(struct llist_head, raised_list);
static DEFINE_PER_CPU(struct llist_head, lazy_list);
-#ifdef CONFIG_PREEMPT_RT_FULL
-static DEFINE_PER_CPU(struct llist_head, hirq_work_list);
-#endif
+
/*
* Claim the entry so that no one else will poke at it.
*/
@@ -68,7 +66,7 @@ void __weak arch_irq_work_raise(void)
*/
bool irq_work_queue_on(struct irq_work *work, int cpu)
{
- bool raise_irqwork;
+ struct llist_head *list;
/* All work should have been flushed before going offline */
WARN_ON_ONCE(cpu_is_offline(cpu));
@@ -80,19 +78,12 @@ bool irq_work_queue_on(struct irq_work *
if (!irq_work_claim(work))
return false;
-#ifdef CONFIG_PREEMPT_RT_FULL
- if (work->flags & IRQ_WORK_HARD_IRQ)
- raise_irqwork = llist_add(&work->llnode,
- &per_cpu(hirq_work_list, cpu));
+ if (IS_ENABLED(CONFIG_PREEMPT_RT_FULL) && !(work->flags & IRQ_WORK_HARD_IRQ))
+ list = &per_cpu(lazy_list, cpu);
else
- raise_irqwork = llist_add(&work->llnode,
- &per_cpu(lazy_list, cpu));
-#else
- raise_irqwork = llist_add(&work->llnode,
- &per_cpu(raised_list, cpu));
-#endif
+ list = &per_cpu(raised_list, cpu);
- if (raise_irqwork)
+ if (llist_add(&work->llnode, list))
arch_send_call_function_single_ipi(cpu);
return true;
@@ -103,6 +94,9 @@ EXPORT_SYMBOL_GPL(irq_work_queue_on);
/* Enqueue the irq work @work on the current CPU */
bool irq_work_queue(struct irq_work *work)
{
+ struct llist_head *list;
+ bool lazy_work, realtime = IS_ENABLED(CONFIG_PREEMPT_RT_FULL);
+
/* Only queue if not already pending */
if (!irq_work_claim(work))
return false;
@@ -110,25 +104,17 @@ bool irq_work_queue(struct irq_work *wor
/* Queue the entry and raise the IPI if needed. */
preempt_disable();
-#ifdef CONFIG_PREEMPT_RT_FULL
- if (work->flags & IRQ_WORK_HARD_IRQ) {
- if (llist_add(&work->llnode, this_cpu_ptr(&hirq_work_list)))
- arch_irq_work_raise();
- } else {
- if (llist_add(&work->llnode, this_cpu_ptr(&lazy_list)) &&
- tick_nohz_tick_stopped())
- raise_softirq(TIMER_SOFTIRQ);
- }
-#else
- if (work->flags & IRQ_WORK_LAZY) {
- if (llist_add(&work->llnode, this_cpu_ptr(&lazy_list)) &&
- tick_nohz_tick_stopped())
- arch_irq_work_raise();
- } else {
- if (llist_add(&work->llnode, this_cpu_ptr(&raised_list)))
+ lazy_work = work->flags & IRQ_WORK_LAZY;
+
+ if (lazy_work || (realtime && !(work->flags & IRQ_WORK_HARD_IRQ)))
+ list = this_cpu_ptr(&lazy_list);
+ else
+ list = this_cpu_ptr(&raised_list);
+
+ if (llist_add(&work->llnode, list)) {
+ if (!lazy_work || tick_nohz_tick_stopped())
arch_irq_work_raise();
}
-#endif
preempt_enable();
@@ -143,12 +129,8 @@ bool irq_work_needs_cpu(void)
raised = this_cpu_ptr(&raised_list);
lazy = this_cpu_ptr(&lazy_list);
- if (llist_empty(raised))
- if (llist_empty(lazy))
-#ifdef CONFIG_PREEMPT_RT_FULL
- if (llist_empty(this_cpu_ptr(&hirq_work_list)))
-#endif
- return false;
+ if (llist_empty(raised) && llist_empty(lazy))
+ return false;
/* All work should have been flushed before going offline */
WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
@@ -162,9 +144,7 @@ static void irq_work_run_list(struct lli
struct irq_work *work;
struct llist_node *llnode;
-#ifndef CONFIG_PREEMPT_RT_FULL
- BUG_ON(!irqs_disabled());
-#endif
+ BUG_ON(!IS_ENABLED(CONFIG_PREEMPT_RT_FULL) && !irqs_disabled());
if (llist_empty(list))
return;
@@ -200,26 +180,27 @@ static void irq_work_run_list(struct lli
*/
void irq_work_run(void)
{
-#ifdef CONFIG_PREEMPT_RT_FULL
- irq_work_run_list(this_cpu_ptr(&hirq_work_list));
-#else
irq_work_run_list(this_cpu_ptr(&raised_list));
- irq_work_run_list(this_cpu_ptr(&lazy_list));
-#endif
+ if (IS_ENABLED(CONFIG_PREEMPT_RT_FULL)) {
+ /*
+ * NOTE: we raise softirq via IPI for safety,
+ * and execute in irq_work_tick() to move the
+ * overhead from hard to soft irq context.
+ */
+ if (!llist_empty(this_cpu_ptr(&lazy_list)))
+ raise_softirq(TIMER_SOFTIRQ);
+ } else
+ irq_work_run_list(this_cpu_ptr(&lazy_list));
}
EXPORT_SYMBOL_GPL(irq_work_run);
void irq_work_tick(void)
{
-#ifdef CONFIG_PREEMPT_RT_FULL
- irq_work_run_list(this_cpu_ptr(&lazy_list));
-#else
- struct llist_head *raised = &__get_cpu_var(raised_list);
+ struct llist_head *raised = this_cpu_ptr(&raised_list);
if (!llist_empty(raised) && !arch_irq_work_has_interrupt())
irq_work_run_list(raised);
- irq_work_run_list(&__get_cpu_var(lazy_list));
-#endif
+ irq_work_run_list(this_cpu_ptr(&lazy_list));
}
/*
On 2015-04-24 08:54, Mike Galbraith wrote:
> On Thu, 2015-04-23 at 17:00 -0400, Steven Rostedt wrote:
>> On Thu, 23 Apr 2015 09:19:26 +0200
>> Mike Galbraith <[email protected]> wrote:
>>
>>>> CC kernel/irq_work.o
>>>> In file included from ../include/asm-generic/percpu.h:6:0,
>>>> from ../arch/x86/include/asm/percpu.h:522,
>>>> from ../arch/x86/include/asm/current.h:5,
>>>> from ../arch/x86/include/asm/processor.h:15,
>>>> from ../arch/x86/include/asm/irq_work.h:4,
>>>> from ../include/linux/irq_work.h:47,
>>>> from ../kernel/irq_work.c:11:
>>>> ../kernel/irq_work.c: In function ‘irq_work_queue_on’:
>>>> ../kernel/irq_work.c:85:17: error: ‘hirq_work_list’ undeclared
>>>> (first use in this function)
>>>> &per_cpu(hirq_work_list, cpu));
>>>
>>> Aw poo, so that's just what I _thought_ it was for.
>>
>> It helps optimization but does nothing for undefined symbols.
>>
>> That said, why don't we clean up that irq_work code and at least
>> declare both lists, and get rid of all the #ifdefs. I wonder if gcc is
>> smart enough to not allocate a static variable if it happens to be
>> optimized out?
>
> Nope, it didn't notice a thing.
>
> This is a stab at that cleanup. Usable as is with Jan's ok, or as
> fodder for your bitmaster-9000 patch shredder, or whatever. Box works
> and it makes line count shrink...
>
> I downgraded evolution v3.16->v3.12 to restore its ability to read it's
> own fscking "Preformatted" switch, so whitespace should be fine.
>
> Oh, btw, if anyone (else) makes a 4.1-rt, your rt push work will want
> one of those nifty hirq tags lest box make boom due to trying to do that
> not only way late, but with irqs enabled which pisses sched all off.
>
> Subject: [PATCH RT 3.18] irq_work: Provide a soft-irq based queue
> Date: Thu, 16 Apr 2015 18:28:16 +0200
> From: Jan Kiszka <[email protected]>
>
> Instead of turning all irq_work requests into lazy ones on -rt, just
> move their execution from hard into soft-irq context.
>
> This resolves deadlocks of ftrace which will queue work from arbitrary
> contexts, including those that have locks held that are needed for
> raising a soft-irq.
>
> Mike: cleanup ifdef mess and kill hirq_work_list. We need two lists,
> and already have them, merely need to select according to work type.
> In -rt all work not tagged for hirq execution is queued to the lazy
> list and runs via irq_work_tick(). Raising SOFTIRQ_TIMER is always
> done via IPI for deadlock safety, if the work item is not a lazy work
> or the tick is stopped, fire IPI immediately, otherwise let it wait.
> IOW, lazy work is lazy in -rt only until someone queues immediate work.
The approach looks good to me, but the commit log deserves a rework now.
Jan
--
Siemens AG, Corporate Technology, CT RTC ITP SES-DE
Corporate Competence Center Embedded Linux
On Fri, 2015-04-24 at 11:00 +0200, Jan Kiszka wrote:
> The approach looks good to me, but the commit log deserves a rework now.
Yeah. While you're at it, you should change my chop to an ack or a
tested-by too, as it's your patch, I just rearranged a bit.
-Mike
On Fri, 2015-04-24 at 11:00 +0200, Jan Kiszka wrote:
> The approach looks good to me, but the commit log deserves a rework now.
Ok, we agree on the approach, and that the changelog wants a bit of
attention, so either you're gonna rewrite it to suit you, do a pretty
changelog, and I ack, or I take the blame for the posted form, scribble
something that I hope is a better log, and you ack. Either will work.
Here's my changelog+blame-taking, if you're ok with it, ack, and we can
call it a day, otherwise onward to plan B.
irq_work: Delegate non-immediate irq work to ksoftirqd
Based on a patch from Jan Kiszka.
Jan reported that ftrace queueing work from arbitrary contexts can
and does lead to deadlock. trace-cmd -e sched:* deadlocked in fact.
Resolve the problem by delegating all non-immediate work to ksoftirqd.
We need two lists to do this, one for hard irq, one for soft, so we
can use the two existing lists, eliminating the -rt specific list and
all of the ifdefery while we're at it.
Strategy: Queue work tagged for hirq invocation to the raised_list,
invoke via IPI as usual. If a work item being queued to lazy_list,
which becomes our all others list, is not a lazy work item, or the
tick is stopped, fire an IPI to raise SOFTIRQ_TIMER immediately,
otherwise let ksofirqd find it when the tick comes along. Raising
SOFTIRQ_TIMER via IPI even when queueing local ensures delegation.
Signed-off-by: Mike Galbraith <[email protected]>
---
kernel/irq_work.c | 85 ++++++++++++++++++++----------------------------------
1 file changed, 33 insertions(+), 52 deletions(-)
--- a/kernel/irq_work.c
+++ b/kernel/irq_work.c
@@ -23,9 +23,7 @@
static DEFINE_PER_CPU(struct llist_head, raised_list);
static DEFINE_PER_CPU(struct llist_head, lazy_list);
-#ifdef CONFIG_PREEMPT_RT_FULL
-static DEFINE_PER_CPU(struct llist_head, hirq_work_list);
-#endif
+
/*
* Claim the entry so that no one else will poke at it.
*/
@@ -68,7 +66,7 @@ void __weak arch_irq_work_raise(void)
*/
bool irq_work_queue_on(struct irq_work *work, int cpu)
{
- bool raise_irqwork;
+ struct llist_head *list;
/* All work should have been flushed before going offline */
WARN_ON_ONCE(cpu_is_offline(cpu));
@@ -80,19 +78,12 @@ bool irq_work_queue_on(struct irq_work *
if (!irq_work_claim(work))
return false;
-#ifdef CONFIG_PREEMPT_RT_FULL
- if (work->flags & IRQ_WORK_HARD_IRQ)
- raise_irqwork = llist_add(&work->llnode,
- &per_cpu(hirq_work_list, cpu));
+ if (IS_ENABLED(CONFIG_PREEMPT_RT_FULL) && !(work->flags & IRQ_WORK_HARD_IRQ))
+ list = &per_cpu(lazy_list, cpu);
else
- raise_irqwork = llist_add(&work->llnode,
- &per_cpu(lazy_list, cpu));
-#else
- raise_irqwork = llist_add(&work->llnode,
- &per_cpu(raised_list, cpu));
-#endif
+ list = &per_cpu(raised_list, cpu);
- if (raise_irqwork)
+ if (llist_add(&work->llnode, list))
arch_send_call_function_single_ipi(cpu);
return true;
@@ -103,6 +94,9 @@ EXPORT_SYMBOL_GPL(irq_work_queue_on);
/* Enqueue the irq work @work on the current CPU */
bool irq_work_queue(struct irq_work *work)
{
+ struct llist_head *list;
+ bool lazy_work, realtime = IS_ENABLED(CONFIG_PREEMPT_RT_FULL);
+
/* Only queue if not already pending */
if (!irq_work_claim(work))
return false;
@@ -110,25 +104,17 @@ bool irq_work_queue(struct irq_work *wor
/* Queue the entry and raise the IPI if needed. */
preempt_disable();
-#ifdef CONFIG_PREEMPT_RT_FULL
- if (work->flags & IRQ_WORK_HARD_IRQ) {
- if (llist_add(&work->llnode, this_cpu_ptr(&hirq_work_list)))
- arch_irq_work_raise();
- } else {
- if (llist_add(&work->llnode, this_cpu_ptr(&lazy_list)) &&
- tick_nohz_tick_stopped())
- raise_softirq(TIMER_SOFTIRQ);
- }
-#else
- if (work->flags & IRQ_WORK_LAZY) {
- if (llist_add(&work->llnode, this_cpu_ptr(&lazy_list)) &&
- tick_nohz_tick_stopped())
- arch_irq_work_raise();
- } else {
- if (llist_add(&work->llnode, this_cpu_ptr(&raised_list)))
+ lazy_work = work->flags & IRQ_WORK_LAZY;
+
+ if (lazy_work || (realtime && !(work->flags & IRQ_WORK_HARD_IRQ)))
+ list = this_cpu_ptr(&lazy_list);
+ else
+ list = this_cpu_ptr(&raised_list);
+
+ if (llist_add(&work->llnode, list)) {
+ if (!lazy_work || tick_nohz_tick_stopped())
arch_irq_work_raise();
}
-#endif
preempt_enable();
@@ -143,12 +129,8 @@ bool irq_work_needs_cpu(void)
raised = this_cpu_ptr(&raised_list);
lazy = this_cpu_ptr(&lazy_list);
- if (llist_empty(raised))
- if (llist_empty(lazy))
-#ifdef CONFIG_PREEMPT_RT_FULL
- if (llist_empty(this_cpu_ptr(&hirq_work_list)))
-#endif
- return false;
+ if (llist_empty(raised) && llist_empty(lazy))
+ return false;
/* All work should have been flushed before going offline */
WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
@@ -162,9 +144,7 @@ static void irq_work_run_list(struct lli
struct irq_work *work;
struct llist_node *llnode;
-#ifndef CONFIG_PREEMPT_RT_FULL
- BUG_ON(!irqs_disabled());
-#endif
+ BUG_ON(!IS_ENABLED(CONFIG_PREEMPT_RT_FULL) && !irqs_disabled());
if (llist_empty(list))
return;
@@ -200,26 +180,27 @@ static void irq_work_run_list(struct lli
*/
void irq_work_run(void)
{
-#ifdef CONFIG_PREEMPT_RT_FULL
- irq_work_run_list(this_cpu_ptr(&hirq_work_list));
-#else
irq_work_run_list(this_cpu_ptr(&raised_list));
- irq_work_run_list(this_cpu_ptr(&lazy_list));
-#endif
+ if (IS_ENABLED(CONFIG_PREEMPT_RT_FULL)) {
+ /*
+ * NOTE: we raise softirq via IPI for safety,
+ * and execute in irq_work_tick() to move the
+ * overhead from hard to soft irq context.
+ */
+ if (!llist_empty(this_cpu_ptr(&lazy_list)))
+ raise_softirq(TIMER_SOFTIRQ);
+ } else
+ irq_work_run_list(this_cpu_ptr(&lazy_list));
}
EXPORT_SYMBOL_GPL(irq_work_run);
void irq_work_tick(void)
{
-#ifdef CONFIG_PREEMPT_RT_FULL
- irq_work_run_list(this_cpu_ptr(&lazy_list));
-#else
- struct llist_head *raised = &__get_cpu_var(raised_list);
+ struct llist_head *raised = this_cpu_ptr(&raised_list);
if (!llist_empty(raised) && !arch_irq_work_has_interrupt())
irq_work_run_list(raised);
- irq_work_run_list(&__get_cpu_var(lazy_list));
-#endif
+ irq_work_run_list(this_cpu_ptr(&lazy_list));
}
/*
On 2015-04-25 09:20, Mike Galbraith wrote:
> On Fri, 2015-04-24 at 11:00 +0200, Jan Kiszka wrote:
>
>> The approach looks good to me, but the commit log deserves a rework now.
>
> Ok, we agree on the approach, and that the changelog wants a bit of
> attention, so either you're gonna rewrite it to suit you, do a pretty
> changelog, and I ack, or I take the blame for the posted form, scribble
> something that I hope is a better log, and you ack. Either will work.
>
> Here's my changelog+blame-taking, if you're ok with it, ack, and we can
> call it a day, otherwise onward to plan B.
>
>
>
> irq_work: Delegate non-immediate irq work to ksoftirqd
>
> Based on a patch from Jan Kiszka.
>
> Jan reported that ftrace queueing work from arbitrary contexts can
> and does lead to deadlock. trace-cmd -e sched:* deadlocked in fact.
>
> Resolve the problem by delegating all non-immediate work to ksoftirqd.
>
> We need two lists to do this, one for hard irq, one for soft, so we
> can use the two existing lists, eliminating the -rt specific list and
> all of the ifdefery while we're at it.
>
> Strategy: Queue work tagged for hirq invocation to the raised_list,
> invoke via IPI as usual. If a work item being queued to lazy_list,
> which becomes our all others list, is not a lazy work item, or the
> tick is stopped, fire an IPI to raise SOFTIRQ_TIMER immediately,
> otherwise let ksofirqd find it when the tick comes along. Raising
> SOFTIRQ_TIMER via IPI even when queueing local ensures delegation.
>
> Signed-off-by: Mike Galbraith <[email protected]>
>
> ---
> kernel/irq_work.c | 85 ++++++++++++++++++++----------------------------------
> 1 file changed, 33 insertions(+), 52 deletions(-)
>
> --- a/kernel/irq_work.c
> +++ b/kernel/irq_work.c
> @@ -23,9 +23,7 @@
>
> static DEFINE_PER_CPU(struct llist_head, raised_list);
> static DEFINE_PER_CPU(struct llist_head, lazy_list);
> -#ifdef CONFIG_PREEMPT_RT_FULL
> -static DEFINE_PER_CPU(struct llist_head, hirq_work_list);
> -#endif
> +
> /*
> * Claim the entry so that no one else will poke at it.
> */
> @@ -68,7 +66,7 @@ void __weak arch_irq_work_raise(void)
> */
> bool irq_work_queue_on(struct irq_work *work, int cpu)
> {
> - bool raise_irqwork;
> + struct llist_head *list;
>
> /* All work should have been flushed before going offline */
> WARN_ON_ONCE(cpu_is_offline(cpu));
> @@ -80,19 +78,12 @@ bool irq_work_queue_on(struct irq_work *
> if (!irq_work_claim(work))
> return false;
>
> -#ifdef CONFIG_PREEMPT_RT_FULL
> - if (work->flags & IRQ_WORK_HARD_IRQ)
> - raise_irqwork = llist_add(&work->llnode,
> - &per_cpu(hirq_work_list, cpu));
> + if (IS_ENABLED(CONFIG_PREEMPT_RT_FULL) && !(work->flags & IRQ_WORK_HARD_IRQ))
> + list = &per_cpu(lazy_list, cpu);
> else
> - raise_irqwork = llist_add(&work->llnode,
> - &per_cpu(lazy_list, cpu));
> -#else
> - raise_irqwork = llist_add(&work->llnode,
> - &per_cpu(raised_list, cpu));
> -#endif
> + list = &per_cpu(raised_list, cpu);
>
> - if (raise_irqwork)
> + if (llist_add(&work->llnode, list))
> arch_send_call_function_single_ipi(cpu);
>
> return true;
> @@ -103,6 +94,9 @@ EXPORT_SYMBOL_GPL(irq_work_queue_on);
> /* Enqueue the irq work @work on the current CPU */
> bool irq_work_queue(struct irq_work *work)
> {
> + struct llist_head *list;
> + bool lazy_work, realtime = IS_ENABLED(CONFIG_PREEMPT_RT_FULL);
> +
> /* Only queue if not already pending */
> if (!irq_work_claim(work))
> return false;
> @@ -110,25 +104,17 @@ bool irq_work_queue(struct irq_work *wor
> /* Queue the entry and raise the IPI if needed. */
> preempt_disable();
>
> -#ifdef CONFIG_PREEMPT_RT_FULL
> - if (work->flags & IRQ_WORK_HARD_IRQ) {
> - if (llist_add(&work->llnode, this_cpu_ptr(&hirq_work_list)))
> - arch_irq_work_raise();
> - } else {
> - if (llist_add(&work->llnode, this_cpu_ptr(&lazy_list)) &&
> - tick_nohz_tick_stopped())
> - raise_softirq(TIMER_SOFTIRQ);
> - }
> -#else
> - if (work->flags & IRQ_WORK_LAZY) {
> - if (llist_add(&work->llnode, this_cpu_ptr(&lazy_list)) &&
> - tick_nohz_tick_stopped())
> - arch_irq_work_raise();
> - } else {
> - if (llist_add(&work->llnode, this_cpu_ptr(&raised_list)))
> + lazy_work = work->flags & IRQ_WORK_LAZY;
> +
> + if (lazy_work || (realtime && !(work->flags & IRQ_WORK_HARD_IRQ)))
> + list = this_cpu_ptr(&lazy_list);
> + else
> + list = this_cpu_ptr(&raised_list);
> +
> + if (llist_add(&work->llnode, list)) {
> + if (!lazy_work || tick_nohz_tick_stopped())
> arch_irq_work_raise();
> }
> -#endif
>
> preempt_enable();
>
> @@ -143,12 +129,8 @@ bool irq_work_needs_cpu(void)
> raised = this_cpu_ptr(&raised_list);
> lazy = this_cpu_ptr(&lazy_list);
>
> - if (llist_empty(raised))
> - if (llist_empty(lazy))
> -#ifdef CONFIG_PREEMPT_RT_FULL
> - if (llist_empty(this_cpu_ptr(&hirq_work_list)))
> -#endif
> - return false;
> + if (llist_empty(raised) && llist_empty(lazy))
> + return false;
>
> /* All work should have been flushed before going offline */
> WARN_ON_ONCE(cpu_is_offline(smp_processor_id()));
> @@ -162,9 +144,7 @@ static void irq_work_run_list(struct lli
> struct irq_work *work;
> struct llist_node *llnode;
>
> -#ifndef CONFIG_PREEMPT_RT_FULL
> - BUG_ON(!irqs_disabled());
> -#endif
> + BUG_ON(!IS_ENABLED(CONFIG_PREEMPT_RT_FULL) && !irqs_disabled());
>
> if (llist_empty(list))
> return;
> @@ -200,26 +180,27 @@ static void irq_work_run_list(struct lli
> */
> void irq_work_run(void)
> {
> -#ifdef CONFIG_PREEMPT_RT_FULL
> - irq_work_run_list(this_cpu_ptr(&hirq_work_list));
> -#else
> irq_work_run_list(this_cpu_ptr(&raised_list));
> - irq_work_run_list(this_cpu_ptr(&lazy_list));
> -#endif
> + if (IS_ENABLED(CONFIG_PREEMPT_RT_FULL)) {
> + /*
> + * NOTE: we raise softirq via IPI for safety,
> + * and execute in irq_work_tick() to move the
> + * overhead from hard to soft irq context.
> + */
> + if (!llist_empty(this_cpu_ptr(&lazy_list)))
> + raise_softirq(TIMER_SOFTIRQ);
> + } else
> + irq_work_run_list(this_cpu_ptr(&lazy_list));
> }
> EXPORT_SYMBOL_GPL(irq_work_run);
>
> void irq_work_tick(void)
> {
> -#ifdef CONFIG_PREEMPT_RT_FULL
> - irq_work_run_list(this_cpu_ptr(&lazy_list));
> -#else
> - struct llist_head *raised = &__get_cpu_var(raised_list);
> + struct llist_head *raised = this_cpu_ptr(&raised_list);
>
> if (!llist_empty(raised) && !arch_irq_work_has_interrupt())
> irq_work_run_list(raised);
> - irq_work_run_list(&__get_cpu_var(lazy_list));
> -#endif
> + irq_work_run_list(this_cpu_ptr(&lazy_list));
> }
>
> /*
>
>
>
Acked-by: Jan Kiszka <[email protected]>
This way around makes more sense as you changed the patch significantly.
Thanks,
Jan
* Jan Kiszka | 2015-04-25 09:26:13 [+0200]:
>Acked-by: Jan Kiszka <[email protected]>
>
>This way around makes more sense as you changed the patch significantly.
Now I took the proper one.
>Thanks,
>Jan
Sebastian