MIME-Version: 1.0
In-Reply-To: <1515681091.3039.21.camel@arista.com>
References: <20180109133623.10711-1-dima@arista.com> <20180109133623.10711-2-dima@arista.com>
 <CANn89iK3M97MN0Pf3nXb+UAqqhUWOdSthHRBTYCwP75Ax_hO8Q@mail.gmail.com>
 <1515620880.3350.44.camel@arista.com> <CA+55aFyKKt4_5RT9RT8ZH-W26hC8=AvRYf8YxBm98dGSWwFs8g@mail.gmail.com>
 <20180111032232.GA11633@lerouge> <CA+55aFx_3zwQJ0YbDCL4YxpWEWhcEZfJnn42LzWBWDi3h1VdGA@mail.gmail.com>
 <20180111044456.GC11633@lerouge> <1515681091.3039.21.camel@arista.com>
From: Eric Dumazet <edumazet@google.com>
Date: Thu, 11 Jan 2018 08:20:18 -0800
Message-ID: <CANn89i+mVmzrZ14Kttt=J0wsDOMHhm8CHiMRLQwEZXMxiVpftg@mail.gmail.com>
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context
To: Dmitry Safonov <dima@arista.com>
Cc: Frederic Weisbecker <frederic@kernel.org>,
        Linus Torvalds <torvalds@linux-foundation.org>,
        LKML <linux-kernel@vger.kernel.org>,
        Dmitry Safonov <0x7f454c46@gmail.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        David Miller <davem@davemloft.net>,
        Frederic Weisbecker <fweisbec@gmail.com>,
        Hannes Frederic Sowa <hannes@stressinduktion.org>,
        Ingo Molnar <mingo@kernel.org>,
        "Levin, Alexander (Sasha Levin)" <alexander.levin@verizon.com>,
        Paolo Abeni <pabeni@redhat.com>,
        "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
        Peter Zijlstra <peterz@infradead.org>,
        Radu Rendec <rrendec@arista.com>,
        Rik van Riel <riel@redhat.com>,
        Stanislaw Gruszka <sgruszka@redhat.com>,
        Thomas Gleixner <tglx@linutronix.de>,
        Wanpeng Li <wanpeng.li@hotmail.com>
Content-Type: text/plain; charset="UTF-8"
Sender: linux-kernel-owner@vger.kernel.org

On Thu, Jan 11, 2018 at 6:31 AM, Dmitry Safonov <dima@arista.com> wrote:
> On Thu, 2018-01-11 at 05:44 +0100, Frederic Weisbecker wrote:
>> On Wed, Jan 10, 2018 at 08:19:49PM -0800, Linus Torvalds wrote:
>> > On Wed, Jan 10, 2018 at 7:22 PM, Frederic Weisbecker
>> > <frederic@kernel.org> wrote:
>> > >
>> > > Makes sense, but I think you need to keep the TASK_RUNNING check.
>> >
>> > Yes, good point.
>> >
>> > > So perhaps it should be:
>> > >
>> > > -       return tsk && (tsk->state == TASK_RUNNING);
>> > > +       return (tsk == current) && (tsk->state == TASK_RUNNING);
>> >
>> > Looks good to me - definitely worth trying.
>> >
>> > Maybe that weakens the thing so much that it doesn't actually help
>> > the
>> > UDP packet storm case?
>> >
>> > And maybe it's not sufficient for the dvb issue.
>> >
>> > But I think it's worth at least testing. Maybe it makes neither
>> > side
>> > entirely happy, but maybe it might be a good halfway point?
>>
>> Yes I believe Dmitry is facing a different problem where he would
>> rather
>> see ksoftirqd scheduled more often to handle the queue as a deferred
>> batch
>> instead of having it served one by one on the tails of IRQ storms.
>> (Dmitry correct me if I misunderstood).
>
> Quite so, what I see is that ksoftirqd is rarely (close to never)
> scheduled in case of UDP packet storm. That's because the up coming irq
> is too late in __do_softirq().
> So, there is no wakeup on UDP storm here:
> :        pending = local_softirq_pending();
> :        if (pending & mask) {
> :                if (time_before(jiffies, end) && !need_resched() &&
> :                    --max_restart)
> :                        goto restart;
> :
> :                wakeup_softirqd();
> :        }
> (as there is yet no pending softirq). It comes a bit late to schedule
> ksoftirqd and in result the next softirq is processed on the context of
> the task again, not in the scheduled ksoftirqd.
> That results in cpu-time starvation for the process on irq storm.
>
> While I saw that on out-of-tree driver, I believe that on some
> frequencies (lower than storm) one can observe the same on mainstream
> drivers. And I *think* that I've reproduced that on mainstream with
> virtio driver and package size of 1500 in VMs (thou I don't quite like
> the perf testing in VMs).
>
> So, ITOW, maybe there is a bit better way to *detect* that cpu time
> spent on serving softirqs is close to storm and that userspace starts
> starving? (and launch ksoftirqd in the result or balance between
> deferring and serving softirq right-there).
>
>> But your patch still seems to make sense for the case you described:
>> when
>> ksoftirqd is voluntarily preempted off and the current IRQ could
>> handle the
>> queue.


Note that ksoftirqd being kicked (TASK_RUNNING) is the sign of softirq pressure.
Or maybe we lack one bit to signal that __do_softirq() had to
wakep_softirq() because of pressure.
(If I remember well, I added such state when submitting my first patch,
https://www.spinics.net/lists/netdev/msg377172.html
then Peter suggested  to use tsk->state == TASK_RUNNING

https://www.spinics.net/lists/netdev/msg377210.html


Maybe the problem is not the new patch, but use of need_resched() in
__do_softirq()
that I added in 2013 ( commit c10d73671ad30f54692f7f69f0e09e75d3a8926a
) combined with the new patch.

diff --git a/kernel/softirq.c b/kernel/softirq.c
index 2f5e87f1bae22f3df44fa4493fcc8b255882267f..d2f20daf77d14dc8ebde00d7c4a0237152d082ba
100644
--- a/kernel/softirq.c
+++ b/kernel/softirq.c
@@ -192,7 +192,7 @@ EXPORT_SYMBOL(__local_bh_enable_ip);

 /*
  * We restart softirq processing for at most MAX_SOFTIRQ_RESTART times,
- * but break the loop if need_resched() is set or after 2 ms.
+ * but break the loop after 2 ms.
  * The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in
  * certain cases, such as stop_machine(), jiffies may cease to
  * increment and so we need the MAX_SOFTIRQ_RESTART limit as
@@ -299,8 +299,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)

        pending = local_softirq_pending();
        if (pending) {
-               if (time_before(jiffies, end) && !need_resched() &&
-                   --max_restart)
+               if (time_before(jiffies, end) && --max_restart)
                        goto restart;

                wakeup_softirqd();