MIME-Version: 1.0
In-Reply-To: <20180111163204.GE6176@hirez.programming.kicks-ass.net>
References: <20180109133623.10711-1-dima@arista.com> <20180109133623.10711-2-dima@arista.com>
 <CANn89iK3M97MN0Pf3nXb+UAqqhUWOdSthHRBTYCwP75Ax_hO8Q@mail.gmail.com>
 <1515620880.3350.44.camel@arista.com> <CA+55aFyKKt4_5RT9RT8ZH-W26hC8=AvRYf8YxBm98dGSWwFs8g@mail.gmail.com>
 <20180111032232.GA11633@lerouge> <CA+55aFx_3zwQJ0YbDCL4YxpWEWhcEZfJnn42LzWBWDi3h1VdGA@mail.gmail.com>
 <20180111044456.GC11633@lerouge> <1515681091.3039.21.camel@arista.com>
 <CANn89i+mVmzrZ14Kttt=J0wsDOMHhm8CHiMRLQwEZXMxiVpftg@mail.gmail.com> <20180111163204.GE6176@hirez.programming.kicks-ass.net>
From: Linus Torvalds <torvalds@linux-foundation.org>
Date: Thu, 11 Jan 2018 10:48:48 -0800
Message-ID: <CA+55aFwc3CP-sKOyVvaLab3azmr3LnPfADnGJXDcxYz9dT75=A@mail.gmail.com>
Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context
To: Peter Zijlstra <peterz@infradead.org>
Cc: Eric Dumazet <edumazet@google.com>,
        Dmitry Safonov <dima@arista.com>,
        Frederic Weisbecker <frederic@kernel.org>,
        LKML <linux-kernel@vger.kernel.org>,
        Dmitry Safonov <0x7f454c46@gmail.com>,
        Andrew Morton <akpm@linux-foundation.org>,
        David Miller <davem@davemloft.net>,
        Frederic Weisbecker <fweisbec@gmail.com>,
        Hannes Frederic Sowa <hannes@stressinduktion.org>,
        Ingo Molnar <mingo@kernel.org>,
        "Levin, Alexander (Sasha Levin)" <alexander.levin@verizon.com>,
        Paolo Abeni <pabeni@redhat.com>,
        "Paul E. McKenney" <paulmck@linux.vnet.ibm.com>,
        Radu Rendec <rrendec@arista.com>,
        Rik van Riel <riel@redhat.com>,
        Stanislaw Gruszka <sgruszka@redhat.com>,
        Thomas Gleixner <tglx@linutronix.de>,
        Wanpeng Li <wanpeng.li@hotmail.com>
Content-Type: text/plain; charset="UTF-8"
Sender: linux-kernel-owner@vger.kernel.org

On Thu, Jan 11, 2018 at 8:32 AM, Peter Zijlstra <peterz@infradead.org> wrote:
> On Thu, Jan 11, 2018 at 08:20:18AM -0800, Eric Dumazet wrote:
>> diff --git a/kernel/softirq.c b/kernel/softirq.c
>> index 2f5e87f1bae22f3df44fa4493fcc8b255882267f..d2f20daf77d14dc8ebde00d7c4a0237152d082ba
>> 100644
>> --- a/kernel/softirq.c
>> +++ b/kernel/softirq.c
>> @@ -192,7 +192,7 @@ EXPORT_SYMBOL(__local_bh_enable_ip);
>>
>>  /*
>>   * We restart softirq processing for at most MAX_SOFTIRQ_RESTART times,
>> - * but break the loop if need_resched() is set or after 2 ms.
>> + * but break the loop after 2 ms.
>>   * The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in
>>   * certain cases, such as stop_machine(), jiffies may cease to
>>   * increment and so we need the MAX_SOFTIRQ_RESTART limit as
>> @@ -299,8 +299,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void)
>>
>>         pending = local_softirq_pending();
>>         if (pending) {
>> -               if (time_before(jiffies, end) && !need_resched() &&
>> -                   --max_restart)
>> +               if (time_before(jiffies, end) && --max_restart)
>>                         goto restart;
>>
>>                 wakeup_softirqd();
>
> You just introduced a 2ms preempt-disable region I think, that's not
> cool for PREEMPT and a plain bug on PREEMPT_RT.

I actually think that he whole "time_before()" is garbage and should be removed.

If doing 10 rounds of softirq processing takes several jiffies, you
have bigger issue.

Looking at the history, the original time_before() was _replacing_ the
"max_retry".  See commit c10d73671ad3 ("softirq: reduce latencies").

But then "max_restart" was put back in because jiffies just isn't
reliable in the first place.

And I think *this* is the real issue. Networking does a *shitton* of
things in softirq context, and the problem is not the softirq code,
it's the networking code. The softirq code doesn't understand that the
networking code does billions of operations, and one softirq can take
ages and ages because it is handling millions of packets.

My feeling is that it's the networking code that should notice "I have
a billion packets in my softirq queue, and I just keep getting more,
now *I* should start doing things in a thread instead".

Because none of those limits really make sense for any of the other
softirq users. Everybody else just wants a low-latency "soft
interrupt", not some stupid thread. Who knew? Maybe the name should
have given the thing away?

If you want a thread, use a tasklet and add_task_work().

Maybe networking should reconsider its use of softirqs entirely?

                Linus