Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S965546AbeAKTPy (ORCPT + 1 other); Thu, 11 Jan 2018 14:15:54 -0500 Received: from mail-wm0-f67.google.com ([74.125.82.67]:46829 "EHLO mail-wm0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S965448AbeAKTPw (ORCPT ); Thu, 11 Jan 2018 14:15:52 -0500 X-Google-Smtp-Source: ACJfBoudpdtokWq7qpcjcX3ryH5i5Ga7Pt9xDL7AZdwiVVprz3DN4FAnL/SCHYf17S7cEBpvobqrCJI2G6d5OEYf3SE= MIME-Version: 1.0 In-Reply-To: References: <20180109133623.10711-1-dima@arista.com> <20180109133623.10711-2-dima@arista.com> <1515620880.3350.44.camel@arista.com> <20180111032232.GA11633@lerouge> <20180111044456.GC11633@lerouge> <1515681091.3039.21.camel@arista.com> <20180111163204.GE6176@hirez.programming.kicks-ass.net> From: Eric Dumazet Date: Thu, 11 Jan 2018 11:15:49 -0800 Message-ID: Subject: Re: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context To: Linus Torvalds Cc: Peter Zijlstra , Dmitry Safonov , Frederic Weisbecker , LKML , Dmitry Safonov <0x7f454c46@gmail.com>, Andrew Morton , David Miller , Frederic Weisbecker , Hannes Frederic Sowa , Ingo Molnar , "Levin, Alexander (Sasha Levin)" , Paolo Abeni , "Paul E. McKenney" , Radu Rendec , Rik van Riel , Stanislaw Gruszka , Thomas Gleixner , Wanpeng Li Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Return-Path: On Thu, Jan 11, 2018 at 10:48 AM, Linus Torvalds wrote: > On Thu, Jan 11, 2018 at 8:32 AM, Peter Zijlstra wrote: >> On Thu, Jan 11, 2018 at 08:20:18AM -0800, Eric Dumazet wrote: >>> diff --git a/kernel/softirq.c b/kernel/softirq.c >>> index 2f5e87f1bae22f3df44fa4493fcc8b255882267f..d2f20daf77d14dc8ebde00d7c4a0237152d082ba >>> 100644 >>> --- a/kernel/softirq.c >>> +++ b/kernel/softirq.c >>> @@ -192,7 +192,7 @@ EXPORT_SYMBOL(__local_bh_enable_ip); >>> >>> /* >>> * We restart softirq processing for at most MAX_SOFTIRQ_RESTART times, >>> - * but break the loop if need_resched() is set or after 2 ms. >>> + * but break the loop after 2 ms. >>> * The MAX_SOFTIRQ_TIME provides a nice upper bound in most cases, but in >>> * certain cases, such as stop_machine(), jiffies may cease to >>> * increment and so we need the MAX_SOFTIRQ_RESTART limit as >>> @@ -299,8 +299,7 @@ asmlinkage __visible void __softirq_entry __do_softirq(void) >>> >>> pending = local_softirq_pending(); >>> if (pending) { >>> - if (time_before(jiffies, end) && !need_resched() && >>> - --max_restart) >>> + if (time_before(jiffies, end) && --max_restart) >>> goto restart; >>> >>> wakeup_softirqd(); >> >> You just introduced a 2ms preempt-disable region I think, that's not >> cool for PREEMPT and a plain bug on PREEMPT_RT. > > I actually think that he whole "time_before()" is garbage and should be removed. > > If doing 10 rounds of softirq processing takes several jiffies, you > have bigger issue. > > Looking at the history, the original time_before() was _replacing_ the > "max_retry". See commit c10d73671ad3 ("softirq: reduce latencies"). > > But then "max_restart" was put back in because jiffies just isn't > reliable in the first place. > > And I think *this* is the real issue. Networking does a *shitton* of > things in softirq context, and the problem is not the softirq code, > it's the networking code. The softirq code doesn't understand that the > networking code does billions of operations, and one softirq can take > ages and ages because it is handling millions of packets. > > My feeling is that it's the networking code that should notice "I have > a billion packets in my softirq queue, and I just keep getting more, > now *I* should start doing things in a thread instead". > > Because none of those limits really make sense for any of the other > softirq users. Everybody else just wants a low-latency "soft > interrupt", not some stupid thread. Who knew? Maybe the name should > have given the thing away? > > If you want a thread, use a tasklet and add_task_work(). > > Maybe networking should reconsider its use of softirqs entirely? Issue can happen when one packet is delivered (and handled) per NIC IRQ. Not 'billions' :/ Suppose host receives 99,000 packets per second on one queue (one CPU handling the particular IRQ), each packet taking 10 usec to handle (various cache misses and all the overhead) How are we supposed to know this particular workload is problematic for innocent user threads on the same cpu ? Is 'networking' supposed to perform some kind of mid-term analysis of what happened in the past and/or predict the future ?