Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1757614AbeAINgd (ORCPT + 1 other); Tue, 9 Jan 2018 08:36:33 -0500 Received: from mail-wm0-f67.google.com ([74.125.82.67]:44969 "EHLO mail-wm0-f67.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1757543AbeAINg1 (ORCPT ); Tue, 9 Jan 2018 08:36:27 -0500 X-Google-Smtp-Source: ACJfBou9ncDQ/hI3yb5rj4GQFggsshc5zBz5A5cEfhckLinMgNpV5mvNJpHjwOPHJd1iSJNZrgBaaA== From: Dmitry Safonov To: linux-kernel@vger.kernel.org Cc: 0x7f454c46@gmail.com, Dmitry Safonov , Andrew Morton , David Miller , Eric Dumazet , Frederic Weisbecker , Hannes Frederic Sowa , Ingo Molnar , "Levin, Alexander (Sasha Levin)" , Linus Torvalds , Paolo Abeni , "Paul E. McKenney" , Peter Zijlstra , Radu Rendec , Rik van Riel , Stanislaw Gruszka , Thomas Gleixner , Wanpeng Li Subject: [RFC 1/2] softirq: Defer net rx/tx processing to ksoftirqd context Date: Tue, 9 Jan 2018 13:36:22 +0000 Message-Id: <20180109133623.10711-2-dima@arista.com> X-Mailer: git-send-email 2.13.6 In-Reply-To: <20180109133623.10711-1-dima@arista.com> References: <20180109133623.10711-1-dima@arista.com> Sender: linux-kernel-owner@vger.kernel.org List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Return-Path: Warning: Not merge-ready I. Current workflow of ksoftirqd. Softirqs are processed in the context of ksoftirqd iff they are being raised very frequently. How it works: do_softirq() and invoke_softirq() deffer pending softirq iff ksoftirqd is in runqueue. Ksoftirqd is scheduled mostly in the end of processed softirqs if 2ms were not enough to process all pending softirqs. Here is pseudo-picture of the workflow (for simplicity on UMP): ------------- ------------------ ------------------ | ksoftirqd | | User's process | | Softirqs | ------------- ------------------ ------------------ Not scheduled Running | o------------------------o | __do_softirq() | 2ms & softirq pending? Schedule ksoftirqd | Scheduled o------------------------o | o--------------------o | Running Scheduled | o--------------------o | Not scheduled Running Timegraph for the workflow, dash (-) means ksoftirqd not scheduled; equal(=) ksoftirqd is scheduled, a softirq may still be pending Pending softirqs | | | | | | | | | v v v v | | | | v Processing o-----o | | | | o--o softirqs | | | | | | | | | | | | | | | | | | | | | | | | Userspace o-o o=========o | | | | o----o o---------o <-2ms-> | | | | | | | v v v v | Ksoftirqd o----------o II. Corner-conditions. During testing of commit [1] on some non-mainstream driver, I've found that due to platform specifics, the IRQ is being raised too late (after softirq has been processed). In result softirqs steal time from userspace process, leaving it starving for CPU time and never/rarely scheduling ksoftirqd: Pending softirqs | | | | | | v v v v v v Processing o-----o o-----o o-----o o-----o o-----o o ... softirqs | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Userspace o-o o-o o-o o-o o-o o-o (starving) Ksoftirqd (rarely scheduled) Afterwards I thought that the same may happen to mainstream if PPS rate is selected to raise an IRQ just after previous softirq was processed. I managed to reproduce the conjecture, see (IV). III. RFC proposal. Firstly, I tried to count all time spent in softirq processing to ksoftirqd thread that serves local CPU and add comparison of vruntime for ksoftirqd and current task to decide if softirq should be delayed. You may imagine what a disgraceful hacks were involved. Current RFC has nothing of that kind and relies on fair scheduling of ksoftirqd and other tasks. To do that we check pending softirqs and serve them on current context only if there are non-net softirqs pending. The following patch adds a mask to __do_softirq() to process net-softirqs only on ksoftirqd context if multiply softirqs are pending. IV. Test results. Unfortunately, I wasn't able to test it on hardware with mainstream kernel. So, I've only results from Qemu VMs with fedora 26. The first VM stresses the second with UDP packages by pktgen. The receiver VM is running udp_sink[2] program and prints the amount of PPS served. Vms have virtio as network cards, have rt priority and are assigned to different CPUs on the host. Host's CPU is Intel Core i7-7600U @ 2.80GHz. RFC definitely needs some testing on the real HW (because I don't expect anyone would quite believe VM perf testing) - any help with testing it would be appreciated. Source | Destination --------|------------------------------------ | master | RFC | | (4.15-rc4) | | --------|------------------|----------------| 5000 | 5000.7 | 4999.7 | --------|------------------|----------------| 7000 | 6997.42 | 6995.88 | --------|------------------|----------------| 8000 | 7999.55 | 7999.86 | --------|------------------|----------------| 9000 | 8951.37 | 8986.30 | --------|------------------|----------------| 10000 | 9864.96 | 9972.05 | --------|------------------|----------------| 11000 | 10711.92 | 10976.26 | --------|------------------|----------------| 12000 | 11494.79 | 11962.40 | --------|------------------|----------------| 13000 | 12161.76 | 12946.91 | --------|------------------|----------------| 14000 | 11152.07 | 13942.96 | --------|------------------|----------------| 15000 | 8650.22 | 14878.26 | --------|------------------|----------------| 16000 | 7662.55 | 15880.60 | --------|------------------|----------------| 17000 | 6485.49 | 16814.07 | --------|------------------|----------------| 18000 | 5489.48 | 17679.69 | --------|------------------|----------------| 19000 | 4679.59 | 18543.60 | --------|------------------|----------------| 20000 | 4738.24 | 19233.56 | --------|------------------|----------------| 21000 | 4015.00 | 20247.50 | --------|------------------|----------------| 22000 | 4376.99 | 20654.62 | --------|------------------|----------------| 23000 | 9429.80 | 20925.07 | --------|------------------|----------------| 24000 | 8872.33 | 21336.31 | --------|------------------|----------------| 25000 | 19824.67 | 21486.84 | --------|------------------|----------------| 30000 | 20779.49 | 21487.15 | --------|------------------|----------------| 40000 | 24559.83 | 21452.74 | --------|------------------|----------------| 50000 | 18469.20 | 21191.34 | --------|------------------|----------------| 100000 | 19773.00 | 22592.28 | --------|------------------|----------------| Note, that I tested in VMs and I've found that if I produce more hw irqs on the host, than the results for master are not that dramatically bad, but still much worse then with RFC. By that reason I have qualms if my test's results are correct. V. References: [1] 4cd13c21b207 ("softirq: Let ksoftirqd do its job") [2] https://github.com/netoptimizer/network-testing/blob/master/src/udp_sink.c Signed-off-by: Dmitry Safonov --- kernel/softirq.c | 29 ++++++++++++++++++++++++----- 1 file changed, 24 insertions(+), 5 deletions(-) diff --git a/kernel/softirq.c b/kernel/softirq.c index 2f5e87f1bae2..ee48f194dcec 100644 --- a/kernel/softirq.c +++ b/kernel/softirq.c @@ -88,6 +88,28 @@ static bool ksoftirqd_running(void) return tsk && (tsk->state == TASK_RUNNING); } +static bool defer_softirq(void) +{ + __u32 pending = local_softirq_pending(); + + if (!pending) + return true; + + if (ksoftirqd_running()) + return true; + + /* + * Defer net-rx softirqs to ksoftirqd processing as they may + * make userspace starving cpu time. + */ + if (pending & (NET_RX_SOFTIRQ | NET_TX_SOFTIRQ)) { + wakeup_softirqd(); + return true; + } + + return false; +} + /* * preempt_count and SOFTIRQ_OFFSET usage: * - preempt_count is changed by SOFTIRQ_OFFSET on entering or leaving @@ -315,7 +337,6 @@ asmlinkage __visible void __softirq_entry __do_softirq(void) asmlinkage __visible void do_softirq(void) { - __u32 pending; unsigned long flags; if (in_interrupt()) @@ -323,9 +344,7 @@ asmlinkage __visible void do_softirq(void) local_irq_save(flags); - pending = local_softirq_pending(); - - if (pending && !ksoftirqd_running()) + if (!defer_softirq()) do_softirq_own_stack(); local_irq_restore(flags); @@ -352,7 +371,7 @@ void irq_enter(void) static inline void invoke_softirq(void) { - if (ksoftirqd_running()) + if (defer_softirq()) return; if (!force_irqthreads) { -- 2.13.6