Received: by 2002:a25:b323:0:0:0:0:0 with SMTP id l35csp2349107ybj; Mon, 23 Sep 2019 02:12:14 -0700 (PDT) X-Google-Smtp-Source: APXvYqyyxqxUE5cRLQPY5Il3NbNTDK1lf1ln9uIOo4Vs+3frEq+IgUyWxh9oWmgVO8WJb6B67eae X-Received: by 2002:a17:906:e28f:: with SMTP id gg15mr23248812ejb.182.1569229934552; Mon, 23 Sep 2019 02:12:14 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1569229934; cv=none; d=google.com; s=arc-20160816; b=pN6WW/sfa8uyeOO1ixJSZ4i9OZWJD9nqcr6orkK7+61+zsMjuHwOd6opF82TIIiGog 9RmVv3X+Qml44BBmXj7D9898HHR9uHq1re4UrfU0lTXgUExb7ndQ8mHhFDESJtXl/puv ZltA9tAXOKWjPgEqJhgcWki3xQiOYQpRzeVQiqyWbLrGnyY7ZnKoZ8dkuqq9kNUFcduE M4uyiEhVG90+cyNgoX/aVVGDvZx9LHtCkfcwI/a3pe6bu6yk36IIKKwLuoxdPBEswMiq 6qjEWdlmoU92gUlTep0c8bO+PvIYLTGJxLGpm0IoIoYhxTjKVxpwFzimCAcbKB6yrEZL S3eQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject; bh=dDKMsbuxijXFlLq6eWilir6TdAT/w1I2nrkQyKxFFKQ=; b=mqkgb4C5BMR77VDteJ8M9LwCzD9sLOvBAs8wjPQwJUm/CJVwR4nGKQDyrE+RbIbbfJ zjNpFCEN9RIBSKV4+bUxNRqEg+lFQWP/ae7hEiRj9leGwYA4vY3nf5QTWKtAq5fH8ehU W5MJ5csoIzocvH8X29qKNKx51CJIdRbBvcT2EC2H9YBId48Okln9D3q2nq/VVbYyolpL 6BngLA1b9Mg8AFFsap//q/wtsJK+KnhyM+1QYaZwkJocSzNpnAt6sUbzO9JiwGCdbZz7 nFfezMHCB99Tiak78g/Tlc7FqX4JdGH2YeVuoW5roBhUZftVJF92Ptv/UJTOWF72zek2 upvw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id u23si4599414ejm.26.2019.09.23.02.11.50; Mon, 23 Sep 2019 02:12:14 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S2387420AbfITUpk (ORCPT + 99 others); Fri, 20 Sep 2019 16:45:40 -0400 Received: from mail-wr1-f65.google.com ([209.85.221.65]:36760 "EHLO mail-wr1-f65.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1727518AbfITUpk (ORCPT ); Fri, 20 Sep 2019 16:45:40 -0400 Received: by mail-wr1-f65.google.com with SMTP id y19so8090219wrd.3; Fri, 20 Sep 2019 13:45:37 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:cc:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=dDKMsbuxijXFlLq6eWilir6TdAT/w1I2nrkQyKxFFKQ=; b=Abs1m3KKPGHte+2MuVlJA7Emn8fHtX5wnIR269C2BJQ76jhsgxmRyF5JNp8B7CZPlV mp/jYn08WI6KqEVL19BuSsYhrCKPVAgDrJ3WAZ96rubr+EwQQtkLdc2Vr/WM7KJYQLXb V58Na/DNt7Pq1rrh680rkQMzMCyNZO5vD5/oVk+R8f/hXD9wEW6ufIMS0ueWyk5l+hfY MWBNV0UYx11hrFyX7uW64HSVE1RtWjIXpoREE2K0Ya/np1CyS1uHhdfjk0ulLHLWNcPs g8XD+EMydWfAkgT6A6LbWODqu7sdC1pU/eMk9MnncrM3LKGdYkdPKozaz4a2E/eovjtR AbRA== X-Gm-Message-State: APjAAAV9tKM+WEr5aEeZ03oLCk/SFlwhq9J+5bhdONi9RvhiAmUI7Pqi J3hce4Wh0rhCAs8kIO9DXHQ= X-Received: by 2002:a5d:408c:: with SMTP id o12mr13894129wrp.312.1569012336940; Fri, 20 Sep 2019 13:45:36 -0700 (PDT) Received: from ?IPv6:2600:1700:65a0:78e0:514:7862:1503:8e4d? ([2600:1700:65a0:78e0:514:7862:1503:8e4d]) by smtp.gmail.com with ESMTPSA id c132sm3104677wme.27.2019.09.20.13.45.32 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Fri, 20 Sep 2019 13:45:36 -0700 (PDT) Subject: Re: [PATCH 1/4] softirq: implement IRQ flood detection mechanism To: Long Li , Ming Lei Cc: Jens Axboe , Hannes Reinecke , John Garry , Bart Van Assche , "linux-scsi@vger.kernel.org" , Peter Zijlstra , Daniel Lezcano , LKML , "linux-nvme@lists.infradead.org" , Keith Busch , Ingo Molnar , Thomas Gleixner , Christoph Hellwig References: <6b88719c-782a-4a63-db9f-bf62734a7874@linaro.org> <20190903072848.GA22170@ming.t460p> <6f3b6557-1767-8c80-f786-1ea667179b39@acm.org> <2a8bd278-5384-d82f-c09b-4fce236d2d95@linaro.org> <20190905090617.GB4432@ming.t460p> <6a36ccc7-24cd-1d92-fef1-2c5e0f798c36@linaro.org> <20190906014819.GB27116@ming.t460p> <6eb2a745-7b92-73ce-46f5-cc6a5ef08abc@grimberg.me> <20190907000100.GC12290@ming.t460p> <30dc6fa9-ea5e-50d6-56f9-fbc9627d8c29@grimberg.me> From: Sagi Grimberg Message-ID: <100d001a-1dda-32ff-fa5e-c18b121444d9@grimberg.me> Date: Fri, 20 Sep 2019 13:45:30 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.8.0 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8; format=flowed Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >>> Sagi, >>> >>> Sorry it took a while to bring my system back online. >>> >>> With the patch, the IOPS is about the same drop with the 1st patch. I think >> the excessive context switches are causing the drop in IOPS. >>> >>> The following are captured by "perf sched record" for 30 seconds during >> tests. >>> >>> "perf sched latency" >>> With patch: >>> fio:(82) | 937632.706 ms | 1782255 | avg: 0.209 ms | max: 63.123 >> ms | max at: 768.274023 s >>> >>> without patch: >>> fio:(82) |2348323.432 ms | 18848 | avg: 0.295 ms | max: 28.446 >> ms | max at: 6447.310255 s >> >> Without patch means the proposed hard-irq patch? > > It means the current upstream code without any patch. But It's prone to soft lockup. > > Ming's proposed hard-irq patch gets similar results to "without patch", however it fixes the soft lockup. Thanks for the clarification. The problem with what Ming is proposing in my mind (and its an existing problem that exists today), is that nvme is taking precedence over anything else until it absolutely cannot hog the cpu in hardirq. In the thread Ming referenced a case where today if the cpu core has a net softirq activity it cannot make forward progress. So with Ming's suggestion, net softirq will eventually make progress, but it creates an inherent fairness issue. Who said that nvme completions should come faster then the net rx/tx or another I/O device (or hrtimers or sched events...)? As much as I'd like nvme to complete as soon as possible, I might have other activities in the system that are as important if not more. So I don't think we can solve this with something that is not cooperative or fair with the rest of the system. >> If we are context switching too much, it means the soft-irq operation is not >> efficient, not necessarily the fact that the completion path is running in soft- >> irq.. >> >> Is your kernel compiled with full preemption or voluntary preemption? > > The tests are based on Ubuntu 18.04 kernel configuration. Here are the parameters: > > # CONFIG_PREEMPT_NONE is not set > CONFIG_PREEMPT_VOLUNTARY=y > # CONFIG_PREEMPT is not set I see, so it still seems that irq_poll_softirq is still not efficient in reaping completions. reaping the completions on its own is pretty much the same in hard and soft irq, so its really the scheduling part that is creating the overhead (which does not exist in hard irq). Question: when you test with without the patch (completions are coming in hard-irq), do the fio threads that run on the cpu cores that are assigned to the cores that are handling interrupts get substantially lower throughput than the rest of the fio threads? I would expect that the fio threads that are running on the first 32 cores to get very low iops (overpowered by the nvme interrupts) and the rest doing much more given that nvme has almost no limits to how much time it can spend on processing completions. If need_resched() is causing us to context switch too aggressively, does changing that to local_softirq_pending() make things better? -- diff --git a/lib/irq_poll.c b/lib/irq_poll.c index d8eab563fa77..05d524fcaf04 100644 --- a/lib/irq_poll.c +++ b/lib/irq_poll.c @@ -116,7 +116,7 @@ static void __latent_entropy irq_poll_softirq(struct softirq_action *h) /* * If softirq window is exhausted then punt. */ - if (need_resched()) + if (local_softirq_pending()) break; } -- Although, this can potentially cause other threads from making forward progress.. If it is better, perhaps we also need a time limit as well. Perhaps we should add statistics/tracing on how many completions we are reaping per invocation...