Received: by 2002:a25:6193:0:0:0:0:0 with SMTP id v141csp1773762ybb; Thu, 9 Apr 2020 08:46:48 -0700 (PDT) X-Google-Smtp-Source: APiQypLgJzQ6xsktcI+XLNZBO2fB8Pol6m2JhYNM8EMHa4yMbH8ptCCLZzeVmcR2dN1U8YVsV/+G X-Received: by 2002:a37:a2c2:: with SMTP id l185mr404517qke.201.1586447208310; Thu, 09 Apr 2020 08:46:48 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1586447208; cv=none; d=google.com; s=arc-20160816; b=X7FY4Vm+GHRpmvkhAqp3O1n/B9AqGr7TvbSdyiec8PLmFEkUC+N9PB9fc77LL1UOpj CI0X3TgwkjV+xm1cXehDkR+4M8Qty1NisIpzL+g8XRbYImO/WukDPS8LrRQyMiu7Awkw pIIDegJptf0stiYTEdL3bzjTX1N8LZlwI3sC+pa+YmVywEm+G3Q2ZcyDgXEPzA7HrZRr b7vvE+ak+Iuv/jhICJ4hW7k9AJUyJ7dM75q8NQ4KN4lUR/t5ZuUNTipw48FUu36PeZrn s1hFJMMVWztEauwYT73tWwksUoFEq/ujyoIq8DPPshY+/3oGm6lU1JA6SWIR8qQ/sqeG D3iA== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:to:subject:dkim-signature; bh=A+392dlFffMR2Zs+A393uuNHfV4VZPU06/pi7V42CWo=; b=RuRVazhY/I5mpyVpflt92E9YJIkGlrajrk0Da6sLOX/smf5gvrRFNSG1vVFVbKEubL 4irFvPjOHphmItb2JZxVSEM1TA9s0fwVwMSGpZHrejAk1GB0zFX1xDYz6bMTEZ579llM 1VuYhLOn7KRHk/4Bq5O5LuN6xNhP+fS0lD0j0uBkb209arl9oNiz/tC7MhAguz8HhaJj J7X6SvbGzGfeEYuRdwrNtrIwZVYWmP7PRhIlArunhvBCWXmbtuXsYbh0wDIzNTuCU0bK kglKYy5RodhOoB0nChTq9b9xmP7D97y+ZtqO49iwYExAy3gVQfc0qp5lxRDWsgL9/k1C FQvw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel-dk.20150623.gappssmtp.com header.s=20150623 header.b=QzPSCCjx; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id z1si5918588qtb.56.2020.04.09.08.46.31; Thu, 09 Apr 2020 08:46:48 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel-dk.20150623.gappssmtp.com header.s=20150623 header.b=QzPSCCjx; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1728306AbgDIPoh (ORCPT + 99 others); Thu, 9 Apr 2020 11:44:37 -0400 Received: from mail-pf1-f193.google.com ([209.85.210.193]:37576 "EHLO mail-pf1-f193.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1728254AbgDIPog (ORCPT ); Thu, 9 Apr 2020 11:44:36 -0400 Received: by mail-pf1-f193.google.com with SMTP id u65so4286720pfb.4 for ; Thu, 09 Apr 2020 08:44:34 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=kernel-dk.20150623.gappssmtp.com; s=20150623; h=subject:to:references:from:message-id:date:user-agent:mime-version :in-reply-to:content-language:content-transfer-encoding; bh=A+392dlFffMR2Zs+A393uuNHfV4VZPU06/pi7V42CWo=; b=QzPSCCjxge96HH4Z1pNDVxNSXwUJqh+j7B4xiHHJa0l5CW9u6HF6oYrhEeAHDG+Pcy uwdqlVIbZRVvmuFK7iDLty12pv+IlKRZrgwOAvuxUltShrZ/UZwJEz/2RqMX58HiZGI8 IEkVbAsLxDrAd58J5A7mIc+EmqS9bDyfOr+XDMxQLKLKlONgFr32/6s59XOVLkJIUZha lJ6Ga9jC7m5Y8sVBK1grNp22LMOcJvXlZLT9cu72BA4CTsrHttSI7PfJPOEjTHZjnky9 CItk90pdMO8goJ2ZqZ+VzidVE6oLlq92WGFbzgMGGmhRXPKaEvYczPPAusQlMEVhsCg+ SCTg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:subject:to:references:from:message-id:date :user-agent:mime-version:in-reply-to:content-language :content-transfer-encoding; bh=A+392dlFffMR2Zs+A393uuNHfV4VZPU06/pi7V42CWo=; b=mZ6TJqd/QtxsHx3nPCsR5+BT29aJQBk1/CtHhmtKzVV59ckT9m/6vjTmbu3WUJmT5t N1EluJ9xQ9rixd1zWzMwotBSj3H+y3OkgQb++/i/CnZabsO0Oc2QQJoNzJgJ5JL0oN8A E0hHDvd6BUmMoCB8L6MkjdPrTlp1t4RzCSkkkkB/nPGMewlVd+WZqMFOXOo0jBSkNBIW OVWlrdHTEVWLslaEUbgYLDrlTipo208eI40YrGFHnTWH1vauGT31cQBz7Z02joDO4LFm PjmKO9VeCfvOLc1BcR/n+A4EewmJms6VJy0laoNwB6hEzK7MShBvlRTj6ADnT4YCzB7r vH+g== X-Gm-Message-State: AGi0Puau1VT48ga6D9FXHUvS+sn/ENaMMpGbvuj2ZtqO+2n3FbTthEqt QWmgzjgbHsKwUg/qzw+zDjphWhwrhE4+Tw== X-Received: by 2002:a63:a556:: with SMTP id r22mr16479pgu.429.1586447073271; Thu, 09 Apr 2020 08:44:33 -0700 (PDT) Received: from ?IPv6:2605:e000:100e:8c61:70f8:a8e1:daca:d677? ([2605:e000:100e:8c61:70f8:a8e1:daca:d677]) by smtp.gmail.com with ESMTPSA id m9sm19123222pff.93.2020.04.09.08.44.31 (version=TLS1_3 cipher=TLS_AES_128_GCM_SHA256 bits=128/128); Thu, 09 Apr 2020 08:44:32 -0700 (PDT) Subject: Re: [PATCH 1/2] eventfd: Make wake counter work for single fd instead of all To: He Zhe , viro@zeniv.linux.org.uk, bcrl@kvack.org, linux-fsdevel@vger.kernel.org, linux-aio@kvack.org, linux-kernel@vger.kernel.org References: <1586257192-58369-1-git-send-email-zhe.he@windriver.com> <3f395813-a497-aa25-71cc-8aed345b9f75@kernel.dk> From: Jens Axboe Message-ID: <3b4aa4cb-0e76-89c2-c48a-cf24e1a36bc2@kernel.dk> Date: Thu, 9 Apr 2020 08:44:30 -0700 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.4.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 4/9/20 3:37 AM, He Zhe wrote: > > > On 4/8/20 4:06 AM, Jens Axboe wrote: >> On 4/7/20 3:59 AM, zhe.he@windriver.com wrote: >>> From: He Zhe >>> >>> commit b5e683d5cab8 ("eventfd: track eventfd_signal() recursion depth") >>> introduces a percpu counter that tracks the percpu recursion depth and >>> warn if it greater than one, to avoid potential deadlock and stack >>> overflow. >>> >>> However sometimes different eventfds may be used in parallel. >>> Specifically, when high network load goes through kvm and vhost, working >>> as below, it would trigger the following call trace. >>> >>> - 100.00% >>> - 66.51% >>> ret_from_fork >>> kthread >>> - vhost_worker >>> - 33.47% handle_tx_kick >>> handle_tx >>> handle_tx_copy >>> vhost_tx_batch.isra.0 >>> vhost_add_used_and_signal_n >>> eventfd_signal >>> - 33.05% handle_rx_net >>> handle_rx >>> vhost_add_used_and_signal_n >>> eventfd_signal >>> - 33.49% >>> ioctl >>> entry_SYSCALL_64_after_hwframe >>> do_syscall_64 >>> __x64_sys_ioctl >>> ksys_ioctl >>> do_vfs_ioctl >>> kvm_vcpu_ioctl >>> kvm_arch_vcpu_ioctl_run >>> vmx_handle_exit >>> handle_ept_misconfig >>> kvm_io_bus_write >>> __kvm_io_bus_write >>> eventfd_signal >>> >>> 001: WARNING: CPU: 1 PID: 1503 at fs/eventfd.c:73 eventfd_signal+0x85/0xa0 >>> ---- snip ---- >>> 001: Call Trace: >>> 001: vhost_signal+0x15e/0x1b0 [vhost] >>> 001: vhost_add_used_and_signal_n+0x2b/0x40 [vhost] >>> 001: handle_rx+0xb9/0x900 [vhost_net] >>> 001: handle_rx_net+0x15/0x20 [vhost_net] >>> 001: vhost_worker+0xbe/0x120 [vhost] >>> 001: kthread+0x106/0x140 >>> 001: ? log_used.part.0+0x20/0x20 [vhost] >>> 001: ? kthread_park+0x90/0x90 >>> 001: ret_from_fork+0x35/0x40 >>> 001: ---[ end trace 0000000000000003 ]--- >>> >>> This patch moves the percpu counter into eventfd control structure and >>> does the clean-ups, so that eventfd can still be protected from deadlock >>> while allowing different ones to work in parallel. >>> >>> As to potential stack overflow, we might want to figure out a better >>> solution in the future to warn when the stack is about to overflow so it >>> can be better utilized, rather than break the working flow when just the >>> second one comes. >> This doesn't work for the infinite recursion case, the state has to be >> global, or per thread. > > Thanks, but I'm not very clear about why the counter has to be global > or per thread. > > If the recursion happens on the same eventfd, the attempt to re-grab > the same ctx->wqh.lock would be blocked by the fd-specific counter in > this patch. > > If the recursion happens with a chain of different eventfds, that > might lead to a stack overflow issue. The issue should be handled but > it seems unnecessary to stop the just the second ring(when the counter > is going to be 2) of the chain. > > Specifically in the vhost case, it runs very likely with heavy network > load which generates loads of eventfd_signal. Delaying the > eventfd_signal to worker threads will still end up violating the > global counter later and failing as above. > > So we might want to take care of the potential overflow later, > hopefully with a measurement that can tell us if it's about to > overflow. The worry is different eventfds, recursion on a single one could be detected by keeping state in the ctx itself. And yeah, I agree that one level isn't very deep, but wakeup chains can be deep and we can't allow a whole lot more. I'm sure folks would be open to increasing it, if some worst case kind of data was collected to prove it's fine to go deeper. -- Jens Axboe