Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp3140461imu; Thu, 29 Nov 2018 16:28:52 -0800 (PST) X-Google-Smtp-Source: AFSGD/VgnGCS/ajeJYO5/lezBDPh1ZofStcg4AJvbRazas1T1NzZPa6BYK54YlbBu2dRgn5H6KTy X-Received: by 2002:a63:f444:: with SMTP id p4mr3045719pgk.124.1543537732312; Thu, 29 Nov 2018 16:28:52 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1543537732; cv=none; d=google.com; s=arc-20160816; b=ZTR7uRabQ3BvtF9TQJjXtTVpbl5n0XViN4MZla7+IKVK4OqfxEAITz3NBR0CfwuZiT LDMTlNcNIYIoUNvLnxrbHcXuCBEdCeI8oL0BokeLAFU9rJ5eEjKNJoSXSvN41/m5i8yv 2F80zYIfM4S2Zfs4wBHV2yPDtVzLN9PB2UZ3F9SffZTgMZp5gbrb8e46jfN66jBnUUo6 c6ucYda71yAs36KOfxETSSLxBLVRXIOG4ENZ23hTFaFLQ6mqZ/+As7ywj6r8PZMSc4GF 7QmX5DDH7owMvXU6KGM8jkHE6ukns7sa8fbji+9BNFdBNhYE1kUG5+tMFU0HlPHg47qk htvg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:references:cc:to:subject:from:dkim-signature; bh=2fYOi+yH2B427lRZccMPiL7YPi76VVmW8jxeNBI10Bo=; b=L6FTqMSKkEWf2rgdtIyXjmKloK0oLB3RwLZH+S9bTuFRnH3i2LYFtlnfsNReLKBiAT UczSXM9xdFggLfiXThjlRYNBrJGFLqlnvtYvAiywIG5mY1EFZ2DZ9S+Ss03MYmqx9ENF zRCJd5G8x/6Zl3YOaromx4T0Ep6ACXY5GMtS9gHKuM6UfM+7oY+4z+u6x9keFsejChoG T/UXKdcXL44coV/nNM5VYF2v1gVqaCbk2hpuNYxNpBW0Jt5EetZh+WAXulwTXOnPTwQI fyhSS/7pCy6Gruj2uISxXQrNsn9zB/mEMYJicsrNHoTU9Gz1Bnet+ttrr2IOuXpZTQRD mohw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@cisco.com header.s=iport header.b=OEV408cX; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=cisco.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id a64si3296449pge.124.2018.11.29.16.28.37; Thu, 29 Nov 2018 16:28:52 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@cisco.com header.s=iport header.b=OEV408cX; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=cisco.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727103AbeK3LfL (ORCPT + 99 others); Fri, 30 Nov 2018 06:35:11 -0500 Received: from alln-iport-2.cisco.com ([173.37.142.89]:8369 "EHLO alln-iport-2.cisco.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726446AbeK3LfL (ORCPT ); Fri, 30 Nov 2018 06:35:11 -0500 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=cisco.com; i=@cisco.com; l=8040; q=dns/txt; s=iport; t=1543537667; x=1544747267; h=from:subject:to:cc:references:message-id:date: mime-version:in-reply-to:content-transfer-encoding; bh=zAUoc1alI1jElF1lNzEfqKrkXoeuFhr63ENrhZqGeF4=; b=OEV408cXjAo7JMFCCvQxfR09Vh8sUUsMiQQ1r39kAJvrxXDSsfINOFGn MVZIWGgfhl4uv45EJ8RUzD9TO4zJcFS+L3JY3DpqF96Av0D9osph0P7qG JcVlL4J7OeDZ+s0gnQ+EMDL9hprdyNigWey/25xbGVQzIMoW0Y3ec5QCZ M=; X-IronPort-AV: E=Sophos;i="5.56,296,1539648000"; d="scan'208";a="206380669" Received: from alln-core-8.cisco.com ([173.36.13.141]) by alln-iport-2.cisco.com with ESMTP/TLS/DHE-RSA-AES256-GCM-SHA384; 30 Nov 2018 00:27:47 +0000 Received: from [10.156.165.82] ([10.156.165.82]) by alln-core-8.cisco.com (8.15.2/8.15.2) with ESMTP id wAU0RjIE000767; Fri, 30 Nov 2018 00:27:45 GMT From: Enke Chen Subject: Re: [PATCH v5 1/2] kernel/signal: Signal-based pre-coredump notification To: Dave Martin Cc: Oleg Nesterov , Thomas Gleixner , Ingo Molnar , Borislav Petkov , "H. Peter Anvin" , Peter Zijlstra , Arnd Bergmann , "Eric W. Biederman" , Khalid Aziz , Kate Stewart , Helge Deller , Greg Kroah-Hartman , Al Viro , Andrew Morton , Christian Brauner , Catalin Marinas , Will Deacon , Mauro Carvalho Chehab , Michal Hocko , Rik van Riel , "Kirill A. Shutemov" , Roman Gushchin , Marcos Paulo de Souza , Dominik Brodowski , Cyrill Gorcunov , Yang Shi , Jann Horn , Kees Cook , "linux-kernel@vger.kernel.org" , "linux-arch@vger.kernel.org" , "Victor Kamensky (kamensky)" , "xe-linux-external@cisco.com" , Stefan Strogin , Enke Chen References: <458c04d8-d189-4a26-729a-bb1d1d751534@cisco.com> <7741efa7-a3f8-62a1-ba52-613883164643@cisco.com> <84460a77-a111-404e-4bad-88104a6e246e@cisco.com> <20181026082812.GA10581@redhat.com> <21f678a8-4001-df36-c26e-e96cf203b1b1@cisco.com> <20181029111804.GA24820@redhat.com> <0c197608-3b7e-ffd1-8943-801a60beb917@cisco.com> <80e96710-f424-9b39-72ee-9cc7cbe7a5f7@cisco.com> <20181128151911.GN3505@e103592.cambridge.arm.com> <20181129115520.GO3505@e103592.cambridge.arm.com> Message-ID: Date: Thu, 29 Nov 2018 16:27:44 -0800 User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:52.0) Gecko/20100101 Thunderbird/52.9.1 MIME-Version: 1.0 In-Reply-To: <20181129115520.GO3505@e103592.cambridge.arm.com> Content-Type: text/plain; charset=utf-8 Content-Language: en-US Content-Transfer-Encoding: 7bit X-Auto-Response-Suppress: DR, OOF, AutoReply X-Outbound-SMTP-Client: 10.156.165.82, [10.156.165.82] X-Outbound-Node: alln-core-8.cisco.com Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org Hi, Dave: On 11/29/18 3:55 AM, Dave Martin wrote: >> Indeed, I defined the signal code CLD_PREDUMP for SIGCHLD initially, but it >> was removed after discussion: >> >> v3 --> v4: >> >> Addressed review comments from Oleg Nesterov, and Eric W. Biederman, >> including: >> o remove the definition CLD_PREDUMP. >> --- >> >> You can look at the discussions in the email thread, in particular several >> issues pointed out by Eric Biederman, and my reply to Eric. > > Ah, right. > >> There are two models 1:1 (one process manager with one child process), and 1:N >> (one process manager with many child processes). A legacy signal like SIGCHLD >> would not work in the 1:N model due to compression/loss of legacy signals. One >> need to use a RT signal in that case. > > SIGCHLD can be redirected to an RT signal via clone(). Are you saying > the signal is still not queued in that case? I had assumed that things > like pthreads rely on this working. > > However, one detail I had missed is that only child exits are reported > via the exit signal set by clone(). Other child status changes are > seem to be reported via SIGCHLD always. > > Making your supervised processes into clone-children might interact > badly with pthreads if it uses wait(__WCLONE) internally. I've not > looked into that. As Oleg commented before: And once again, SIGCHLD/SIGUSR do not queue, this means that PR_SET_PREDUMP_SIG is pointless if you have 2 or more children. In addition, there is really no need to introduce a new semantics to SIGCHLD. There are enough signals available for one to be designated in the parent process for the pre-coredump notification. > >> One more point in my reply: >> >> When an application chooses a signal for pre-coredump notification, it is much >> simpler and robust for the signal to be dedicated for that purpose (in the parent) >> and not be mixed with other semantics. The "signo + pid" should be sufficient for >> the parent process in both 1:1 and 1:N models. > > What if the signal queue overflows? sigqueue() returns EAGAIN, but I > think that signals queued by the kernel would simply be lost. This > probably won't happen in any non-pathological scenario, but the process > manager may just silently go wrong instead of failing cleanly when/if > this happens. As pointed out by Oleg: see the legacy_queue() check. Any signal < SIGRTMIN do not queue. IOW, if SIGCHLD is already pending, then next SIGCHLD is simply ignored. I went though the code and confirm it. > > SIGCHLD + wait() is immune to this problem for other child status > notifications (albeit with higher overhead). > > Unless I've missed something fundamental, signals simply aren't a > reliable data transport: if you need 100% reliability, you need to be > using another mechanism, either in combination with a signal, or by > itself. Given the right signo, e.g., a RT signal for both models, or SIGUSR1/SIGUSR2 for 1:1 model, the pre-coredump signal notification is 100% reliable, and it is the simplest solution. When there are many child processes for the 1:N model, if needed, there is an API for enlarging queue limit: setrlimit(RLIMIT_SIGPENDING, xxx); > >>>> >>>> Signed-off-by: Enke Chen >>>> Reviewed-by: Oleg Nesterov >>>> --- >>>> v4 -> v5: >>>> Addressed review comments from Oleg Nesterov: >>>> o use rcu_read_lock instead. >>>> o revert back to notify the real_parent. >>>> >>>> fs/coredump.c | 23 +++++++++++++++++++++++ >>>> fs/exec.c | 3 +++ >>>> include/linux/sched/signal.h | 3 +++ >>>> include/uapi/linux/prctl.h | 4 ++++ >>>> kernel/sys.c | 13 +++++++++++++ >>>> 5 files changed, 46 insertions(+) >>>> >>>> diff --git a/fs/coredump.c b/fs/coredump.c >>>> index e42e17e..740b1bb 100644 >>>> --- a/fs/coredump.c >>>> +++ b/fs/coredump.c >>>> @@ -536,6 +536,24 @@ static int umh_pipe_setup(struct subprocess_info *info, struct cred *new) >>>> return err; >>>> } >>>> >>>> +/* >>>> + * While do_notify_parent() notifies the parent of a child's death post >>>> + * its coredump, this function lets the parent (if so desired) know about >>>> + * the imminent death of a child just prior to its coredump. >>>> + */ >>>> +static void do_notify_parent_predump(void) >>>> +{ >>>> + struct task_struct *parent; >>>> + int sig; >>>> + >>>> + rcu_read_lock(); >>>> + parent = rcu_dereference(current->real_parent); >>>> + sig = parent->signal->predump_signal; >>>> + if (sig != 0) >>>> + do_send_sig_info(sig, SEND_SIG_NOINFO, parent, PIDTYPE_TGID); >>> >>> Doesn't this send si_code == SI_USER. That seems wrong: the receiving >>> process wouldn't not be able to distinguish a real pre-coredump >>> notification from a bogus one sent by kill(2) et >> The receiving process (i.e., the process manager) knows the PID of all >> its child processes. Thus it can easily tell whether the signal is from >> a child or not. > > But it can't tell whether the child sent the signal via kill(2) etc., or > whether the child started dumping core. Once a signal is chosen and designated for this special purpose, a child process would not send the signal to its parent via kill(2), right? > > It's cleaner to be able to filter reliably on si_code, especially if the > process you're supervising isn't fully under your control. For example, > even if the supervised process is considered trustworthy, it could still > be exploited by an attacker (or simply go wrong) in a way that causes > it do to a bogus kill(). > > (If you treat any incoming signal as anything more than a hint, failure > to check si_code is usually a bug in general.) > There is one signal that the application has designated for this special purpose. The application should take the appropriate action once the signal is received from its child. If it does not, that is a bug in the application and not in the kernel. >>> >>> SEND_SIG_PRIV also looks wrong, because it assumes that the sender is >>> "the kernel" so there is no si_pid. >> >> That is why SEND_SIG_NOINFO is chosen here for carrying the "pid". What >> matters to the parent process is the "signo + pid" for identifying the >> child process for the pre-coredump notification. > > I think that si_code should at least be SI_KERNEL, but how that is > achieved doesn't really matter. > Again, it is the "signo + pid" that matter to the parent process. There is no need to over-specify. >>>> diff --git a/kernel/sys.c b/kernel/sys.c >>>> index 123bd73..39aa3b8 100644 >>>> --- a/kernel/sys.c >>>> +++ b/kernel/sys.c >>>> @@ -2476,6 +2476,19 @@ int __weak arch_prctl_spec_ctrl_set(struct task_struct *t, unsigned long which, >>>> return -EINVAL; >>>> error = arch_prctl_spec_ctrl_set(me, arg2, arg3); >>>> break; >>>> + case PR_SET_PREDUMP_SIG: >>>> + if (arg3 || arg4 || arg5) >>> >>> glibc has >>> >>> int prctl(int option, ...); >>> >>> Some prctls() police extra arguments for zeros, but this means that >>> the userspace caller also has to supply pointless 0 arguments. >>> >>> It's debatable which is the preferred approach. Did you have any >>> particular rationale for your choice here? >>> >> >> The initial version did not check the values of these unused arguments. >> But Jann Horn pointed out the new convention is to enforce the 0 values >> so I followed ... > > Hmmm, I wasn't aware of this convention when I added PR_SVE_SET_VL etc., > and there is no clear pattern in sys.c, and nobody commented at the > time. > > Of course, it works either way. Here is Jann's comment: -- New prctl() calls should check that the unused arguments are zero, in order to make it possible to safely add more flags in the future. -- I can do it either way, but we do need to decide on a convention going forward. Thanks. -- Enke