Received: by 2002:a05:6a10:5bc5:0:0:0:0 with SMTP id os5csp343095pxb; Fri, 29 Oct 2021 10:52:13 -0700 (PDT) X-Google-Smtp-Source: ABdhPJxsvHrOR9kvFfBBDYQvlurSVf8X+m3M/J35pbcn8IlJUZAVUG5QTv0dPp76oWh4D0Ni9t3a X-Received: by 2002:a05:6638:22d2:: with SMTP id j18mr9435148jat.15.1635529933577; Fri, 29 Oct 2021 10:52:13 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1635529933; cv=none; d=google.com; s=arc-20160816; b=sFjeEjDLGL+X2Tx6M7aYD9ChY4mVOZzvzBU2FxbV11rbXyQfpTJ67MkIHeOW7Ri6EM EhtMpWrrWWmCRGxvxYIyAp3ENnaM3vs1nL755Bnyz61IWsUyplc5LSOpslFYdmulaGFS VYP/hOru/xFVp5l2bFY2mleGb0s5bva9fT46wcL9udAdLn+GWun37XsI1dbGMAepKcAE qAZvkWe7xysKvTEVKAh5zVSO58t1RwZVVk3IWCBJYTH4F5jdvjNkiUzZxNT4EvakwIo1 zPxChnGxixBSTE6tPzrEXV2m6aGEEA+LB6pRDbK0xattmTr9c+/XYaK0MmQ18U4zPBJr T+Vw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:in-reply-to:content-disposition:mime-version :references:message-id:subject:cc:to:from:date; bh=O3zj62G4b6N/zmpiAL7LXI/pt9nRHU55BuElNLR4d/c=; b=S4oEqJ4XYpBQ3uwRAaus3GbMbWid6WvQ14egHea/cC6yQvDbH/5MNqubqr9dnbaNUz tPZrHnsRCtUXx6ayuKSUA0Tqeczb9du5BCIAJRQhEKY4FFXDZdA5lYDj/OqcuJO5rJYT 2Ngh4dZDwcKd2oIst+t1NDT0W95iLvutuVc65N21+FIkcyKQjuk4MfgYHrHLPnQPdRPk C67rirnK+vIDGR9YQvBggl/+O22pSo4xO4yJ/b7v/t/HE3aAjbiVzUKDVxLi4JbRwAW8 C0dTm1VwzBvFs0c+yP1Pf4wCojGTo9iYuHfzLyUroq1lM0IvK77q96+MgKfSPTo+jEcX VJRw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id l15si8798532iln.109.2021.10.29.10.52.00; Fri, 29 Oct 2021 10:52:13 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=arm.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230126AbhJ2RxL (ORCPT + 99 others); Fri, 29 Oct 2021 13:53:11 -0400 Received: from mail.kernel.org ([198.145.29.99]:58554 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229489AbhJ2RxK (ORCPT ); Fri, 29 Oct 2021 13:53:10 -0400 Received: by mail.kernel.org (Postfix) with ESMTPSA id C7CA9610C7; Fri, 29 Oct 2021 17:50:38 +0000 (UTC) Date: Fri, 29 Oct 2021 18:50:35 +0100 From: Catalin Marinas To: Linus Torvalds Cc: Andreas Gruenbacher , Paul Mackerras , Alexander Viro , Christoph Hellwig , "Darrick J. Wong" , Jan Kara , Matthew Wilcox , cluster-devel , linux-fsdevel , Linux Kernel Mailing List , ocfs2-devel@oss.oracle.com, kvm-ppc@vger.kernel.org, linux-btrfs , Tony Luck , Andy Lutomirski Subject: Re: [PATCH v8 00/17] gfs2: Fix mmap + page fault deadlocks Message-ID: References: MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Oct 28, 2021 at 03:32:23PM -0700, Linus Torvalds wrote: > The pointer color fault (or whatever some other architecture may do to > generate sub-page faults) is not only not recoverable in the sense > that we can't fix it up, it also ends up being a forced SIGSEGV (ie it > can't be blocked - it has to either be caught or cause the process to > be killed). > > And the thing is, I think we could just make the rule be that kernel > code that has this kind of retry loop with fault_in_pages() would > force an EFAULT on a pending SIGSEGV. > > IOW, the pending SIGSEGV could effectively be exactly that "thread flag". > > And that means that fault_in_xyz() wouldn't need to worry about this > situation at all: the regular copy_from_user() (or whatever flavor it > is - to/from/iter/whatever) would take the fault. And if it's a > regular page fault,. it would act exactly like it does now, so no > changes. > > If it's a sub-page fault, we'd just make the rule be that we send a > SIGSEGV even if the instruction in question has a user exception > fixup. > > Then we just need to add the logic somewhere that does "if active > pending SIGSEGV, return -EFAULT". > > Of course, that logic might be in fault_in_xyz(), but it migth also be > a separate function entirely. > > So this does effectively end up being a thread flag, but it's also > slightly more than that - it's that a sub-page fault from kernel mode > has semantics that a regular page fault does not. > > The whole "kernel access doesn't cause SIGSEGV, but returns -EFAULT > instead" has always been an odd and somewhat wrong-headed thing. Of > course it should cause a SIGSEGV, but that's not how Unix traditionall > worked. We would just say "color faults always raise a signal, even if > the color fault was triggered in a system call". It's doable and, at least for MTE, people have asked for a signal even when the fault was caused by a kernel uaccess. But there are some potentially confusing aspects to sort out: First of all, a uaccess in interrupt should not force such signal as it had nothing to do with the interrupted context. I guess we can do an in_task() check in the fault handler. Second, is there a chance that we enter the fault-in loop with a SIGSEGV already pending? Maybe it's not a problem, we just bail out of the loop early and deliver the signal, though unrelated to the actual uaccess in the loop. Third is the sigcontext.pc presented to the signal handler. Normally for SIGSEGV it points to the address of a load/store instruction and a handler could disable MTE and restart from that point. With a syscall we don't want it to point to the syscall place as it shouldn't be restarted in case it copied something. Pointing it to the next instruction after syscall is backwards-compatible but it may confuse the handler (if it does some reporting). I think we need add a new si_code that describes a fault in kernel mode to differentiate from the genuine user access. There was a discussion back in August on infinite loops with hwpoison and Tony said that Andy convinced him that the kernel should not send a SIGBUS for uaccess: https://lore.kernel.org/linux-edac/20210823152437.GA1637466@agluck-desk2.amr.corp.intel.com/ I personally like the approach of a SIG{SEGV,BUS} on uaccess and I don't think the ABI change is significant but ideally we should have a unified approach that's not just for MTE. Adding Andy and Tony (the background is potentially infinite loops with faults at sub-page granularity: arm64 MTE, hwpoison, sparc ADI). Thanks. -- Catalin