Received: by 10.213.65.68 with SMTP id h4csp850908imn; Wed, 4 Apr 2018 08:17:25 -0700 (PDT) X-Google-Smtp-Source: AIpwx4+qVYksaryud+B2qPKPKWkJ0lxqL+CMvN6AmV0myr4BmYEOpEksLqpk6Qy3NWByHKFdmB1C X-Received: by 2002:a17:902:85:: with SMTP id a5-v6mr18710245pla.99.1522855045908; Wed, 04 Apr 2018 08:17:25 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1522855045; cv=none; d=google.com; s=arc-20160816; b=xHsh5d1zYBvEsecqzv3JMT7MuErGKfTCsmxN7lvrS1RYT7v0krD0DTbdUp/zUXaXzN zxvgN18Xml5rMkndfrRCYssETyJ2grvdR4LQe8dymNwSFbJfXb63BeVyskqppebn6/l2 6UX7P4aKVzfM4xpB8/60W/szMMfnGjFwktJybt//gI0nXNChd3XqiiGkhZymhlU4y5xv urzknK9ApO2D0ZLo6Pt4JN5TmCexqUigeQY6AiOXr4mywFd5mX0jCJE/YoudVouUiPDT Zce/cn6mMLLjrMYKmZK1dT0NVSQH9AxBxxSkHloYRQI5opH3HUdpJsyHQshoTNxNTWpT S+NQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:cc:to:subject:message-id:date:from :references:in-reply-to:mime-version:dkim-signature :arc-authentication-results; bh=AUOTTFP2BkA8NgEvFY75A1AqubBDIZLGEAw+0xDLyxo=; b=rAmwldzI45gT2/+ELcKnTinO76GYM2dgVOxI7hUa9SY51T6R75J+6t34HI2eYvWmiv iEbW7qSXROovsAd+BMn18xAaBuepvsLAKxi6FDrFfaK2Hjr91PvkhJCPXQg7bJN0A/PX Q1eauh8EBXoG4pcf5FLn6rwT89w8Bl/8xm+343EfdKQva1zN+s03+YJDRi/oVRsamCf4 LJ/AMT3ItUqTSIY6lNLbY/PjY446llky1/3YgC/wYGJ0jZLclvsdLdZGZ9n6mGvUpqcN YWwucd8TQovdNEkf41R8mFIwg/TSI0VW17vo/McJ6u1AGxYJaTSrsx93U/Ky72F4peBS 7wJQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@ffwll.ch header.s=google header.b=kGXrWuZW; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e7si3793952pgq.348.2018.04.04.08.17.11; Wed, 04 Apr 2018 08:17:25 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@ffwll.ch header.s=google header.b=kGXrWuZW; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1751698AbeDDPPs (ORCPT + 99 others); Wed, 4 Apr 2018 11:15:48 -0400 Received: from mail-it0-f52.google.com ([209.85.214.52]:40773 "EHLO mail-it0-f52.google.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1751414AbeDDPPr (ORCPT ); Wed, 4 Apr 2018 11:15:47 -0400 Received: by mail-it0-f52.google.com with SMTP id u62-v6so17202348ita.5 for ; Wed, 04 Apr 2018 08:15:47 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=ffwll.ch; s=google; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :cc; bh=AUOTTFP2BkA8NgEvFY75A1AqubBDIZLGEAw+0xDLyxo=; b=kGXrWuZWwEPrTg/lQzBZYQWNlk+IlEbESa/AbTnbnOT7I/Kr8irmS7keUSeVx+Jecu L97fuWTfDYz+VMwFMpLOSncwAgzeZyOmnw1iFOvr+xR6U6Z88el8xrpgyt3FxJZshY3w ibkuFzjC5HxUcjivi8bgJmAT2gwNvT80w9WJw= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to:cc; bh=AUOTTFP2BkA8NgEvFY75A1AqubBDIZLGEAw+0xDLyxo=; b=IMTv9/e56oAiFfFJj9QdwC5KNMwY7BZiGg1pIom3DbLwL8CLdGs5bGejkhQb8Ifqsr b/Qj27+2LOonwhKhN8h8rk5gPwSSbYZZE7uQoILTEbW4frFpspuFUKMk89fHnM6jMUG9 yziPSM58BQC1tUuCv1fXvE0NllfUov22ljZq5ki4qmI500oFhpN6xz4IES2F4lCFZ6F9 N4P4PcH1Xzs05f7WMDyehAJhGmXORJ9yTMV79rvZeqBXqrUVpwmPu8pYd1VibZaNbvSz yGb5rikr4rth7C/QbL3gqqXhbCRSylH3XpqbPkpWSTDlL47l7+AeVZFLj6MY6HADV/hc RWJw== X-Gm-Message-State: AElRT7GZr17Wpmr8kQHwlRXoNhfAedu0yxWlb7Qr1fLSnltDAsmLAPeq V1drriiof7bJK2ulagk8ymLTVYqTHH5dY1BUricZQQ== X-Received: by 2002:a24:7904:: with SMTP id z4-v6mr9562392itc.63.1522854946622; Wed, 04 Apr 2018 08:15:46 -0700 (PDT) MIME-Version: 1.0 Received: by 10.79.40.129 with HTTP; Wed, 4 Apr 2018 08:15:46 -0700 (PDT) X-Originating-IP: [212.51.149.109] In-Reply-To: <20180404143900.GA1777@bombadil.infradead.org> References: <20180402141058.GL13332@bombadil.infradead.org> <20180404093254.GC3881@phenom.ffwll.local> <20180404143900.GA1777@bombadil.infradead.org> From: Daniel Vetter Date: Wed, 4 Apr 2018 17:15:46 +0200 Message-ID: Subject: Re: Signal handling in a page fault handler To: Matthew Wilcox Cc: dri-devel , Linux MM , Souptick Joarder , Linux Kernel Mailing List Content-Type: text/plain; charset="UTF-8" Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Wed, Apr 4, 2018 at 4:39 PM, Matthew Wilcox wrote: > On Wed, Apr 04, 2018 at 11:32:54AM +0200, Daniel Vetter wrote: >> So we've done some experiments for the case where the fault originated >> from kernel context (copy_to|from_user and friends). The fixup code seems >> to retry the copy once after the fault (in copy_user_handle_tail), if that >> fails again we get a short read/write. This might result in an -EFAULT, >> short read()/write() or anything else really, depending upon the syscall >> api. >> >> Except in some code paths in gpu drivers where we convert anything into >> -ERESTARTSYS/EINTR if there's a signal pending it won't ever result in the >> syscall getting restarted (well except maybe short read/writes if >> userspace bothers with that). >> >> So I guess gpu fault handlers indeed break the kernel's expectations, but >> then I think we're getting away with that because the inner workings of >> gpu memory objects is all heavily abstracted away by opengl/vulkan and >> friends. >> >> I guess what we could do is try to only do killable sleeps if it's a >> kernel fault, but that means wiring a flag through all the callchains. Not >> pretty. Except when there's a magic set of functions that would convert >> all interruptible sleeps to killable ones only for us. > > I actually have plans to allow mutex_lock_{interruptible,killable} to > return -EWOULDBLOCK if a flag is set. So this doesn't seem entirely > unrelated. Something like this perhaps: > > struct task_struct { > + unsigned int sleep_state; > }; > > static noinline int __sched > -__mutex_lock_interruptible_slowpath(struct mutex *lock) > +__mutex_lock_slowpath(struct mutex *lock, long state) > { > - return __mutex_lock(lock, TASK_INTERRUPTIBLE, 0, NULL, _RET_IP_); > + if (state == TASK_NOBLOCK) > + return -EWOULDBLOCK; > + return __mutex_lock(lock, state, 0, NULL, _RET_IP_); > } > > +int __sched mutex_lock_state(struct mutex *lock, long state) > +{ > + might_sleep(); > + > + if (__mutex_trylock_fast(lock)) > + return 0; > + > + return __mutex_lock_slowpath(lock, state); > +} > +EXPORT_SYMBOL(mutex_lock_state); > > Then the page fault handler can do something like: > > old_state = current->sleep_state; > current->sleep_state = TASK_INTERRUPTIBLE; > ... > current->sleep_state = old_state; > > > This has the page-fault-in-a-signal-handler problem. I don't know if > there's a way to determine if we're already in a signal handler and use > a different sleep_state ...? Not sure what problem you're trying to solve, but I don't think that's the one we have. The only way what we do goes wrong is if the fault originates from kernel context. For faults from the signal handler I think you just get to keep the pieces. Faults form kernel we can detect through FAULT_FLAG_USER. The issue I'm seeing is the following: 1. Some kernel code does copy_*_user, and it points at a gpu mmap region. 2. We fault and go into the gpu driver fault handler. That refuses to insert the pte because a signal is pending (because of all the interruptible waits and locks). 3. Fixup section runs, which afaict tries to do the copy once more using copy_user_handle_tail. 4. We fault again, because the pte is still not present. 5. GPU driver is still refusing to install the pte because signals are pending. 6. Fixup section for copy_user_handle_tail just bails out. 7. copy_*_user returns and indicates that that not all bytes have been copied. 8. syscall (or whatever it is) bails out and returns to userspace, most likely with -EFAULT (but this ofc depends upon the syscall and what it should do when userspace access faults. 9. Signal finally gets handled, but the syscall already failed, and no one will restart it. If userspace is prudent, it might fail (or maybe hit an assert or something). Or maybe I'm confused by your diff, since nothing seems to use current->sleep_state. The problem is also that it's any sleep we do (they all tend to be interruptible, at least when waiting for the gpu or taking any locks that might be held while waiting for the gpu, or anything else that might be blocked waiting for the gpu really). So only patching mutex_lock won't fix this. -Daniel -- Daniel Vetter Software Engineer, Intel Corporation +41 (0) 79 365 57 48 - http://blog.ffwll.ch