Received: by 2002:a05:6a10:5bc5:0:0:0:0 with SMTP id os5csp487801pxb; Mon, 25 Oct 2021 12:12:33 -0700 (PDT) X-Google-Smtp-Source: ABdhPJwKPgtIQBct8h9UrDFxFJe6Kt5M7Hfe2w6BzAJJZ39McGJTJs1YXIgCU0hemiZs6a8/PpnW X-Received: by 2002:a62:3808:0:b0:47b:d1da:e734 with SMTP id f8-20020a623808000000b0047bd1dae734mr17910197pfa.2.1635189152805; Mon, 25 Oct 2021 12:12:32 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1635189152; cv=none; d=google.com; s=arc-20160816; b=U90YYwsRxgvCNj08tzE4/UPh++vaRC6MwG/O4oZAuktYd/rX45XVY+mBM5XMOYN8QH ifes2yuLzVzoWcKZZ7kympXCFjKqjSGLs2jMbmxbyK7mhGmcd72yyUWSKdR5hZYwB3qu KL8VmCldm2mL0BfshC6KxZFpZukNSN+CLlJ5VCf9ZyZ9/TycoGNJ/l5UpHMQ0wYmQnjn zQXjID6TVU4fmTc4Hux23FcRlxZ2ucilo/fKIw8X5hUDy+Tycezs518PFOYQTeSZbK9y /zRWs2jGA8lkPo64xyBnRc1AA4GGlONLaDH91lFZAUs9riS0FKYDoKk3nuMU7e0eAeVh Ca0A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:to:references:message-id :content-transfer-encoding:cc:date:in-reply-to:from:subject :mime-version:dkim-signature; bh=CV9xop57N44Xoc3ELL1tfkVbPt6ec4IJCZX741zNY1A=; b=FLhZs6UCZBqBChzKK0zmXn1Sd5qQQIMh0F2HaTMqKmWTzgGhZTfsOTwCJTTAYmHx6b hHiEdKVGqWGmvsoLa//XY02O40JyjnDjhYimtFSv04vjaMlchHqnvCi+2ZCXo7sPuoX4 tPQMi47QYH795Kl7l89Ueym+D1QyDmK4x9/pCheGFxtjZU/Z4BJuWPm856Zx98blrWhJ m25Np6YznOUwtE/D49nX8xYVzlMaP9bm0BMrvTMLfVeuHARJrDHwD10IANUsZdB+GpCR 7NbZReSJQwO8/O8/UTadthnr8xR0C/0HfkspSEpmmf5ybtmE7ntP4VZV8VAMwUqGHyhu cFsQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=IVxTXSin; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id h16si29814637pfh.288.2021.10.25.12.12.07; Mon, 25 Oct 2021 12:12:32 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@gmail.com header.s=20210112 header.b=IVxTXSin; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=QUARANTINE dis=NONE) header.from=gmail.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233909AbhJYQWC (ORCPT + 99 others); Mon, 25 Oct 2021 12:22:02 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:37720 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233494AbhJYQWB (ORCPT ); Mon, 25 Oct 2021 12:22:01 -0400 Received: from mail-pl1-x630.google.com (mail-pl1-x630.google.com [IPv6:2607:f8b0:4864:20::630]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 3FCDFC061745 for ; Mon, 25 Oct 2021 09:19:39 -0700 (PDT) Received: by mail-pl1-x630.google.com with SMTP id v20so8286792plo.7 for ; Mon, 25 Oct 2021 09:19:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20210112; h=mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=CV9xop57N44Xoc3ELL1tfkVbPt6ec4IJCZX741zNY1A=; b=IVxTXSin5BglUw4bXJHX+dRiUU5dDRS/G8xMx7vrIAKQGjUrGJoafbgQcoyySh8izD QOHQyTU2CzqkphZ631WAQqBJFo3HqbzSPddnHSqGtyhAvIzzqqpRCnD9EcKXn5pQ0DZc gRWIcFSA9GRZLWqHqxTjLYOPVWcrdV453x9sZKnHvJgSJUqKgcIaTRph9nJ0bzV7w1Bz 6eL5sn21ezzZ2AMqBJOEyIr90Hhcdu/A3YvL5tFpGCLZjnWGCLhTTxwzL5SlOkXP4BbT IWeqM5ekgmOkOFHbVV8yyCIlg0Es5Ebss7lPx0ECXG302ElTCnfd2VLON4PqHymGPzgU 0Bbg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; h=x-gm-message-state:mime-version:subject:from:in-reply-to:date:cc :content-transfer-encoding:message-id:references:to; bh=CV9xop57N44Xoc3ELL1tfkVbPt6ec4IJCZX741zNY1A=; b=BOZ8ZRP4YvLTzD80lB1ytCkr3cmqnA+PK69T4bOcfsvy7U6W+QuP/UFpJ8xxwSpMHt f9TjYOQSFdxZugYuTgeOtgrPG0oycwpn4EQ+MJClFp0l1KyP0qiQg5zwnUgvDl8BZEhZ NORWqYgvyKEQP5HgEVG7dqRDt4ud8q4x3e00wwI9FlAoU8paR2Ege1Fp10j51nqerzxB BbaVhF6Qe3A3YipGqO/TFHUIYpdeJIVQNkuatQPxcRgP7qnd6U9f7TQksYavgGtkjUfm r8Wwgd3sJ5RmHIqcQpHl0jyJXTgwY5CtnzFq8XglkGBHj3UDEBnU4uaJf8pt+rijZERK mlKw== X-Gm-Message-State: AOAM531fD3k7b5efJwTKzN4mC1mOOryTtdv+loThk7z4gDKre5cs455O 7UNhi+ysQ4n5ox+JBylL5o0= X-Received: by 2002:a17:90a:c088:: with SMTP id o8mr17219078pjs.1.1635178778487; Mon, 25 Oct 2021 09:19:38 -0700 (PDT) Received: from smtpclient.apple (c-24-6-216-183.hsd1.ca.comcast.net. [24.6.216.183]) by smtp.gmail.com with ESMTPSA id gp21sm3727491pjb.47.2021.10.25.09.19.36 (version=TLS1_2 cipher=ECDHE-ECDSA-AES128-GCM-SHA256 bits=128/128); Mon, 25 Oct 2021 09:19:37 -0700 (PDT) Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 14.0 \(3654.120.0.1.13\)) Subject: Re: [PATCH v2 3/5] x86/mm: check exec permissions on fault From: Nadav Amit In-Reply-To: Date: Mon, 25 Oct 2021 09:19:35 -0700 Cc: Linux-MM , LKML , Andrea Arcangeli , Andrew Cooper , Andrew Morton , Andy Lutomirski , Dave Hansen , Peter Xu , Peter Zijlstra , Thomas Gleixner , Will Deacon , Yu Zhao , Nick Piggin , x86@kernel.org Content-Transfer-Encoding: quoted-printable Message-Id: <00C2DC4B-A77D-4B32-B7F7-2291830BC2D2@gmail.com> References: <20211021122112.592634-1-namit@vmware.com> <20211021122112.592634-4-namit@vmware.com> To: Dave Hansen X-Mailer: Apple Mail (2.3654.120.0.1.13) Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org > On Oct 25, 2021, at 7:20 AM, Dave Hansen = wrote: >=20 > On 10/21/21 5:21 AM, Nadav Amit wrote: >> access_error() currently does not check for execution permission >> violation.=20 > Ye >=20 >> As a result, spurious page-faults due to execution permission >> violation cause SIGSEGV. >=20 > While I could totally believe that something is goofy when VMAs are > being changed underneath a page fault, I'm having trouble figuring out > why the "if (error_code & X86_PF_WRITE)" code is being modified. In the scenario I mentioned the VMAs are not changed underneath the page-fault. They change *before* the page-fault, but there are residues of the old PTE in the TLB.=20 >=20 >> It appears not to be an issue so far, but the next patches avoid TLB >> flushes on permission promotion, which can lead to this scenario. = nodejs >> for instance crashes when TLB flush is avoided on permission = promotion. >=20 > Just to be clear, "promotion" is going from something like: >=20 > W=3D0->W=3D1 > or > NX=3D1->NX=3D0 >=20 > right? I tend to call that "relaxing" permissions. I specifically talk about NX=3D1>NX=3D0. I can change the language to =E2=80=9Crelaxing=E2=80=9D. >=20 > Currently, X86_PF_WRITE faults are considered an access error unless = the > VMA to which the write occurred allows writes. Returning "no access > error" permits continuing and handling the copy-on-write. >=20 > It sounds like you want to expand that. You want to add a whole class > of new faults that can be ignored: not just that some COW handling = might > be necessary, but that the PTE itself might be out of date. Just = like > a "COW fault" may just result in setting the PTE.W=3D1 and moving on = with > our day, an instruction fault might now just end up with setting > PTE.NX=3D0 and also moving on with our day. You raise an interesting idea (which can easily be implemented with = uffd), but no - I had none of that in my mind. My only purpose is to deal with actual spurious page-faults that I encountered when I removed the TLB flush the happens after NX=3D1->NX=3D0.= I am actually surprised that the kernel makes such a strong assumption that every change of NX=3D1->NX=3D0 would be followed by a TLB flush, = and that during these changes the mm is locked for write. But that is the case. If you do not have this change and a PTE is changed from NX=3D1->NX=3D0 and *later* you access the page, you can have a = page-fault due to stale PTE, and get a SIGSEGV since access_error() is wrong to assume that this is an invalid access. I did not change and there are no changes to the VMA during the page-fault. The page-fault handler would do pretty much nothing and return to user-space which would retry the instruction. [ page-fault triggers an implicit TLB flush of the offending PTE ] >=20 > I'm really confused why the "error_code & X86_PF_WRITE" case is = getting > modified. I would have expected it to be something like just adding: >=20 > /* read, instruction fetch */ > if (error_code & X86_PF_INSN) { > /* Avoid enforcing access error if spurious: */ > if (unlikely(!(vma->vm_flags & VM_EXEC))) > return 1; > return 0; > } >=20 > I'm really confused what X86_PF_WRITE and X86_PF_INSN have in common > other than both being able to (now) be generated spuriously. That was my first version, but I was concerned that perhaps there is some strange scenario in which both X86_PF_WRITE and X86_PF_INSN can be set. That is the reason that Peter asked you whether this is something that might happen. If you confirm they cannot be both set, I would the version you just mentioned.