Received: by 2002:a05:7412:5112:b0:fa:6e18:a558 with SMTP id fm18csp1589683rdb; Wed, 24 Jan 2024 23:18:26 -0800 (PST) X-Google-Smtp-Source: AGHT+IFZ5Xq0/XmtrLQe8fNmHiB6RgAlr1CX1iPKLpjcgvgFLvu78U3HB/G6eQ558aIyR2ScOXnL X-Received: by 2002:aa7:d84b:0:b0:55c:a2f5:e208 with SMTP id f11-20020aa7d84b000000b0055ca2f5e208mr205939eds.89.1706167106051; Wed, 24 Jan 2024 23:18:26 -0800 (PST) ARC-Seal: i=2; a=rsa-sha256; t=1706167106; cv=pass; d=google.com; s=arc-20160816; b=g7bFsjeaIaEtVZF021Q3rv1XcqaWb0YEccUIh7EFA5XRTcCKUQCftenUWhib5B/FPX 41BPhaFv9qMX6tDiih5i69ZUbuhmjng7Ij6Dnl8NGRg1g/+gY3heIhDCmiAZ3oaoDsg2 hSyUPrAaP4UmCtN5Q4HFxhuLWY0unR3xFiMiy537Iqlp3qXe/aHQfeJ5XqLhJVeXUrhw DZV4Aa6m13RLvSl5gvF5Byfrv3TRInNBBN2Zr1HsETZkbiqI1+uJOtBsHvVT+BZb6Cqp Kb2vs9G3jggg3pwjfi7fQwRuCazNb5fJH+0MIMGbY5urXj6R7oa/W4OJYtpH6PlNSBPB rEsA== ARC-Message-Signature: i=2; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=content-language:content-transfer-encoding:in-reply-to:mime-version :list-unsubscribe:list-subscribe:list-id:precedence:user-agent:date :message-id:from:references:cc:to:subject; bh=w2YHJBhJjftZAOwInoZKgPaPCxKjKwq8yGTWhRte53w=; fh=2FB2abfuUyaqClhUJbh9qEUdN3qF/PNXOYtInYLM0Iw=; b=u/r14KdEBHKOV6hz+RJSMriT9CJr9/dZ9TCXAZfBaOIV6R9InzTsUb5c0tUO1agwXd iSdrnhCxpHHNEwn2pWx77laqJwE0N21RPxrkq2AHrME3vYicu+HwOSlOO/9PkqyDFOVF 939k4TtMTDdUoFP2dPIcwvEjOVoxMMImOIgBzH9vvJawNjw07f/ALEM7HM4YyhaBJIeP VN53GQje1ZYi1O8KFQd8V+Et8DJrfsAS/1bKnbNVHGei3OVTv0IVshNXbbsnuSr+bpoS EDR4L+nPV2hobqPtdSzk1T+/PSs85zLnmML7KMLPWaxpcYMYlGMuSoAx4e7KKxM30CkE naIg== ARC-Authentication-Results: i=2; mx.google.com; arc=pass (i=1 spf=pass spfdomain=huaweicloud.com); spf=pass (google.com: domain of linux-kernel+bounces-38062-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-38062-linux.lists.archive=gmail.com@vger.kernel.org" Return-Path: Received: from am.mirrors.kernel.org (am.mirrors.kernel.org. [147.75.80.249]) by mx.google.com with ESMTPS id i1-20020a05640242c100b005590583beb1si12657212edc.501.2024.01.24.23.18.26 for (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Wed, 24 Jan 2024 23:18:26 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel+bounces-38062-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) client-ip=147.75.80.249; Authentication-Results: mx.google.com; arc=pass (i=1 spf=pass spfdomain=huaweicloud.com); spf=pass (google.com: domain of linux-kernel+bounces-38062-linux.lists.archive=gmail.com@vger.kernel.org designates 147.75.80.249 as permitted sender) smtp.mailfrom="linux-kernel+bounces-38062-linux.lists.archive=gmail.com@vger.kernel.org" Received: from smtp.subspace.kernel.org (wormhole.subspace.kernel.org [52.25.139.140]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by am.mirrors.kernel.org (Postfix) with ESMTPS id 9C22E1F22ECA for ; Thu, 25 Jan 2024 07:18:25 +0000 (UTC) Received: from localhost.localdomain (localhost.localdomain [127.0.0.1]) by smtp.subspace.kernel.org (Postfix) with ESMTP id 480121173F; Thu, 25 Jan 2024 07:18:14 +0000 (UTC) Received: from dggsgout11.his.huawei.com (dggsgout11.his.huawei.com [45.249.212.51]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by smtp.subspace.kernel.org (Postfix) with ESMTPS id A30FB125AC; Thu, 25 Jan 2024 07:18:10 +0000 (UTC) Authentication-Results: smtp.subspace.kernel.org; arc=none smtp.client-ip=45.249.212.51 ARC-Seal:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706167093; cv=none; b=robchwaiK2c9jojkAA4ZS+bExMv51Qv2/O596/Av1hmCDT78Xea0UAPFyD7IGqV8KLxnOOoZGwr9edKvOl3SC292a8zZ6Ja903rCgefz0tKX8qkOfOtQ2c0OZHLxlIBNU1JM/BDuFlydCsbDyY7iNJH5Faeo+TiT5xmv02/4MCw= ARC-Message-Signature:i=1; a=rsa-sha256; d=subspace.kernel.org; s=arc-20240116; t=1706167093; c=relaxed/simple; bh=v9CmbAhgiwRyHdIyDJQME8LxQd+xB3pL3KsdDf+BcCw=; h=Subject:To:Cc:References:From:Message-ID:Date:MIME-Version: In-Reply-To:Content-Type; b=XMfwQ+h0Q4RpVR9DhEr/UceYOc/M/j8GZY0G0ibTAA+3BMO69vGAWGmrcMgnojvqa/Lbh96UV2n/O8wo50dz7BC+JgaHGvMVenMaWP3vV9AvTgenK+p+9pj6npfe2kvEpCGI+Kg25MlQuJJz03lqTajAJ244tm44R7c4WQuDpe0= ARC-Authentication-Results:i=1; smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=huaweicloud.com; spf=pass smtp.mailfrom=huaweicloud.com; arc=none smtp.client-ip=45.249.212.51 Authentication-Results: smtp.subspace.kernel.org; dmarc=none (p=none dis=none) header.from=huaweicloud.com Authentication-Results: smtp.subspace.kernel.org; spf=pass smtp.mailfrom=huaweicloud.com Received: from mail.maildlp.com (unknown [172.19.93.142]) by dggsgout11.his.huawei.com (SkyGuard) with ESMTP id 4TLBxS5mf5z4f3k6C; Thu, 25 Jan 2024 15:18:04 +0800 (CST) Received: from mail02.huawei.com (unknown [10.116.40.128]) by mail.maildlp.com (Postfix) with ESMTP id 2D4C91A0171; Thu, 25 Jan 2024 15:18:07 +0800 (CST) Received: from [10.174.176.117] (unknown [10.174.176.117]) by APP4 (Coremail) with SMTP id gCh0CgBXXG4rC7JlVenABw--.25628S2; Thu, 25 Jan 2024 15:18:06 +0800 (CST) Subject: Re: [PATCH bpf 2/3] x86/mm: Disallow vsyscall page read for copy_from_kernel_nofault() To: Sohil Mehta , x86@kernel.org, bpf@vger.kernel.org Cc: Dave Hansen , Andy Lutomirski , Peter Zijlstra , Thomas Gleixner , Ingo Molnar , Borislav Petkov , "H . Peter Anvin" , linux-kernel@vger.kernel.org, xingwei lee , Jann Horn , houtao1@huawei.com References: <20240119073019.1528573-1-houtao@huaweicloud.com> <20240119073019.1528573-3-houtao@huaweicloud.com> From: Hou Tao Message-ID: <6f1aa71b-13f3-0972-3cb0-62f431de7e48@huaweicloud.com> Date: Thu, 25 Jan 2024 15:18:03 +0800 User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:78.0) Gecko/20100101 Thunderbird/78.6.0 Precedence: bulk X-Mailing-List: linux-kernel@vger.kernel.org List-Id: List-Subscribe: List-Unsubscribe: MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit Content-Language: en-US X-CM-TRANSID:gCh0CgBXXG4rC7JlVenABw--.25628S2 X-Coremail-Antispam: 1UD129KBjvJXoWxJw4rGr1ktw4DXFWkKw17ZFb_yoW7Jw45pw 18A3yUtFW8Ar1rAFsFq34qqFyrJ348Ja15Grn5tF1rZw1jgF1YqrWDWa4jgF17Jr4xKw1x tw4UXr1qvw1UJaDanT9S1TB71UUUUUUqnTZGkaVYY2UrUUUUjbIjqfuFe4nvWSU5nxnvy2 9KBjDU0xBIdaVrnRJUUUvab4IE77IF4wAFF20E14v26r4j6ryUM7CY07I20VC2zVCF04k2 6cxKx2IYs7xG6rWj6s0DM7CIcVAFz4kK6r1j6r18M28lY4IEw2IIxxk0rwA2F7IY1VAKz4 vEj48ve4kI8wA2z4x0Y4vE2Ix0cI8IcVAFwI0_Ar0_tr1l84ACjcxK6xIIjxv20xvEc7Cj xVAFwI0_Gr1j6F4UJwA2z4x0Y4vEx4A2jsIE14v26rxl6s0DM28EF7xvwVC2z280aVCY1x 0267AKxVW0oVCq3wAS0I0E0xvYzxvE52x082IY62kv0487Mc02F40EFcxC0VAKzVAqx4xG 6I80ewAv7VC0I7IYx2IY67AKxVWUJVWUGwAv7VC2z280aVAFwI0_Jr0_Gr1lOx8S6xCaFV Cjc4AY6r1j6r4UM4x0Y48IcVAKI48JM4IIrI8v6xkF7I0E8cxan2IY04v7Mxk0xIA0c2IE e2xFo4CEbIxvr21l42xK82IYc2Ij64vIr41l4I8I3I0E4IkC6x0Yz7v_Jr0_Gr1lx2IqxV Aqx4xG67AKxVWUJVWUGwC20s026x8GjcxK67AKxVWUGVWUWwC2zVAF1VAY17CE14v26r1q 6r43MIIYrxkI7VAKI48JMIIF0xvE2Ix0cI8IcVAFwI0_Jr0_JF4lIxAIcVC0I7IYx2IY6x kF7I0E14v26r4j6F4UMIIF0xvE42xK8VAvwI8IcIk0rVWrJr0_WFyUJwCI42IY6I8E87Iv 67AKxVWUJVW8JwCI42IY6I8E87Iv6xkF7I0E14v26r4j6r4UJbIYCTnIWIevJa73UjIFyT uYvjxUrR6zUUUUU X-CM-SenderInfo: xkrx3t3r6k3tpzhluzxrxghudrp/ Hi, On 1/23/2024 8:18 AM, Sohil Mehta wrote: > On 1/18/2024 11:30 PM, Hou Tao wrote: >> From: Hou Tao >> >> When trying to use copy_from_kernel_nofault() to read vsyscall page >> through a bpf program, the following oops was reported: >> >> BUG: unable to handle page fault for address: ffffffffff600000 >> #PF: supervisor read access in kernel mode >> #PF: error_code(0x0000) - not-present page >> PGD 3231067 P4D 3231067 PUD 3233067 PMD 3235067 PTE 0 >> Oops: 0000 [#1] PREEMPT SMP PTI >> CPU: 1 PID: 20390 Comm: test_progs ...... 6.7.0+ #58 >> Hardware name: QEMU Standard PC (i440FX + PIIX, 1996) ...... >> RIP: 0010:copy_from_kernel_nofault+0x6f/0x110 >> ...... >> Call Trace: >> >> ? copy_from_kernel_nofault+0x6f/0x110 >> bpf_probe_read_kernel+0x1d/0x50 >> bpf_prog_2061065e56845f08_do_probe_read+0x51/0x8d >> trace_call_bpf+0xc5/0x1c0 >> perf_call_bpf_enter.isra.0+0x69/0xb0 >> perf_syscall_enter+0x13e/0x200 >> syscall_trace_enter+0x188/0x1c0 >> do_syscall_64+0xb5/0xe0 >> entry_SYSCALL_64_after_hwframe+0x6e/0x76 >> >> ...... >> ---[ end trace 0000000000000000 ]--- >> >> The oops happens as follows: A bpf program uses bpf_probe_read_kernel() >> to read from vsyscall page, bpf_probe_read_kernel() invokes >> copy_from_kernel_nofault() in turn and then invokes __get_user_asm(). A >> page fault exception is triggered accordingly, but handle_page_fault() >> considers the vsyscall page address as a userspace address instead of >> a kernel space address, so the fix-up set-up by bpf isn't applied. > This comment and the one in the code below seem contradictory and > confusing. Do we want the vsyscall page address to be considered as a > userspace address or not? Now handle_page_fault() has already considered the vsyscall page as a userspace address, and in the patch we update copy_from_kernel_nofault() to consider vsyscall page as a userspapce address as well. > > IIUC, the issue here is that the vsyscall page (in xonly mode) is not > really mapped and therefore running copy_from_kernel_nofault() on this > address is incorrect. This patch fixes this by making > copy_from_kernel_nofault() return an error for a vsyscall address. > Yes, but the issue may occur for vsyscall=none case as well. Because fault_in_kernel_space() invoked by handle_page_fault() will return false, so in do_user_addr_fault(), when smap feature is enabled, the invocation of copy_from_kernel_nofault() will trigger oops due to the following code snippet:         if (unlikely(cpu_feature_enabled(X86_FEATURE_SMAP) &&                      !(error_code & X86_PF_USER) &&                      !(regs->flags & X86_EFLAGS_AC))) {                 /*                  * No extable entry here.  This was a kernel access to an                  * invalid pointer.  get_kernel_nofault() will not get here.                  */                 page_fault_oops(regs, error_code, address);                 return;         } >> Because the exception happens in kernel space and page fault handling is >> disabled, page_fault_oops() is invoked and an oops happens. >> >> Fix it by disallowing vsyscall page read for copy_from_kernel_nofault(). >> > [Maybe I have misunderstood the issue here and following questions are > not even relevant.] > > But, what about vsyscall=emulate? In that mode the page is actually > mapped. Would we want the page read to go through then? Er, Now the vsyscall page is considered as a userspace address, I think we should reject its read through copy_from_kernel_nofault() even it is mapped. > >> Originally-from: Thomas Gleixner > Documentation/process/maintainer-tip.rst says to use "Originally-by:" Thanks for the tip. Will update. > > >> diff --git a/arch/x86/mm/maccess.c b/arch/x86/mm/maccess.c >> index 6993f026adec9..bb454e0abbfcf 100644 >> --- a/arch/x86/mm/maccess.c >> +++ b/arch/x86/mm/maccess.c >> @@ -3,6 +3,8 @@ >> #include >> #include >> >> +#include "mm_internal.h" >> + >> #ifdef CONFIG_X86_64 >> bool copy_from_kernel_nofault_allowed(const void *unsafe_src, size_t size) >> { >> @@ -15,6 +17,10 @@ bool copy_from_kernel_nofault_allowed(const void *unsafe_src, size_t size) >> if (vaddr < TASK_SIZE_MAX + PAGE_SIZE) >> return false; >> >> + /* vsyscall page is also considered as userspace address. */ > A bit more explanation about why this should happen might be useful. > >> + if (is_vsyscall_vaddr(vaddr)) >> + return false; >> + >> /* >> * Allow everything during early boot before 'x86_virt_bits' >> * is initialized. Needed for instruction decoding in early