Received: by 2002:a05:6a10:8395:0:0:0:0 with SMTP id n21csp639504pxh; Wed, 10 Nov 2021 07:15:13 -0800 (PST) X-Google-Smtp-Source: ABdhPJxW1Jju5LoPAyY/58/jAlQT0s5Gsjs4me9Gs4DHiFjH6JQva+MhLzI3uMWY4nxwxBsG/Gn9 X-Received: by 2002:a05:6a00:1946:b0:492:64f1:61b5 with SMTP id s6-20020a056a00194600b0049264f161b5mr407638pfk.52.1636557312837; Wed, 10 Nov 2021 07:15:12 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1636557312; cv=none; d=google.com; s=arc-20160816; b=EJ5riqhCuG0pa2cBShlSLCtH5QTSIlHAiGSOBcqksgVQIVWZNwerydDVCDm0EIwq3W ceZ/HRQYF8H9MAFm9agagyI7517nAU10sCmRs9OE6+KEEeiPURW2iK8w0EsRC8FdMDSU 1ASL+Irq/73m/bxk1cgfEkGim6TmfyHd5nKyN7lOYvZ9GDhBfSCgNaKcDo0FMf0k1FVm lxtX/enUkNEmjF77BxdU0BvirtImEcc1o6MQ3NMaPVDLbATKmo0dAjlq6nvf2dFSJat5 YovY+6JwNYAGHd5MXgBVq6nKKeBIvQBN13uz1SC9xVXMc1t5jUxuTxr+ERE/pCFSMW85 szBQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:mime-version:user-agent :content-transfer-encoding:references:in-reply-to:date:cc:to:from :subject:message-id:dkim-signature; bh=e9tnSW3iIAV4FoDSSI0tVaZqg1Wa20jPHGx0DLL4Krs=; b=mOHsUwjuucUlatKYvmgTcKeBle5fPzgQXBZyRmCm6Q54Xx8uZJP9bLKq4grNtLXOQz IYa2iZuPby7AZxOC9nbQv064P5gesPx9W00ogPUqZn+kz7qL02ZHoVBU9qtcIXt8SXEz gvnF7CE3bf6SSqoOmvHyUIWsbVh5gQGKZvDnRrSRSpaW9XsPqjU5yVOp01WYGaVXiAS5 OS4XtxyaqSCoyYqh7z/Gpx8Y1JHSxhOutBVJWkijckNB1uIsX3TGzHAhQwjgJuuks/cb g9zKev8RXMew+KnBbq6peWToGqQbBjXk/pe6Hd28+C/0/zDxI8aEt82uZkUbcwAswyQo HGVQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=MEXne+JQ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [23.128.96.18]) by mx.google.com with ESMTP id s8si46490pgs.318.2021.11.10.07.14.54; Wed, 10 Nov 2021 07:15:12 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) client-ip=23.128.96.18; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=k20201202 header.b=MEXne+JQ; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.18 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S232338AbhKJPOl (ORCPT + 99 others); Wed, 10 Nov 2021 10:14:41 -0500 Received: from mail.kernel.org ([198.145.29.99]:50418 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S232062AbhKJPOk (ORCPT ); Wed, 10 Nov 2021 10:14:40 -0500 Received: by mail.kernel.org (Postfix) with ESMTPSA id C8FD2610FF; Wed, 10 Nov 2021 15:11:52 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=k20201202; t=1636557113; bh=e9tnSW3iIAV4FoDSSI0tVaZqg1Wa20jPHGx0DLL4Krs=; h=Subject:From:To:Cc:Date:In-Reply-To:References:From; b=MEXne+JQRwVTeUO4JwxM09cK9a2fJEhawivcF/clegnPbuC4O8tcIbDnSuawh0umX 9n0RcoE9DwYYU9LKNgMzUwobIgtwE1dbI9KoMNzeWdwxTtbjPYceUIWar746qH/aXN GRC/foRb49jBvz4zEsnP8XPF8Yy/T4+ELUVJYdamKT/kVQGrbodv4aUybv7ui2gP13 u6hDPDAxOrag7qAl3oN5pggvNurs5Lf7I3qHii6Wbn6wYUpIWfnmdMR0oRDRXlM64O j/BBr05SlVPgjZNS8L9nwaEgMxq3E/9ckbV77kp1fSx7RcpUWGHS7yZabDoQmLb/oW h0jw721o1CBYQ== Message-ID: <8e0bb87f05b79317a06ed2d8ab5e2f5cf6132b6a.camel@kernel.org> Subject: Re: [PATCH V2] x86/sgx: Fix free page accounting From: Jarkko Sakkinen To: Reinette Chatre , dave.hansen@linux.intel.com, tglx@linutronix.de, bp@alien8.de, mingo@redhat.com, linux-sgx@vger.kernel.org, x86@kernel.org Cc: seanjc@google.com, tony.luck@intel.com, hpa@zytor.com, linux-kernel@vger.kernel.org, stable@vger.kernel.org Date: Wed, 10 Nov 2021 17:11:50 +0200 In-Reply-To: References: Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable User-Agent: Evolution 3.40.4-1 MIME-Version: 1.0 Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, 2021-11-09 at 12:00 -0800, Reinette Chatre wrote: > The SGX driver maintains a single global free page counter, > sgx_nr_free_pages, that reflects the number of free pages available > across all NUMA nodes. Correspondingly, a list of free pages is > associated with each NUMA node and sgx_nr_free_pages is updated > every time a page is added or removed from any of the free page > lists. The main usage of sgx_nr_free_pages is by the reclaimer > that will run when it (sgx_nr_free_pages) goes below a watermark > to ensure that there are always some free pages available to, for > example, support efficient page faults. >=20 > With sgx_nr_free_pages accessed and modified from a few places > it is essential to ensure that these accesses are done safely but > this is not the case. sgx_nr_free_pages is read without any > protection and updated with inconsistent protection by any one > of the spin locks associated with the individual NUMA nodes. > For example: >=20 > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 CPU_A=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 CPU_B > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 -----=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0 ----- > =C2=A0spin_lock(&nodeA->lock);=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 spin_lock(&nodeB->lock); > =C2=A0...=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ... > =C2=A0sgx_nr_free_pages--;=C2=A0 /* NOT SAFE */=C2=A0 sgx_nr_free_pages--= ; >=20 > =C2=A0spin_unlock(&nodeA->lock);=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2= =A0=C2=A0=C2=A0=C2=A0=C2=A0 spin_unlock(&nodeB->lock); I don't understand the "NOT SAFE" part here. It is safe to access the variable, even when it is not atomic... I don't understand what the sequence above should tell me. > The consequence of sgx_nr_free_pages not being protected is that > its value may not accurately reflect the actual number of free > pages on the system, impacting the availability of free pages in > support of many flows. The problematic scenario isu when the In non-atomicity is not a problem, when it is not a problem :-) > reclaimer does not run because it believes there to be sufficient > free pages while any attempt to allocate a page fails because there > are no free pages available. >=20 > The worst scenario observed was a user space hang because of > repeated page faults caused by no free pages made available. >=20 > The following flow was encountered: > asm_exc_page_fault > =C2=A0... > =C2=A0=C2=A0 sgx_vma_fault() > =C2=A0=C2=A0=C2=A0=C2=A0 sgx_encl_load_page() > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 sgx_encl_eldu() // Encrypted page ne= eds to be loaded from backing > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 // storage int= o newly allocated SGX memory page > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 sgx_alloc_epc_page() // = Allocate a page of SGX memory > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 __sgx_alloc_= epc_page() // Fails, no free SGX memory > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 ... > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 if (sgx_shou= ld_reclaim(SGX_NR_LOW_PAGES)) // Wake reclaimer > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 = wake_up(&ksgxd_waitq); > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 return -EBUS= Y; // Return -EBUSY giving reclaimer time to run > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 return -EBUSY; > =C2=A0=C2=A0=C2=A0=C2=A0 return -EBUSY; > =C2=A0=C2=A0 return VM_FAULT_NOPAGE; >=20 > The reclaimer is triggered in above flow with the following code: >=20 > static bool sgx_should_reclaim(unsigned long watermark) > { > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0 return sgx_nr_free_pages < wat= ermark && > =C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0=C2=A0= =C2=A0=C2=A0 !list_empty(&sgx_active_page_list); > } >=20 > In the problematic scenario there were no free pages available yet the > value of sgx_nr_free_pages was above the watermark. The allocation of > SGX memory thus always failed because of a lack of free pages while no > free pages were made available because the reclaimer is never started > because of sgx_nr_free_pages' incorrect value. The consequence was that > user space kept encountering VM_FAULT_NOPAGE that caused the same > address to be accessed repeatedly with the same result. That causes sgx_should_reclaim() executed to be multiple times as the fault is retried. Eventually it should be successful. > Change the global free page counter to an atomic type that > ensures simultaneous updates are done safely. While doing so, move > the updating of the variable outside of the spin lock critical > section to which it does not belong. >=20 > Cc: stable@vger.kernel.org > Fixes: 901ddbb9ecf5 ("x86/sgx: Add a basic NUMA allocation scheme to sgx_= alloc_epc_page()") > Suggested-by: Dave Hansen > Reviewed-by: Tony Luck > Signed-off-by: Reinette Chatre I'm not yet sure if this a bug, even if it'd be a improvement to the performance. /Jarkko