Received: by 2002:a05:7412:419a:b0:f3:1519:9f41 with SMTP id i26csp2487914rdh; Sun, 26 Nov 2023 08:04:56 -0800 (PST) X-Google-Smtp-Source: AGHT+IEu5+37xLqsgXIGGzykb/69azOcqEXcvHePNgufVyz03RTQpdajCi/yOFLBACXYtYzFK13e X-Received: by 2002:a05:6a20:1454:b0:18c:fc4:df13 with SMTP id a20-20020a056a20145400b0018c0fc4df13mr8649212pzi.27.1701014696092; Sun, 26 Nov 2023 08:04:56 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1701014696; cv=none; d=google.com; s=arc-20160816; b=vmstDK99I9n/4AOIcupFvbfmtj5wyQrKy+y/GLGUw5QHrUkBz2q0t1L++SDBjoiyWn 2A9NditnO+XIWlWvm1i7/0N954L8SaEUrVaeN40ipRYKKK3vY6WGtvTAWslII+PVmCAw BClvIqGRWvYY7u7XMYvrdsZdAO9JSZ3uKWGbnU7SzODInG/HmCPKa+cfv6xiFGafOLLV kTAQ1RPMjMpGhpYn9Cd+vRcYm1mIh63wVlcCbhGVHHslq6KU2R8rWuevhM/rt0ZRFagf iQQSSol9vKMFPuRMGxMr33P12BWwDo1mIv7jWuIk+SptVx3n1MQUqnux8YPVN/a1OCSn yPiw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:message-id:organization :from:content-transfer-encoding:mime-version:date:references:subject :cc:to:dkim-signature; bh=VqO9/ZCEqt6DuBayAg8Xn3WiaFBpE1x6XyPq5RHfbHI=; fh=ll/HJ2fRZicXiv3L+d1wZcYATs47aWCR+1TxNRbbSOk=; b=zehjmM1L9L5U0a7qI5CpHx7vNllvWRRUqvWyvJU0LYYltaadDZWPu+HFUEQcWNCMkD 0cbUNrV6MStOxWo10pYfktGwlVyMxVd+1lmPx76uzuZXsdcwtaWZRitJuqMCxdpRO3KD LLIYwYKj5DU/zl0Gew0SwMa9oitX2VFeYhHLhxduK/JEwJIu9DRoWybqy/mhpBZhQ5mR fOrkSnyADOWoSyVJ8xS0W0mpqGGtg57VWWdvbQal1mMErg6wl2f7IrPuntDp1QKuzDR/ ESOKo1rSMe5wLQ2antVcdf5i+x4EjGntPXfQaScEoXc4an/5lK2Fu6h2u0AEUBYWBpNt OTfA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="SKAJ0/3F"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from snail.vger.email (snail.vger.email. [2620:137:e000::3:7]) by mx.google.com with ESMTPS id e1-20020a636901000000b005b8857bb156si7781912pgc.463.2023.11.26.08.04.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Sun, 26 Nov 2023 08:04:56 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) client-ip=2620:137:e000::3:7; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="SKAJ0/3F"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::3:7 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id E98D280657F5; Sun, 26 Nov 2023 08:04:54 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.11 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S230089AbjKZQCZ (ORCPT + 99 others); Sun, 26 Nov 2023 11:02:25 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:41736 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S229437AbjKZQCY (ORCPT ); Sun, 26 Nov 2023 11:02:24 -0500 Received: from mgamail.intel.com (mgamail.intel.com [192.55.52.120]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id D52CD85; Sun, 26 Nov 2023 08:02:30 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1701014550; x=1732550550; h=to:cc:subject:references:date:mime-version: content-transfer-encoding:from:message-id:in-reply-to; bh=EReFqfVgYNOrty8zXOm3hVHtJMOHGEjFtGOu59g9hJg=; b=SKAJ0/3F+CrCzNEGakJcr98RMi5266kZfN89Dqwo9/ot/y09Ykw64534 O4SWvxlvuN4RSl0Y7Qf9m65EKqNo9ZTFx9Bolqvtd58clyl88XWLC8fSm OG2i9Mgn3lTXShew+W/YMWPRDXLMJ48h5fIskJArzYNlPcs+BxxOKmBXr OFKl8Tlw0dk+kAMArFlX9OvWZAM+q9Pb2LVlLpTt7pGL2PiZv3snNdtzp Qmsu1HF3evK+l0AWnP10RBuhqBAoxjfNUyCOYd50s0ujqoL4zmz0RpIRV oj5MnyIvP3K7iOy3/xTJ06m0xu/TZ/bkTl+GeokXvkYW/uEgAbYBLDbEO A==; X-IronPort-AV: E=McAfee;i="6600,9927,10906"; a="391445169" X-IronPort-AV: E=Sophos;i="6.04,229,1695711600"; d="scan'208";a="391445169" Received: from orviesa001.jf.intel.com ([10.64.159.141]) by fmsmga104.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 26 Nov 2023 08:02:08 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=Sophos;i="6.04,229,1695711600"; d="scan'208";a="16077213" Received: from hhuan26-mobl.amr.corp.intel.com ([10.124.112.56]) by smtpauth.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-SHA; 26 Nov 2023 08:02:02 -0800 Content-Type: text/plain; charset=iso-8859-15; format=flowed; delsp=yes To: "tj@kernel.org" , "jarkko@kernel.org" , "x86@kernel.org" , "dave.hansen@linux.intel.com" , "cgroups@vger.kernel.org" , "hpa@zytor.com" , "mingo@redhat.com" , "linux-kernel@vger.kernel.org" , "mkoutny@suse.com" , "tglx@linutronix.de" , "linux-sgx@vger.kernel.org" , "Mehta, Sohil" , "bp@alien8.de" , "Huang, Kai" Cc: "mikko.ylinen@linux.intel.com" , "seanjc@google.com" , "Zhang, Bo" , "kristen@linux.intel.com" , "yangjie@microsoft.com" , "sean.j.christopherson@intel.com" , "Li, Zhiquan1" , "anakrish@microsoft.com" Subject: Re: [PATCH v6 04/12] x86/sgx: Implement basic EPC misc cgroup functionality References: <20231030182013.40086-1-haitao.huang@linux.intel.com> <20231030182013.40086-5-haitao.huang@linux.intel.com> <34a337b96a5a917612c4ec4eff2b5a378c21879b.camel@intel.com> Date: Mon, 27 Nov 2023 00:01:56 +0800 MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: "Haitao Huang" Organization: Intel Message-ID: In-Reply-To: User-Agent: Opera Mail/1.0 (Win32) X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,RCVD_IN_DNSWL_BLOCKED, SPF_HELO_NONE,SPF_NONE,T_SCC_BODY_TEXT_LINE autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Sun, 26 Nov 2023 08:04:55 -0800 (PST) On Mon, 20 Nov 2023 11:16:42 +0800, Huang, Kai wrote: >> > > >> > >> > That's true. I was thinking no need to have them done in separate >> calls. >> > The caller has to check the return value for epc_cg instance first, >> then >> > check result of try_charge. But there is really only one caller, >> > sgx_alloc_epc_page() below, so I don't have strong opinions now. >> > >> > With them separate, the checks will look like this: >> > if (epc_cg = sgx_get_current_epc_cg()) // NULL means cgroup disabled, >> > should continue for allocation >> > { >> > if (ret = sgx_epc_cgroup_try_charge()) >> > return ret >> > } >> > // continue... >> > >> > I can go either way. > > Let's keep this aligned with other _try_charge() variants: return 'int' > to > indicate whether the charge is done or not. > Fine with me if no objections from maintainers. >> > >> > > >> > > > > > struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool >> reclaim) >> > > > > > { >> > > > > > struct sgx_epc_page *page; >> > > > > > + struct sgx_epc_cgroup *epc_cg; >> > > > > > + >> > > > > > + epc_cg = sgx_epc_cgroup_try_charge(); >> > > > > > + if (IS_ERR(epc_cg)) >> > > > > > + return ERR_CAST(epc_cg); >> > > > > > >> > > > > > for ( ; ; ) { >> > > > > > page = __sgx_alloc_epc_page(); >> > > > > > @@ -580,10 +587,21 @@ struct sgx_epc_page >> *sgx_alloc_epc_page(void >> > > > > > *owner, bool reclaim) >> > > > > > break; >> > > > > > } >> > > > > > >> > > > > > + /* >> > > > > > + * Need to do a global reclamation if cgroup was not full >> but >> > > > free >> > > > > > + * physical pages run out, causing __sgx_alloc_epc_page() >> to >> > > > fail. >> > > > > > + */ >> > > > > > sgx_reclaim_pages(); >> > > > > >> > > > > What's the final behaviour? IIUC it should be reclaiming from >> the >> > > > > *current* EPC >> > > > > cgroup? If so shouldn't we just pass the @epc_cg to it here? >> > > > > >> > > > > I think we can make this patch as "structure" patch w/o actually >> > > > having >> > > > > EPC >> > > > > cgroup enabled, i.e., sgx_get_current_epc_cg() always return >> NULL. >> > > > > >> > > > > And we can have one patch to change sgx_reclaim_pages() to take >> the >> > > > > 'struct >> > > > > sgx_epc_lru_list *' as argument: >> > > > > >> > > > > void sgx_reclaim_pages_lru(struct sgx_epc_lru_list * lru) >> > > > > { >> > > > > ... >> > > > > } >> > > > > >> > > > > Then here we can have something like: >> > > > > >> > > > > void sgx_reclaim_pages(struct sgx_epc_cg *epc_cg) >> > > > > { >> > > > > struct sgx_epc_lru_list *lru = epc_cg ? &epc_cg->lru : >> > > > > &sgx_global_lru; >> > > > > >> > > > > sgx_reclaim_pages_lru(lru); >> > > > > } >> > > > > >> > > > > Makes sense? >> > > > > The reason we 'isolate' first then do real 'reclaim' is that the actual reclaim is expensive and especially for eblock, etrack, etc. >> > > > >> > > > This is purely global reclamation. No cgroup involved. >> > > >> > > Again why? Here you are allocating one EPC page for enclave in a >> > > particular EPC >> > > cgroup. When that fails, shouldn't you try only to reclaim from the >> > > *current* >> > > EPC cgroup? Or at least you should try to reclaim from the >> *current* >> > > EPC cgroup >> > > first? >> > > >> > >> > Later sgx_epc_cg_try_charge will take a 'reclaim' flag, if true, >> cgroup >> > reclaims synchronously, otherwise in background and returns -EBUSY in >> > that case. This function also returns if no valid epc_cg pointer >> > returned. >> > >> > All reclamation for *current* cgroup is done in >> sgx_epc_cg_try_charge(). > > This is fine, but I believe my question above is about where to reclaim > when > "allocation" fails, but not "try charge" fails. > I mean "will be done" :-) Currently no reclaim in try_charge. > And for "reclaim for current cgroup when charge fails", I don't think > its even > necessary in this initial implementation of EPC cgroup. You can just > fail the > allocation when charge fails (reaching the limit). Trying to reclaim > when limit > is hit can be done later. > Yes. It is done later. > Please see Dave and Michal's replies here: > > https://lore.kernel.org/lkml/7a1a5125-9da2-47b6-ba0f-cf24d84df16b@intel.com/#t > https://lore.kernel.org/lkml/yz44wukoic3syy6s4fcrngagurkjhe2hzka6kvxbajdtro3fwu@zd2ilht7wcw3/ > >> > >> > So, by reaching to this point, a valid epc_cg pointer was returned, >> > that means allocation is allowed for the cgroup (it has reclaimed if >> > necessary, and its usage is not above limit after charging). > > I found memory cgroup uses different logic -- allocation first and then > charge: > > For instance: > > static vm_fault_t do_anonymous_page(struct vm_fault *vmf) > { > ...... > > folio = vma_alloc_zeroed_movable_folio(vma, vmf->address); > if (!folio) > goto oom; > if (mem_cgroup_charge(folio, vma->vm_mm, GFP_KERNEL)) > goto oom_free_page; > > ...... } > > Why EPC needs to "charge first" and "then allocate"? > EPC allocation can involve reclaiming which is more expensive than regular RAM reclamation. Also misc only has max hard limit. Thanks Haitao >> > >> > But the system level free count may be low (e.g., limits of all >> cgroups >> > may add up to be more than capacity). so we need to do a global >> > reclamation here, which may involve reclaiming a few pages (from >> current >> > or other groups) so the system can be at a performant state with >> minimal >> > free count. (current behavior of ksgxd). >> > >> I should have sticked to the orignial comment added in code. Actually >> __sgx_alloc_epc_page() can fail if system runs out of EPC. That's the >> really reason for global reclaim. > > I don't see how this can work. With EPC cgroup likely all EPC pages > will go to > the individual LRU of each cgroup, and the sgx_global_lru will basically > empty. > How can you reclaim from the sgx_global_lru? Currently, nothing in cgroup LRU, all EPCs pages are in global list.