Received: by 2002:a05:7412:8521:b0:e2:908c:2ebd with SMTP id t33csp2509046rdf; Mon, 6 Nov 2023 17:17:12 -0800 (PST) X-Google-Smtp-Source: AGHT+IFcXGYsogW8Y5a+wktjoUrRIOEroqTG5RNJ3kOkCMUtAwcnL/QhFzyFBieA/rIX84HG+FTv X-Received: by 2002:a17:902:efc2:b0:1cc:d510:85fb with SMTP id ja2-20020a170902efc200b001ccd51085fbmr3861483plb.36.1699319832044; Mon, 06 Nov 2023 17:17:12 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1699319832; cv=none; d=google.com; s=arc-20160816; b=Yy0L6vNGpBp2tmci/ma333OMlXo6gweDtLLTvWWcG/w3b09SwEytt0x/apIZnrylzz wjHEsHQSGfm+dm09dxmIhaLM2u+swQ5SPOJ5NCq02m/qh7nkg0QVxISPVeGarjzS/a6n axmzxt70fvMx+6CWp1ivZCW+9j00KU7lIzekLz8ONNk6t9a1O4e/Nbqds5EQeEN4+bhw sgj77xJCO4ADD7oD7TbCNxsaVPFiAxSzMNIfEiy2KbKTahnwnaHmdtm8zY1wlxpVvJOW XPSG5sSkCE/trUy8QJJmXY4DSxaF5Ls/9+AZoKrnhagoExqVY6QXuO4UAeUGBZit34Eb MHgw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:message-id:organization :from:content-transfer-encoding:mime-version:date:references:subject :cc:to:dkim-signature; bh=SxUQH3LH/SQBwlNxRgwdQliEbqfxSf1i5SBbMGVgV4M=; fh=Qgi0PiuOzrejhPFZ/RlBOTHdjTFwDOb4pxAqqGSiFmQ=; b=k8EDW5fYa+WKo9nj7o1jYvae96c9XBL8dKTJQ51L0mEQyu84O0mgEW6mWXTCDAZVAI KSFM3Z/A1ZE2vscPE7jgOhAYRM9WxqmJB2F6+UNUmQg/BEBhRrcMSJ0nr6LIRpXtOUf1 gC79k41fvoMMQ7g9OzfB+6WsI6v63f3VYie5LLBrDyy0vZz6TmlmODPTFZnmj0vBhR8c 92awnOCtl5WRN3qeTF0hUvoc9+5NeO90DMd5o+LOeF1KHK826jMZq5MVMuxD4YAy3BVI WNWeYO8ik3io3BS6vi696sDaOxPktPKu+78FG3Td3q7xaoiODGBsI+/r7bpETdlatqOG nurQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=NSoiFFUv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from agentk.vger.email (agentk.vger.email. [23.128.96.32]) by mx.google.com with ESMTPS id a16-20020a170902ecd000b001c752577582si10130792plh.359.2023.11.06.17.17.10 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 06 Nov 2023 17:17:12 -0800 (PST) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) client-ip=23.128.96.32; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=NSoiFFUv; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.32 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by agentk.vger.email (Postfix) with ESMTP id 63581802EE66; Mon, 6 Nov 2023 17:17:07 -0800 (PST) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at agentk.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S233342AbjKGBQk (ORCPT + 99 others); Mon, 6 Nov 2023 20:16:40 -0500 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:46390 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S233338AbjKGBQi (ORCPT ); Mon, 6 Nov 2023 20:16:38 -0500 Received: from mgamail.intel.com (mgamail.intel.com [198.175.65.10]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id F10801BF; Mon, 6 Nov 2023 17:16:34 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1699319796; x=1730855796; h=to:cc:subject:references:date:mime-version: content-transfer-encoding:from:message-id:in-reply-to; bh=MDXh+A4zL9HW9l/AyyMH+wCbad2DITke0bugqK13VVk=; b=NSoiFFUvCAdml96wsN3aWb16whu4SwsK5kb6k9249dQMltZSVGyiLNmU TmZHG17933YkFT1HmBpXjZyMbmNcjHnAP28y5QuXYkNrsSszMzFuNXyJX BRwya3YJxEMp3d+UtzS1ZINPROkNtqTCfEldoh7zVEVa2QSqdMJJ+eY/j 1kQRj06mGiVqbciZGjxHBEjDqiGQ6lzY3pgW7EGj3qd/4ycKZMigUaDh6 yUFhKmLy9eWz58ijb/RLmDV4Jxz6yAVokn0q9dwdsCL7Lt4zEK6ngL101 9gW+HC+p9xnDmfmTOMvytCdGPaFKrGxVrwfroDQmn4rlLgAlk22pKX3ga Q==; X-IronPort-AV: E=McAfee;i="6600,9927,10886"; a="2402898" X-IronPort-AV: E=Sophos;i="6.03,282,1694761200"; d="scan'208";a="2402898" Received: from orsmga002.jf.intel.com ([10.7.209.21]) by orvoesa102.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 06 Nov 2023 17:16:35 -0800 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10886"; a="762495599" X-IronPort-AV: E=Sophos;i="6.03,282,1694761200"; d="scan'208";a="762495599" Received: from hhuan26-mobl.amr.corp.intel.com ([10.93.50.175]) by orsmga002-auth.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-SHA; 06 Nov 2023 17:16:31 -0800 Content-Type: text/plain; charset=iso-8859-15; format=flowed; delsp=yes To: "mingo@redhat.com" , "jarkko@kernel.org" , "x86@kernel.org" , "dave.hansen@linux.intel.com" , "cgroups@vger.kernel.org" , "hpa@zytor.com" , "linux-kernel@vger.kernel.org" , "linux-sgx@vger.kernel.org" , "mkoutny@suse.com" , "tglx@linutronix.de" , "tj@kernel.org" , "Mehta, Sohil" , "bp@alien8.de" , "Huang, Kai" Cc: "mikko.ylinen@linux.intel.com" , "Christopherson,, Sean" , "Zhang, Bo" , "kristen@linux.intel.com" , "anakrish@microsoft.com" , "sean.j.christopherson@intel.com" , "Li, Zhiquan1" , "yangjie@microsoft.com" Subject: Re: [PATCH v6 04/12] x86/sgx: Implement basic EPC misc cgroup functionality References: <20231030182013.40086-1-haitao.huang@linux.intel.com> <20231030182013.40086-5-haitao.huang@linux.intel.com> <34a337b96a5a917612c4ec4eff2b5a378c21879b.camel@intel.com> Date: Mon, 06 Nov 2023 19:16:30 -0600 MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: "Haitao Huang" Organization: Intel Message-ID: In-Reply-To: <34a337b96a5a917612c4ec4eff2b5a378c21879b.camel@intel.com> User-Agent: Opera Mail/1.0 (Win32) X-Spam-Status: No, score=-0.8 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,HEADER_FROM_DIFFERENT_DOMAINS,MAILING_LIST_MULTI, SPF_HELO_NONE,SPF_PASS,T_SCC_BODY_TEXT_LINE autolearn=unavailable autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on agentk.vger.email Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (agentk.vger.email [0.0.0.0]); Mon, 06 Nov 2023 17:17:07 -0800 (PST) On Mon, 06 Nov 2023 16:18:30 -0600, Huang, Kai wrote: >> >> > > +/** >> > > + * sgx_epc_cgroup_try_charge() - hierarchically try to charge a >> single >> > > EPC page >> > > + * >> > > + * Returns EPC cgroup or NULL on success, -errno on failure. >> > > + */ >> > > +struct sgx_epc_cgroup *sgx_epc_cgroup_try_charge(void) >> > > +{ >> > > + struct sgx_epc_cgroup *epc_cg; >> > > + int ret; >> > > + >> > > + if (sgx_epc_cgroup_disabled()) >> > > + return NULL; >> > > + >> > > + epc_cg = sgx_epc_cgroup_from_misc_cg(get_current_misc_cg()); >> > > + ret = misc_cg_try_charge(MISC_CG_RES_SGX_EPC, epc_cg->cg, >> PAGE_SIZE); >> > > + >> > > + if (!ret) { >> > > + /* No epc_cg returned, release ref from get_current_misc_cg() */ >> > > + put_misc_cg(epc_cg->cg); >> > > + return ERR_PTR(-ENOMEM); >> > >> > misc_cg_try_charge() returns 0 when successfully charged, no? >> >> Right. I really made some mess in rebasing :-( >> >> > >> > > + } >> > > + >> > > + /* Ref released in sgx_epc_cgroup_uncharge() */ >> > > + return epc_cg; >> > > +} >> > >> > IMHO the above _try_charge() returning a pointer of EPC cgroup is a >> > little bit >> > odd, because it doesn't match the existing misc_cg_try_charge() which >> > returns >> > whether the charge is successful or not. sev_misc_cg_try_charge() >> > matches >> > misc_cg_try_charge() too. >> > >> > I think it's better to split "getting EPC cgroup" part out as a >> separate >> > helper, >> > and make this _try_charge() match existing pattern: >> > >> > struct sgx_epc_cgroup *sgx_get_current_epc_cg(void) >> > { >> > if (sgx_epc_cgroup_disabled()) >> > return NULL; >> > >> > return sgx_epc_cgroup_from_misc_cg(get_current_misc_cg()); >> > } >> > >> > int sgx_epc_cgroup_try_charge(struct sgx_epc_cgroup *epc_cg) >> > { >> > if (!epc_cg) >> > return -EINVAL; >> > >> > return misc_cg_try_charge(epc_cg->cg); >> > } >> > >> > Having sgx_get_current_epc_cg() also makes the caller easier to read, >> > because we >> > can immediately know we are going to charge the *current* EPC cgroup, >> > but not >> > some cgroup hidden within sgx_epc_cgroup_try_charge(). >> > >> >> Actually, unlike other misc controllers, we need charge and get the >> epc_cg >> reference at the same time. > > Can you elaborate? > > And in practice you always call sgx_epc_cgroup_try_charge() right after > sgx_get_current_epc_cg() anyway. The only difference is the whole thing > is done > in one function or in separate functions. > > [...] > That's true. I was thinking no need to have them done in separate calls. The caller has to check the return value for epc_cg instance first, then check result of try_charge. But there is really only one caller, sgx_alloc_epc_page() below, so I don't have strong opinions now. With them separate, the checks will look like this: if (epc_cg = sgx_get_current_epc_cg()) // NULL means cgroup disabled, should continue for allocation { if (ret = sgx_epc_cgroup_try_charge()) return ret } // continue... I can go either way. > >> > > struct sgx_epc_page *sgx_alloc_epc_page(void *owner, bool reclaim) >> > > { >> > > struct sgx_epc_page *page; >> > > + struct sgx_epc_cgroup *epc_cg; >> > > + >> > > + epc_cg = sgx_epc_cgroup_try_charge(); >> > > + if (IS_ERR(epc_cg)) >> > > + return ERR_CAST(epc_cg); >> > > >> > > for ( ; ; ) { >> > > page = __sgx_alloc_epc_page(); >> > > @@ -580,10 +587,21 @@ struct sgx_epc_page *sgx_alloc_epc_page(void >> > > *owner, bool reclaim) >> > > break; >> > > } >> > > >> > > + /* >> > > + * Need to do a global reclamation if cgroup was not full but >> free >> > > + * physical pages run out, causing __sgx_alloc_epc_page() to >> fail. >> > > + */ >> > > sgx_reclaim_pages(); >> > >> > What's the final behaviour? IIUC it should be reclaiming from the >> > *current* EPC >> > cgroup? If so shouldn't we just pass the @epc_cg to it here? >> > >> > I think we can make this patch as "structure" patch w/o actually >> having >> > EPC >> > cgroup enabled, i.e., sgx_get_current_epc_cg() always return NULL. >> > >> > And we can have one patch to change sgx_reclaim_pages() to take the >> > 'struct >> > sgx_epc_lru_list *' as argument: >> > >> > void sgx_reclaim_pages_lru(struct sgx_epc_lru_list * lru) >> > { >> > ... >> > } >> > >> > Then here we can have something like: >> > >> > void sgx_reclaim_pages(struct sgx_epc_cg *epc_cg) >> > { >> > struct sgx_epc_lru_list *lru = epc_cg ? &epc_cg->lru : >> > &sgx_global_lru; >> > >> > sgx_reclaim_pages_lru(lru); >> > } >> > >> > Makes sense? >> > >> >> This is purely global reclamation. No cgroup involved. > > Again why? Here you are allocating one EPC page for enclave in a > particular EPC > cgroup. When that fails, shouldn't you try only to reclaim from the > *current* > EPC cgroup? Or at least you should try to reclaim from the *current* > EPC cgroup > first? > Later sgx_epc_cg_try_charge will take a 'reclaim' flag, if true, cgroup reclaims synchronously, otherwise in background and returns -EBUSY in that case. This function also returns if no valid epc_cg pointer returned. All reclamation for *current* cgroup is done in sgx_epc_cg_try_charge(). So, by reaching to this point, a valid epc_cg pointer was returned, that means allocation is allowed for the cgroup (it has reclaimed if necessary, and its usage is not above limit after charging). But the system level free count may be low (e.g., limits of all cgroups may add up to be more than capacity). so we need to do a global reclamation here, which may involve reclaiming a few pages (from current or other groups) so the system can be at a performant state with minimal free count. (current behavior of ksgxd). Note this patch does not do per-cgroup reclamation. If we had stopped with this patch without next patches, cgroups will only block allocation by returning invalid epc_cg pointer (-ENOMEM) from sgx_epc_cg_try_charge(). Reclamation only happens when cgroup is not full but system level free count is below threshold. >> You can see it >> later in changes in patch 10/12. For now I just make a comment there but >> no real changes. Cgroup reclamation will be done as part of _try_charge >> call. >> >> > > cond_resched(); >> > > } >> > > >> > > + if (!IS_ERR(page)) { >> > > + WARN_ON_ONCE(page->epc_cg); >> > > + page->epc_cg = epc_cg; >> > > + } else { >> > > + sgx_epc_cgroup_uncharge(epc_cg); >> > > + } >> > > + >> > > if (sgx_should_reclaim(SGX_NR_LOW_PAGES)) >> > > wake_up(&ksgxd_waitq); >> > > >> > > @@ -604,6 +622,11 @@ void sgx_free_epc_page(struct sgx_epc_page >> *page) >> > > struct sgx_epc_section *section = >> &sgx_epc_sections[page->section]; >> > > struct sgx_numa_node *node = section->node; >> > > >> > > + if (page->epc_cg) { >> > > + sgx_epc_cgroup_uncharge(page->epc_cg); >> > > + page->epc_cg = NULL; >> > > + } >> > > + >> > > spin_lock(&node->lock); >> > > >> > > page->owner = NULL; >> > > @@ -643,6 +666,7 @@ static bool __init sgx_setup_epc_section(u64 >> > > phys_addr, u64 size, >> > > section->pages[i].flags = 0; >> > > section->pages[i].owner = NULL; >> > > section->pages[i].poison = 0; >> > > + section->pages[i].epc_cg = NULL; >> > > list_add_tail(§ion->pages[i].list, &sgx_dirty_page_list); >> > > } >> > > >> > > @@ -787,6 +811,7 @@ static void __init >> arch_update_sysfs_visibility(int >> > > nid) {} >> > > static bool __init sgx_page_cache_init(void) >> > > { >> > > u32 eax, ebx, ecx, edx, type; >> > > + u64 capacity = 0; >> > > u64 pa, size; >> > > int nid; >> > > int i; >> > > @@ -837,6 +862,7 @@ static bool __init sgx_page_cache_init(void) >> > > >> > > sgx_epc_sections[i].node = &sgx_numa_nodes[nid]; >> > > sgx_numa_nodes[nid].size += size; >> > > + capacity += size; >> > > >> > > sgx_nr_epc_sections++; >> > > } >> > > @@ -846,6 +872,8 @@ static bool __init sgx_page_cache_init(void) >> > > return false; >> > > } >> > > >> > > + misc_cg_set_capacity(MISC_CG_RES_SGX_EPC, capacity); > > Hmm.. I think this is why MISC_CG_RES_SGX_EPC is needed when > !CONFIG_CGROUP_SGX_EPC. right, that was it :-) > >> > > + >> > > return true; >> > > } >> > >> > I would separate setting up capacity as a separate patch. >> >> I thought about that, but again it was only 3-4 lines all in this >> function >> and it's also necessary part of basic setup for misc controller... > > Fine. Anyway it depends on what things you want to do on this patch. > It's fine > to include the capacity if this patch is some "structure" patch that > shows the > high level flow of how EPC cgroup works. Ok, I'll keep this way for now unless any objections. Thanks Haitao