Received: by 2002:a05:7412:da14:b0:e2:908c:2ebd with SMTP id fe20csp2235997rdb; Mon, 9 Oct 2023 18:42:56 -0700 (PDT) X-Google-Smtp-Source: AGHT+IGs9LcZ1HRm/LM8irFBx/1XAqxfw95iiP7B1b+0uyn2jsxs7nireSnVlaX4Kuj8m5n4FdQt X-Received: by 2002:a17:903:1245:b0:1c7:49dd:2df with SMTP id u5-20020a170903124500b001c749dd02dfmr16179970plh.32.1696902176274; Mon, 09 Oct 2023 18:42:56 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1696902176; cv=none; d=google.com; s=arc-20160816; b=wJ5C7cOScYS9hDUYYImo1VBRIBkMu1ifgVQk7UCg5uk1PNYNouJJnEGy7rMm7NlfNh ejt2vAXVkW5LoWwMWVGCGNE3E3Vm/y08Y7cbrIGPMT83StHTq4/FPAGQ6qLB6gJ/wgbd v/TMUavwOUjOiKaYk4hWvdvRlwxuCgUjMnKggngiXHUZTyJTEAXd3j/6+IXr9yaxjy9U LivvXNDy+JgH8ay9RSnLuyUhKcmc/gUcQEfBwHqBZ3UXTEqtya+uFzho5yaA0xhJea6p kCLfEKWhyS3PytsOvoagew7MkIVW3VnXV6ybdPYYwEkJNmZRjgkH/ltVzKatk5/WssAY mOMw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:user-agent:in-reply-to:message-id:organization :from:content-transfer-encoding:mime-version:date:references:subject :cc:to:dkim-signature; bh=Z1T04tNahlaAbXq4IZ7VEQLzYSFktglc0OBDRmgQaVM=; fh=eUBWjTfL7i0D9RQBmzWYS0dXkN6Wzv9K7hukVbpAMY0=; b=u+KgHiXDVStdRzaiT4gIF+j1GzN1x4LKgw0le8XfB6cu6ESaj2nLWv0ot7d38/91Ol 2Z0PgZvTW0jRGuXHtbSgFLtCT/UZbmsFwYx8/KuZ8bmxEL2H0eru7YCzva73fq+0Npcu 7w2O7xoElzVrT4zgbJL5ENlr+3sE7+5e0qd61DHb5NIgYRAXderVhplZCx2WVS8FU21p 6+FXVVRfENJ9605sZZw6mnOygsFVPLG+ojwpiVhiXR1qeWo4hjX7mWd6s2QnFV2dffTx WHw0ob4hFbkOtylv+7ogJ8l0qd8Q1sH2+TpPRPuFfIuHGXr8OS9EMHOinBTcppE74UrW fWQA== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="CIkPWDc/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from snail.vger.email (snail.vger.email. [23.128.96.37]) by mx.google.com with ESMTPS id a14-20020a170902ecce00b001c5f8995611si5822099plh.483.2023.10.09.18.42.55 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Mon, 09 Oct 2023 18:42:56 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) client-ip=23.128.96.37; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b="CIkPWDc/"; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 23.128.96.37 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from out1.vger.email (depot.vger.email [IPv6:2620:137:e000::3:0]) by snail.vger.email (Postfix) with ESMTP id 69EC88101EC3; Mon, 9 Oct 2023 18:42:55 -0700 (PDT) X-Virus-Status: Clean X-Virus-Scanned: clamav-milter 0.103.10 at snail.vger.email Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1379370AbjJJBmv (ORCPT + 99 others); Mon, 9 Oct 2023 21:42:51 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:60808 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1379334AbjJJBmu (ORCPT ); Mon, 9 Oct 2023 21:42:50 -0400 Received: from mgamail.intel.com (mgamail.intel.com [134.134.136.20]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 6BBF091; Mon, 9 Oct 2023 18:42:49 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1696902169; x=1728438169; h=to:cc:subject:references:date:mime-version: content-transfer-encoding:from:message-id:in-reply-to; bh=AlHkwm9zEXCrDr1o3Gq7oWu7SSn9X4lm0zxGjkR3glU=; b=CIkPWDc/0a0HhMGQA0T3kfZ6++idRIZ8xhl+tV69DSC1GJXOTFzJZipZ 2o3jOcReH8Tp8gI283yf/3PABA9f32FS/gtqKfduj9vfhNWxrn1pBkMIe YNu7vcGEAoTwiAoOasJU0pyCj+ZPjm8Lzxs9h/CYw7DF+qxa1EMeCjYE/ Mc7h3FCpyk/6ZPlndEUK5upf5eJ8myzDtDT4X/KyJCfI1vLUoVIkSGHtp G5o0TSbbzrN/cjgzd06HjtpF4U8QKanr/SCSABqXlVFRnLr2YL8l3m0pP VJReB4cGr4aen3y+s7tF6oZaHaUJl/+W4yHKs/9RPW30KjITundDFvN4o w==; X-IronPort-AV: E=McAfee;i="6600,9927,10858"; a="374632931" X-IronPort-AV: E=Sophos;i="6.03,211,1694761200"; d="scan'208";a="374632931" Received: from fmsmga006.fm.intel.com ([10.253.24.20]) by orsmga101.jf.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 09 Oct 2023 18:42:48 -0700 X-ExtLoop1: 1 X-IronPort-AV: E=McAfee;i="6600,9927,10858"; a="1000477573" X-IronPort-AV: E=Sophos;i="6.03,211,1694761200"; d="scan'208";a="1000477573" Received: from hhuan26-mobl.amr.corp.intel.com ([10.92.96.100]) by fmsmga006-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-SHA; 09 Oct 2023 18:42:46 -0700 Content-Type: text/plain; charset=iso-8859-15; format=flowed; delsp=yes To: "Kai Huang" , "Sean Christopherson" Cc: "hpa@zytor.com" , "linux-sgx@vger.kernel.org" , "x86@kernel.org" , "dave.hansen@linux.intel.com" , "cgroups@vger.kernel.org" , "bp@alien8.de" , "linux-kernel@vger.kernel.org" , "jarkko@kernel.org" , "tglx@linutronix.de" , "Sohil Mehta" , "tj@kernel.org" , "mingo@redhat.com" , "kristen@linux.intel.com" , "yangjie@microsoft.com" , "Zhiquan1 Li" , "mikko.ylinen@linux.intel.com" , "Bo Zhang" , "anakrish@microsoft.com" Subject: Re: [PATCH v5 12/18] x86/sgx: Add EPC OOM path to forcefully reclaim EPC References: <20230923030657.16148-1-haitao.huang@linux.intel.com> <20230923030657.16148-13-haitao.huang@linux.intel.com> <1b265d0c9dfe17de2782962ed26a99cc9d330138.camel@intel.com> Date: Mon, 09 Oct 2023 20:42:45 -0500 MIME-Version: 1.0 Content-Transfer-Encoding: 7bit From: "Haitao Huang" Organization: Intel Message-ID: In-Reply-To: User-Agent: Opera Mail/1.0 (Win32) X-Spam-Status: No, score=-4.3 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_EF,RCVD_IN_DNSWL_MED, RCVD_IN_MSPIKE_H3,RCVD_IN_MSPIKE_WL,SPF_HELO_NONE,SPF_NONE, URIBL_BLOCKED autolearn=ham autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org X-Greylist: Sender passed SPF test, not delayed by milter-greylist-4.6.4 (snail.vger.email [0.0.0.0]); Mon, 09 Oct 2023 18:42:55 -0700 (PDT) Hi Sean On Mon, 09 Oct 2023 19:23:04 -0500, Sean Christopherson wrote: > On Mon, Oct 09, 2023, Kai Huang wrote: >> On Fri, 2023-09-22 at 20:06 -0700, Haitao Huang wrote: >> > +/** >> > + * sgx_epc_oom() - invoke EPC out-of-memory handling on target LRU >> > + * @lru: LRU that is low >> > + * >> > + * Return: %true if a victim was found and kicked. >> > + */ >> > +bool sgx_epc_oom(struct sgx_epc_lru_lists *lru) >> > +{ >> > + struct sgx_epc_page *victim; >> > + >> > + spin_lock(&lru->lock); >> > + victim = sgx_oom_get_victim(lru); >> > + spin_unlock(&lru->lock); >> > + >> > + if (!victim) >> > + return false; >> > + >> > + if (victim->flags & SGX_EPC_OWNER_PAGE) >> > + return sgx_oom_encl_page(victim->encl_page); >> > + >> > + if (victim->flags & SGX_EPC_OWNER_ENCL) >> > + return sgx_oom_encl(victim->encl); >> >> I hate to bring this up, at least at this stage, but I am wondering why >> we need >> to put VA and SECS pages to the unreclaimable list, but cannot keep an >> "enclave_list" instead? > > The motivation for tracking EPC pages instead of enclaves was so that > the EPC > OOM-killer could "kill" VMs as well as host-owned enclaves. The virtual > EPC code > didn't actually kill the VM process, it instead just freed all of the > EPC pages > and abused the SGX architecture to effectively make the guest recreate > all its > enclaves (IIRC, QEMU does the same thing to "support" live migration). > > Looks like y'all punted on that with: > > The EPC pages allocated for KVM guests by the virtual EPC driver are > not > reclaimable by the host kernel [5]. Therefore they are not tracked by > any > LRU lists for reclaiming purposes in this implementation, but they are > charged toward the cgroup of the user processs (e.g., QEMU) launching > the > guest. And when the cgroup EPC usage reaches its limit, the virtual > EPC > driver will stop allocating more EPC for the VM, and return SIGBUS to > the > user process which would abort the VM launch. > > which IMO is a hack, unless returning SIGBUS is actually enforced > somehow. Relying > on userspace to be kind enough to kill its VMs kinda defeats the purpose > of cgroup > enforcement. E.g. if the hard limit for a EPC cgroup is lowered, > userspace running > encalves in a VM could continue on and refuse to give up its EPC, and > thus run above > its limit in perpetuity. > Cgroup would refuse to allocate more when limit is reached so VMs can not run above limit. IIRC VMs only support static EPC size right now, reaching limit at launch means the EPC size given in command line for QEMU is not appropriate. So VM should not launch, hence the current behavior. [all EPC pages in guest are allocated on page fault caused by the sensitization process in guest kernel during init, which is part of the VM Launch process. So SIGNBUS will turn into failed VM launch.] Once it is launched, guest kernel would have 'total capacity' given by the static value from QEMU option. And it would start paging when it is used up, never would ask for more from host. For future with dynamic EPC for running guests, QEMU could handle allocation failure and pass SIGBUS to the running guest kernel. Is that correct understanding? > I can see userspace wanting to explicitly terminate the VM instead of > "silently" > the VM's enclaves, but that seems like it should be a knob in the > virtual EPC > code. If my understanding above is correct and understanding your statement above correctly, then don't see we really need separate knob for vEPC code. Reaching a cgroup limit by a running guest (assuming dynamic allocation implemented) should not translate automatically killing the VM. Instead, it's user space job to work with guest to handle allocation failure. Guest could page and kill enclaves. Haitao