Received: by 2002:a05:6a10:7420:0:0:0:0 with SMTP id hk32csp584864pxb; Tue, 15 Feb 2022 22:55:36 -0800 (PST) X-Google-Smtp-Source: ABdhPJxWZtn5NaruzMvV7DKeSCRcFC7gVjnteOZQq2KiPHERHinwcyj/i+n7Nfjs+etyWg+lv5t/ X-Received: by 2002:a17:902:ce12:b0:14e:e18e:80a4 with SMTP id k18-20020a170902ce1200b0014ee18e80a4mr1124938plg.34.1644994535950; Tue, 15 Feb 2022 22:55:35 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1644994535; cv=none; d=google.com; s=arc-20160816; b=1EElthEXrDMf3LQ/VBIUnwxQ8l7uJCoLCsSHae6Y82A6AdHLwprRVxXCocnwzLspaX C8GY+zC66N93LFuLIVFdZYOtwEjBrW84b5n+LvJT4RTJWqStbZmTxqmE2ZcBlOiOazf1 AKl9RDDlskncHKpuRBEnaZnOo8Zq/RI7jHRTmxQrmEYmsE6sWxdVrgV8vipeAg+mjKBH sQGXcryG2eb0Bu/90Cl7zoFU+c+jBcOOPNTH5GKhPzmwjMYHxJox4/9BV1lJ63Ryh6wM phSkXQLOO3hcvLr/Pz0iBPfkUeXnyg2s+s301uwYJyl+x38xh5sQv9RETwObkXmGcH5s cGeg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:in-reply-to:subject :from:references:cc:to:content-language:user-agent:mime-version:date :message-id:dkim-signature; bh=uTz0iuupsjNwfdv7fyNvRXb1YpwHl7rTvuoZ62xl6UY=; b=A3taiBwrdFHfkiXsz1bpT+iizDBZm7wOSBF9szJSYX8nDGF13x5sqCBgpK14ybvGX0 03QsiJ1PcsIco3H8vZb0b0TObErRZXR013VKtGkIilqyaYqJ2qT+PzZWuLnddbsZUT9i B0dZEIocTVwxnpv+n/75I2LY5fVXpOFZqXxHvCcXpXcPGSkM9/+//lPNaAfb1nSoIz3T 6HQY0MYV26ZBV/HE9thVRqjosxrHlF5Txge4ubIzEdIM0xfqexGNhXArAMYCGYddqscl awN90nmiZT+GfLf9t+1XlVl22InpF9FiUwo6MGgvyP58+gvFxaoNJQ1aTxCApihMWipq 1uxw== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=hGxfzI+q; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Return-Path: Received: from lindbergh.monkeyblade.net (lindbergh.monkeyblade.net. [23.128.96.19]) by mx.google.com with ESMTPS id j13si16599094plr.277.2022.02.15.22.55.35 (version=TLS1_3 cipher=TLS_AES_256_GCM_SHA384 bits=256/256); Tue, 15 Feb 2022 22:55:35 -0800 (PST) Received-SPF: softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) client-ip=23.128.96.19; Authentication-Results: mx.google.com; dkim=pass header.i=@intel.com header.s=Intel header.b=hGxfzI+q; spf=softfail (google.com: domain of transitioning linux-kernel-owner@vger.kernel.org does not designate 23.128.96.19 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=intel.com Received: from vger.kernel.org (vger.kernel.org [23.128.96.18]) by lindbergh.monkeyblade.net (Postfix) with ESMTP id E6F261AC9E6; Tue, 15 Feb 2022 22:37:59 -0800 (PST) Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S242314AbiBORNK (ORCPT + 99 others); Tue, 15 Feb 2022 12:13:10 -0500 Received: from mxb-00190b01.gslb.pphosted.com ([23.128.96.19]:39016 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S235970AbiBORNH (ORCPT ); Tue, 15 Feb 2022 12:13:07 -0500 Received: from mga05.intel.com (mga05.intel.com [192.55.52.43]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 52DE7119F7A; Tue, 15 Feb 2022 09:12:57 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=intel.com; i=@intel.com; q=dns/txt; s=Intel; t=1644945177; x=1676481177; h=message-id:date:mime-version:to:cc:references:from: subject:in-reply-to:content-transfer-encoding; bh=NgaNvbLh/5PNCAekVCyte5+CqdZYMQ6Cb7Fvz6rUpBA=; b=hGxfzI+qH+2nBbvtKpBoIipw7EvobynCdMjJu4lBuCpQVAHQsNr6G/qw xVh4Hk592PGTai6CrUxcQMwhm139jnHDza+Kf6PDVDaRjJQIhh2O+EChh 0j3Lo0aEc3sQJskZnNF2h9X91kXDDrS6S/LnguIc0KwuNlcI9LABJAYMc dfJlEGd5MR/3DbX/8mSukJ5vo5bh4NfnxcNlAns3TUOcej2EWeaCesEzt TWt/TBQVSOxl6UrFF0QcGlfpQ7dzZur/lLzv4dDkbej97siycToV3bwVZ NGYKebIyouQNWYB4W5Sqy98MHn3hEH2UzrF9D1PI0DdNyk5l22Rk8INVU w==; X-IronPort-AV: E=McAfee;i="6200,9189,10259"; a="336830818" X-IronPort-AV: E=Sophos;i="5.88,371,1635231600"; d="scan'208";a="336830818" Received: from fmsmga005.fm.intel.com ([10.253.24.32]) by fmsmga105.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Feb 2022 09:07:53 -0800 X-IronPort-AV: E=Sophos;i="5.88,371,1635231600"; d="scan'208";a="775934195" Received: from tngodup-mobl.amr.corp.intel.com (HELO [10.209.32.98]) ([10.209.32.98]) by fmsmga005-auth.fm.intel.com with ESMTP/TLS/ECDHE-RSA-AES256-GCM-SHA384; 15 Feb 2022 09:07:50 -0800 Message-ID: <56fc0ced-d8d2-146f-6ca8-b95bd7e0b4f5@intel.com> Date: Tue, 15 Feb 2022 09:07:45 -0800 MIME-Version: 1.0 User-Agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.5.0 Content-Language: en-US To: Brian Geffon , Thomas Gleixner Cc: Willis Kung , Guenter Roeck , Borislav Petkov , Andy Lutomirski , stable@vger.kernel.org, x86@kernel.org, linux-kernel@vger.kernel.org References: <20220215153644.3654582-1-bgeffon@google.com> From: Dave Hansen Subject: Re: [PATCH] x86/fpu: Correct pkru/xstate inconsistency In-Reply-To: <20220215153644.3654582-1-bgeffon@google.com> Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 7bit X-Spam-Status: No, score=-2.0 required=5.0 tests=BAYES_00,DKIMWL_WL_HIGH, DKIM_SIGNED,DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, MAILING_LIST_MULTI,NICE_REPLY_A,RDNS_NONE,SPF_HELO_NONE, T_SCC_BODY_TEXT_LINE autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On 2/15/22 07:36, Brian Geffon wrote: > There are two issues with PKRU handling prior to 5.13. Are you sure both of these issues were introduced by 0cecca9d03c? I'm surprised that the get_xsave_addr() issue is not older. Should this be two patches? > The first is that when eagerly switching PKRU we check that current Don't forget to write in imperative mood. No "we's", please. https://www.kernel.org/doc/html/latest/process/maintainer-tip.html This goes for changelogs and comments too. > is not a kernel thread as kernel threads will never use PKRU. It's > possible that this_cpu_read_stable() on current_task (ie. > get_current()) is returning an old cached value. By forcing the read > with this_cpu_read() the correct task is used. Without this it's > possible when switching from a kernel thread to a userspace thread > that we'll still observe the PF_KTHREAD flag and never restore the > PKRU. And as a result this issue only occurs when switching from a > kernel thread to a userspace thread, switching from a non kernel > thread works perfectly fine because all we consider in that situation > is the flags from some other non kernel task and the next fpu is > passed in to switch_fpu_finish(). It makes *sense* that there would be a place in the context switch code where 'current' is wonky, but I never realized this. This seems really fragile, but *also* trivially detectable. Is the PKRU code really the only code to use 'current' in a buggy way like this? > The second issue is when using write_pkru() we only write to the > xstate when the feature bit is set because get_xsave_addr() returns > NULL when the feature bit is not set. This is problematic as the CPU > is free to clear the feature bit when it observes the xstate in the > init state, this behavior seems to be documented a few places throughout > the kernel. If the bit was cleared then in write_pkru() we would happily > write to PKRU without ever updating the xstate, and the FPU restore on > return to userspace would load the old value agian. ^ again It's probably worth noting that the AMD init tracker is a lot more aggressive than Intel's. On Intel, I think XRSTOR is the only way to get back to the init state. You're obviously hitting this on AMD. It's also *very* unlikely that PKRU gets back to a value of 0. I think we added a selftest for this case in later kernels. That helps explain why this bug hung around for so long. > diff --git a/arch/x86/include/asm/fpu/internal.h b/arch/x86/include/asm/fpu/internal.h > index 03b3de491b5e..540bda5bdd28 100644 > --- a/arch/x86/include/asm/fpu/internal.h > +++ b/arch/x86/include/asm/fpu/internal.h > @@ -598,7 +598,7 @@ static inline void switch_fpu_finish(struct fpu *new_fpu) > * PKRU state is switched eagerly because it needs to be valid before we > * return to userland e.g. for a copy_to_user() operation. > */ > - if (!(current->flags & PF_KTHREAD)) { > + if (!(this_cpu_read(current_task)->flags & PF_KTHREAD)) { This really deserves a specific comment. > /* > * If the PKRU bit in xsave.header.xfeatures is not set, > * then the PKRU component was in init state, which means > diff --git a/arch/x86/include/asm/pgtable.h b/arch/x86/include/asm/pgtable.h > index 9e71bf86d8d0..aa381b530de0 100644 > --- a/arch/x86/include/asm/pgtable.h > +++ b/arch/x86/include/asm/pgtable.h > @@ -140,16 +140,22 @@ static inline void write_pkru(u32 pkru) > if (!boot_cpu_has(X86_FEATURE_OSPKE)) > return; > > - pk = get_xsave_addr(¤t->thread.fpu.state.xsave, XFEATURE_PKRU); > - > /* > * The PKRU value in xstate needs to be in sync with the value that is > * written to the CPU. The FPU restore on return to userland would > * otherwise load the previous value again. > */ > fpregs_lock(); > - if (pk) > - pk->pkru = pkru; > + /* > + * The CPU is free to clear the feature bit when the xstate is in the > + * init state. For this reason, we need to make sure the feature bit is > + * reset when we're explicitly writing to pkru. If we did not then we > + * would write to pkru and it would not be saved on a context switch. > + */ > + current->thread.fpu.state.xsave.header.xfeatures |= XFEATURE_MASK_PKRU; I don't think we need to describe how the init optimization works again. I'm also not sure it's worth mentioning context switches here. It's a wider problem than that. Maybe: /* * All fpregs will be XRSTOR'd from this buffer before returning * to userspace. Ensure that XRSTOR does not init PKRU and that * get_xsave_addr() will work. */ > + pk = get_xsave_addr(¤t->thread.fpu.state.xsave, XFEATURE_PKRU); > + BUG_ON(!pk); A BUG_ON() a line before a NULL pointer dereference doesn't tend to do much good. > + pk->pkru = pkru; > __write_pkru(pkru); > fpregs_unlock(); > }