Received: by 10.223.176.5 with SMTP id f5csp2035864wra; Sun, 28 Jan 2018 11:30:51 -0800 (PST) X-Google-Smtp-Source: AH8x225eDyKfPY/I0pEn8/9+EUVf9oryEEr7D0wvx15U2pi2QgA7qUYtFtdH6JoyRNNOC2vw//PZ X-Received: by 10.101.80.6 with SMTP id f6mr10859935pgo.272.1517167851411; Sun, 28 Jan 2018 11:30:51 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1517167851; cv=none; d=google.com; s=arc-20160816; b=OGYIRxk2u2AhoqYLEYjhGHpP9alMjiS3mCzAHqxPoBj/vKP4J8ru4UXVl5ruaxjZ41 GJnpIzPUB02wfMTEpJnsQ64dY0u1gN1mvSzE/PSNetjzUO6YExGeAsaMgzyB7I4odIrM up4DVRaRfCl+LQARbRFZU9LWTgyBAV5JyuyMudIVwOLSNA1CA87/5JSq+J14s6OOcEhP cl8UNmFc0CJOGmQyRn0cGg2uAZqQdwWNqmqbW+K9cSSHUPqgYWiuQ5NsoEl0idkzOVah dxDDXMqpRixUZLpfz8c6t1X1D2yp0p5obUDBLfOxk5JDS0l0iQtga8e2MEf6yhbC3P3J 5G7A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding:cc:to:subject :message-id:date:from:references:in-reply-to:mime-version :dmarc-filter:arc-authentication-results; bh=tuoB6HeKZ3h2IJr0aEZUM0mJrf6Jyf94h0BvnkyZefg=; b=IdOAFg+h3xZV1B/u/eSGq4Of085BoUakYtfHUpG04V5qY/C1EjvI/p/zifSEDVU29d vSEgLRsihweH93NqIn2gv9FpQgOLe4p/aStN4g4u9oSA1t8kh8ozKc7Tfrhhe0ncfKgo o/EE+T+jOK0mstBgTsUMB7TdlpekGSCpHUCxjeSDpBxBr+Ek9CS5cdY+btRDRbNKzdf+ r/HQDsE9ef+5DRjtedlgv3Bx1K7uNh9qi8m4dr71VdAfgLf919WS7t+ui8ABRAFS3ytS dkPnjLeFeQvj1zAF9bSr4QR/XNgxfSt6E37wMGptxSvezHg5NH/6ewgpA/nhe2u9F7MZ 2Ing== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id j3-v6si7551804plk.506.2018.01.28.11.30.35; Sun, 28 Jan 2018 11:30:51 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1752474AbeA1TV7 convert rfc822-to-8bit (ORCPT + 99 others); Sun, 28 Jan 2018 14:21:59 -0500 Received: from mail.kernel.org ([198.145.29.99]:47120 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1752457AbeA1TV5 (ORCPT ); Sun, 28 Jan 2018 14:21:57 -0500 Received: from mail-io0-f174.google.com (mail-io0-f174.google.com [209.85.223.174]) (using TLSv1.2 with cipher ECDHE-RSA-AES128-GCM-SHA256 (128/128 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id C1CAB217AA for ; Sun, 28 Jan 2018 19:21:56 +0000 (UTC) DMARC-Filter: OpenDMARC Filter v1.3.2 mail.kernel.org C1CAB217AA Authentication-Results: mail.kernel.org; dmarc=none (p=none dis=none) header.from=kernel.org Authentication-Results: mail.kernel.org; spf=none smtp.mailfrom=luto@kernel.org Received: by mail-io0-f174.google.com with SMTP id n7so5387073iob.0 for ; Sun, 28 Jan 2018 11:21:56 -0800 (PST) X-Gm-Message-State: AKwxytc3RKNDb7tqLNy0kKddhMmVaScH3Ir/bOFDnFNI6TsaGqYxwEpI rryuLL7tKA3XNeEogsA64AXZP8sN5ZdtbfjZA8vBBg== X-Received: by 10.107.78.16 with SMTP id c16mr24128117iob.105.1517167315983; Sun, 28 Jan 2018 11:21:55 -0800 (PST) MIME-Version: 1.0 Received: by 10.2.137.84 with HTTP; Sun, 28 Jan 2018 11:21:35 -0800 (PST) In-Reply-To: References: <20180126153631.ha7yc33fj5uhitjo@xps> From: Andy Lutomirski Date: Sun, 28 Jan 2018 11:21:35 -0800 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: selftests/x86/fsgsbase_64 test problem To: Andy Lutomirski Cc: Borislav Petkov , "H. Peter Anvin" , Dan Rue , Shuah Khan , Ingo Molnar , Dmitry Safonov , "open list:KERNEL SELFTEST FRAMEWORK" , LKML Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: 8BIT Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Fri, Jan 26, 2018 at 2:42 PM, Andy Lutomirski wrote: > On Fri, Jan 26, 2018 at 2:38 PM, Andy Lutomirski wrote: >> On Fri, Jan 26, 2018 at 11:46 AM, Andy Lutomirski wrote: >>> On Fri, Jan 26, 2018 at 10:59 AM, Andy Lutomirski wrote: >>>> On Fri, Jan 26, 2018 at 8:22 AM, Andy Lutomirski wrote: >>>>> On Fri, Jan 26, 2018 at 7:36 AM, Dan Rue wrote: >>>>>> >>>>>> We've noticed that fsgsbase_64 can fail intermittently with the >>>>>> following error: >>>>>> >>>>>> [RUN] ARCH_SET_GS(0x0) and clear gs, then schedule to 0x1 >>>>>> Before schedule, set selector to 0x1 >>>>>> other thread: ARCH_SET_GS(0x1) -- sel is 0x0 >>>>>> [FAIL] GS/BASE changed from 0x1/0x0 to 0x0/0x0 >>>>>> >>>>>> This can be reliably reproduced by running fsgsbase_64 in a loop. i.e. >>>>>> >>>>>> for i in $(seq 1 10000); do ./fsgsbase_64 || break; done >>>>>> >>>>>> This problem isn't new - I've reproduced it on latest mainline and every >>>>>> release going back to v4.12 (I did not try earlier). This was tested on >>>>>> a Supermicro board with a Xeon E3-1220 as well as an Intel Nuc with an >>>>>> i3-5010U. >>>>>> >>>>> >>>>> Hmm, I can reproduce it, too. I'll look in a bit. >>>> >>>> I'm triggering a different error, and I think what's going on is that >>>> the kernel doesn't currently re-save GSBASE when a task switches out >>>> and that task has save gsbase != 0 and in-register GS == 0. This is >>>> arguably a bug, but it's not an infoleak, and fixing it could be a wee >>>> bit expensive. I'm not sure what, if anything, to do about this. I >>>> suppose I could add some gross perf hackery to the test to detect this >>>> case and suppress the error. >>>> >>>> I can also trigger the problem you're seeing, and I don't know what's >>>> up. It may be related to and old problem I've seen that causes signal >>>> delivery to sometimes corrupt %gs. It's deterministic, but it depends >>>> in some odd way on register state. I can currently reproduce that >>>> issue 100% of the time, and I'm trying to see if I can figure out >>>> what's happening. >>> >>> I think it's a CPU bug, and I'm a bit mystified. I can trigger the >>> following, plausibly related issue: >>> >>> Write a program that writes %gs = 1. >>> Run that program under gdb >>> break in which %gs == 1 >>> display/x $gs >>> si >>> >>> Under QEMU TCG, gs stays equal to 1. On native or KVM, on Skylake, it >>> changes to 0. >>> >>> On KVM or native, I do not observe do_debug getting called with %gs == >>> 1. On TCG, I do. I don't think that's precisely the problem that's >>> causing the test to fail, since the test doesn't use TF or ptrace, but >>> I wouldn't be shocked if it's related. >>> >>> hpa, any insight? >>> >>> (NB: if you want to play with this as I've described it, you may need >>> to make invalid_selector() in ptrace.c always return false. The >>> current implementation is too strict and causes problems.) >> >> Much simpler test. Run the attached program (gs1). It more or less >> just sets %gs to 1 and spins until it stops being 1. Do it on a >> kernel with the attached patch applied. I see stuff like this: >> >> # ./gs1 >> PID = 129 >> [ 15.703015] pid 129 saved gs = 1 >> [ 15.703517] pid 129 loaded gs = 1 >> [ 15.703973] pid 129 prepare_exit_to_usermode: gs = 1 >> ax = 0, cx = 0, dx = 0 >> >> So we're interrupting the program, switching out, switching back in, >> setting %gs to 1, observing that %gs is *still* 1 in >> prepare_exit_to_usermode(), returning to usermode, and observing %gs >> == 0. >> >> Presumably what's happening is that the IRET microcode matches the >> SDM's pseudocode, which says: >> >> RETURN-TO-OUTER-PRIVILEGE-LEVEL: >> ... >> FOR each SegReg in (ES, FS, GS, and DS) >> DO >> tempDesc ← descriptor cache for SegReg (* hidden part of segment register *) >> IF tempDesc(DPL) < CPL AND tempDesc(Type) is data or non-conforming code >> THEN (* Segment register invalid *) >> SegReg ← NULL; >> FI; >> OD; >> >> But this is very odd. The actual permission checks (in the docs for MOV) are: >> >> IF DS, ES, FS, or GS is loaded with non-NULL selector >> THEN >> IF segment selector index is outside descriptor table limits >> or segment is not a data or readable code segment >> or ((segment is a data or nonconforming code segment) >> or ((RPL > DPL) and (CPL > DPL)) >> THEN #GP(selector); FI; >> >> ^^^^ >> This makes no sense. This says that the data segments cannot be >> loaded with MOV. Empirically, it seems like MOV works if CPL <= DPL >> and RPL <= DPL, but I haven't checked that hard. > > Surely Intel meant: > > ... or ((segment is a data segment or nonconforming code segment) and > ((RPL > DPL) or (CPL > DPL)) > > This would be consistent with the AMD APM #GP condition of "The DS, > ES, FS, or GS register was loaded and the segment pointed to was a > data or non-conforming code segment, but the RPL or CPL was greater > than the DPL." > >> >> IF segment not marked present >> THEN #NP(selector); >> ELSE >> SegmentRegister ← segment selector; >> SegmentRegister ← segment descriptor; FI; >> FI; >> >> IF DS, ES, FS, or GS is loaded with NULL selector >> THEN >> SegmentRegister ← segment selector; >> SegmentRegister ← segment descriptor; >> ^^^^ >> wtf? There is no "segment descriptor". Presumably what actually >> gets written to segment.DPL is nonsense. >> FI; > > I think the bug is here. I think that, when writing a NULL selector > to DS, ES, FS, or GS, Intel CPUs incorrectly set DPL == RPL, whereas > they should set DPL to 3. As an experiment, I did this: DEFINE_PER_CPU_PAGE_ALIGNED(struct gdt_page, gdt_page) = { .gdt = { + [0] = { .dpl = 3, }, + This had no apparent effect. I was hoping that maybe loading NULL into a selector would copy DPL from from gdt[0], but it seems like it doesn't.