Received: by 2002:ad5:474a:0:0:0:0:0 with SMTP id i10csp7530301imu; Tue, 22 Jan 2019 07:28:32 -0800 (PST) X-Google-Smtp-Source: ALg8bN77lDWFCPbomPYqH5I+SAbezD/0BvS27cGKqxa8t+5tWeYIv6Uxd3rwdJ5TbJRtgU9xQhy2 X-Received: by 2002:a62:3603:: with SMTP id d3mr35034234pfa.146.1548170912813; Tue, 22 Jan 2019 07:28:32 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1548170912; cv=none; d=google.com; s=arc-20160816; b=hy3Lqlx4gagn+k3V9skDymxoxwYJYe0qGFmKONQUevb2c9y1bZ16ZdQm1oCna0yDyW T1Z7a9ZF5cX0GuYPIflmu6Of6tK1Rd2F/T+dGjOacGotnRpxrAdA8R8taciZBMM3EgNS Yg9Q8R1f7hNJWX+NSU0oBm03tYdbs+D5Hr/6E/GiLbn9Kxsu+r2rerDkLjKSfP2LW2Lp AjKuZ0uxjmxxRksOmmmyE/hUNcSYWPPj/4bngccTpvd0wkPnieod5Pp+4O591VoMA+xo SLgsB2aI6TrlTUPuCdZCNztfeyCUhcZWuhx9OStBakCJ+Hj+aRD3bD1m1THf6IrAjcTW 8Egg== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date; bh=FV3OA15+nJYrovpCjGpjh5rbHSYqOeKMOvloxvLV16g=; b=fBGT9sG//a4GevuCYxVMm8FJfS9WmndehfnHsJrJ3dAWaP6o4WejwJj3Z6V2PtVyFp qh+O5/EhrC5q1crrdMngQYl7L9J4spntarHReWMck/nBENzQhSs4R8VJi5BFguWDf2AJ FyMqOVzwwpggTky8zfKYJGMI5+ykQF7Sl92xQAxFrM3fPs2PnJTQ5RjVBtWqb1/SZfBc mBlyPzjiZE3ZuV+GpkioRKpcOBx57oOWXBcyxPJVZMTnJJHJTtmSbydOT034Vuvka5Nc j/leRZJe+ZmkUmZOwP7BF7UFOD0DnQIixmmwLN1I/ODa8jApJukBSQLGDiryatfcJjlY 642Q== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id p189si2583625pfb.0.2019.01.22.07.28.17; Tue, 22 Jan 2019 07:28:32 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1729596AbfAVPZi (ORCPT + 99 others); Tue, 22 Jan 2019 10:25:38 -0500 Received: from usa-sjc-mx-foss1.foss.arm.com ([217.140.101.70]:55770 "EHLO foss.arm.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729901AbfAVPXn (ORCPT ); Tue, 22 Jan 2019 10:23:43 -0500 Received: from usa-sjc-imap-foss1.foss.arm.com (unknown [10.72.51.249]) by usa-sjc-mx-foss1.foss.arm.com (Postfix) with ESMTP id 461B6A78; Tue, 22 Jan 2019 07:23:43 -0800 (PST) Received: from lakrids.cambridge.arm.com (usa-sjc-imap-foss1.foss.arm.com [10.72.51.249]) by usa-sjc-imap-foss1.foss.arm.com (Postfix) with ESMTPSA id 35FB83F589; Tue, 22 Jan 2019 07:23:42 -0800 (PST) Date: Tue, 22 Jan 2019 15:23:39 +0000 From: Mark Rutland To: "Zhang, Lei" Cc: "'catalin.marinas@arm.com'" , "'will.deacon@arm.com'" , "'linux-arm-kernel@lists.infradead.org'" , "'linux-kernel@vger.kernel.org'" Subject: Re: [PATCH] arm64 memory accesses may cause undefined fault on Fujitsu-A64FX Message-ID: <20190122152339.GD52887@lakrids.cambridge.arm.com> References: <8898674D84E3B24BA3A2D289B872026A6A29FA8F@G01JPEXMBKW03> <20190118141758.GC12256@lakrids.cambridge.arm.com> <8898674D84E3B24BA3A2D289B872026A6A2A2F44@G01JPEXMBKW03> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <8898674D84E3B24BA3A2D289B872026A6A2A2F44@G01JPEXMBKW03> User-Agent: Mutt/1.11.1+11 (2f07cb52) (2018-12-01) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jan 22, 2019 at 02:05:26AM +0000, Zhang, Lei wrote: > Hi, Mark > > Thanks for your comments, and sorry for late. > > > -----Original Message----- > > * Under what conditions can the fault occur? e.g. is this in place of > > some other fault, or completely spurious? > This fault can occur completely spurious under a specific hardware > condition and instructions order. Ok. Can you be more specific regarding the conditions under which this occurs? e.g. can this only occur with certain instruction sequences? > > * Does this only occur for data abort? i.e. not instruction aborts? > Yes. This fault only occurs for data abort. > > > * How often does this fault occur? > In my test, this fault occurs once every several times in the OS boot > sequence, and after the completion of OS boot, this fault have never > occurred. > In my opinion, this fault rarely occurs after the completion of OS > boot. I'm very concerned that this could occur during boot (even if rarely), as that implies this is being taken EL1->EL1 or EL2->EL2. Which exception levels can the fault be taken from? e.g. is it possible for this fault to be taken from EL2 to EL2, or from EL3 to EL3? > > * Does this only apply to Stage-1, or can the same faults be taken at > > Stage-2? > This fault can be taken only at Stage-1. > > > I'm a bit surprised by the single retry. Is there any guarantee that a > > thread will eventually stop delivering this fault code? > I guarantee that a thread will stop delivering this fault code by the > this patch. > The hardware condition which cause this fault is reset at exception > entry, therefore execution of at least one instruction is guaranteed > by this single retry. Ok, so we can guarantee forward progress, but in the worst case that's down to single-step performance levels. > > Note that all CPUs and threads share the do_bad_ignore_first variable, > > so this is going to behave non-deterministically and kill threads in > > some cases. I see now that I'd misread the code, and we'll always retry the fault (on A64FX), so this is not true. > > This code is also preemptible, so checking the MIDR here doesn't make > > much sense. Either this is always uniform (and we can check once in the > > errata framework), or it's variable (e.g. on a big.LITTLE system) > > and we need to avoid preemption up until this point. ... though this may be a problem if A64FX is integrated into a non-uniform system (and we could unwittingly kill threads). > > Rather than dynamically checking the MIDR, this should use the errata > > framework, and if any A64FX CPU is discovered, set an erratum cap like > > ARM64_WORKAROUND_CONFIG_FUJITSU_ERRATUM_010001, so we can do something > > like: > I try to provide a new patch to reflect your comments in today. > Unfortunately this bug may occurs before init_cpu_hwcaps_indirect_list > called. As above, I'm very concerned that this could be taken from kernel context. There are a number of cases where we cannot handle such faults: * During boot, when we hand-over between agents (e.g. UEFI->kernel). * Before VBAR_EL1 is initialized. * During exception entry/return sequences (including when the KPTI trampoline vectors are installed). * While the KVM vectors are installed (for VHE). Are there any constraints on when the fault can be raised? Under which conditions does this happen? Thanks, Mark.