Received: by 10.223.164.202 with SMTP id h10csp462063wrb; Thu, 9 Nov 2017 08:59:59 -0800 (PST) X-Google-Smtp-Source: ABhQp+TffQyVTQojDkTbAmWxWQBDnEvSU1V1VQCj5WafHKjFtYebtZFUXShuwA0aB6mtGtJYnFD2 X-Received: by 10.99.115.79 with SMTP id d15mr74072pgn.279.1510246799716; Thu, 09 Nov 2017 08:59:59 -0800 (PST) ARC-Seal: i=1; a=rsa-sha256; t=1510246799; cv=none; d=google.com; s=arc-20160816; b=Gt17rzSSK1XofSW3tCthHOKSbme/b0NwHdVPqyDUEPvlcf+Rgd2xsqA1Fy2LysqLiY HnsXZxAzceEzLhze6tTd4pOe+k4hS+aWJ6PT+spn22dKqAc27rKWlqOfc5ADiw41AJmZ Hf98noGGLVQOvN/RRhpPlZIfHUZtXLAe4XeyarAe54pDTdPZXDKKAvUZeRoZAp8TgjmZ ptnOM8G95LIA8vNjLyb3wfK65/D6cQci4TRsQunJGXyNtsGA1kT1e/oz5XfCzsY5McaG Qp0P4HwEl19aw96iOJzlYX5EWF/cIu4kM/nYgwhL8JbZj1UNUdEDtltoVfXT/ehNTH47 7XNQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:mime-version:user-agent:references :message-id:in-reply-to:subject:cc:to:from:date :arc-authentication-results; bh=36NSvfQ+wCu/4tz61OoBFER8sCdFyK9qt1p3Gz4298A=; b=JUDhOoDJbgF2pcwtgNw0Q0z+9rgGn0WG3i+Vja1Y+pUHPiZ5YoXXLvNQ3wzDle7LFl WVWWprukR2awCGiPIRH2wzqIdeORgGja9RRqyZAkIKCq0pqqR3f+gvYTBw2NcqJDk6Rp fJwdHhp4FyMInyyK/8SCwsxbUE2zyz2S3qEzCaE8z/gwF3ZVEMvdSgOWbkr/6LGRKG3a u+5G1AO5U/aXNZbohovLT3WO3/i3GT3aRk+oq/jI9cBsgu8x/ZMr2hdJrgReJYV8a2jM Z/CY91bzDU8TD7B9rFdTx+3+OXX9o/GneNn5Cgc20vfJSwoyDhR+TRQi1oREb8geSSKg Dogw== ARC-Authentication-Results: i=1; mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=canonical.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id g3si6642172plb.209.2017.11.09.08.59.48; Thu, 09 Nov 2017 08:59:59 -0800 (PST) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=fail (p=NONE sp=NONE dis=NONE) header.from=canonical.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1753967AbdKIQ6j (ORCPT + 81 others); Thu, 9 Nov 2017 11:58:39 -0500 Received: from youngberry.canonical.com ([91.189.89.112]:58571 "EHLO youngberry.canonical.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1753224AbdKIQ6i (ORCPT ); Thu, 9 Nov 2017 11:58:38 -0500 Received: from 1.general.manjo.us.vpn ([10.172.65.2] helo=lazy) by youngberry.canonical.com with esmtpsa (TLS1.0:DHE_RSA_AES_256_CBC_SHA1:32) (Exim 4.76) (envelope-from ) id 1eCq9w-00056k-Nq; Thu, 09 Nov 2017 16:58:32 +0000 Date: Thu, 9 Nov 2017 10:58:29 -0600 (CST) From: Manoj Iyer X-X-Sender: manjo@lazy To: James Morse cc: Manoj Iyer , Shanker Donthineni , Will Deacon , Marc Zyngier , linux-arm-kernel@lists.infradead.org, Catalin Marinas , Ard Biesheuvel , Matt Fleming , Christoffer Dall , linux-kernel@vger.kernel.org, linux-efi@vger.kernel.org, kvmarm@lists.cs.columbia.edu Subject: Re: [3/3] arm64: Add software workaround for Falkor erratum 1041 In-Reply-To: Message-ID: References: <1509679664-3749-4-git-send-email-shankerd@codeaurora.org> <5A04369A.2020405@arm.com> User-Agent: Alpine 2.20 (DEB 67 2015-01-07) MIME-Version: 1.0 Content-Type: text/plain; charset=US-ASCII; format=flowed Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org James, Looks like my VM test raised a false alarm. I retested stock Artful 4.13 kernel (No erratum 1041 patches applied). Host: Ubuntu Artful 4.13 kernel with *no* erratum 1041 patches applied. Guest: Ubuntu Zesty (4.10) kernel. - Created 20 VMs one at a time In a loop: - Stop (virsh destroy) 20 VMs one at a time - Start (virsh start) 20 VMs one at a time. And, I am able to reproduce the system reset issue I previously reported. I think the problem I reported with VMs might have nothing to do with the erratum 1041 patches, and probably needs to be root caused seperately. With stock 4.13 kernel (no erratum 1041 patches applied): awrep6 login: [ 461.881379] ACPI CPPC: PCC check channel failed. Status=0 [ 462.051194] ACPI CPPC: PCC check channel failed. Status=0 [ 462.223137] ACPI CPPC: PCC check channel failed. Status=0 [ 462.633790] ACPI CPPC: PCC check channel failed. Status=0 [ 463.231971] ACPI CPPC: PCC check channel failed. Status=0 [ 463.403163] ACPI CPPC: PCC check channel failed. Status=0 [ 463.822936] ACPI CPPC: PCC check channel failed. Status=0 [ 463.995222] ACPI CPPC: PCC check channel failed. Status=0 [ 464.130962] ACPI CPPC: PCC check channel failed. Status=0 [ 464.258973] ACPI CPPC: PCC check channel failed. Status=0 [ 465.283028] ACPI CPPC: PCC check channel failed. Status=0 SYS_DBG: Running SDI image (immediate mode) SYS_DBG: Ram Dump Init SYS_DBG: Failed to init SD card SYS_DBG: Resetting system! On Thu, 9 Nov 2017, Manoj Iyer wrote: > > > > On Thu, 9 Nov 2017, Manoj Iyer wrote: > >> >> James, >> >> (sorry for top-posting) >> >> Applied patch 3 patches to Ubuntu Artful Kernel ( 4.13.0-16-generic ) >> >> - Start 20 VMs one at a time >> >> In a loop: >> - Stop (virsh destroy) 20 VMs one at a time >> - Start (virsh start) 20 VMs one at a time. > > Fixing some confusion I might have introduced in my prev email. > > - Applied all 3 patches to Ubuntu Artful Kernel ( 4.13.0-16-generic ) > > - Created 20 VMs one at a time > > In a loop: > - Stop (virsh destroy) 20 VMs one at a time > - Start (virsh start) 20 VMs one at a time. > >> >> The system reset's itself after starting the last VM on the 1st loop >> displaying the following: >> >> awrep6 login: [ 603.349141] ACPI CPPC: PCC check channel failed. Status=0 >> [ 603.765101] ACPI CPPC: PCC check channel failed. Status=0 >> [ 603.937389] ACPI CPPC: PCC check channel failed. Status=0 >> [ 608.285495] ACPI CPPC: PCC check channel failed. Status=0 >> [ 608.289481] ACPI CPPC: PCC check channel failed. Status=0 >> >> SYS_DBG: Running SDI image (immediate mode) >> SYS_DBG: Ram Dump Init >> SYS_DBG: Failed to init SD card >> SYS_DBG: Resetting system! >> >> Followed by the following messages on system reboot: >> [ 6.616891] BERT: Error records from previous boot: >> [ 6.621655] [Hardware Error]: event severity: fatal >> [ 6.626516] [Hardware Error]: imprecise tstamp: 0000-00-00 00:00:00 >> [ 6.632851] [Hardware Error]: Error 0, type: fatal >> [ 6.637713] [Hardware Error]: section type: unknown, >> d2e2621c-f936-468d-0d84-15a4ed015c8b >> [ 6.646045] [Hardware Error]: section length: 0x238 >> [ 6.651082] [Hardware Error]: 00000000: 72724502 5220726f 6f736165 6e55206e >> .Error Reason Un >> [ 6.659761] [Hardware Error]: 00000010: 776f6e6b 0000006e 00000000 00000000 >> known........... >> [ 6.668442] [Hardware Error]: 00000020: 00000000 00000000 00000000 00000000 >> ................ >> [ 6.677122] [Hardware Error]: 00000030: 00000000 00000000 00000000 00000000 >> ................ >> >> >> On Thu, 9 Nov 2017, James Morse wrote: >> >>> Hi Manoj, >>> >>> On 08/11/17 19:05, Manoj Iyer wrote: >>>> On Thu, 2 Nov 2017, Shanker Donthineni wrote: >>>>> The ARM architecture defines the memory locations that are permitted >>>>> to be accessed as the result of a speculative instruction fetch from >>>>> an exception level for which all stages of translation are disabled. >>>>> Specifically, the core is permitted to speculatively fetch from the >>>>> 4KB region containing the current program counter and next 4KB. >>>>> >>>>> When translation is changed from enabled to disabled for the running >>>>> exception level (SCTLR_ELn[M] changed from a value of 1 to 0), the >>>>> Falkor core may errantly speculatively access memory locations outside >>>>> of the 4KB region permitted by the architecture. The errant memory >>>>> access may lead to one of the following unexpected behaviors. >>> >>>> I applied the 3 patches to Ubuntu 4.13.0-16-generic (Artful) kernel and >>>> ran stress-ng cpu tests on QDF2400 server >>> >>> [...] >>> >>>> Where stress-ng would spawn N workers and test cpu offline/online, >>>> perform >>>> matrix operations, do rapid context switchs, and anonymous mmaps. >>>> Although >>>> I was not able to reproduce the erratum on the stock 4.13 kernel using >>>> the >>>> same test case, the patched kernel did not seem to introduce any >>>> regressions either. I ran the stress-ng tests for over 8hrs found the >>>> system to be stable. >>> >>> >>> Could you throw kexec and KVM into the mix? This issue only shows up when >>> we >>> disable the MMU, which we almost never do. >>> >>> For CPU offline/online we make the PSCI 'offline' call with the MMU >>> enabled. >>> When the CPU comes back firmware has reset the EL2/EL1 SCTLR from a higher >>> exception level, so it won't hit this issue. >>> >>> One place we do this is kexec, where we drop into purgatory with the MMU >>> disabled. >>> >>> The other is KVM unloading itself to return to the hyp stub. You can >>> stress this >>> by starting and stopping a VM. When the number of VMs reaches 0 KVM should >>> unload via 'kvm_arch_hardware_disable()'. >>> >>> >>> Thanks, >>> >>> James >>> >>> >> >> -- >> ============================ >> Manoj Iyer >> Ubuntu/Canonical >> ARM Servers - Cloud >> ============================ >> >> > > -- > ============================ > Manoj Iyer > Ubuntu/Canonical > ARM Servers - Cloud > ============================ > > -- ============================ Manoj Iyer Ubuntu/Canonical ARM Servers - Cloud ============================ From 1583605783323732600@xxx Thu Nov 09 16:16:02 +0000 2017 X-GM-THRID: 1583013979194485035 X-Gmail-Labels: Inbox,Category Forums,HistoricalUnread