Received: by 2002:a25:ab43:0:0:0:0:0 with SMTP id u61csp2500317ybi; Mon, 17 Jun 2019 06:02:05 -0700 (PDT) X-Google-Smtp-Source: APXvYqyaRAU/I4o3qLyg/38xo0Ly/0MDzbXmx2dapj3wVjkn89A5hItd9/OKcBU/2ky0hGBGYZCM X-Received: by 2002:a63:649:: with SMTP id 70mr49557270pgg.445.1560776520411; Mon, 17 Jun 2019 06:02:00 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1560776520; cv=none; d=google.com; s=arc-20160816; b=wbfyun7BxGlionrztIPPEhWKdBdly5S5VwQ/a12DBOloneADrFIocP4Y8z8RE3hz4N x/aKqwNXf5uu5ho7pEQl+kYYiu00noTyzAWaDblzSdtdI2aGMRcf0lNGWxfYlNXnG1Kv jbWd8zPyq+8OairuM4nbjyTNxnrL4w4b0AZSMWemyJdaaMWYSZ+ZUNtcRKKg4fjK3+wY wBmVvrRMpUw2+921cGwwc2tkDo0ktM3+JV/4eO2CAw1gQ6bIMs4HTv/9ANLbqRLIJE7a 01tGzNX9j4CXWN7lww8nYwBMe/SzxwVcEC28VYx/hiFErQh+Y2gzNOHMXENCHen0rQS5 Yj0A== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:content-transfer-encoding :content-language:in-reply-to:mime-version:user-agent:date :message-id:from:references:cc:to:subject:dkim-signature; bh=KnynchamP1yl4uXc3vaUNVtHe8Iw/fDe/Wyl73exxdc=; b=Sbk+lUZ0dyhLPTi+mvHYi+yvZaqRCsPWJXCwasyikBnmne3tb86pPoaYuopL0JsrF8 740CjshZGJ0n3M/xpXkhtSftTuMrd6AHkVo4pSe84LHxuC6OMjY+tpeK2YJSXUok75Wo IvoxF8e93r8U8hDJt3YZVQCgTXz/UMgoUP6KudGq2U/y1DcAM5XLMjvEc2P7ugZoyylT 7q+14HDhkuJO7reH+gdbGeA9LotwAZ48dedPWept6/lDchfv+Xz+EUIVF6fRAt8WvDIH jQuW60QKNKjOQEaxRFxzbKWb7jp/gW3TlRYX9kIiBA+gSjSJnkkI13yLJOcHb1m6g8+s 4cFQ== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@amazon.com header.s=amazon201209 header.b=Cu8MzXfQ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id 9si10683867pfp.180.2019.06.17.06.01.43; Mon, 17 Jun 2019 06:02:00 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@amazon.com header.s=amazon201209 header.b=Cu8MzXfQ; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=QUARANTINE sp=QUARANTINE dis=NONE) header.from=amazon.com Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1727703AbfFQNA7 (ORCPT + 99 others); Mon, 17 Jun 2019 09:00:59 -0400 Received: from smtp-fw-9102.amazon.com ([207.171.184.29]:11128 "EHLO smtp-fw-9102.amazon.com" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1726286AbfFQNA6 (ORCPT ); Mon, 17 Jun 2019 09:00:58 -0400 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=amazon.com; i=@amazon.com; q=dns/txt; s=amazon201209; t=1560776457; x=1592312457; h=subject:to:cc:references:from:message-id:date: mime-version:in-reply-to:content-transfer-encoding; bh=KnynchamP1yl4uXc3vaUNVtHe8Iw/fDe/Wyl73exxdc=; b=Cu8MzXfQZmqppH6Kpru2u9O0Hhe0lCwhVFWhikWS3ou39ezkF6ryo9i8 zhUfQOw5x+oYjsmivTcds01SfTRoUWQAdqkV+WvVKe1ucYMvPXS3jvNdl AV7P7OdWmAkp0Tka3pUANU2zPiZLIotOC2w+YCCaJGFmiUYZLST6od9pJ I=; X-IronPort-AV: E=Sophos;i="5.62,385,1554768000"; d="scan'208";a="680326438" Received: from sea3-co-svc-lb6-vlan2.sea.amazon.com (HELO email-inbound-relay-1e-c7c08562.us-east-1.amazon.com) ([10.47.22.34]) by smtp-border-fw-out-9102.sea19.amazon.com with ESMTP; 17 Jun 2019 13:00:54 +0000 Received: from EX13MTAUEA001.ant.amazon.com (iad55-ws-svc-p15-lb9-vlan2.iad.amazon.com [10.40.159.162]) by email-inbound-relay-1e-c7c08562.us-east-1.amazon.com (Postfix) with ESMTPS id 085A2240FF7; Mon, 17 Jun 2019 13:00:50 +0000 (UTC) Received: from EX13D08UEE004.ant.amazon.com (10.43.62.182) by EX13MTAUEA001.ant.amazon.com (10.43.61.82) with Microsoft SMTP Server (TLS) id 15.0.1367.3; Mon, 17 Jun 2019 13:00:50 +0000 Received: from EX13MTAUEE001.ant.amazon.com (10.43.62.200) by EX13D08UEE004.ant.amazon.com (10.43.62.182) with Microsoft SMTP Server (TLS) id 15.0.1367.3; Mon, 17 Jun 2019 13:00:50 +0000 Received: from [10.107.3.17] (10.107.3.17) by mail-relay.amazon.com (10.43.62.226) with Microsoft SMTP Server (TLS) id 15.0.1367.3 via Frontend Transport; Mon, 17 Jun 2019 13:00:46 +0000 Subject: Re: [PATCH 2/2] edac: add support for Amazon's Annapurna Labs EDAC To: James Morse CC: , , , , , , , , , , , , , , , , References: <1559211329-13098-1-git-send-email-hhhawa@amazon.com> <1559211329-13098-3-git-send-email-hhhawa@amazon.com> <3129ed19-0259-d227-0cff-e9f165ce5964@arm.com> <4514bfa2-68b2-2074-b817-2f5037650c4e@amazon.com> From: "Hawa, Hanna" Message-ID: Date: Mon, 17 Jun 2019 16:00:45 +0300 User-Agent: Mozilla/5.0 (Windows NT 10.0; WOW64; rv:60.0) Gecko/20100101 Thunderbird/60.7.1 MIME-Version: 1.0 In-Reply-To: Content-Type: text/plain; charset="utf-8"; format=flowed Content-Language: en-US Content-Transfer-Encoding: 8bit Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org >>>> +static void al_a57_edac_l2merrsr(void *arg) >>>> +{ >>> >>>> +    edac_device_handle_ce(edac_dev, 0, 0, "L2 Error"); >>> >>> How do we know this is corrected? > >>> If looks like L2CTLR_EL1[20] might force fatal 1/0 to map to uncorrected/corrected. Is >>> this what you are depending on here? > >> No - not on this. Reporting all the errors as corrected seems to be bad. >> >> Can i be depends on fatal field? > > That is described as "set to 1 on the first memory error that caused a Data Abort". I > assume this is one of the parity-error external-aborts. > > If the repeat counter shows, say, 2, and fatal is set, you only know that at least one of > these errors caused an abort. But it could have been all three. The repeat counter only > matches against the RAMID and friends, otherwise the error is counted in 'other'. > > I don't think there is a right thing to do here, (other than increase the scrubbing > frequency). As you can only feed one error into edac at a time then: > >> if (fatal) >>     edac_device_handle_ue(edac_dev, 0, 0, "L2 Error"); >> else >>     edac_device_handle_ce(edac_dev, 0, 0, "L2 Error"); > > seems reasonable. You're reporting the most severe, and 'other/repeat' counter values just > go missing. I had print the values of 'other/repeat' to be noticed. > > >> How can L2CTLR_EL1[20] force fatal? > > I don't think it can, on a second reading, it looks to be even more complicated than I > thought! That bit is described as disabling forwarding of uncorrected data, but it looks > like the uncorrected data never actually reaches the other end. (I'm unsure what 'flush' > means in this context.) > I was looking for reasons you could 'know' that any reported error was corrected. This was > just a bad suggestion! Is there interrupt for un-correctable error? Does 'asynchronous errors' in L2 used to report UE? In case no interrupt, can we use die-notifier subsystem to check if any error had occur while system shutdown? >>>> +        cluster = topology_physical_package_id(cpu); >>> >>> Hmm, I'm not sure cluster==package is guaranteed to be true forever. >>> >>> If you describe the L2MERRSR_EL1 cpu mapping in your DT you could use that. Otherwise >>> pulling out the DT using something like the arch code's parse_cluster(). > >> I rely on that it's alpine SoC specific driver. > > ... and that the topology code hasn't changed to really know what a package is: > https://lore.kernel.org/lkml/20190529211340.17087-2-atish.patra@wdc.com/T/#u > > As what you really want to know is 'same L2?', and you're holding the cpu_read_lock(), > would struct cacheinfo's shared_cpu_map be a better fit? > > This would be done by something like a cpu-mask of cache:shared_cpu_map's for the L2's > you've visited. It removes the dependency on package==L2, and insulates you from the > cpu-numbering not being exactly as you expect. I'll add dt property that point to L2-cache node (phandle), then it'll be easy to create cpu-mask with all cores that point to same l2 cache. Thanks, Hanna