Received: by 2002:ac0:a5a7:0:0:0:0:0 with SMTP id m36-v6csp90615imm; Tue, 17 Jul 2018 14:37:24 -0700 (PDT) X-Google-Smtp-Source: AAOMgpdSQUn/jWNgJR8xdaHLGufXNA4ChU7OQq2LAWur7imTMk/xtQn/+esopP1ZB2i+F89HdQtb X-Received: by 2002:a63:6004:: with SMTP id u4-v6mr3156273pgb.441.1531863444483; Tue, 17 Jul 2018 14:37:24 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1531863444; cv=none; d=google.com; s=arc-20160816; b=WBWRHN2gem+BAd96PmOvBA5i4SE1gMzX6RP8JmeEDCY5U5/6fkufvxHBR6tZ2MSSYs PrsL1lM15kf29UlvfogcMhlRFiC/+ageGr5M8tYQgZxs2h6l7FC61w8qYUFmkhTj8qlD 4+Fd03906RZaF0toIHzhVblfX8LupuT4rSYplHFRR1EPAedyfMeVYoFAVp3QfLHZtVxY 6f/7R5+D3G8wr51x6DzGbCJ2olExAkn26MQIMkjcbA9eqEOVt2IH4qcUnms0GhRVPLoU KZPfO99i8cT1nHkIIFDTZhvjgnAirCnlo9UmL98XMsHk4ky0Z+jMjq80OgOh0wsCM48E XKvQ== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:sender:user-agent:in-reply-to :content-disposition:mime-version:references:message-id:subject:cc :to:from:date:dkim-signature:arc-authentication-results; bh=/PazhWG16mnFi/cKA7iMuiXw7m3Y7VqDjvGZ9xyx+oA=; b=Elp7WRISlFoV28KU9LfrXwoCoacXmAyJ3yj41sfTJdDV+mcXgit7v1swBQLrNUDgC2 w9ZKTb4au9l7QhmcYEzVk4ubUUyCEI6B/7ohZFmU47i+ZeIex151GsdXuFglaU1tdP5A DaFgy3/bSjVwb3rVWPMIqQhTWRtemRZmz8gMIWJHrIuL0aEm6vUrt+mVwfkyvf7Mj8+f b3F4z2A+gpTN3bgLiwUdnR3HvSzBH38NMMO3JYTQaTHGU6TGMPy/z7VkTnia699sZ/kJ xNwSQL9Gtgk+Z6f+U72LOQLaj4QyPW/hL+s9Wrq49+4VlC0+0eUiq++YA/T9blInqHY0 M49g== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=0UBdtWud; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Return-Path: Received: from vger.kernel.org (vger.kernel.org. [209.132.180.67]) by mx.google.com with ESMTP id e125-v6si1898248pfh.334.2018.07.17.14.37.09; Tue, 17 Jul 2018 14:37:24 -0700 (PDT) Received-SPF: pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) client-ip=209.132.180.67; Authentication-Results: mx.google.com; dkim=pass header.i=@kernel.org header.s=default header.b=0UBdtWud; spf=pass (google.com: best guess record for domain of linux-kernel-owner@vger.kernel.org designates 209.132.180.67 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=kernel.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S1730828AbeGQWKj (ORCPT + 99 others); Tue, 17 Jul 2018 18:10:39 -0400 Received: from mail.kernel.org ([198.145.29.99]:40144 "EHLO mail.kernel.org" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S1729759AbeGQWKj (ORCPT ); Tue, 17 Jul 2018 18:10:39 -0400 Received: from localhost (unknown [69.71.4.100]) (using TLSv1.2 with cipher ECDHE-RSA-AES256-GCM-SHA384 (256/256 bits)) (No client certificate requested) by mail.kernel.org (Postfix) with ESMTPSA id 3037B206B8; Tue, 17 Jul 2018 21:36:04 +0000 (UTC) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/simple; d=kernel.org; s=default; t=1531863364; bh=CDcjd7SbHLYCfTvCU+qAr4k4diKnA3UPG9vqMfhZ7NI=; h=Date:From:To:Cc:Subject:References:In-Reply-To:From; b=0UBdtWudNNOVMkDgm0gDfWijaweRvwL4JF2BpTCguAWvPEjWDknBfcQzC/2SEceXs XAdbvQqzPct3N6AXPheebs8Iwd7/nwWTWcNJtq3kKhOgaFYaQvWHp/NFvVJtnDtEe1 0ITbe7nDRz/0jafXMBU1Zfg7b8v489YCHD0+4dH0= Date: Tue, 17 Jul 2018 16:36:01 -0500 From: Bjorn Helgaas To: Oza Pawandeep Cc: Bjorn Helgaas , Philippe Ombredanne , Thomas Gleixner , Greg Kroah-Hartman , Kate Stewart , linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, Dongdong Liu , Keith Busch , Wei Zhang , Sinan Kaya , Timur Tabi Subject: Re: [PATCH v2 1/6] PCI/AER: Take severity mask into account while clearing error bits Message-ID: <20180717213601.GC128988@bhelgaas-glaptop.roam.corp.google.com> References: <1529661494-20936-1-git-send-email-poza@codeaurora.org> <1529661494-20936-2-git-send-email-poza@codeaurora.org> <20180717190329.GA128988@bhelgaas-glaptop.roam.corp.google.com> MIME-Version: 1.0 Content-Type: text/plain; charset=us-ascii Content-Disposition: inline In-Reply-To: <20180717190329.GA128988@bhelgaas-glaptop.roam.corp.google.com> User-Agent: Mutt/1.9.2 (2017-12-15) Sender: linux-kernel-owner@vger.kernel.org Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Tue, Jul 17, 2018 at 02:03:29PM -0500, Bjorn Helgaas wrote: > Hi Oza, > > Thanks for doing this! > > On Fri, Jun 22, 2018 at 05:58:09AM -0400, Oza Pawandeep wrote: > > pci_cleanup_aer_uncorrect_error_status() is called by different slot_reset > > callbacks in case of ERR_NONFATAL. > > IIRC, the current strategy is: > > ERR_COR: log only > ERR_NONFATAL: call driver callbacks (pci_error_handlers) > ERR_FATAL: remove device and re-enumerate > > So these slot_reset callbacks are only used for ERR_NONFATAL, which > are all uncorrectable errors, of course. > > This patch makes it so that when the slot_reset callbacks call > pci_cleanup_aer_uncorrect_error_status(), we only clear the > bits set by ERR_NONFATAL events (this is determined by > PCI_ERR_UNCOR_SEVER). > > That makes good sense to me. All these status bits are RW1CS, so they > will be preserved across a reset but will be cleared when we > re-enumerate, in this path: > > pci_init_capabilities > pci_aer_init > pci_cleanup_aer_error_status_regs > # clear all PCI_ERR_UNCOR_STATUS and PCI_ERR_COR_STATUS bits > > > AER uncorrectable error status should take severity into account in order > > to clear the bits, so that ERR_NONFATAL path does not clear the bit which > > are marked with severity fatal. > > Two comments: > > 1) Can you split this into two patches, one that changes > pci_cleanup_aer_uncorrect_error_status() so it looks like the error > clearing code in aer_error_resume(), and a second that factors out the > duplicate code? > > 2) Maybe use "sev" or "sever" instead of "mask" for the local > variable, since there is also a separate PCI_ERR_UNCOR_MASK register, > which is not involved here. Let me back up a little here: I'm not asking you to do the things below here. They're just possible future things, so we can think about them after this series. And the things above are things I can easily do myself. So no action required from you, unless you think I'm on the wrong track :) > 3) The "pci_cleanup_aer_uncorrect_error_status()" name no longer > really describes what this does. Something like > "pci_aer_clear_nonfatal_status()" would be more descriptive. But I > see you have a subsequent patch (which I haven't looked at yet) that > is related to this. > > 4) I don't think the driver slot_reset callbacks should be responsible > for clearing these AER status bits. Can we clear them somewhere in > the pcie_do_nonfatal_recovery() path and remove these calls from the > drivers? > > 5) In principle, we should only read PCI_ERR_UNCOR_STATUS *once* per > device when handling an error. We currently read it three times: > > aer_isr > aer_isr_one_error > find_source_device > find_device_iter > is_error_source > read PCI_ERR_UNCOR_STATUS # 1 > aer_process_err_devices > get_device_error_info(e_info->dev[i]) > read PCI_ERR_UNCOR_STATUS # 2 > handle_error_source > pcie_do_nonfatal_recovery > ... > report_slot_reset > driver->err_handler->slot_reset > pci_cleanup_aer_uncorrect_error_status > read PCI_ERR_UNCOR_STATUS # 3 > > OK, that was more than two comments :)