Received: by 2002:a05:6358:11c7:b0:104:8066:f915 with SMTP id i7csp1239691rwl; Fri, 7 Apr 2023 12:11:41 -0700 (PDT) X-Google-Smtp-Source: AKy350a2Wcj80I0GWyvW++uPcYYX6jb8aPBKMDNLKte6kPyL3Shx4SA2r9ekoqDwGKNhO3Lmy06f X-Received: by 2002:a05:6a20:4e03:b0:dd:ac3a:b798 with SMTP id gk3-20020a056a204e0300b000ddac3ab798mr3961341pzb.13.1680894701426; Fri, 07 Apr 2023 12:11:41 -0700 (PDT) ARC-Seal: i=1; a=rsa-sha256; t=1680894701; cv=none; d=google.com; s=arc-20160816; b=CDqZuVHEgfaVELCWBOq+NwnnymNcFGTFrUWHz8zJStSHz8GvRszps9TlajT2JgARf6 kU+s5xiOaCb27cX+XMnIFjMXAAT/TFAYrUiZIQOCwKqmf+WQdADdhmw8rRJ/kQOxnwOk L/Y1+x/zT35jx5wGJ/wsEsl/6Noc90fjr9YZYXakEo6ybkUxjXSmLIli/mEexTNUEADA HRBuaQ1iTElWu41ovRywaITRAmmXHAZ5TmvMypoqmNjvCHbWv52anBq1qOE+rwOdgT8D ONl/NsrPYoYPBXO+uQmyKTGJjH5VCLHvFoesmb6lbipdxuAgky8EsxiZG3lVAXTCY/yO uBdw== ARC-Message-Signature: i=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=arc-20160816; h=list-id:precedence:content-transfer-encoding:cc:to:subject :message-id:date:from:in-reply-to:references:mime-version :dkim-signature; bh=57OqYugB2/xhFeaBFwZjJ6ChuTE8oOboxRR8GaV91vE=; b=Xby8KBRJf87xok9SwzA6o0qRMb3zbn3+NhNSfO8ActjP911SQ3IrOeJ29Y+3BAPhlu kHH+4CA2b+oxY4avuTSOrzz9O8QDc0twxPjL8qJMfHxNOmYj/U0gbhzpYRQnxiXn9z3e u5nlFHyMMP+9tZvh9GYzcuRbhj7nUT5x4SqzUnrl8VEx1a4RAf3yc60m/uHn35hNbi+B aoEmkpihjlRkv6W8mNbCGxe4B9E+/d6FeQ4W/7gsF/IrE8yq+N8Cm2jtgFCRHrOBjt++ 804yBNqWHsrpnXzOcE6y+jUeoWICX3BIHDsW9ir1TQce6dcC6E/MGgN9JmnxO5LoiPZy eo5A== ARC-Authentication-Results: i=1; mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=TTSj0mzM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Return-Path: Received: from out1.vger.email (out1.vger.email. [2620:137:e000::1:20]) by mx.google.com with ESMTP id k28-20020aa79d1c000000b005f681d0192csi4310263pfp.27.2023.04.07.12.11.29; Fri, 07 Apr 2023 12:11:41 -0700 (PDT) Received-SPF: pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) client-ip=2620:137:e000::1:20; Authentication-Results: mx.google.com; dkim=pass header.i=@chromium.org header.s=google header.b=TTSj0mzM; spf=pass (google.com: domain of linux-kernel-owner@vger.kernel.org designates 2620:137:e000::1:20 as permitted sender) smtp.mailfrom=linux-kernel-owner@vger.kernel.org; dmarc=pass (p=NONE sp=NONE dis=NONE) header.from=chromium.org Received: (majordomo@vger.kernel.org) by vger.kernel.org via listexpand id S234675AbjDGS4m (ORCPT + 99 others); Fri, 7 Apr 2023 14:56:42 -0400 Received: from lindbergh.monkeyblade.net ([23.128.96.19]:43824 "EHLO lindbergh.monkeyblade.net" rhost-flags-OK-OK-OK-OK) by vger.kernel.org with ESMTP id S234670AbjDGS4X (ORCPT ); Fri, 7 Apr 2023 14:56:23 -0400 Received: from mail-qt1-x836.google.com (mail-qt1-x836.google.com [IPv6:2607:f8b0:4864:20::836]) by lindbergh.monkeyblade.net (Postfix) with ESMTPS id 25465D50A for ; Fri, 7 Apr 2023 11:54:25 -0700 (PDT) Received: by mail-qt1-x836.google.com with SMTP id d75a77b69052e-3e673b5f558so480631cf.1 for ; Fri, 07 Apr 2023 11:54:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=chromium.org; s=google; t=1680893619; x=1683485619; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:from:to:cc:subject:date :message-id:reply-to; bh=57OqYugB2/xhFeaBFwZjJ6ChuTE8oOboxRR8GaV91vE=; b=TTSj0mzMY+NkQG1mVVWkjrpeEJe22SBs+5ls6QopGa0lxcGn/GwU01nepzLSuVPzNB 8no7JrjKTpa0ymH1gaOkPBr6kopN/SCCyvRzZgtXou5sbAxWknszaGEFsDhSlpeHtuV+ kRi9QrjEzZqXk72HVh/58tb7Jt4q56Rci39NE= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20210112; t=1680893619; x=1683485619; h=content-transfer-encoding:cc:to:subject:message-id:date:from :in-reply-to:references:mime-version:x-gm-message-state:from:to:cc :subject:date:message-id:reply-to; bh=57OqYugB2/xhFeaBFwZjJ6ChuTE8oOboxRR8GaV91vE=; b=qdoEroQcIZdplE87pbM5KHOzo2lQ5vjB42Npg7zsSUmo3wRAiLJCWhfruhTjkIHGx7 lzTUuHtcxqySLE0sgwYMAliYbDel6y3o77bzH2geUpJaE4vQZBeeR1v8/CSIL3E0uJGX wXQnSxFj4Hjl2p05sh9Of+AvFkk3eCwA4nc//d/5Bicaooq+Ez5+7kUe2+PqJsXQMuHe UDTUlYMXYkXqb04Qpx7JjRMNlpz3W5+Fw1lRTl2+It8TeqxLZ3cIm4IXwOrHuMZ1FczS tEvT7QuTecKuK8JtRIxf9tgVi2EeWp3Mw/x0U+DrqgG+8B7g7DCrUoIAUYfyqlSoQgTS 1EZA== X-Gm-Message-State: AAQBX9dkaYMMBkyexnun5V19A6dYf+u7zxVl2EF3DesuKT9oorJDO9fT YD8cR9H9huhrom1/5tnpvNgLvmYhbKeovXvWuuRXsw== X-Received: by 2002:a05:622a:1b86:b0:3e2:3de:3732 with SMTP id bp6-20020a05622a1b8600b003e203de3732mr27107qtb.7.1680893618577; Fri, 07 Apr 2023 11:53:38 -0700 (PDT) MIME-Version: 1.0 References: <20230317175109.3859943-2-grundler@chromium.org> <20230406195045.GA3729127@bhelgaas> In-Reply-To: <20230406195045.GA3729127@bhelgaas> From: Grant Grundler Date: Fri, 7 Apr 2023 11:53:27 -0700 Message-ID: Subject: Re: [PATCHv2 pci-next 2/2] PCI/AER: Rate limit the reporting of the correctable errors To: Bjorn Helgaas Cc: Grant Grundler , Mahesh J Salgaonkar , "Oliver O 'Halloran" , Bjorn Helgaas , Rajat Jain , Rajat Khandelwal , linux-pci@vger.kernel.org, linux-kernel@vger.kernel.org, linuxppc-dev@lists.ozlabs.org Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable X-Spam-Status: No, score=-7.4 required=5.0 tests=DKIMWL_WL_HIGH,DKIM_SIGNED, DKIM_VALID,DKIM_VALID_AU,HEADER_FROM_DIFFERENT_DOMAINS, RCVD_IN_DNSWL_NONE,SPF_HELO_NONE,SPF_PASS,USER_IN_DEF_SPF_WL autolearn=no autolearn_force=no version=3.4.6 X-Spam-Checker-Version: SpamAssassin 3.4.6 (2021-04-09) on lindbergh.monkeyblade.net Precedence: bulk List-ID: X-Mailing-List: linux-kernel@vger.kernel.org On Thu, Apr 6, 2023 at 12:50=E2=80=AFPM Bjorn Helgaas = wrote: > > On Fri, Mar 17, 2023 at 10:51:09AM -0700, Grant Grundler wrote: > > From: Rajat Khandelwal > > > > There are many instances where correctable errors tend to inundate > > the message buffer. We observe such instances during thunderbolt PCIe > > tunneling. > > > > It's true that they are mitigated by the hardware and are non-fatal > > but we shouldn't be spamming the logs with such correctable errors as i= t > > confuses other kernel developers less familiar with PCI errors, support > > staff, and users who happen to look at the logs, hence rate limit them. > > > > A typical example log inside an HP TBT4 dock: > > [54912.661142] pcieport 0000:00:07.0: AER: Multiple Corrected error rec= eived: 0000:2b:00.0 > > [54912.661194] igc 0000:2b:00.0: PCIe Bus Error: severity=3DCorrected, = type=3DData Link Layer, (Transmitter ID) > > [54912.661203] igc 0000:2b:00.0: device [8086:5502] error status/mask= =3D00001100/00002000 > > [54912.661211] igc 0000:2b:00.0: [ 8] Rollover > > [54912.661219] igc 0000:2b:00.0: [12] Timeout > > [54982.838760] pcieport 0000:00:07.0: AER: Corrected error received: 00= 00:2b:00.0 > > [54982.838798] igc 0000:2b:00.0: PCIe Bus Error: severity=3DCorrected, = type=3DData Link Layer, (Transmitter ID) > > [54982.838808] igc 0000:2b:00.0: device [8086:5502] error status/mask= =3D00001000/00002000 > > [54982.838817] igc 0000:2b:00.0: [12] Timeout > > The timestamps don't contribute to understanding the problem, so we > can omit them. Ok. > > This gets repeated continuously, thus inundating the buffer. > > > > Signed-off-by: Rajat Khandelwal > > Signed-off-by: Grant Grundler > > --- > > drivers/pci/pcie/aer.c | 42 ++++++++++++++++++++++++++++-------------- > > 1 file changed, 28 insertions(+), 14 deletions(-) > > > > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c > > index cb6b96233967..b592cea8bffe 100644 > > --- a/drivers/pci/pcie/aer.c > > +++ b/drivers/pci/pcie/aer.c > > @@ -706,8 +706,8 @@ static void __aer_print_error(struct pci_dev *dev, > > errmsg =3D "Unknown Error Bit"; > > > > if (info->severity =3D=3D AER_CORRECTABLE) > > - pci_info(dev, " [%2d] %-22s%s\n", i, errmsg, > > - info->first_error =3D=3D i ? " (First)" := ""); > > + pci_info_ratelimited(dev, " [%2d] %-22s%s\n", i= , errmsg, > > + info->first_error =3D=3D i ?= " (First)" : ""); > > I don't think this is going to reliably work the way we want. We have > a bunch of pci_info_ratelimited() calls, and each caller has its own > ratelimit_state data. Unless we call pci_info_ratelimited() exactly > the same number of times for each error, the ratelimit counters will > get out of sync and we'll end up printing fragments from error A mixed > with fragments from error B. Ok - what I'm reading between the lines here is the output should be emitted in one step, not multiple pci_info_ratelimited() calls. if the code built an output string (using sprintnf()), and then called pci_info_ratelimited() exactly once at the bottom, would that be sufficient? > I think we need to explicitly manage the ratelimiting ourselves, > similar to print_hmi_event_info() or print_extlog_rcd(). Then we can > have a *single* ratelimit_state, and we can check it once to determine > whether to log this correctable error. Is the rate limiting per call location or per device? From above, I understood rate limiting is "per call location". If the code only has one call location, it should achieve the same goal, ri= ght? cheers, grant > > > else > > pci_err(dev, " [%2d] %-22s%s\n", i, errmsg, > > info->first_error =3D=3D i ? " (First)" := ""); > > @@ -719,7 +719,6 @@ void aer_print_error(struct pci_dev *dev, struct ae= r_err_info *info) > > { > > int layer, agent; > > int id =3D ((dev->bus->number << 8) | dev->devfn); > > - const char *level; > > > > if (!info->status) { > > pci_err(dev, "PCIe Bus Error: severity=3D%s, type=3DInacc= essible, (Unregistered Agent ID)\n", > > @@ -730,14 +729,21 @@ void aer_print_error(struct pci_dev *dev, struct = aer_err_info *info) > > layer =3D AER_GET_LAYER_ERROR(info->severity, info->status); > > agent =3D AER_GET_AGENT(info->severity, info->status); > > > > - level =3D (info->severity =3D=3D AER_CORRECTABLE) ? KERN_INFO : K= ERN_ERR; > > + if (info->severity =3D=3D AER_CORRECTABLE) { > > + pci_info_ratelimited(dev, "PCIe Bus Error: severity=3D%s,= type=3D%s, (%s)\n", > > + aer_error_severity_string[info->seve= rity], > > + aer_error_layer[layer], aer_agent_st= ring[agent]); > > > > - pci_printk(level, dev, "PCIe Bus Error: severity=3D%s, type=3D%s,= (%s)\n", > > - aer_error_severity_string[info->severity], > > - aer_error_layer[layer], aer_agent_string[agent]); > > + pci_info_ratelimited(dev, " device [%04x:%04x] error sta= tus/mask=3D%08x/%08x\n", > > + dev->vendor, dev->device, info->stat= us, info->mask); > > + } else { > > + pci_err(dev, "PCIe Bus Error: severity=3D%s, type=3D%s, (= %s)\n", > > + aer_error_severity_string[info->severity], > > + aer_error_layer[layer], aer_agent_string[agent]); > > > > - pci_printk(level, dev, " device [%04x:%04x] error status/mask=3D= %08x/%08x\n", > > - dev->vendor, dev->device, info->status, info->mask); > > + pci_err(dev, " device [%04x:%04x] error status/mask=3D%0= 8x/%08x\n", > > + dev->vendor, dev->device, info->status, info->mas= k); > > + } > > > > __aer_print_error(dev, info); > > > > @@ -757,11 +763,19 @@ static void aer_print_port_info(struct pci_dev *d= ev, struct aer_err_info *info) > > u8 bus =3D info->id >> 8; > > u8 devfn =3D info->id & 0xff; > > > > - pci_info(dev, "%s%s error received: %04x:%02x:%02x.%d\n", > > - info->multi_error_valid ? "Multiple " : "", > > - aer_error_severity_string[info->severity], > > - pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn), > > - PCI_FUNC(devfn)); > > + if (info->severity =3D=3D AER_CORRECTABLE) > > + pci_info_ratelimited(dev, "%s%s error received: %04x:%02x= :%02x.%d\n", > > + info->multi_error_valid ? "Multiple = " : "", > > + aer_error_severity_string[info->seve= rity], > > + pci_domain_nr(dev->bus), bus, PCI_SL= OT(devfn), > > + PCI_FUNC(devfn)); > > + else > > + pci_info(dev, "%s%s error received: %04x:%02x:%02x.%d\n", > > + info->multi_error_valid ? "Multiple " : "", > > + aer_error_severity_string[info->severity], > > + pci_domain_nr(dev->bus), bus, PCI_SLOT(devfn), > > + PCI_FUNC(devfn)); > > + > > } > > > > #ifdef CONFIG_ACPI_APEI_PCIEAER > > -- > > 2.40.0.rc1.284.g88254d51c5-goog > >