2022-12-29 12:47:32

by Rajat Khandelwal

[permalink] [raw]
Subject: [PATCH] igc: Mask replay rollover/timeout errors in I225_LMVP

The CPU logs get flooded with replay rollover/timeout AER errors in
the system with i225_lmvp connected, usually inside thunderbolt devices.

One of the prominent TBT4 docks we use is HP G4 Hook2, which incorporates
an Intel Foxville chipset, which uses the igc driver.
On connecting ethernet, CPU logs get inundated with these errors. The point
is we shouldn't be spamming the logs with such correctible errors as it
confuses other kernel developers less familiar with PCI errors, support
staff, and users who happen to look at the logs.

Signed-off-by: Rajat Khandelwal <[email protected]>
---
drivers/net/ethernet/intel/igc/igc_main.c | 28 +++++++++++++++++++++--
1 file changed, 26 insertions(+), 2 deletions(-)

diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
index ebff0e04045d..a3a6e8086c8d 100644
--- a/drivers/net/ethernet/intel/igc/igc_main.c
+++ b/drivers/net/ethernet/intel/igc/igc_main.c
@@ -6201,6 +6201,26 @@ u32 igc_rd32(struct igc_hw *hw, u32 reg)
return value;
}

+#ifdef CONFIG_PCIEAER
+static void igc_mask_aer_replay_correctible(struct igc_adapter *adapter)
+{
+ struct pci_dev *pdev = adapter->pdev;
+ u32 aer_pos, corr_mask;
+
+ if (pdev->device != IGC_DEV_ID_I225_LMVP)
+ return;
+
+ aer_pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ERR);
+ if (!aer_pos)
+ return;
+
+ pci_read_config_dword(pdev, aer_pos + PCI_ERR_COR_MASK, &corr_mask);
+
+ corr_mask |= PCI_ERR_COR_REP_ROLL | PCI_ERR_COR_REP_TIMER;
+ pci_write_config_dword(pdev, aer_pos + PCI_ERR_COR_MASK, corr_mask);
+}
+#endif
+
/**
* igc_probe - Device Initialization Routine
* @pdev: PCI device information struct
@@ -6236,8 +6256,6 @@ static int igc_probe(struct pci_dev *pdev,
if (err)
goto err_pci_reg;

- pci_enable_pcie_error_reporting(pdev);
-
err = pci_enable_ptm(pdev, NULL);
if (err < 0)
dev_info(&pdev->dev, "PCIe PTM not supported by PCIe bus/controller\n");
@@ -6272,6 +6290,12 @@ static int igc_probe(struct pci_dev *pdev,
if (!adapter->io_addr)
goto err_ioremap;

+#ifdef CONFIG_PCIEAER
+ igc_mask_aer_replay_correctible(adapter);
+#endif
+
+ pci_enable_pcie_error_reporting(pdev);
+
/* hw->hw_addr can be zeroed, so use adapter->io_addr for unmap */
hw->hw_addr = adapter->io_addr;

--
2.34.1


2023-01-01 08:26:03

by Sasha Neftin

[permalink] [raw]
Subject: Re: [Intel-wired-lan] [PATCH] igc: Mask replay rollover/timeout errors in I225_LMVP

On 12/29/2022 14:26, Rajat Khandelwal wrote:
> The CPU logs get flooded with replay rollover/timeout AER errors in
> the system with i225_lmvp connected, usually inside thunderbolt devices.
>
> One of the prominent TBT4 docks we use is HP G4 Hook2, which incorporates
> an Intel Foxville chipset, which uses the igc driver.
> On connecting ethernet, CPU logs get inundated with these errors. The point
> is we shouldn't be spamming the logs with such correctible errors as it
> confuses other kernel developers less familiar with PCI errors, support
> staff, and users who happen to look at the logs.
>
> Signed-off-by: Rajat Khandelwal <[email protected]>
> ---
> drivers/net/ethernet/intel/igc/igc_main.c | 28 +++++++++++++++++++++--
> 1 file changed, 26 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
> index ebff0e04045d..a3a6e8086c8d 100644
> --- a/drivers/net/ethernet/intel/igc/igc_main.c
> +++ b/drivers/net/ethernet/intel/igc/igc_main.c
> @@ -6201,6 +6201,26 @@ u32 igc_rd32(struct igc_hw *hw, u32 reg)
> return value;
> }
>
> +#ifdef CONFIG_PCIEAER
> +static void igc_mask_aer_replay_correctible(struct igc_adapter *adapter)
> +{
> + struct pci_dev *pdev = adapter->pdev;
> + u32 aer_pos, corr_mask;
> +
> + if (pdev->device != IGC_DEV_ID_I225_LMVP)
> + return;
> +
> + aer_pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ERR);
> + if (!aer_pos)
> + return;
> +
> + pci_read_config_dword(pdev, aer_pos + PCI_ERR_COR_MASK, &corr_mask);
> +
> + corr_mask |= PCI_ERR_COR_REP_ROLL | PCI_ERR_COR_REP_TIMER;
> + pci_write_config_dword(pdev, aer_pos + PCI_ERR_COR_MASK, corr_mask);
> +}
> +#endif
> +
Hello Rajat,
May we use the privilege flag approach, give user control: and mask some
advanced errors?
Although... Why did it happen? Didn't you prefer not to investigate it
or else mask it? (I have concerns about the PCIe link over the
thunderbolt tunnel)
> /**
> * igc_probe - Device Initialization Routine
> * @pdev: PCI device information struct
> @@ -6236,8 +6256,6 @@ static int igc_probe(struct pci_dev *pdev,
> if (err)
> goto err_pci_reg;
>
> - pci_enable_pcie_error_reporting(pdev);
> -
> err = pci_enable_ptm(pdev, NULL);
> if (err < 0)
> dev_info(&pdev->dev, "PCIe PTM not supported by PCIe bus/controller\n");
> @@ -6272,6 +6290,12 @@ static int igc_probe(struct pci_dev *pdev,
> if (!adapter->io_addr)
> goto err_ioremap;
>
> +#ifdef CONFIG_PCIEAER
> + igc_mask_aer_replay_correctible(adapter);
> +#endif
> +
> + pci_enable_pcie_error_reporting(pdev);
> +
> /* hw->hw_addr can be zeroed, so use adapter->io_addr for unmap */
> hw->hw_addr = adapter->io_addr;
>

2023-01-01 09:21:39

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [PATCH] igc: Mask replay rollover/timeout errors in I225_LMVP

On Thu, Dec 29, 2022 at 05:56:40PM +0530, Rajat Khandelwal wrote:
> The CPU logs get flooded with replay rollover/timeout AER errors in
> the system with i225_lmvp connected, usually inside thunderbolt devices.
>
> One of the prominent TBT4 docks we use is HP G4 Hook2, which incorporates
> an Intel Foxville chipset, which uses the igc driver.
> On connecting ethernet, CPU logs get inundated with these errors. The point
> is we shouldn't be spamming the logs with such correctible errors as it
> confuses other kernel developers less familiar with PCI errors, support
> staff, and users who happen to look at the logs.
>
> Signed-off-by: Rajat Khandelwal <[email protected]>
> ---
> drivers/net/ethernet/intel/igc/igc_main.c | 28 +++++++++++++++++++++--
> 1 file changed, 26 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
> index ebff0e04045d..a3a6e8086c8d 100644
> --- a/drivers/net/ethernet/intel/igc/igc_main.c
> +++ b/drivers/net/ethernet/intel/igc/igc_main.c
> @@ -6201,6 +6201,26 @@ u32 igc_rd32(struct igc_hw *hw, u32 reg)
> return value;
> }
>
> +#ifdef CONFIG_PCIEAER
> +static void igc_mask_aer_replay_correctible(struct igc_adapter *adapter)
> +{
> + struct pci_dev *pdev = adapter->pdev;
> + u32 aer_pos, corr_mask;
> +
> + if (pdev->device != IGC_DEV_ID_I225_LMVP)
> + return;
> +
> + aer_pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ERR);
> + if (!aer_pos)
> + return;
> +
> + pci_read_config_dword(pdev, aer_pos + PCI_ERR_COR_MASK, &corr_mask);
> +
> + corr_mask |= PCI_ERR_COR_REP_ROLL | PCI_ERR_COR_REP_TIMER;
> + pci_write_config_dword(pdev, aer_pos + PCI_ERR_COR_MASK, corr_mask);

Shouldn't this igc_mask_aer_replay_correctible function be implemented
in drivers/pci/quirks.c and not in igc_probe()?

Thanks

2023-01-01 11:08:56

by Paul Menzel

[permalink] [raw]
Subject: Re: [Intel-wired-lan] [PATCH] igc: Mask replay rollover/timeout errors in I225_LMVP

[Cc: +Bjorn, +linux-pci]


Dear Rajat,


Thank you for your patch.

Am 29.12.22 um 13:26 schrieb Rajat Khandelwal:
> The CPU logs get flooded with replay rollover/timeout AER errors in
> the system with i225_lmvp connected, usually inside thunderbolt devices.

Please add one example log message to the commit message.

> One of the prominent TBT4 docks we use is HP G4 Hook2, which incorporates

I couldn’t find that device. Is that the correct name?

> an Intel Foxville chipset, which uses the igc driver.

Please add a blank line between paragraphs.

> On connecting ethernet, CPU logs get inundated with these errors. The point
> is we shouldn't be spamming the logs with such correctible errors as it

correctable

> confuses other kernel developers less familiar with PCI errors, support
> staff, and users who happen to look at the logs.

Please reference the bug reports (bug tracker and mailing list), you
know of, where this was reported.

> Signed-off-by: Rajat Khandelwal <[email protected]>
> ---
> drivers/net/ethernet/intel/igc/igc_main.c | 28 +++++++++++++++++++++--
> 1 file changed, 26 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
> index ebff0e04045d..a3a6e8086c8d 100644
> --- a/drivers/net/ethernet/intel/igc/igc_main.c
> +++ b/drivers/net/ethernet/intel/igc/igc_main.c
> @@ -6201,6 +6201,26 @@ u32 igc_rd32(struct igc_hw *hw, u32 reg)
> return value;
> }
>
> +#ifdef CONFIG_PCIEAER
> +static void igc_mask_aer_replay_correctible(struct igc_adapter *adapter)

correctable

> +{
> + struct pci_dev *pdev = adapter->pdev;
> + u32 aer_pos, corr_mask;

Instead of using the preprocessor, use a normal C conditional. From
`Documentation/process/coding-style.rst`:

> Within code, where possible, use the IS_ENABLED macro to convert a Kconfig
> symbol into a C boolean expression, and use it in a normal C conditional:
>
> .. code-block:: c
>
> if (IS_ENABLED(CONFIG_SOMETHING)) {
> ...
> }
>
> The compiler will constant-fold the conditional away, and include or exclude
> the block of code just as with an #ifdef, so this will not add any runtime
> overhead. However, this approach still allows the C compiler to see the code
> inside the block, and check it for correctness (syntax, types, symbol
> references, etc). Thus, you still have to use an #ifdef if the code inside the
> block references symbols that will not exist if the condition is not met.


> +
> + if (pdev->device != IGC_DEV_ID_I225_LMVP)
> + return;
> +
> + aer_pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ERR);
> + if (!aer_pos)
> + return;
> +
> + pci_read_config_dword(pdev, aer_pos + PCI_ERR_COR_MASK, &corr_mask);
> +
> + corr_mask |= PCI_ERR_COR_REP_ROLL | PCI_ERR_COR_REP_TIMER;
> + pci_write_config_dword(pdev, aer_pos + PCI_ERR_COR_MASK, corr_mask);
> +}
> +#endif
> +
> /**
> * igc_probe - Device Initialization Routine
> * @pdev: PCI device information struct
> @@ -6236,8 +6256,6 @@ static int igc_probe(struct pci_dev *pdev,
> if (err)
> goto err_pci_reg;
>
> - pci_enable_pcie_error_reporting(pdev);
> -
> err = pci_enable_ptm(pdev, NULL);
> if (err < 0)
> dev_info(&pdev->dev, "PCIe PTM not supported by PCIe bus/controller\n");
> @@ -6272,6 +6290,12 @@ static int igc_probe(struct pci_dev *pdev,
> if (!adapter->io_addr)
> goto err_ioremap;
>
> +#ifdef CONFIG_PCIEAER
> + igc_mask_aer_replay_correctible(adapter);
> +#endif
> +
> + pci_enable_pcie_error_reporting(pdev);
> +
> /* hw->hw_addr can be zeroed, so use adapter->io_addr for unmap */
> hw->hw_addr = adapter->io_addr;
>


Kind regards,

Paul

2023-01-01 11:16:54

by Paul Menzel

[permalink] [raw]
Subject: Re: [Intel-wired-lan] [PATCH] igc: Mask replay rollover/timeout errors in I225_LMVP

[Cc: +Bjorn, +linux-pci]

Dear Leon, dear Rajat,


Am 01.01.23 um 09:32 schrieb Leon Romanovsky:
> On Thu, Dec 29, 2022 at 05:56:40PM +0530, Rajat Khandelwal wrote:
>> The CPU logs get flooded with replay rollover/timeout AER errors in
>> the system with i225_lmvp connected, usually inside thunderbolt devices.
>>
>> One of the prominent TBT4 docks we use is HP G4 Hook2, which incorporates
>> an Intel Foxville chipset, which uses the igc driver.
>> On connecting ethernet, CPU logs get inundated with these errors. The point
>> is we shouldn't be spamming the logs with such correctible errors as it
>> confuses other kernel developers less familiar with PCI errors, support
>> staff, and users who happen to look at the logs.
>>
>> Signed-off-by: Rajat Khandelwal <[email protected]>
>> ---
>> drivers/net/ethernet/intel/igc/igc_main.c | 28 +++++++++++++++++++++--
>> 1 file changed, 26 insertions(+), 2 deletions(-)
>>
>> diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
>> index ebff0e04045d..a3a6e8086c8d 100644
>> --- a/drivers/net/ethernet/intel/igc/igc_main.c
>> +++ b/drivers/net/ethernet/intel/igc/igc_main.c
>> @@ -6201,6 +6201,26 @@ u32 igc_rd32(struct igc_hw *hw, u32 reg)
>> return value;
>> }
>>
>> +#ifdef CONFIG_PCIEAER
>> +static void igc_mask_aer_replay_correctible(struct igc_adapter *adapter)
>> +{
>> + struct pci_dev *pdev = adapter->pdev;
>> + u32 aer_pos, corr_mask;
>> +
>> + if (pdev->device != IGC_DEV_ID_I225_LMVP)
>> + return;
>> +
>> + aer_pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ERR);
>> + if (!aer_pos)
>> + return;
>> +
>> + pci_read_config_dword(pdev, aer_pos + PCI_ERR_COR_MASK, &corr_mask);
>> +
>> + corr_mask |= PCI_ERR_COR_REP_ROLL | PCI_ERR_COR_REP_TIMER;
>> + pci_write_config_dword(pdev, aer_pos + PCI_ERR_COR_MASK, corr_mask);
>
> Shouldn't this igc_mask_aer_replay_correctible function be implemented
> in drivers/pci/quirks.c and not in igc_probe()?

Probably. Though I think, the PCI quirk file, is getting too big.


Kind regards,

Paul

2023-01-02 18:11:26

by Rajat Khandelwal

[permalink] [raw]
Subject: RE: [Intel-wired-lan] [PATCH] igc: Mask replay rollover/timeout errors in I225_LMVP

Hi Paul, Sasha
Thanks for the acknowledgement!

-> Will add the example logs
-> Device: https://www.hp.com/us-en/monitors-accessories/computer-accessories/thunderbolt-G4-dock.html
-> correctible -> correctable
-> I guess acc to the convention, I still have to use #ifdef for my function since it
references variables that won't exist if the condition is not met.
However, I have used the IS_ENABLED macro to call the function inside igc_probe().
I hope that's okay!

-> One last thing, I was also skeptical on the location of this function, but then I witnessed
netxen_mask_aer_correctable() function inside net/ethernet/qlogic/netxen/netxen_nic_main.c,
which masks the correctable errors in its PCIe device.
Also, I don’t see a CONFIG_PCIEAER macro enabled function in pci/quirks.c!
I still think to keep the function in igc_main.c, but I am waiting for your judgement.

@Neftin, Sasha, I and my team prefer masking these errors rather than debugging them.
First, they are correctable and non-fatal. Second, these errors are observed in many of the devices I
have worked with (i.e., replay errors). Maybe there is something universal which has to be done for the
thunderbolt domain regarding these specific replay errors in the long term?
Anyhow, we would like to mask these errors for now to avoid any confusions when ethernet gets
connected to the dock. I hope that will be okay? Waiting for your judgement :)

Let me know on any more queries and any suggestions until I roll out v2.

Thanks
Rajat

-----Original Message-----
From: Paul Menzel <[email protected]>
Sent: Sunday, January 1, 2023 4:02 PM
To: Rajat Khandelwal <[email protected]>
Cc: Brandeburg, Jesse <[email protected]>; Nguyen, Anthony L <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; [email protected]; Khandelwal, Rajat <[email protected]>; Bjorn Helgaas <[email protected]>; [email protected]
Subject: Re: [Intel-wired-lan] [PATCH] igc: Mask replay rollover/timeout errors in I225_LMVP

[Cc: +Bjorn, +linux-pci]


Dear Rajat,


Thank you for your patch.

Am 29.12.22 um 13:26 schrieb Rajat Khandelwal:
> The CPU logs get flooded with replay rollover/timeout AER errors in
> the system with i225_lmvp connected, usually inside thunderbolt devices.

Please add one example log message to the commit message.

> One of the prominent TBT4 docks we use is HP G4 Hook2, which
> incorporates

I couldn’t find that device. Is that the correct name?

> an Intel Foxville chipset, which uses the igc driver.

Please add a blank line between paragraphs.

> On connecting ethernet, CPU logs get inundated with these errors. The
> point is we shouldn't be spamming the logs with such correctible
> errors as it

correctable

> confuses other kernel developers less familiar with PCI errors,
> support staff, and users who happen to look at the logs.

Please reference the bug reports (bug tracker and mailing list), you know of, where this was reported.

> Signed-off-by: Rajat Khandelwal <[email protected]>
> ---
> drivers/net/ethernet/intel/igc/igc_main.c | 28 +++++++++++++++++++++--
> 1 file changed, 26 insertions(+), 2 deletions(-)
>
> diff --git a/drivers/net/ethernet/intel/igc/igc_main.c
> b/drivers/net/ethernet/intel/igc/igc_main.c
> index ebff0e04045d..a3a6e8086c8d 100644
> --- a/drivers/net/ethernet/intel/igc/igc_main.c
> +++ b/drivers/net/ethernet/intel/igc/igc_main.c
> @@ -6201,6 +6201,26 @@ u32 igc_rd32(struct igc_hw *hw, u32 reg)
> return value;
> }
>
> +#ifdef CONFIG_PCIEAER
> +static void igc_mask_aer_replay_correctible(struct igc_adapter
> +*adapter)

correctable

> +{
> + struct pci_dev *pdev = adapter->pdev;
> + u32 aer_pos, corr_mask;

Instead of using the preprocessor, use a normal C conditional. From
`Documentation/process/coding-style.rst`:

> Within code, where possible, use the IS_ENABLED macro to convert a
> Kconfig symbol into a C boolean expression, and use it in a normal C conditional:
>
> .. code-block:: c
>
> if (IS_ENABLED(CONFIG_SOMETHING)) {
> ...
> }
>
> The compiler will constant-fold the conditional away, and include or
> exclude the block of code just as with an #ifdef, so this will not add
> any runtime overhead. However, this approach still allows the C
> compiler to see the code inside the block, and check it for
> correctness (syntax, types, symbol references, etc). Thus, you still
> have to use an #ifdef if the code inside the block references symbols that will not exist if the condition is not met.


> +
> + if (pdev->device != IGC_DEV_ID_I225_LMVP)
> + return;
> +
> + aer_pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ERR);
> + if (!aer_pos)
> + return;
> +
> + pci_read_config_dword(pdev, aer_pos + PCI_ERR_COR_MASK, &corr_mask);
> +
> + corr_mask |= PCI_ERR_COR_REP_ROLL | PCI_ERR_COR_REP_TIMER;
> + pci_write_config_dword(pdev, aer_pos + PCI_ERR_COR_MASK, corr_mask);
> +} #endif
> +
> /**
> * igc_probe - Device Initialization Routine
> * @pdev: PCI device information struct @@ -6236,8 +6256,6 @@ static
> int igc_probe(struct pci_dev *pdev,
> if (err)
> goto err_pci_reg;
>
> - pci_enable_pcie_error_reporting(pdev);
> -
> err = pci_enable_ptm(pdev, NULL);
> if (err < 0)
> dev_info(&pdev->dev, "PCIe PTM not supported by PCIe
> bus/controller\n"); @@ -6272,6 +6290,12 @@ static int igc_probe(struct pci_dev *pdev,
> if (!adapter->io_addr)
> goto err_ioremap;
>
> +#ifdef CONFIG_PCIEAER
> + igc_mask_aer_replay_correctible(adapter);
> +#endif
> +
> + pci_enable_pcie_error_reporting(pdev);
> +
> /* hw->hw_addr can be zeroed, so use adapter->io_addr for unmap */
> hw->hw_addr = adapter->io_addr;
>


Kind regards,

Paul

2023-01-03 10:21:21

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [Intel-wired-lan] [PATCH] igc: Mask replay rollover/timeout errors in I225_LMVP

On Sun, Jan 01, 2023 at 11:34:21AM +0100, Paul Menzel wrote:
> [Cc: +Bjorn, +linux-pci]
>
> Dear Leon, dear Rajat,
>
>
> Am 01.01.23 um 09:32 schrieb Leon Romanovsky:
> > On Thu, Dec 29, 2022 at 05:56:40PM +0530, Rajat Khandelwal wrote:
> > > The CPU logs get flooded with replay rollover/timeout AER errors in
> > > the system with i225_lmvp connected, usually inside thunderbolt devices.
> > >
> > > One of the prominent TBT4 docks we use is HP G4 Hook2, which incorporates
> > > an Intel Foxville chipset, which uses the igc driver.
> > > On connecting ethernet, CPU logs get inundated with these errors. The point
> > > is we shouldn't be spamming the logs with such correctible errors as it
> > > confuses other kernel developers less familiar with PCI errors, support
> > > staff, and users who happen to look at the logs.
> > >
> > > Signed-off-by: Rajat Khandelwal <[email protected]>
> > > ---
> > > drivers/net/ethernet/intel/igc/igc_main.c | 28 +++++++++++++++++++++--
> > > 1 file changed, 26 insertions(+), 2 deletions(-)
> > >
> > > diff --git a/drivers/net/ethernet/intel/igc/igc_main.c b/drivers/net/ethernet/intel/igc/igc_main.c
> > > index ebff0e04045d..a3a6e8086c8d 100644
> > > --- a/drivers/net/ethernet/intel/igc/igc_main.c
> > > +++ b/drivers/net/ethernet/intel/igc/igc_main.c
> > > @@ -6201,6 +6201,26 @@ u32 igc_rd32(struct igc_hw *hw, u32 reg)
> > > return value;
> > > }
> > > +#ifdef CONFIG_PCIEAER
> > > +static void igc_mask_aer_replay_correctible(struct igc_adapter *adapter)
> > > +{
> > > + struct pci_dev *pdev = adapter->pdev;
> > > + u32 aer_pos, corr_mask;
> > > +
> > > + if (pdev->device != IGC_DEV_ID_I225_LMVP)
> > > + return;
> > > +
> > > + aer_pos = pci_find_ext_capability(pdev, PCI_EXT_CAP_ID_ERR);
> > > + if (!aer_pos)
> > > + return;
> > > +
> > > + pci_read_config_dword(pdev, aer_pos + PCI_ERR_COR_MASK, &corr_mask);
> > > +
> > > + corr_mask |= PCI_ERR_COR_REP_ROLL | PCI_ERR_COR_REP_TIMER;
> > > + pci_write_config_dword(pdev, aer_pos + PCI_ERR_COR_MASK, corr_mask);
> >
> > Shouldn't this igc_mask_aer_replay_correctible function be implemented
> > in drivers/pci/quirks.c and not in igc_probe()?
>
> Probably. Though I think, the PCI quirk file, is getting too big.

As long as that file is right location, we should use it.
One can refactor quirk file later.

Thanks

>
>
> Kind regards,
>
> Paul

2023-01-03 12:10:45

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [Intel-wired-lan] [PATCH] igc: Mask replay rollover/timeout errors in I225_LMVP

On Tue, Jan 03, 2023 at 11:54:24AM +0200, Leon Romanovsky wrote:
> On Sun, Jan 01, 2023 at 11:34:21AM +0100, Paul Menzel wrote:
> > Am 01.01.23 um 09:32 schrieb Leon Romanovsky:
> > > On Thu, Dec 29, 2022 at 05:56:40PM +0530, Rajat Khandelwal wrote:
> > > > The CPU logs get flooded with replay rollover/timeout AER errors in
> > > > the system with i225_lmvp connected, usually inside thunderbolt devices.
> > > >
> > > > One of the prominent TBT4 docks we use is HP G4 Hook2, which incorporates
> > > > an Intel Foxville chipset, which uses the igc driver.
> > > > On connecting ethernet, CPU logs get inundated with these errors. The point
> > > > is we shouldn't be spamming the logs with such correctible errors as it
> > > > confuses other kernel developers less familiar with PCI errors, support
> > > > staff, and users who happen to look at the logs.

> > > > --- a/drivers/net/ethernet/intel/igc/igc_main.c
> > > > +++ b/drivers/net/ethernet/intel/igc/igc_main.c

> > > > +static void igc_mask_aer_replay_correctible(struct igc_adapter *adapter)

> > > Shouldn't this igc_mask_aer_replay_correctible function be implemented
> > > in drivers/pci/quirks.c and not in igc_probe()?
> >
> > Probably. Though I think, the PCI quirk file, is getting too big.
>
> As long as that file is right location, we should use it.
> One can refactor quirk file later.

If a quirk like this is only needed when the driver is loaded, I think
the driver is a better place than drivers/pci/quirks.c. If it's in
quirks.c, either we have to replicate driver Kconfig via #ifdefs, or
the kernel contains the quirk for systems that don't need it.

I'm generally not a fan of simply masking errors because they're
annoying. I'd prefer to figure out the root cause and fix it if
possible. Or maybe we can tone down or rate-limit the logging so it's
not so alarming.

Bjorn

2023-01-03 12:41:57

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [Intel-wired-lan] [PATCH] igc: Mask replay rollover/timeout errors in I225_LMVP

On Tue, Jan 03, 2023 at 05:54:02AM -0600, Bjorn Helgaas wrote:
> On Tue, Jan 03, 2023 at 11:54:24AM +0200, Leon Romanovsky wrote:
> > On Sun, Jan 01, 2023 at 11:34:21AM +0100, Paul Menzel wrote:
> > > Am 01.01.23 um 09:32 schrieb Leon Romanovsky:
> > > > On Thu, Dec 29, 2022 at 05:56:40PM +0530, Rajat Khandelwal wrote:
> > > > > The CPU logs get flooded with replay rollover/timeout AER errors in
> > > > > the system with i225_lmvp connected, usually inside thunderbolt devices.
> > > > >
> > > > > One of the prominent TBT4 docks we use is HP G4 Hook2, which incorporates
> > > > > an Intel Foxville chipset, which uses the igc driver.
> > > > > On connecting ethernet, CPU logs get inundated with these errors. The point
> > > > > is we shouldn't be spamming the logs with such correctible errors as it
> > > > > confuses other kernel developers less familiar with PCI errors, support
> > > > > staff, and users who happen to look at the logs.
>
> > > > > --- a/drivers/net/ethernet/intel/igc/igc_main.c
> > > > > +++ b/drivers/net/ethernet/intel/igc/igc_main.c
>
> > > > > +static void igc_mask_aer_replay_correctible(struct igc_adapter *adapter)
>
> > > > Shouldn't this igc_mask_aer_replay_correctible function be implemented
> > > > in drivers/pci/quirks.c and not in igc_probe()?
> > >
> > > Probably. Though I think, the PCI quirk file, is getting too big.
> >
> > As long as that file is right location, we should use it.
> > One can refactor quirk file later.
>
> If a quirk like this is only needed when the driver is loaded,

This is always the case with PCI devices managed through kernel, isn't it?
Users don't care/aware about "broken" devices unless they start to use them.

Thanks

2023-01-03 14:24:08

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [Intel-wired-lan] [PATCH] igc: Mask replay rollover/timeout errors in I225_LMVP

On Tue, Jan 03, 2023 at 02:00:04PM +0200, Leon Romanovsky wrote:
> On Tue, Jan 03, 2023 at 05:54:02AM -0600, Bjorn Helgaas wrote:
> > On Tue, Jan 03, 2023 at 11:54:24AM +0200, Leon Romanovsky wrote:
> > > On Sun, Jan 01, 2023 at 11:34:21AM +0100, Paul Menzel wrote:
> > > > Am 01.01.23 um 09:32 schrieb Leon Romanovsky:
> > > > > On Thu, Dec 29, 2022 at 05:56:40PM +0530, Rajat Khandelwal wrote:
> > > > > > The CPU logs get flooded with replay rollover/timeout AER errors in
> > > > > > the system with i225_lmvp connected, usually inside thunderbolt devices.
> > > > > >
> > > > > > One of the prominent TBT4 docks we use is HP G4 Hook2, which incorporates
> > > > > > an Intel Foxville chipset, which uses the igc driver.
> > > > > > On connecting ethernet, CPU logs get inundated with these errors. The point
> > > > > > is we shouldn't be spamming the logs with such correctible errors as it
> > > > > > confuses other kernel developers less familiar with PCI errors, support
> > > > > > staff, and users who happen to look at the logs.
> >
> > > > > > --- a/drivers/net/ethernet/intel/igc/igc_main.c
> > > > > > +++ b/drivers/net/ethernet/intel/igc/igc_main.c
> >
> > > > > > +static void igc_mask_aer_replay_correctible(struct igc_adapter *adapter)
> >
> > > > > Shouldn't this igc_mask_aer_replay_correctible function be implemented
> > > > > in drivers/pci/quirks.c and not in igc_probe()?
> > > >
> > > > Probably. Though I think, the PCI quirk file, is getting too big.
> > >
> > > As long as that file is right location, we should use it.
> > > One can refactor quirk file later.
> >
> > If a quirk like this is only needed when the driver is loaded,
>
> This is always the case with PCI devices managed through kernel, isn't it?
> Users don't care/aware about "broken" devices unless they start to use them.

Indeed, that's usually the case. There's a lot of stuff in quirks.c
that could probably be in drivers instead.

Bjorn

2023-01-03 18:26:53

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [Intel-wired-lan] [PATCH] igc: Mask replay rollover/timeout errors in I225_LMVP

On Tue, Jan 03, 2023 at 08:21:04AM -0600, Bjorn Helgaas wrote:
> On Tue, Jan 03, 2023 at 02:00:04PM +0200, Leon Romanovsky wrote:
> > On Tue, Jan 03, 2023 at 05:54:02AM -0600, Bjorn Helgaas wrote:
> > > On Tue, Jan 03, 2023 at 11:54:24AM +0200, Leon Romanovsky wrote:
> > > > On Sun, Jan 01, 2023 at 11:34:21AM +0100, Paul Menzel wrote:
> > > > > Am 01.01.23 um 09:32 schrieb Leon Romanovsky:
> > > > > > On Thu, Dec 29, 2022 at 05:56:40PM +0530, Rajat Khandelwal wrote:
> > > > > > > The CPU logs get flooded with replay rollover/timeout AER errors in
> > > > > > > the system with i225_lmvp connected, usually inside thunderbolt devices.
> > > > > > >
> > > > > > > One of the prominent TBT4 docks we use is HP G4 Hook2, which incorporates
> > > > > > > an Intel Foxville chipset, which uses the igc driver.
> > > > > > > On connecting ethernet, CPU logs get inundated with these errors. The point
> > > > > > > is we shouldn't be spamming the logs with such correctible errors as it
> > > > > > > confuses other kernel developers less familiar with PCI errors, support
> > > > > > > staff, and users who happen to look at the logs.
> > >
> > > > > > > --- a/drivers/net/ethernet/intel/igc/igc_main.c
> > > > > > > +++ b/drivers/net/ethernet/intel/igc/igc_main.c
> > >
> > > > > > > +static void igc_mask_aer_replay_correctible(struct igc_adapter *adapter)
> > >
> > > > > > Shouldn't this igc_mask_aer_replay_correctible function be implemented
> > > > > > in drivers/pci/quirks.c and not in igc_probe()?
> > > > >
> > > > > Probably. Though I think, the PCI quirk file, is getting too big.
> > > >
> > > > As long as that file is right location, we should use it.
> > > > One can refactor quirk file later.
> > >
> > > If a quirk like this is only needed when the driver is loaded,
> >
> > This is always the case with PCI devices managed through kernel, isn't it?
> > Users don't care/aware about "broken" devices unless they start to use them.
>
> Indeed, that's usually the case. There's a lot of stuff in quirks.c
> that could probably be in drivers instead.

NP, so or deprecate quirks.c and prohibit any change to that file or
don't allow drivers to mangle PCI in their probe routines.
Everything in-between will cause to enormous mess in long run.

Thanks

>
> Bjorn

2023-01-04 07:00:03

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [Intel-wired-lan] [PATCH] igc: Mask replay rollover/timeout errors in I225_LMVP

On Tue, Jan 03, 2023 at 07:16:58PM +0200, Leon Romanovsky wrote:
> On Tue, Jan 03, 2023 at 08:21:04AM -0600, Bjorn Helgaas wrote:

<...>

> > > > If a quirk like this is only needed when the driver is loaded,
> > >
> > > This is always the case with PCI devices managed through kernel, isn't it?
> > > Users don't care/aware about "broken" devices unless they start to use them.
> >
> > Indeed, that's usually the case. There's a lot of stuff in quirks.c
> > that could probably be in drivers instead.
>
> NP, so or deprecate quirks.c and prohibit any change to that file or
> don't allow drivers to mangle PCI in their probe routines.
> Everything in-between will cause to enormous mess in long run.

Another thing to consider what if you go with "probe variant", users
will see behavioral differences between drivers and subsystems on
how to control these quirks.

As an example, see proposal in this thread to add ethtool private flag
to enable/disable quirk. In other places, it will be module parameter,
sysfs or special to that subsystem tool.

Thanks

>
> Thanks
>
> >
> > Bjorn