2022-05-09 18:19:33

by Mohamed Khalfella

[permalink] [raw]
Subject: [PATCH] PCI/AER: Iterate over error counters instead of error strings

PCI AER stats counters sysfs attributes need to iterate over
stats counters instead of stats names. Also, added a build
time check to make sure all counters have entries in strings
array.

Fixes: 0678e3109a3c ("PCI/AER: Simplify __aer_print_error()")
Cc: [email protected]
Reported-by: Meeta Saggi <[email protected]>
Signed-off-by: Mohamed Khalfella <[email protected]>
Reviewed-by: Meeta Saggi <[email protected]>
Reviewed-by: Eric Badger <[email protected]>
---
drivers/pci/pcie/aer.c | 7 ++++++-
1 file changed, 6 insertions(+), 1 deletion(-)

diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
index 9fa1f97e5b27..ce99a6d44786 100644
--- a/drivers/pci/pcie/aer.c
+++ b/drivers/pci/pcie/aer.c
@@ -533,7 +533,7 @@ static const char *aer_agent_string[] = {
u64 *stats = pdev->aer_stats->stats_array; \
size_t len = 0; \
\
- for (i = 0; i < ARRAY_SIZE(strings_array); i++) { \
+ for (i = 0; i < ARRAY_SIZE(pdev->aer_stats->stats_array); i++) {\
if (strings_array[i]) \
len += sysfs_emit_at(buf, len, "%s %llu\n", \
strings_array[i], \
@@ -1342,6 +1342,11 @@ static int aer_probe(struct pcie_device *dev)
struct device *device = &dev->device;
struct pci_dev *port = dev->port;

+ BUILD_BUG_ON(ARRAY_SIZE(aer_correctable_error_string) <
+ AER_MAX_TYPEOF_COR_ERRS);
+ BUILD_BUG_ON(ARRAY_SIZE(aer_uncorrectable_error_string) <
+ AER_MAX_TYPEOF_UNCOR_ERRS);
+
/* Limit to Root Ports or Root Complex Event Collectors */
if ((pci_pcie_type(port) != PCI_EXP_TYPE_RC_EC) &&
(pci_pcie_type(port) != PCI_EXP_TYPE_ROOT_PORT))
--
2.29.0



2022-05-11 08:22:52

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH] PCI/AER: Iterate over error counters instead of error strings

[+cc Rajat]

On Mon, May 09, 2022 at 06:14:41PM +0000, Mohamed Khalfella wrote:
> PCI AER stats counters sysfs attributes need to iterate over
> stats counters instead of stats names.

Thanks for catching this; it definitely looks like a real issue! I
guess you're probably seeing junk in the sysfs files?

It would be helpful to reviewers if this said *why* we need to iterate
over the counters instead of the names. I think the problem is that
the current code reads past the end of the stats counters.

There are parallel arrays here:

#define AER_MAX_TYPEOF_COR_ERRS 16
#define AER_MAX_TYPEOF_UNCOR_ERRS 27

aer_correctable_error_string[32] # 32
pdev->aer_stats->dev_cor_errs[AER_MAX_TYPEOF_COR_ERRS] # 16
aer_uncorrectable_error_string[32] # 32
pdev->aer_stats->dev_fatal_errs[AER_MAX_TYPEOF_UNCOR_ERRS] # 27
pdev->aer_stats->dev_nonfatal_errs[AER_MAX_TYPEOF_UNCOR_ERRS] # 27

And here's the current use of them:

#define aer_stats_dev_attr(..., stats_array, strings_array, ...)
for (i = 0; i < ARRAY_SIZE(strings_array); i++) {
if (strings_array[i])
sysfs_emit_at(..., strings_array[i], stats[i]); (1)
else if (stats[i])
sysfs_emit_at(..., stats[i]); (2)

aer_stats_dev_attr(..., dev_cor_errs, aer_correctable_error_string,
aer_stats_dev_attr(..., dev_fatal_errs, aer_uncorrectable_error_string,
aer_stats_dev_attr(..., dev_nonfatal_errs, aer_uncorrectable_error_string,

The current loop iterates over 0..31, which is safe at (1) because the
non-NULL strings are at aer_correctable_error_string[0..15] and
aer_uncorrectable_error_string[0..26].

But it is unsafe at (2) because it references dev_cor_errs[16..31],
dev_fatal_errs[27..31], and dev_nonfatal_errs[27..31], which are past
the end of the arrays.

> Also, added a build time check to make sure all counters have
> entries in strings array.
>
> Fixes: 0678e3109a3c ("PCI/AER: Simplify __aer_print_error()")

Yep, I blew it there. Rajat did it correctly when he added this with
81aa5206f9a7 ("PCI/AER: Add sysfs attributes to provide AER stats and
breakdown"), and I broke it by extending the string arrays to 32
entries.

> Cc: [email protected]
> Reported-by: Meeta Saggi <[email protected]>
> Signed-off-by: Mohamed Khalfella <[email protected]>
> Reviewed-by: Meeta Saggi <[email protected]>
> Reviewed-by: Eric Badger <[email protected]>
> ---
> drivers/pci/pcie/aer.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 9fa1f97e5b27..ce99a6d44786 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -533,7 +533,7 @@ static const char *aer_agent_string[] = {
> u64 *stats = pdev->aer_stats->stats_array; \
> size_t len = 0; \
> \
> - for (i = 0; i < ARRAY_SIZE(strings_array); i++) { \
> + for (i = 0; i < ARRAY_SIZE(pdev->aer_stats->stats_array); i++) {\
> if (strings_array[i]) \
> len += sysfs_emit_at(buf, len, "%s %llu\n", \
> strings_array[i], \

I think maybe we should populate the currently NULL entries in the
string[] arrays and simplify the code here, e.g.,

static const char *aer_correctable_error_string[] = {
"RxErr", /* Bit Position 0 */
"dev_cor_errs_bit[1]",
...

if (stats[i])
len += sysfs_emit_at(buf, len, "%s %llu\n", strings_array[i], stats[i]);

It's a little more data space, but easier to verify.

> @@ -1342,6 +1342,11 @@ static int aer_probe(struct pcie_device *dev)
> struct device *device = &dev->device;
> struct pci_dev *port = dev->port;
>
> + BUILD_BUG_ON(ARRAY_SIZE(aer_correctable_error_string) <
> + AER_MAX_TYPEOF_COR_ERRS);
> + BUILD_BUG_ON(ARRAY_SIZE(aer_uncorrectable_error_string) <
> + AER_MAX_TYPEOF_UNCOR_ERRS);

And make these check for "!=" instead of "<".

2022-05-11 08:54:00

by Mohamed Khalfella

[permalink] [raw]
Subject: Re: [PATCH] PCI/AER: Iterate over error counters instead of error

> Thanks for catching this; it definitely looks like a real issue! I
> guess you're probably seeing junk in the sysfs files?

That is correct. The initial report was seeing junk when reading sysfs
files. As descibed, this is happening because we reading data past the
end of the stats counters array.


> I think maybe we should populate the currently NULL entries in the
> string[] arrays and simplify the code here, e.g.,
>
> static const char *aer_correctable_error_string[] = {
> "RxErr", /* Bit Position 0 */
> "dev_cor_errs_bit[1]",
> ...
>
> if (stats[i])
> len += sysfs_emit_at(buf, len, "%s %llu\n", strings_array[i], stats[i]);

Doing it this way will change the output format. In this case we will show
stats only if their value is greater than zero. The current code shows all the
stats those have names (regardless of their value) plus those have non-zero
values.

>> @@ -1342,6 +1342,11 @@ static int aer_probe(struct pcie_device *dev)
>> struct device *device = &dev->device;
>> struct pci_dev *port = dev->port;
>>
>> + BUILD_BUG_ON(ARRAY_SIZE(aer_correctable_error_string) <
>> + AER_MAX_TYPEOF_COR_ERRS);
>> + BUILD_BUG_ON(ARRAY_SIZE(aer_uncorrectable_error_string) <
>> + AER_MAX_TYPEOF_UNCOR_ERRS);
>
> And make these check for "!=" instead of "<".

This will require unnecessarily extending stats arrays to have 32 entries
in order to match names arrays. If you don't feel strogly about changing
"<" to "!=", I prefer to keep the code as it is.

2022-06-06 04:20:13

by Mohamed Khalfella

[permalink] [raw]
Subject: Re: [PATCH] PCI/AER: Iterate over error counters instead of error

Is there any chance for this to land in 5.19?

On 5/10/22 14:17, Mohamed Khalfella wrote:
> > Thanks for catching this; it definitely looks like a real issue! I
> > guess you're probably seeing junk in the sysfs files?
>
> That is correct. The initial report was seeing junk when reading sysfs
> files. As descibed, this is happening because we reading data past the
> end of the stats counters array.
>
>
> > I think maybe we should populate the currently NULL entries in the
> > string[] arrays and simplify the code here, e.g.,
> >
> > static const char *aer_correctable_error_string[] = {
> > "RxErr", /* Bit Position 0 */
> > "dev_cor_errs_bit[1]",
> > ...
> >
> > if (stats[i])
> > len += sysfs_emit_at(buf, len, "%s %llu\n", strings_array[i], stats[i]);
>
> Doing it this way will change the output format. In this case we will show
> stats only if their value is greater than zero. The current code shows all the
> stats those have names (regardless of their value) plus those have non-zero
> values.
>
> >> @@ -1342,6 +1342,11 @@ static int aer_probe(struct pcie_device *dev)
> >> struct device *device = &dev->device;
> >> struct pci_dev *port = dev->port;
> >>
> >> + BUILD_BUG_ON(ARRAY_SIZE(aer_correctable_error_string) <
> >> + AER_MAX_TYPEOF_COR_ERRS);
> >> + BUILD_BUG_ON(ARRAY_SIZE(aer_uncorrectable_error_string) <
> >> + AER_MAX_TYPEOF_UNCOR_ERRS);
> >
> > And make these check for "!=" instead of "<".

I am happy to remove these BUILD_BUG_ON() if you think it is a good
idea to do so.

>
> This will require unnecessarily extending stats arrays to have 32 entries
> in order to match names arrays. If you don't feel strogly about changing
> "<" to "!=", I prefer to keep the code as it is.

2022-07-11 23:26:35

by Bjorn Helgaas

[permalink] [raw]
Subject: Re: [PATCH] PCI/AER: Iterate over error counters instead of error strings

On Mon, May 09, 2022 at 06:14:41PM +0000, Mohamed Khalfella wrote:
> PCI AER stats counters sysfs attributes need to iterate over
> stats counters instead of stats names. Also, added a build
> time check to make sure all counters have entries in strings
> array.
>
> Fixes: 0678e3109a3c ("PCI/AER: Simplify __aer_print_error()")
> Cc: [email protected]
> Reported-by: Meeta Saggi <[email protected]>
> Signed-off-by: Mohamed Khalfella <[email protected]>
> Reviewed-by: Meeta Saggi <[email protected]>
> Reviewed-by: Eric Badger <[email protected]>

I added some info about why we need this to the commit log and applied
to pci/err for v5.20. Thank you!

> ---
> drivers/pci/pcie/aer.c | 7 ++++++-
> 1 file changed, 6 insertions(+), 1 deletion(-)
>
> diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> index 9fa1f97e5b27..ce99a6d44786 100644
> --- a/drivers/pci/pcie/aer.c
> +++ b/drivers/pci/pcie/aer.c
> @@ -533,7 +533,7 @@ static const char *aer_agent_string[] = {
> u64 *stats = pdev->aer_stats->stats_array; \
> size_t len = 0; \
> \
> - for (i = 0; i < ARRAY_SIZE(strings_array); i++) { \
> + for (i = 0; i < ARRAY_SIZE(pdev->aer_stats->stats_array); i++) {\
> if (strings_array[i]) \
> len += sysfs_emit_at(buf, len, "%s %llu\n", \
> strings_array[i], \
> @@ -1342,6 +1342,11 @@ static int aer_probe(struct pcie_device *dev)
> struct device *device = &dev->device;
> struct pci_dev *port = dev->port;
>
> + BUILD_BUG_ON(ARRAY_SIZE(aer_correctable_error_string) <
> + AER_MAX_TYPEOF_COR_ERRS);
> + BUILD_BUG_ON(ARRAY_SIZE(aer_uncorrectable_error_string) <
> + AER_MAX_TYPEOF_UNCOR_ERRS);
> +
> /* Limit to Root Ports or Root Complex Event Collectors */
> if ((pci_pcie_type(port) != PCI_EXP_TYPE_RC_EC) &&
> (pci_pcie_type(port) != PCI_EXP_TYPE_ROOT_PORT))
> --
> 2.29.0
>

2022-07-11 23:32:53

by Mohamed Khalfella

[permalink] [raw]
Subject: Re: [PATCH] PCI/AER: Iterate over error counters instead of error strings

On 2022-07-11 17:54:37 -0500, Bjorn Helgaas wrote:
> On Mon, May 09, 2022 at 06:14:41PM +0000, Mohamed Khalfella wrote:
> > PCI AER stats counters sysfs attributes need to iterate over
> > stats counters instead of stats names. Also, added a build
> > time check to make sure all counters have entries in strings
> > array.
> >
> > Fixes: 0678e3109a3c ("PCI/AER: Simplify __aer_print_error()")
> > Cc: [email protected]
> > Reported-by: Meeta Saggi <[email protected]>
> > Signed-off-by: Mohamed Khalfella <[email protected]>
> > Reviewed-by: Meeta Saggi <[email protected]>
> > Reviewed-by: Eric Badger <[email protected]>
>
> I added some info about why we need this to the commit log and applied
> to pci/err for v5.20. Thank you!
That is good news! Thank you for helping out.
>
> > ---
> > drivers/pci/pcie/aer.c | 7 ++++++-
> > 1 file changed, 6 insertions(+), 1 deletion(-)
> >
> > diff --git a/drivers/pci/pcie/aer.c b/drivers/pci/pcie/aer.c
> > index 9fa1f97e5b27..ce99a6d44786 100644
> > --- a/drivers/pci/pcie/aer.c
> > +++ b/drivers/pci/pcie/aer.c
> > @@ -533,7 +533,7 @@ static const char *aer_agent_string[] = {
> > u64 *stats = pdev->aer_stats->stats_array; \
> > size_t len = 0; \
> > \
> > - for (i = 0; i < ARRAY_SIZE(strings_array); i++) { \
> > + for (i = 0; i < ARRAY_SIZE(pdev->aer_stats->stats_array); i++) {\
> > if (strings_array[i]) \
> > len += sysfs_emit_at(buf, len, "%s %llu\n", \
> > strings_array[i], \
> > @@ -1342,6 +1342,11 @@ static int aer_probe(struct pcie_device *dev)
> > struct device *device = &dev->device;
> > struct pci_dev *port = dev->port;
> >
> > + BUILD_BUG_ON(ARRAY_SIZE(aer_correctable_error_string) <
> > + AER_MAX_TYPEOF_COR_ERRS);
> > + BUILD_BUG_ON(ARRAY_SIZE(aer_uncorrectable_error_string) <
> > + AER_MAX_TYPEOF_UNCOR_ERRS);
> > +
> > /* Limit to Root Ports or Root Complex Event Collectors */
> > if ((pci_pcie_type(port) != PCI_EXP_TYPE_RC_EC) &&
> > (pci_pcie_type(port) != PCI_EXP_TYPE_ROOT_PORT))
> > --
> > 2.29.0
> >