Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison
granularity") changed nfit_handle_mce() callback to report badrange for
each poison at an alignment indicated by 1ULL << MCI_MISC_ADDR_LSB(mce->misc)
instead of the hardcoded L1_CACHE_BYTES. However recently on a server
populated with Intel DCPMEM v2 dimms, it appears that
1UL << MCI_MISC_ADDR_LSB(mce->misc) turns out is 4KiB, or 8 512-byte blocks.
Consequently, injecting 2 back-to-back poisons via ndctl, and it reports
8 poisons.
[29076.590281] {3}[Hardware Error]: physical_address: 0x00000040a0602400
[..]
[29076.619447] Memory failure: 0x40a0602: recovery action for dax page: Recovered
[29076.627519] mce: [Hardware Error]: Machine check events logged
[29076.634033] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
[29076.648805] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
[..]
[29078.634817] {4}[Hardware Error]: physical_address: 0x00000040a0602600
[..]
[29079.595327] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
[29079.610106] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
[..]
{
"dev":"namespace0.0",
"mode":"fsdax",
"map":"dev",
"size":33820770304,
"uuid":"a1b0f07f-747f-40a8-bcd4-de1560a1ef75",
"sector_size":512,
"align":2097152,
"blockdev":"pmem0",
"badblock_count":8,
"badblocks":[
{
"offset":8208,
"length":8,
"dimms":[
"nmem0"
]
}
]
}
So, 1UL << MCI_MISC_ADDR_LSB(mce->misc) is an unreliable indicator for poison
radius and shouldn't be used. More over, as each injected poison is being
reported independently, any alignment under 512-byte appear works:
L1_CACHE_BYTES (though inaccurate), or 256-bytes (as ars->length reports),
or 512-byte.
To get around this issue, 512-bytes is chosen as the alignment because
a. it happens to be the badblock granularity,
b. ndctl inject-error cannot inject more than one poison to a 512-byte block,
c. architecture agnostic
Fixes: 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison granularity")
Signed-off-by: Jane Chu <[email protected]>
---
drivers/acpi/nfit/mce.c | 4 ++--
1 file changed, 2 insertions(+), 2 deletions(-)
diff --git a/drivers/acpi/nfit/mce.c b/drivers/acpi/nfit/mce.c
index d48a388b796e..eeacc8eb807f 100644
--- a/drivers/acpi/nfit/mce.c
+++ b/drivers/acpi/nfit/mce.c
@@ -32,7 +32,6 @@ static int nfit_handle_mce(struct notifier_block *nb, unsigned long val,
*/
mutex_lock(&acpi_desc_lock);
list_for_each_entry(acpi_desc, &acpi_descs, list) {
- unsigned int align = 1UL << MCI_MISC_ADDR_LSB(mce->misc);
struct device *dev = acpi_desc->dev;
int found_match = 0;
@@ -64,7 +63,8 @@ static int nfit_handle_mce(struct notifier_block *nb, unsigned long val,
/* If this fails due to an -ENOMEM, there is little we can do */
nvdimm_bus_add_badrange(acpi_desc->nvdimm_bus,
- ALIGN_DOWN(mce->addr, align), align);
+ ALIGN(mce->addr, SECTOR_SIZE),
+ SECTOR_SIZE);
nvdimm_region_notify(nfit_spa->nd_region,
NVDIMM_REVALIDATE_POISON);
base-commit: e35e5b6f695d241ffb1d223207da58a1fbcdff4b
--
2.18.4
Jane Chu wrote:
> Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison
> granularity") changed nfit_handle_mce() callback to report badrange for
> each poison at an alignment indicated by 1ULL << MCI_MISC_ADDR_LSB(mce->misc)
> instead of the hardcoded L1_CACHE_BYTES. However recently on a server
> populated with Intel DCPMEM v2 dimms, it appears that
> 1UL << MCI_MISC_ADDR_LSB(mce->misc) turns out is 4KiB, or 8 512-byte blocks.
> Consequently, injecting 2 back-to-back poisons via ndctl, and it reports
> 8 poisons.
>
> [29076.590281] {3}[Hardware Error]: physical_address: 0x00000040a0602400
> [..]
> [29076.619447] Memory failure: 0x40a0602: recovery action for dax page: Recovered
> [29076.627519] mce: [Hardware Error]: Machine check events logged
> [29076.634033] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
> [29076.648805] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
> [..]
> [29078.634817] {4}[Hardware Error]: physical_address: 0x00000040a0602600
> [..]
> [29079.595327] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
> [29079.610106] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
> [..]
> {
> "dev":"namespace0.0",
> "mode":"fsdax",
> "map":"dev",
> "size":33820770304,
> "uuid":"a1b0f07f-747f-40a8-bcd4-de1560a1ef75",
> "sector_size":512,
> "align":2097152,
> "blockdev":"pmem0",
> "badblock_count":8,
> "badblocks":[
> {
> "offset":8208,
> "length":8,
> "dimms":[
> "nmem0"
> ]
> }
> ]
> }
>
> So, 1UL << MCI_MISC_ADDR_LSB(mce->misc) is an unreliable indicator for poison
> radius and shouldn't be used. More over, as each injected poison is being
> reported independently, any alignment under 512-byte appear works:
> L1_CACHE_BYTES (though inaccurate), or 256-bytes (as ars->length reports),
> or 512-byte.
>
> To get around this issue, 512-bytes is chosen as the alignment because
> a. it happens to be the badblock granularity,
> b. ndctl inject-error cannot inject more than one poison to a 512-byte block,
> c. architecture agnostic
I am failing to see the kernel bug? Yes, you injected less than 8
"badblocks" of poison and the hardware reported 8 blocks of poison, but
that's not the kernel's fault, that's the hardware. What happens when
hardware really does detect 8 blocks of consective poison and this
implementation decides to only record 1 at a time?
It seems the fix you want is for the hardware to report the precise
error bounds and that 1UL << MCI_MISC_ADDR_LSB(mce->misc) does not have
that precision in this case.
However, the ARS engine likely can return the precise error ranges so I
think the fix is to just use the address range indicated by 1UL <<
MCI_MISC_ADDR_LSB(mce->misc) to filter the results from a short ARS
scrub request to ask the device for the precise error list.
On 7/12/2022 5:48 PM, Dan Williams wrote:
> Jane Chu wrote:
>> Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison
>> granularity") changed nfit_handle_mce() callback to report badrange for
>> each poison at an alignment indicated by 1ULL << MCI_MISC_ADDR_LSB(mce->misc)
>> instead of the hardcoded L1_CACHE_BYTES. However recently on a server
>> populated with Intel DCPMEM v2 dimms, it appears that
>> 1UL << MCI_MISC_ADDR_LSB(mce->misc) turns out is 4KiB, or 8 512-byte blocks.
>> Consequently, injecting 2 back-to-back poisons via ndctl, and it reports
>> 8 poisons.
>>
>> [29076.590281] {3}[Hardware Error]: physical_address: 0x00000040a0602400
>> [..]
>> [29076.619447] Memory failure: 0x40a0602: recovery action for dax page: Recovered
>> [29076.627519] mce: [Hardware Error]: Machine check events logged
>> [29076.634033] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
>> [29076.648805] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
>> [..]
>> [29078.634817] {4}[Hardware Error]: physical_address: 0x00000040a0602600
>> [..]
>> [29079.595327] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
>> [29079.610106] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
>> [..]
>> {
>> "dev":"namespace0.0",
>> "mode":"fsdax",
>> "map":"dev",
>> "size":33820770304,
>> "uuid":"a1b0f07f-747f-40a8-bcd4-de1560a1ef75",
>> "sector_size":512,
>> "align":2097152,
>> "blockdev":"pmem0",
>> "badblock_count":8,
>> "badblocks":[
>> {
>> "offset":8208,
>> "length":8,
>> "dimms":[
>> "nmem0"
>> ]
>> }
>> ]
>> }
>>
>> So, 1UL << MCI_MISC_ADDR_LSB(mce->misc) is an unreliable indicator for poison
>> radius and shouldn't be used. More over, as each injected poison is being
>> reported independently, any alignment under 512-byte appear works:
>> L1_CACHE_BYTES (though inaccurate), or 256-bytes (as ars->length reports),
>> or 512-byte.
>>
>> To get around this issue, 512-bytes is chosen as the alignment because
>> a. it happens to be the badblock granularity,
>> b. ndctl inject-error cannot inject more than one poison to a 512-byte block,
>> c. architecture agnostic
>
> I am failing to see the kernel bug? Yes, you injected less than 8
> "badblocks" of poison and the hardware reported 8 blocks of poison, but
> that's not the kernel's fault, that's the hardware. What happens when
> hardware really does detect 8 blocks of consective poison and this
> implementation decides to only record 1 at a time?
In that case, there will be 8 reports of the poisons by APEI GHES, and
ARC scan will also report 8 poisons, each will get to be added to the
bad range via nvdimm_bus_add_badrange(), so none of them will be missed.
In the above 2 poison example, the poison in 0x00000040a0602400 and in
0x00000040a0602600 were separately reported.
>
> It seems the fix you want is for the hardware to report the precise
> error bounds and that 1UL << MCI_MISC_ADDR_LSB(mce->misc) does not have
> that precision in this case.
That field describes a 4K range even for a single poison, it confuses
people unnecessarily.
>
> However, the ARS engine likely can return the precise error ranges so I
> think the fix is to just use the address range indicated by 1UL <<
> MCI_MISC_ADDR_LSB(mce->misc) to filter the results from a short ARS
> scrub request to ask the device for the precise error list.
You mean for nfit_handle_mce() callback to issue a short ARS per each
poison report over a 4K range in order to decide the precise range as a
workaround of the hardware issue? if there are 8 poisoned detected,
there will be 8 short ARS, sure we want to do that? also, for now, is
it possible to log more than 1 poison per 512byte block?
thanks!
-jane
Jane Chu wrote:
> On 7/12/2022 5:48 PM, Dan Williams wrote:
> > Jane Chu wrote:
> >> Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison
> >> granularity") changed nfit_handle_mce() callback to report badrange for
> >> each poison at an alignment indicated by 1ULL << MCI_MISC_ADDR_LSB(mce->misc)
> >> instead of the hardcoded L1_CACHE_BYTES. However recently on a server
> >> populated with Intel DCPMEM v2 dimms, it appears that
> >> 1UL << MCI_MISC_ADDR_LSB(mce->misc) turns out is 4KiB, or 8 512-byte blocks.
> >> Consequently, injecting 2 back-to-back poisons via ndctl, and it reports
> >> 8 poisons.
> >>
> >> [29076.590281] {3}[Hardware Error]: physical_address: 0x00000040a0602400
> >> [..]
> >> [29076.619447] Memory failure: 0x40a0602: recovery action for dax page: Recovered
> >> [29076.627519] mce: [Hardware Error]: Machine check events logged
> >> [29076.634033] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
> >> [29076.648805] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
> >> [..]
> >> [29078.634817] {4}[Hardware Error]: physical_address: 0x00000040a0602600
> >> [..]
> >> [29079.595327] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
> >> [29079.610106] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
> >> [..]
> >> {
> >> "dev":"namespace0.0",
> >> "mode":"fsdax",
> >> "map":"dev",
> >> "size":33820770304,
> >> "uuid":"a1b0f07f-747f-40a8-bcd4-de1560a1ef75",
> >> "sector_size":512,
> >> "align":2097152,
> >> "blockdev":"pmem0",
> >> "badblock_count":8,
> >> "badblocks":[
> >> {
> >> "offset":8208,
> >> "length":8,
> >> "dimms":[
> >> "nmem0"
> >> ]
> >> }
> >> ]
> >> }
> >>
> >> So, 1UL << MCI_MISC_ADDR_LSB(mce->misc) is an unreliable indicator for poison
> >> radius and shouldn't be used. More over, as each injected poison is being
> >> reported independently, any alignment under 512-byte appear works:
> >> L1_CACHE_BYTES (though inaccurate), or 256-bytes (as ars->length reports),
> >> or 512-byte.
> >>
> >> To get around this issue, 512-bytes is chosen as the alignment because
> >> a. it happens to be the badblock granularity,
> >> b. ndctl inject-error cannot inject more than one poison to a 512-byte block,
> >> c. architecture agnostic
> >
> > I am failing to see the kernel bug? Yes, you injected less than 8
> > "badblocks" of poison and the hardware reported 8 blocks of poison, but
> > that's not the kernel's fault, that's the hardware. What happens when
> > hardware really does detect 8 blocks of consective poison and this
> > implementation decides to only record 1 at a time?
>
> In that case, there will be 8 reports of the poisons by APEI GHES,
Why would there be 8 reports for just one poison consumption event?
> ARC scan will also report 8 poisons, each will get to be added to the
> bad range via nvdimm_bus_add_badrange(), so none of them will be missed.
Right, that's what I'm saying about the proposed change, trim the
reported poison by what is return from a "short" ARS. Recall that
short-ARS just reads from a staging buffer that the BIOS knows about, it
need not go all the way to hardware.
> In the above 2 poison example, the poison in 0x00000040a0602400 and in
> 0x00000040a0602600 were separately reported.
Separately reported, each with a 4K alignment?
> > It seems the fix you want is for the hardware to report the precise
> > error bounds and that 1UL << MCI_MISC_ADDR_LSB(mce->misc) does not have
> > that precision in this case.
>
> That field describes a 4K range even for a single poison, it confuses
> people unnecessarily.
I agree with you on the problem statement, it's the fix where I have
questions.
> > However, the ARS engine likely can return the precise error ranges so I
> > think the fix is to just use the address range indicated by 1UL <<
> > MCI_MISC_ADDR_LSB(mce->misc) to filter the results from a short ARS
> > scrub request to ask the device for the precise error list.
>
> You mean for nfit_handle_mce() callback to issue a short ARS per each
> poison report over a 4K range
Over a L1_CACHE_BYTES range...
> in order to decide the precise range as a workaround of the hardware
> issue? if there are 8 poisoned detected, there will be 8 short ARS,
> sure we want to do that?
Seems ok to me, short ARS is meant to be cheap. I would hope there are
no latency concerns in this path.
> also, for now, is it possible to log more than 1 poison per 512byte
> block?
For the badrange tracking, no. So this would just be a check to say
"Yes, CPU I see you think the whole 4K is gone, but lets double check
with more precise information for what gets placed in the badrange
tracking".
On 7/13/2022 5:24 PM, Dan Williams wrote:
> Jane Chu wrote:
>> On 7/12/2022 5:48 PM, Dan Williams wrote:
>>> Jane Chu wrote:
>>>> Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison
>>>> granularity") changed nfit_handle_mce() callback to report badrange for
>>>> each poison at an alignment indicated by 1ULL << MCI_MISC_ADDR_LSB(mce->misc)
>>>> instead of the hardcoded L1_CACHE_BYTES. However recently on a server
>>>> populated with Intel DCPMEM v2 dimms, it appears that
>>>> 1UL << MCI_MISC_ADDR_LSB(mce->misc) turns out is 4KiB, or 8 512-byte blocks.
>>>> Consequently, injecting 2 back-to-back poisons via ndctl, and it reports
>>>> 8 poisons.
>>>>
>>>> [29076.590281] {3}[Hardware Error]: physical_address: 0x00000040a0602400
>>>> [..]
>>>> [29076.619447] Memory failure: 0x40a0602: recovery action for dax page: Recovered
>>>> [29076.627519] mce: [Hardware Error]: Machine check events logged
>>>> [29076.634033] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
>>>> [29076.648805] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
>>>> [..]
>>>> [29078.634817] {4}[Hardware Error]: physical_address: 0x00000040a0602600
>>>> [..]
>>>> [29079.595327] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
>>>> [29079.610106] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
>>>> [..]
>>>> {
>>>> "dev":"namespace0.0",
>>>> "mode":"fsdax",
>>>> "map":"dev",
>>>> "size":33820770304,
>>>> "uuid":"a1b0f07f-747f-40a8-bcd4-de1560a1ef75",
>>>> "sector_size":512,
>>>> "align":2097152,
>>>> "blockdev":"pmem0",
>>>> "badblock_count":8,
>>>> "badblocks":[
>>>> {
>>>> "offset":8208,
>>>> "length":8,
>>>> "dimms":[
>>>> "nmem0"
>>>> ]
>>>> }
>>>> ]
>>>> }
>>>>
>>>> So, 1UL << MCI_MISC_ADDR_LSB(mce->misc) is an unreliable indicator for poison
>>>> radius and shouldn't be used. More over, as each injected poison is being
>>>> reported independently, any alignment under 512-byte appear works:
>>>> L1_CACHE_BYTES (though inaccurate), or 256-bytes (as ars->length reports),
>>>> or 512-byte.
>>>>
>>>> To get around this issue, 512-bytes is chosen as the alignment because
>>>> a. it happens to be the badblock granularity,
>>>> b. ndctl inject-error cannot inject more than one poison to a 512-byte block,
>>>> c. architecture agnostic
>>>
>>> I am failing to see the kernel bug? Yes, you injected less than 8
>>> "badblocks" of poison and the hardware reported 8 blocks of poison, but
>>> that's not the kernel's fault, that's the hardware. What happens when
>>> hardware really does detect 8 blocks of consective poison and this
>>> implementation decides to only record 1 at a time?
>>
>> In that case, there will be 8 reports of the poisons by APEI GHES,
>
> Why would there be 8 reports for just one poison consumption event?
I meant to say there would be 8 calls to the nfit_handle_mce() callback,
one call for each poison with accurate address.
Also, short ARS would find 2 poisons.
I attached the console output, my annotation is prefixed with "<==".
It is from these information I concluded that no poison will be missed
in terms of reporting.
>
>> ARC scan will also report 8 poisons, each will get to be added to the
>> bad range via nvdimm_bus_add_badrange(), so none of them will be missed.
>
> Right, that's what I'm saying about the proposed change, trim the
> reported poison by what is return from a "short" ARS. Recall that
> short-ARS just reads from a staging buffer that the BIOS knows about, it
> need not go all the way to hardware.
Okay, that confirms my understanding of your proposal. More below.
>
>> In the above 2 poison example, the poison in 0x00000040a0602400 and in
>> 0x00000040a0602600 were separately reported.
>
> Separately reported, each with a 4K alignment?
Yes, and so twice nfit_handle_mce() call
nvdimm_bus_add_badrange() with addr: 0x40a0602000, length: 0x1000
complete overlap.
>
>>> It seems the fix you want is for the hardware to report the precise
>>> error bounds and that 1UL << MCI_MISC_ADDR_LSB(mce->misc) does not have
>>> that precision in this case.
>>
>> That field describes a 4K range even for a single poison, it confuses
>> people unnecessarily.
>
> I agree with you on the problem statement, it's the fix where I have
> questions.
>
>>> However, the ARS engine likely can return the precise error ranges so I
>>> think the fix is to just use the address range indicated by 1UL <<
>>> MCI_MISC_ADDR_LSB(mce->misc) to filter the results from a short ARS
>>> scrub request to ask the device for the precise error list.
>>
>> You mean for nfit_handle_mce() callback to issue a short ARS per each
>> poison report over a 4K range
>
> Over a L1_CACHE_BYTES range...
>
>> in order to decide the precise range as a workaround of the hardware
>> issue? if there are 8 poisoned detected, there will be 8 short ARS,
>> sure we want to do that?
>
> Seems ok to me, short ARS is meant to be cheap. I would hope there are
> no latency concerns in this path.
Yeah, accumulated latency is the concern here. I know the upstream
user call stack has timeout for getting a response for certain
database transaction. And the test folks might inject dozens of
back-to-back poisons to adjacent pages...
>
>> also, for now, is it possible to log more than 1 poison per 512byte
>> block?
>
> For the badrange tracking, no. So this would just be a check to say
> "Yes, CPU I see you think the whole 4K is gone, but lets double check
> with more precise information for what gets placed in the badrange
> tracking".
Okay, process-wise, this is what I am seeing -
- for each poison, nfit_handle_mce() issues a short ARS given (addr,
64bytes)
- and short ARS returns to say that's actually (addr, 256bytes),
- and then nvdimm_bus_add_badrange() logs the poison in (addr, 512bytes)
anyway.
The precise badrange from short ARS is lost in the process, given the
time spent visiting the BIOS, what's the gain? Could we defer the
precise badrange until there is consumer of the information?
thanks!
-jane
Jane Chu wrote:
> On 7/13/2022 5:24 PM, Dan Williams wrote:
> > Jane Chu wrote:
> >> On 7/12/2022 5:48 PM, Dan Williams wrote:
> >>> Jane Chu wrote:
> >>>> Commit 7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison
> >>>> granularity") changed nfit_handle_mce() callback to report badrange for
> >>>> each poison at an alignment indicated by 1ULL << MCI_MISC_ADDR_LSB(mce->misc)
> >>>> instead of the hardcoded L1_CACHE_BYTES. However recently on a server
> >>>> populated with Intel DCPMEM v2 dimms, it appears that
> >>>> 1UL << MCI_MISC_ADDR_LSB(mce->misc) turns out is 4KiB, or 8 512-byte blocks.
> >>>> Consequently, injecting 2 back-to-back poisons via ndctl, and it reports
> >>>> 8 poisons.
> >>>>
> >>>> [29076.590281] {3}[Hardware Error]: physical_address: 0x00000040a0602400
> >>>> [..]
> >>>> [29076.619447] Memory failure: 0x40a0602: recovery action for dax page: Recovered
> >>>> [29076.627519] mce: [Hardware Error]: Machine check events logged
> >>>> [29076.634033] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
> >>>> [29076.648805] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
> >>>> [..]
> >>>> [29078.634817] {4}[Hardware Error]: physical_address: 0x00000040a0602600
> >>>> [..]
> >>>> [29079.595327] nfit ACPI0012:00: addr in SPA 1 (0x4080000000, 0x1f80000000)
> >>>> [29079.610106] nd_bus ndbus0: XXX nvdimm_bus_add_badrange: (0x40a0602000, 0x1000)
> >>>> [..]
> >>>> {
> >>>> "dev":"namespace0.0",
> >>>> "mode":"fsdax",
> >>>> "map":"dev",
> >>>> "size":33820770304,
> >>>> "uuid":"a1b0f07f-747f-40a8-bcd4-de1560a1ef75",
> >>>> "sector_size":512,
> >>>> "align":2097152,
> >>>> "blockdev":"pmem0",
> >>>> "badblock_count":8,
> >>>> "badblocks":[
> >>>> {
> >>>> "offset":8208,
> >>>> "length":8,
> >>>> "dimms":[
> >>>> "nmem0"
> >>>> ]
> >>>> }
> >>>> ]
> >>>> }
> >>>>
> >>>> So, 1UL << MCI_MISC_ADDR_LSB(mce->misc) is an unreliable indicator for poison
> >>>> radius and shouldn't be used. More over, as each injected poison is being
> >>>> reported independently, any alignment under 512-byte appear works:
> >>>> L1_CACHE_BYTES (though inaccurate), or 256-bytes (as ars->length reports),
> >>>> or 512-byte.
> >>>>
> >>>> To get around this issue, 512-bytes is chosen as the alignment because
> >>>> a. it happens to be the badblock granularity,
> >>>> b. ndctl inject-error cannot inject more than one poison to a 512-byte block,
> >>>> c. architecture agnostic
> >>>
> >>> I am failing to see the kernel bug? Yes, you injected less than 8
> >>> "badblocks" of poison and the hardware reported 8 blocks of poison, but
> >>> that's not the kernel's fault, that's the hardware. What happens when
> >>> hardware really does detect 8 blocks of consective poison and this
> >>> implementation decides to only record 1 at a time?
> >>
> >> In that case, there will be 8 reports of the poisons by APEI GHES,
> >
> > Why would there be 8 reports for just one poison consumption event?
>
> I meant to say there would be 8 calls to the nfit_handle_mce() callback,
> one call for each poison with accurate address.
>
> Also, short ARS would find 2 poisons.
>
> I attached the console output, my annotation is prefixed with "<==".
>
> It is from these information I concluded that no poison will be missed
> in terms of reporting.
>
> >
> >> ARC scan will also report 8 poisons, each will get to be added to the
> >> bad range via nvdimm_bus_add_badrange(), so none of them will be missed.
> >
> > Right, that's what I'm saying about the proposed change, trim the
> > reported poison by what is return from a "short" ARS. Recall that
> > short-ARS just reads from a staging buffer that the BIOS knows about, it
> > need not go all the way to hardware.
>
> Okay, that confirms my understanding of your proposal. More below.
>
> >
> >> In the above 2 poison example, the poison in 0x00000040a0602400 and in
> >> 0x00000040a0602600 were separately reported.
> >
> > Separately reported, each with a 4K alignment?
>
> Yes, and so twice nfit_handle_mce() call
> nvdimm_bus_add_badrange() with addr: 0x40a0602000, length: 0x1000
> complete overlap.
>
> >
> >>> It seems the fix you want is for the hardware to report the precise
> >>> error bounds and that 1UL << MCI_MISC_ADDR_LSB(mce->misc) does not have
> >>> that precision in this case.
> >>
> >> That field describes a 4K range even for a single poison, it confuses
> >> people unnecessarily.
> >
> > I agree with you on the problem statement, it's the fix where I have
> > questions.
> >
> >>> However, the ARS engine likely can return the precise error ranges so I
> >>> think the fix is to just use the address range indicated by 1UL <<
> >>> MCI_MISC_ADDR_LSB(mce->misc) to filter the results from a short ARS
> >>> scrub request to ask the device for the precise error list.
> >>
> >> You mean for nfit_handle_mce() callback to issue a short ARS per each
> >> poison report over a 4K range
> >
> > Over a L1_CACHE_BYTES range...
> >
> >> in order to decide the precise range as a workaround of the hardware
> >> issue? if there are 8 poisoned detected, there will be 8 short ARS,
> >> sure we want to do that?
> >
> > Seems ok to me, short ARS is meant to be cheap. I would hope there are
> > no latency concerns in this path.
>
> Yeah, accumulated latency is the concern here.
An actual concern in practice? I.e. you see the round trip call to the
BIOS violating application latency concerns?
> I know the upstream user call stack has timeout for getting a response
> for certain database transaction.
I would expect that's on the order of seconds, no?
> And the test folks might inject dozens of back-to-back poisons to
> adjacent pages...
I am not sure the extreme scenarios of test folks should be guiding
kernel design, the interfaces need to make sense first, and then the
performance can be fixed up if kernel architecture causes practical
problems.
> >
> >> also, for now, is it possible to log more than 1 poison per 512byte
> >> block?
> >
> > For the badrange tracking, no. So this would just be a check to say
> > "Yes, CPU I see you think the whole 4K is gone, but lets double check
> > with more precise information for what gets placed in the badrange
> > tracking".
>
> Okay, process-wise, this is what I am seeing -
>
> - for each poison, nfit_handle_mce() issues a short ARS given (addr,
> 64bytes)
Why would the short-ARS be performed over a 64-byte span when the MCE
reported a 4K aligned event?
> - and short ARS returns to say that's actually (addr, 256bytes),
> - and then nvdimm_bus_add_badrange() logs the poison in (addr, 512bytes)
> anyway.
Right, I am reacting to the fact that the patch is picking 512 as an
arbtitrary blast radius. It's ok to expand the blast radius from
hardware when, for example, recording a 64-byte MCE in badrange which
only understands 512 byte records, but it's not ok to take a 4K MCE and
trim it to 512 bytes without asking hardware for a more precise report.
Recall that the NFIT driver supports platforms that may not offer ARS.
In that case the 4K MCE from the CPU is all that the driver gets and
there is no data source for a more precise answer.
So the ask is to avoid trimming the blast radius of MCE reports unless
and until a short-ARS says otherwise.
> The precise badrange from short ARS is lost in the process, given the
> time spent visiting the BIOS, what's the gain?
Generic support for not under-recording poison on platforms that do not
support ARS.
> Could we defer the precise badrange until there is consumer of the
> information?
Ideally the consumer is immediate and this precise information can make
it to the filesystem which might be able to make a better decision about
what data got clobbered.
See dax_notify_failure() infrastructure currently in linux-next that can
convey poison events to filesystems. That might be a path to start
tracking and reporting precise failure information to address the
constraints of the badrange implementation.
Jane Chu wrote:
> I meant to say there would be 8 calls to the nfit_handle_mce() callback,
> one call for each poison with accurate address.
>
> Also, short ARS would find 2 poisons.
>
> I attached the console output, my annotation is prefixed with "<==".
[29078.634817] {4}[Hardware Error]: physical_address: 0x00000040a0602600 <== 2nd poison @ 0x600
[29078.642200] {4}[Hardware Error]: physical_address_mask: 0xffffffffffffff00
Why is nfit_handle_mce() seeing a 4K address mask when the CPER record
is seeing a 256-byte address mask?
Sigh, is this "firmware-first" causing the kernel to get bad information
via the native mechanisms?
I would expect that if this test was truly worried about minimizing BIOS
latency it would disable firmware-first error reporting. I wonder if
that fixes the observed problem?
On 7/14/2022 5:58 PM, Dan Williams wrote:
[..]
>>>
>>>>> However, the ARS engine likely can return the precise error ranges so I
>>>>> think the fix is to just use the address range indicated by 1UL <<
>>>>> MCI_MISC_ADDR_LSB(mce->misc) to filter the results from a short ARS
>>>>> scrub request to ask the device for the precise error list.
>>>>
>>>> You mean for nfit_handle_mce() callback to issue a short ARS per each
>>>> poison report over a 4K range
>>>
>>> Over a L1_CACHE_BYTES range...
>>>
[..]
>>>
>>> For the badrange tracking, no. So this would just be a check to say
>>> "Yes, CPU I see you think the whole 4K is gone, but lets double check
>>> with more precise information for what gets placed in the badrange
>>> tracking".
>>
>> Okay, process-wise, this is what I am seeing -
>>
>> - for each poison, nfit_handle_mce() issues a short ARS given (addr,
>> 64bytes)
>
> Why would the short-ARS be performed over a 64-byte span when the MCE
> reported a 4K aligned event?
Cuz you said so, see above. :) Yes, 4K range as reported by the MCE
makes sense.
>
>> - and short ARS returns to say that's actually (addr, 256bytes),
>> - and then nvdimm_bus_add_badrange() logs the poison in (addr, 512bytes)
>> anyway.
>
> Right, I am reacting to the fact that the patch is picking 512 as an
> arbtitrary blast radius. It's ok to expand the blast radius from
> hardware when, for example, recording a 64-byte MCE in badrange which
> only understands 512 byte records, but it's not ok to take a 4K MCE and
> trim it to 512 bytes without asking hardware for a more precise report.
Agreed.
>
> Recall that the NFIT driver supports platforms that may not offer ARS.
> In that case the 4K MCE from the CPU is all that the driver gets and
> there is no data source for a more precise answer.
>
> So the ask is to avoid trimming the blast radius of MCE reports unless
> and until a short-ARS says otherwise.
>
What happens to short ARS on a platform that doesn't support ARS?
-EOPNOTSUPPORTED ?
>> The precise badrange from short ARS is lost in the process, given the
>> time spent visiting the BIOS, what's the gain?
>
> Generic support for not under-recording poison on platforms that do not
> support ARS.
>
>> Could we defer the precise badrange until there is consumer of the
>> information?
>
> Ideally the consumer is immediate and this precise information can make
> it to the filesystem which might be able to make a better decision about
> what data got clobbered.
>
> See dax_notify_failure() infrastructure currently in linux-next that can
> convey poison events to filesystems. That might be a path to start
> tracking and reporting precise failure information to address the
> constraints of the badrange implementation.
Yes, I'm aware of dax_notify_failure(), but would appreciate if you
don't mind to elaborate on how the code path could be leveraged for
precise badrange implementation.
My understanding is that dax_notify_failure() is in the path of
synchronous fault accompanied by SIGBUS with BUS_MCEERR_AR.
But badrange could be recorded without poison being consumed, even
without DAX filesystem in the picture.
thanks,
-jane
On 7/14/2022 6:19 PM, Dan Williams wrote:
> Jane Chu wrote:
>> I meant to say there would be 8 calls to the nfit_handle_mce() callback,
>> one call for each poison with accurate address.
>>
>> Also, short ARS would find 2 poisons.
>>
>> I attached the console output, my annotation is prefixed with "<==".
>
> [29078.634817] {4}[Hardware Error]: physical_address: 0x00000040a0602600 <== 2nd poison @ 0x600
> [29078.642200] {4}[Hardware Error]: physical_address_mask: 0xffffffffffffff00
>
> Why is nfit_handle_mce() seeing a 4K address mask when the CPER record
> is seeing a 256-byte address mask?
Good question! One would think both GHES reporting and
nfit_handle_mce() are consuming the same mce record...
Who might know?
>
> Sigh, is this "firmware-first" causing the kernel to get bad information
> via the native mechanisms >
> I would expect that if this test was truly worried about minimizing BIOS
> latency it would disable firmware-first error reporting. I wonder if
> that fixes the observed problem?
Could you elaborate on firmware-first error please? What are the
possible consequences disabling it? and how to disable it?
thanks!
-jane
[ add Tony ]
Jane Chu wrote:
> On 7/14/2022 6:19 PM, Dan Williams wrote:
> > Jane Chu wrote:
> >> I meant to say there would be 8 calls to the nfit_handle_mce() callback,
> >> one call for each poison with accurate address.
> >>
> >> Also, short ARS would find 2 poisons.
> >>
> >> I attached the console output, my annotation is prefixed with "<==".
> >
> > [29078.634817] {4}[Hardware Error]: physical_address: 0x00000040a0602600 <== 2nd poison @ 0x600
> > [29078.642200] {4}[Hardware Error]: physical_address_mask: 0xffffffffffffff00
> >
> > Why is nfit_handle_mce() seeing a 4K address mask when the CPER record
> > is seeing a 256-byte address mask?
>
> Good question! One would think both GHES reporting and
> nfit_handle_mce() are consuming the same mce record...
> Who might know?
Did some grepping...
Have a look at: apei_mce_report_mem_error()
"The call is coming from inside the house!"
Luckily we do not need to contact a BIOS engineer to get this fixed.
> > Sigh, is this "firmware-first" causing the kernel to get bad information
> > via the native mechanisms >
> > I would expect that if this test was truly worried about minimizing BIOS
> > latency it would disable firmware-first error reporting. I wonder if
> > that fixes the observed problem?
>
> Could you elaborate on firmware-first error please? What are the
> possible consequences disabling it? and how to disable it?
With my Linux kernel developer hat on, firmware-first error handling is
really only useful for supporting legacy operating systems that do not
have native machine check handling, or for platforms that have bugs that
would otherwise cause OS native error handling to fail. Otherwise, for
modern Linux, firmware-first error handling is pure overhead and a
source of bugs.
In this case the bug is in the Linux code that translates the ACPI event
back into an MCE record.
On 7/15/2022 12:17 PM, Dan Williams wrote:
> [ add Tony ]
>
> Jane Chu wrote:
>> On 7/14/2022 6:19 PM, Dan Williams wrote:
>>> Jane Chu wrote:
>>>> I meant to say there would be 8 calls to the nfit_handle_mce() callback,
>>>> one call for each poison with accurate address.
>>>>
>>>> Also, short ARS would find 2 poisons.
>>>>
>>>> I attached the console output, my annotation is prefixed with "<==".
>>>
>>> [29078.634817] {4}[Hardware Error]: physical_address: 0x00000040a0602600 <== 2nd poison @ 0x600
>>> [29078.642200] {4}[Hardware Error]: physical_address_mask: 0xffffffffffffff00
>>>
>>> Why is nfit_handle_mce() seeing a 4K address mask when the CPER record
>>> is seeing a 256-byte address mask?
>>
>> Good question! One would think both GHES reporting and
>> nfit_handle_mce() are consuming the same mce record...
>> Who might know?
>
> Did some grepping...
>
> Have a look at: apei_mce_report_mem_error()
>
> "The call is coming from inside the house!"
>
> Luckily we do not need to contact a BIOS engineer to get this fixed.
Great, thank you!
Just put together a quick fix for review after I tested it.
>
>>> Sigh, is this "firmware-first" causing the kernel to get bad information
>>> via the native mechanisms >
>>> I would expect that if this test was truly worried about minimizing BIOS
>>> latency it would disable firmware-first error reporting. I wonder if
>>> that fixes the observed problem?
>>
>> Could you elaborate on firmware-first error please? What are the
>> possible consequences disabling it? and how to disable it?
>
> With my Linux kernel developer hat on, firmware-first error handling is
> really only useful for supporting legacy operating systems that do not
> have native machine check handling, or for platforms that have bugs that
> would otherwise cause OS native error handling to fail. Otherwise, for
> modern Linux, firmware-first error handling is pure overhead and a
> source of bugs.
>
> In this case the bug is in the Linux code that translates the ACPI event
> back into an MCE record.
Thanks!
-jane
The following commit has been merged into the ras/core branch of tip:
Commit-ID: f9781bb18ed828e7b83b7bac4a4ad7cd497ee7d7
Gitweb: https://git.kernel.org/tip/f9781bb18ed828e7b83b7bac4a4ad7cd497ee7d7
Author: Jane Chu <[email protected]>
AuthorDate: Fri, 26 Aug 2022 17:38:51 -06:00
Committer: Borislav Petkov <[email protected]>
CommitterDate: Mon, 29 Aug 2022 09:33:42 +02:00
x86/mce: Retrieve poison range from hardware
When memory poison consumption machine checks fire, MCE notifier
handlers like nfit_handle_mce() record the impacted physical address
range which is reported by the hardware in the MCi_MISC MSR. The error
information includes data about blast radius, i.e. how many cachelines
did the hardware determine are impacted. A recent change
7917f9cdb503 ("acpi/nfit: rely on mce->misc to determine poison granularity")
updated nfit_handle_mce() to stop hard coding the blast radius value of
1 cacheline, and instead rely on the blast radius reported in 'struct
mce' which can be up to 4K (64 cachelines).
It turns out that apei_mce_report_mem_error() had a similar problem in
that it hard coded a blast radius of 4K rather than reading the blast
radius from the error information. Fix apei_mce_report_mem_error() to
convey the proper poison granularity.
Signed-off-by: Jane Chu <[email protected]>
Signed-off-by: Borislav Petkov <[email protected]>
Reviewed-by: Dan Williams <[email protected]>
Reviewed-by: Ingo Molnar <[email protected]>
Link: https://lore.kernel.org/r/[email protected]
Link: https://lore.kernel.org/r/[email protected]
---
arch/x86/kernel/cpu/mce/apei.c | 13 ++++++++++++-
1 file changed, 12 insertions(+), 1 deletion(-)
diff --git a/arch/x86/kernel/cpu/mce/apei.c b/arch/x86/kernel/cpu/mce/apei.c
index 7171929..8ed3417 100644
--- a/arch/x86/kernel/cpu/mce/apei.c
+++ b/arch/x86/kernel/cpu/mce/apei.c
@@ -29,15 +29,26 @@
void apei_mce_report_mem_error(int severity, struct cper_sec_mem_err *mem_err)
{
struct mce m;
+ int lsb;
if (!(mem_err->validation_bits & CPER_MEM_VALID_PA))
return;
+ /*
+ * Even if the ->validation_bits are set for address mask,
+ * to be extra safe, check and reject an error radius '0',
+ * and fall back to the default page size.
+ */
+ if (mem_err->validation_bits & CPER_MEM_VALID_PA_MASK)
+ lsb = find_first_bit((void *)&mem_err->physical_addr_mask, PAGE_SHIFT);
+ else
+ lsb = PAGE_SHIFT;
+
mce_setup(&m);
m.bank = -1;
/* Fake a memory read error with unknown channel */
m.status = MCI_STATUS_VAL | MCI_STATUS_EN | MCI_STATUS_ADDRV | MCI_STATUS_MISCV | 0x9f;
- m.misc = (MCI_MISC_ADDR_PHYS << 6) | PAGE_SHIFT;
+ m.misc = (MCI_MISC_ADDR_PHYS << 6) | lsb;
if (severity >= GHES_SEV_RECOVERABLE)
m.status |= MCI_STATUS_UC;