On baremetal Intel platform with DCPMEM installed and configured to
provision daxfs, say a poison was consumed by a load from a user thread,
and then daxfs takes action and clears the poison, confirmed by "ndctl
-NM".
Now, depends on the luck, after sometime(from a few seconds to 5+ hours)
the ghost of the previous poison will surface, and it takes
unload/reload the libnvdimm drivers in order to drive the phantom poison
away, confirmed by ARS.
Turns out, the issue is quite reproducible with the latest stable Linux.
Here is the relevant console message after injected 8 poisons in one
page via
# ndctl inject-error namespace0.0 -n 2 -B 8210
then, cleared them all, and wait for 5+ hours, notice the time stamp.
BTW, the system is idle otherwise.
[ 2439.742296] mce: Uncorrected hardware memory error in user-access at
1850602400
[ 2439.742420] Memory failure: 0x1850602: Sending SIGBUS to
fsdax_poison_v1:8457 due to hardware memory corruption
[ 2439.761866] Memory failure: 0x1850602: recovery action for dax page:
Recovered
[ 2439.769949] mce: [Hardware Error]: Machine check events logged
-1850603000 uncached-minus<->write-back
[ 2439.769984] x86/PAT: memtype_reserve failed [mem
0x1850602000-0x1850602fff], track uncached-minus, req uncached-minus
[ 2439.769985] Could not invalidate pfn=0x1850602 from 1:1 map
[ 2440.856351] x86/PAT: fsdax_poison_v1:8457 freeing invalid memtype
[mem 0x1850602000-0x1850602fff]
At this point,
# ndctl list -NMu -r 0
{
"dev":"namespace0.0",
"mode":"fsdax",
"map":"dev",
"size":"15.75 GiB (16.91 GB)",
"uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
"sector_size":4096,
"align":2097152,
"blockdev":"pmem0"
}
[21351.992296] {2}[Hardware Error]: Hardware error from APEI Generic
Hardware Error Source: 1
[21352.001528] {2}[Hardware Error]: event severity: recoverable
[21352.007838] {2}[Hardware Error]: Error 0, type: recoverable
[21352.014156] {2}[Hardware Error]: section_type: memory error
[21352.020572] {2}[Hardware Error]: physical_address: 0x0000001850603200
[21352.027958] {2}[Hardware Error]: physical_address_mask:
0xffffffffffffff00
[21352.035827] {2}[Hardware Error]: node: 0 module: 1
[21352.041466] {2}[Hardware Error]: DIMM location: /SYS/MB/P0 D6
[21352.048277] Memory failure: 0x1850603: recovery action for dax page:
Recovered
[21352.056346] mce: [Hardware Error]: Machine check events logged
[21352.056890] EDAC skx MC0: HANDLING MCE MEMORY ERROR
[21352.056892] EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 255:
0xbc0000000000009f
[21352.056894] EDAC skx MC0: TSC 0x0
[21352.056895] EDAC skx MC0: ADDR 0x1850603200
[21352.056897] EDAC skx MC0: MISC 0x8c
[21352.056898] EDAC skx MC0: PROCESSOR 0:0x50656 TIME 1642758243 SOCKET
0 APIC 0x0
[21352.056909] EDAC MC0: 1 UE memory read error on
CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x1850603
offset:0x200 grain:32 - err_code:0x0000:0x009f [..]
And now,
# ndctl list -NMu -r 0
{
"dev":"namespace0.0",
"mode":"fsdax",
"map":"dev",
"size":"15.75 GiB (16.91 GB)",
"uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
"sector_size":4096,
"align":2097152,
"blockdev":"pmem0",
"badblock_count":1,
"badblocks":[
{
"offset":8217,
"length":1,
"dimms":[
"nmem0"
]
}
]
}
According to my limited research, when ghes_proc_in_irq() is fired to
report a delayed UE and it calls memory_failure() to take the page out
and causes driver to record a badblock record, and that's how the
phantom poison appeared.
Note, 1 phantom poison for 8 injected poisons, so, not an accurate
phantom representation.
But that aside, it seems that the GHES mechanism and the synchronous MCE
handling is totally at odds with each other, and that cannot be correct.
What is the right thing to do to fix the issue? Should memory_failure
handler second-guess the GHES report? Should the synchronous MCE
handling mechanism manage to tell the firmware that so-and-so memory UE
has been cleared and hence clear the record in firmware? Other ideas?
Thanks!
-jane
On 1/21/2022 4:31 PM, Jane Chu wrote:
> On baremetal Intel platform with DCPMEM installed and configured to
> provision daxfs, say a poison was consumed by a load from a user thread,
> and then daxfs takes action and clears the poison, confirmed by "ndctl
> -NM".
>
> Now, depends on the luck, after sometime(from a few seconds to 5+ hours)
> the ghost of the previous poison will surface, and it takes
> unload/reload the libnvdimm drivers in order to drive the phantom poison
> away, confirmed by ARS.
>
> Turns out, the issue is quite reproducible with the latest stable Linux.
>
> Here is the relevant console message after injected 8 poisons in one
> page via
> # ndctl inject-error namespace0.0 -n 2 -B 8210
There is a cut-n-paste error, the above line should be
"# ndctl inject-error namespace0.0 -n 8 -B 8210"
-jane
> then, cleared them all, and wait for 5+ hours, notice the time stamp.
> BTW, the system is idle otherwise.
>
> [ 2439.742296] mce: Uncorrected hardware memory error in user-access at
> 1850602400
> [ 2439.742420] Memory failure: 0x1850602: Sending SIGBUS to
> fsdax_poison_v1:8457 due to hardware memory corruption
> [ 2439.761866] Memory failure: 0x1850602: recovery action for dax page:
> Recovered
> [ 2439.769949] mce: [Hardware Error]: Machine check events logged
> -1850603000 uncached-minus<->write-back
> [ 2439.769984] x86/PAT: memtype_reserve failed [mem
> 0x1850602000-0x1850602fff], track uncached-minus, req uncached-minus
> [ 2439.769985] Could not invalidate pfn=0x1850602 from 1:1 map
> [ 2440.856351] x86/PAT: fsdax_poison_v1:8457 freeing invalid memtype
> [mem 0x1850602000-0x1850602fff]
>
> At this point,
> # ndctl list -NMu -r 0
> {
> "dev":"namespace0.0",
> "mode":"fsdax",
> "map":"dev",
> "size":"15.75 GiB (16.91 GB)",
> "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
> "sector_size":4096,
> "align":2097152,
> "blockdev":"pmem0"
> }
>
> [21351.992296] {2}[Hardware Error]: Hardware error from APEI Generic
> Hardware Error Source: 1
> [21352.001528] {2}[Hardware Error]: event severity: recoverable
> [21352.007838] {2}[Hardware Error]: Error 0, type: recoverable
> [21352.014156] {2}[Hardware Error]: section_type: memory error
> [21352.020572] {2}[Hardware Error]: physical_address: 0x0000001850603200
> [21352.027958] {2}[Hardware Error]: physical_address_mask:
> 0xffffffffffffff00
> [21352.035827] {2}[Hardware Error]: node: 0 module: 1
> [21352.041466] {2}[Hardware Error]: DIMM location: /SYS/MB/P0 D6
> [21352.048277] Memory failure: 0x1850603: recovery action for dax page:
> Recovered
> [21352.056346] mce: [Hardware Error]: Machine check events logged
> [21352.056890] EDAC skx MC0: HANDLING MCE MEMORY ERROR
> [21352.056892] EDAC skx MC0: CPU 0: Machine Check Event: 0x0 Bank 255:
> 0xbc0000000000009f
> [21352.056894] EDAC skx MC0: TSC 0x0
> [21352.056895] EDAC skx MC0: ADDR 0x1850603200
> [21352.056897] EDAC skx MC0: MISC 0x8c
> [21352.056898] EDAC skx MC0: PROCESSOR 0:0x50656 TIME 1642758243 SOCKET
> 0 APIC 0x0
> [21352.056909] EDAC MC0: 1 UE memory read error on
> CPU_SrcID#0_MC#0_Chan#0_DIMM#1 (channel:0 slot:1 page:0x1850603
> offset:0x200 grain:32 - err_code:0x0000:0x009f [..]
>
> And now,
>
> # ndctl list -NMu -r 0
> {
> "dev":"namespace0.0",
> "mode":"fsdax",
> "map":"dev",
> "size":"15.75 GiB (16.91 GB)",
> "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
> "sector_size":4096,
> "align":2097152,
> "blockdev":"pmem0",
> "badblock_count":1,
> "badblocks":[
> {
> "offset":8217,
> "length":1,
> "dimms":[
> "nmem0"
> ]
> }
> ]
> }
>
> According to my limited research, when ghes_proc_in_irq() is fired to
> report a delayed UE and it calls memory_failure() to take the page out
> and causes driver to record a badblock record, and that's how the
> phantom poison appeared.
>
> Note, 1 phantom poison for 8 injected poisons, so, not an accurate
> phantom representation.
>
> But that aside, it seems that the GHES mechanism and the synchronous MCE
> handling is totally at odds with each other, and that cannot be correct.
>
> What is the right thing to do to fix the issue? Should memory_failure
> handler second-guess the GHES report? Should the synchronous MCE
> handling mechanism manage to tell the firmware that so-and-so memory UE
> has been cleared and hence clear the record in firmware? Other ideas?
>
>
> Thanks!
> -jane
On Sat, Jan 22, 2022 at 12:40:18AM +0000, Jane Chu wrote:
> On 1/21/2022 4:31 PM, Jane Chu wrote:
> > On baremetal Intel platform with DCPMEM installed and configured to
> > provision daxfs, say a poison was consumed by a load from a user thread,
> > and then daxfs takes action and clears the poison, confirmed by "ndctl
> > -NM".
> >
> > Now, depends on the luck, after sometime(from a few seconds to 5+ hours)
> > the ghost of the previous poison will surface, and it takes
> > unload/reload the libnvdimm drivers in order to drive the phantom poison
> > away, confirmed by ARS.
> >
> > Turns out, the issue is quite reproducible with the latest stable Linux.
> >
> > Here is the relevant console message after injected 8 poisons in one
> > page via
> > # ndctl inject-error namespace0.0 -n 2 -B 8210
>
> There is a cut-n-paste error, the above line should be
> "# ndctl inject-error namespace0.0 -n 8 -B 8210"
You say "in one page" here. What is the page size?
>
> -jane
>
> > then, cleared them all, and wait for 5+ hours, notice the time stamp.
> > BTW, the system is idle otherwise.
> >
> > [ 2439.742296] mce: Uncorrected hardware memory error in user-access at
> > 1850602400
> > [ 2439.742420] Memory failure: 0x1850602: Sending SIGBUS to
> > fsdax_poison_v1:8457 due to hardware memory corruption
> > [ 2439.761866] Memory failure: 0x1850602: recovery action for dax page:
> > Recovered
> > [ 2439.769949] mce: [Hardware Error]: Machine check events logged
> > -1850603000 uncached-minus<->write-back
> > [ 2439.769984] x86/PAT: memtype_reserve failed [mem
> > 0x1850602000-0x1850602fff], track uncached-minus, req uncached-minus
> > [ 2439.769985] Could not invalidate pfn=0x1850602 from 1:1 map
> > [ 2440.856351] x86/PAT: fsdax_poison_v1:8457 freeing invalid memtype
> > [mem 0x1850602000-0x1850602fff]
This error is reported in PFN=1850602 (at offset 0x400 = 1K)
> >
> > At this point,
> > # ndctl list -NMu -r 0
> > {
> > "dev":"namespace0.0",
> > "mode":"fsdax",
> > "map":"dev",
> > "size":"15.75 GiB (16.91 GB)",
> > "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
> > "sector_size":4096,
> > "align":2097152,
> > "blockdev":"pmem0"
> > }
> >
> > [21351.992296] {2}[Hardware Error]: Hardware error from APEI Generic
> > Hardware Error Source: 1
> > [21352.001528] {2}[Hardware Error]: event severity: recoverable
> > [21352.007838] {2}[Hardware Error]: Error 0, type: recoverable
> > [21352.014156] {2}[Hardware Error]: section_type: memory error
> > [21352.020572] {2}[Hardware Error]: physical_address: 0x0000001850603200
This error is in the following page: PFN=1850603 (at offset 0x200 = 512b)
Is that what you mean by "phantom error" ... from a different
address from those that were injected?
-Tony
Hi Jane,
Is phantom error, an poison that was injected and then cleared, but somehow shows up again?
How is "daxfs takes acation and clears the poison" by doing mailbox or writes?
Also how are you doing ARS?
Erwin
-----Original Message-----
From: Luck, Tony <[email protected]>
Sent: Friday, January 21, 2022 5:27 PM
To: chu, jane <[email protected]>
Cc: Williams, Dan J <[email protected]>; [email protected] >> Borislav Petkov <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]
Subject: Re: Phantom PMEM poison issue
On Sat, Jan 22, 2022 at 12:40:18AM +0000, Jane Chu wrote:
> On 1/21/2022 4:31 PM, Jane Chu wrote:
> > On baremetal Intel platform with DCPMEM installed and configured to
> > provision daxfs, say a poison was consumed by a load from a user
> > thread, and then daxfs takes action and clears the poison, confirmed
> > by "ndctl -NM".
> >
> > Now, depends on the luck, after sometime(from a few seconds to 5+
> > hours) the ghost of the previous poison will surface, and it takes
> > unload/reload the libnvdimm drivers in order to drive the phantom
> > poison away, confirmed by ARS.
> >
> > Turns out, the issue is quite reproducible with the latest stable Linux.
> >
> > Here is the relevant console message after injected 8 poisons in one
> > page via
> > # ndctl inject-error namespace0.0 -n 2 -B 8210
>
> There is a cut-n-paste error, the above line should be
> "# ndctl inject-error namespace0.0 -n 8 -B 8210"
You say "in one page" here. What is the page size?
>
> -jane
>
> > then, cleared them all, and wait for 5+ hours, notice the time stamp.
> > BTW, the system is idle otherwise.
> >
> > [ 2439.742296] mce: Uncorrected hardware memory error in user-access
> > at
> > 1850602400
> > [ 2439.742420] Memory failure: 0x1850602: Sending SIGBUS to
> > fsdax_poison_v1:8457 due to hardware memory corruption [
> > 2439.761866] Memory failure: 0x1850602: recovery action for dax page:
> > Recovered
> > [ 2439.769949] mce: [Hardware Error]: Machine check events logged
> > -1850603000 uncached-minus<->write-back [ 2439.769984] x86/PAT:
> > memtype_reserve failed [mem 0x1850602000-0x1850602fff], track
> > uncached-minus, req uncached-minus [ 2439.769985] Could not
> > invalidate pfn=0x1850602 from 1:1 map [ 2440.856351] x86/PAT:
> > fsdax_poison_v1:8457 freeing invalid memtype [mem
> > 0x1850602000-0x1850602fff]
This error is reported in PFN=1850602 (at offset 0x400 = 1K)
> >
> > At this point,
> > # ndctl list -NMu -r 0
> > {
> > "dev":"namespace0.0",
> > "mode":"fsdax",
> > "map":"dev",
> > "size":"15.75 GiB (16.91 GB)",
> > "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
> > "sector_size":4096,
> > "align":2097152,
> > "blockdev":"pmem0"
> > }
> >
> > [21351.992296] {2}[Hardware Error]: Hardware error from APEI Generic
> > Hardware Error Source: 1 [21352.001528] {2}[Hardware Error]: event
> > severity: recoverable [21352.007838] {2}[Hardware Error]: Error 0,
> > type: recoverable
> > [21352.014156] {2}[Hardware Error]: section_type: memory error
> > [21352.020572] {2}[Hardware Error]: physical_address: 0x0000001850603200
This error is in the following page: PFN=1850603 (at offset 0x200 = 512b)
Is that what you mean by "phantom error" ... from a different address from those that were injected?
-Tony
On 1/21/2022 5:27 PM, Luck, Tony wrote:
> On Sat, Jan 22, 2022 at 12:40:18AM +0000, Jane Chu wrote:
>> On 1/21/2022 4:31 PM, Jane Chu wrote:
>>> On baremetal Intel platform with DCPMEM installed and configured to
>>> provision daxfs, say a poison was consumed by a load from a user thread,
>>> and then daxfs takes action and clears the poison, confirmed by "ndctl
>>> -NM".
>>>
>>> Now, depends on the luck, after sometime(from a few seconds to 5+ hours)
>>> the ghost of the previous poison will surface, and it takes
>>> unload/reload the libnvdimm drivers in order to drive the phantom poison
>>> away, confirmed by ARS.
>>>
>>> Turns out, the issue is quite reproducible with the latest stable Linux.
>>>
>>> Here is the relevant console message after injected 8 poisons in one
>>> page via
>>> # ndctl inject-error namespace0.0 -n 2 -B 8210
>>
>> There is a cut-n-paste error, the above line should be
>> "# ndctl inject-error namespace0.0 -n 8 -B 8210"
>
> You say "in one page" here. What is the page size?
The page size is 4K, the size of base page on x86.
I said "one page", as 8 (poisons) * 256B = 2KiB, only half page.
>>
>> -jane
>>
>>> then, cleared them all, and wait for 5+ hours, notice the time stamp.
>>> BTW, the system is idle otherwise.
>>>
>>> [ 2439.742296] mce: Uncorrected hardware memory error in user-access at
>>> 1850602400
>>> [ 2439.742420] Memory failure: 0x1850602: Sending SIGBUS to
>>> fsdax_poison_v1:8457 due to hardware memory corruption
>>> [ 2439.761866] Memory failure: 0x1850602: recovery action for dax page:
>>> Recovered
>>> [ 2439.769949] mce: [Hardware Error]: Machine check events logged
>>> -1850603000 uncached-minus<->write-back
>>> [ 2439.769984] x86/PAT: memtype_reserve failed [mem
>>> 0x1850602000-0x1850602fff], track uncached-minus, req uncached-minus
>>> [ 2439.769985] Could not invalidate pfn=0x1850602 from 1:1 map
>>> [ 2440.856351] x86/PAT: fsdax_poison_v1:8457 freeing invalid memtype
>>> [mem 0x1850602000-0x1850602fff]
>
> This error is reported in PFN=1850602 (at offset 0x400 = 1K)
yes.
>
>>>
>>> At this point,
>>> # ndctl list -NMu -r 0
>>> {
>>> "dev":"namespace0.0",
>>> "mode":"fsdax",
>>> "map":"dev",
>>> "size":"15.75 GiB (16.91 GB)",
>>> "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
>>> "sector_size":4096,
>>> "align":2097152,
>>> "blockdev":"pmem0"
>>> }
>>>
>>> [21351.992296] {2}[Hardware Error]: Hardware error from APEI Generic
>>> Hardware Error Source: 1
>>> [21352.001528] {2}[Hardware Error]: event severity: recoverable
>>> [21352.007838] {2}[Hardware Error]: Error 0, type: recoverable
>>> [21352.014156] {2}[Hardware Error]: section_type: memory error
>>> [21352.020572] {2}[Hardware Error]: physical_address: 0x0000001850603200
>
> This error is in the following page: PFN=1850603 (at offset 0x200 = 512b)
>
I see, this is the next page... the issue is reproducible with
a single poison injection.
> Is that what you mean by "phantom error" ... from a different
> address from those that were injected?
All 8 poisons were cleared by the driver via DSM, and verified
by "ndctl -NMu -r 0", that covers every page in the 16GiB /dev/pmem.
It's phantom because unload->reload libnvdimm, followed by a full ARS
scan confirms the poison isn't there, hence phantom.
thanks,
-jane
>
> -Tony
On 1/21/2022 5:51 PM, Tsaur, Erwin wrote:
> Hi Jane,
>
> Is phantom error, an poison that was injected and then cleared, but somehow shows up again?
> How is "daxfs takes acation and clears the poison" by doing mailbox or writes?
> Also how are you doing ARS?
The phantom show up as soon as this console message show up
[Hardware Error]: Hardware error from APEI Generic Hardware Error
Source: 1
from 'ghes'.
The poisons were clear via pmem_clear_poison().
ARS was run as
"ndctl start-scrub; ndctl wait-scrub -p 30"
thanks,
-jane
>
> Erwin
>
> -----Original Message-----
> From: Luck, Tony <[email protected]>
> Sent: Friday, January 21, 2022 5:27 PM
> To: chu, jane <[email protected]>
> Cc: Williams, Dan J <[email protected]>; [email protected] >> Borislav Petkov <[email protected]>; [email protected]; [email protected]; [email protected]; [email protected]
> Subject: Re: Phantom PMEM poison issue
>
> On Sat, Jan 22, 2022 at 12:40:18AM +0000, Jane Chu wrote:
>> On 1/21/2022 4:31 PM, Jane Chu wrote:
>>> On baremetal Intel platform with DCPMEM installed and configured to
>>> provision daxfs, say a poison was consumed by a load from a user
>>> thread, and then daxfs takes action and clears the poison, confirmed
>>> by "ndctl -NM".
>>>
>>> Now, depends on the luck, after sometime(from a few seconds to 5+
>>> hours) the ghost of the previous poison will surface, and it takes
>>> unload/reload the libnvdimm drivers in order to drive the phantom
>>> poison away, confirmed by ARS.
>>>
>>> Turns out, the issue is quite reproducible with the latest stable Linux.
>>>
>>> Here is the relevant console message after injected 8 poisons in one
>>> page via
>>> # ndctl inject-error namespace0.0 -n 2 -B 8210
>>
>> There is a cut-n-paste error, the above line should be
>> "# ndctl inject-error namespace0.0 -n 8 -B 8210"
>
> You say "in one page" here. What is the page size?
>>
>> -jane
>>
>>> then, cleared them all, and wait for 5+ hours, notice the time stamp.
>>> BTW, the system is idle otherwise.
>>>
>>> [ 2439.742296] mce: Uncorrected hardware memory error in user-access
>>> at
>>> 1850602400
>>> [ 2439.742420] Memory failure: 0x1850602: Sending SIGBUS to
>>> fsdax_poison_v1:8457 due to hardware memory corruption [
>>> 2439.761866] Memory failure: 0x1850602: recovery action for dax page:
>>> Recovered
>>> [ 2439.769949] mce: [Hardware Error]: Machine check events logged
>>> -1850603000 uncached-minus<->write-back [ 2439.769984] x86/PAT:
>>> memtype_reserve failed [mem 0x1850602000-0x1850602fff], track
>>> uncached-minus, req uncached-minus [ 2439.769985] Could not
>>> invalidate pfn=0x1850602 from 1:1 map [ 2440.856351] x86/PAT:
>>> fsdax_poison_v1:8457 freeing invalid memtype [mem
>>> 0x1850602000-0x1850602fff]
>
> This error is reported in PFN=1850602 (at offset 0x400 = 1K)
>
>>>
>>> At this point,
>>> # ndctl list -NMu -r 0
>>> {
>>> "dev":"namespace0.0",
>>> "mode":"fsdax",
>>> "map":"dev",
>>> "size":"15.75 GiB (16.91 GB)",
>>> "uuid":"2ccc540a-3c7b-4b91-b87b-9e897ad0b9bb",
>>> "sector_size":4096,
>>> "align":2097152,
>>> "blockdev":"pmem0"
>>> }
>>>
>>> [21351.992296] {2}[Hardware Error]: Hardware error from APEI Generic
>>> Hardware Error Source: 1 [21352.001528] {2}[Hardware Error]: event
>>> severity: recoverable [21352.007838] {2}[Hardware Error]: Error 0,
>>> type: recoverable
>>> [21352.014156] {2}[Hardware Error]: section_type: memory error
>>> [21352.020572] {2}[Hardware Error]: physical_address: 0x0000001850603200
>
> This error is in the following page: PFN=1850603 (at offset 0x200 = 512b)
>
> Is that what you mean by "phantom error" ... from a different address from those that were injected?
>
> -Tony
>