2013-04-11 03:28:34

by Mitsuhiro Tanino

[permalink] [raw]
Subject: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable

Hi All,
Please find a patch set that introduces these new sysctl interfaces,
to handle a case when an memory error is detected on dirty page cache.

- vm.memory_failure_dirty_panic
- vm.memory_failure_print_ratelimit
- vm.memory_failure_print_ratelimit_burst

Problem
---------
Recently, it is common that enterprise servers likely have a large
amount of memory, especially for cloud environment. This means that
possibility of memory failures is increased.

To handle memory failure, Linux has a hwpoison feature. When a memory
error is detected by memory scrub, the error is reported as machine
check, uncorrected recoverable (UCR), to OS. Then OS isolates the memory
region with memory failure if the memory page can be isolated.
The hwpoison handles it according to the memory region, such as kernel,
dirty cache, clean cache. If the memory region can be isolated, the
page is marked "hwpoison" and it is not used again.

When SRAO machine check is reported on a page which is included dirty
page cache, the page is truncated because the memory is corrupted and
data of the page cannot be written to a disk any more.

As a result, if the dirty cache includes user data, the data is lost,
and data corruption occurs if an application uses old data.



Solution
---------
The patch proposes a new sysctl interface, vm.memory_failure_dirty_panic,
in order to prevent data corruption comes from data lost problem.
Also this patch displays information of affected file such as device name,
inode number, file offset and file type if the file is mapped on a memory
and the page is dirty cache.

When SRAO machine check occurs on a dirty page cache, corresponding
data cannot be recovered any more. Therefore, the patch proposes a kernel
option to keep a system running or force system panic in order
to avoid further trouble such as data corruption problem of application.

System administrator can select an error action using this option
according to characteristics of target system.



Use Case
---------
This option is intended to be adopted in KVM guest because it is
supposed that Linux on KVM guest operates customers business and
it is big impact to lost or corrupt customers data by memory failure.

On the other hand, this option does not recommend to apply KVM host
as following reasons.

- Making KVM host panic has a big impact because all virtual guests are
affected by their host panic. Affected virtual guests are forced to stop
and have to be restarted on the other hypervisor.

- If disk cached model of qemu is set to "none", I/O type of virtual
guests becomes O_DIRECT and KVM host does not cache guest's disk I/O.
Therefore, if SRAO machine check is reported on a dirty page cache
in KVM host, its virtual machines are not affected by the machine check.
So the host is expected to keep operating instead of kernel panic.


Past discussion
--------------------
This problem was previously discussed in the kernel community,
(refer: mail threads pertaining to
http://marc.info/?l=linux-kernel&m=135187403804934&w=4).

> > - I worry that if a hardware error occurs, it might affect a large
> > amount of memory all at the same time. For example, if a 4G memory
> > block goes bad, this message will be printed a million times?


As Andrew mentioned in the above threads, if 4GB memory blocks goes bad,
error messages will be printed a million times and this behavior loses
a system reliability.

Therefore, the second patch introduces two sysctl parameters for
__ratelimit() which is used at mce_notify_irq() in order to notify
occurrence of machine check event to system administrator.
The use of __ratelimit(), this patch can limit quantity of messages
per interval to be output at syslog or terminal console.

If system administrator needs to limit quantity of messages,
these parameters are available.

- vm.memory_failure_print_ratelimit:
Specifies the minimum length of time between messages.
By default the rate limiting is disabled.

- vm.memory_failure_print_ratelimit_burst:
Specifies the number of messages we can send before rate limiting.



Test Results
---------
These patches are tested on 3.8.1 kernel(FC18) using software pseudo MCE
injection from KVM host to guest.


******** Host OS Screen logs(SRAO Machine Check injection) ********
Inject software pseudo MCE into guest qemu process.

(1) Load mce-inject module
# modprobe mce-inject

(2) Find a PID of target qemu-kvm and page struct
# ps -C qemu-kvm -o pid=
8176

(3) Edit software pseudo MCE data
Choose a offset of page struct and insert the offset to ADDR line in mce-file.

# ./page-types -p 8176 -LN | grep "___UDlA____Ma_b___________________"
voffset offset flags
...
7fd25eb77 344d77 ___UDlA____Ma_b___________________
7fd25eb78 344d78 ___UDlA____Ma_b___________________
7fd25eb79 344d79 ___UDlA____Ma_b___________________
7fd25eb7a 344d7a ___UDlA____Ma_b___________________
7fd25eb7b 344d7b ___UDlA____Ma_b___________________
7fd25eb7c 344d7c ___UDlA____Ma_b___________________
7fd25eb7d 344d7d ___UDlA____Ma_b___________________
...

# vi mce-file
CPU 0 BANK 2
STATUS UNCORRECTED SRAO 0x17a
MCGSTATUS MCIP RIPV
MISC 0x8c
ADDR 0x344d77000
EOF

(4) Inject MCE
# mce-inject mce-file

Try step (3) to (4) a couple of times




******** Guest OS Screen logs(kdump) ********
Receive MCE from KVM host

(1) Set vm.memory_failure_dirty_panic parameter to 1.

(2) When guest catches MCE injection from qemu and
MCE hit dirty page cache, hwpoison dirty cache handler
displays information of affected file such as Device name,
Inode Number, Offset and File Type.
And then, system goes a panic.

ex.
-------------
[root@host /]# sysctl -a | grep memory_failure
vm.memory_failure_dirty_panic = 1
vm.memory_failure_early_kill = 0
vm.memory_failure_recovery = 1
[root@host /]#
[ 517.975220] MCE 0x326e6: clean LRU page recovery: Recovered
[ 521.969218] MCE 0x34df8: clean LRU page recovery: Recovered
[ 525.769171] MCE 0x37509: corrupted page was clean: dropped without side effects
[ 525.771070] MCE 0x37509: clean LRU page recovery: Recovered
[ 529.969246] MCE 0x39c18: File was corrupted: Dev:vda3 Inode:808998 Offset:6561
[ 529.969995] Kernel panic - not syncing: MCE 0x39c18: Force a panic because of dirty page cache was corrupted : File type:0x81a4
[ 529.969995]
[ 529.970055] Pid: 245, comm: kworker/0:2 Tainted: G M 3.8.1 #22
[ 529.970055] Call Trace:
[ 529.970055] [<ffffffff81645d1e>] panic+0xc1/0x1d0
[ 529.970055] [<ffffffff811991fa>] me_pagecache_dirty+0xda/0x1a0
[ 529.970055] [<ffffffff8119a0ab>] memory_failure+0x4eb/0xca0
[ 529.970055] [<ffffffff8102f34d>] mce_process_work+0x3d/0x60
[ 529.970055] [<ffffffff8107a5f7>] process_one_work+0x147/0x490
[ 529.970055] [<ffffffff8102f310>] ? mce_schedule_work+0x50/0x50
[ 529.970055] [<ffffffff8107ce8e>] worker_thread+0x15e/0x450
[ 529.970055] [<ffffffff8107cd30>] ? busy_worker_rebind_fn+0x110/0x110
[ 529.970055] [<ffffffff81081f50>] kthread+0xc0/0xd0
[ 529.970055] [<ffffffff81010000>] ? ftrace_define_fields_xen_mc_entry+0xa0/0xf0
[ 529.970055] [<ffffffff81081e90>] ? kthread_create_on_node+0x120/0x120
[ 529.970055] [<ffffffff81657cec>] ret_from_fork+0x7c/0xb0
[ 529.970055] [<ffffffff81081e90>] ? kthread_create_on_node+0x120/0x120
-------------

(3) Case of a number of MCE occurs
If a number of MCE occurs during a minute, error messages are suppressed to output
by __ratelimit() with following message.

[ 414.815303] me_pagecache_dirty: 3 callbacks suppressed

ex.
-------------
[root@host /]# sysctl -a | grep memory_failure
vm.memory_failure_dirty_panic = 0
vm.memory_failure_early_kill = 0
vm.memory_failure_print_ratelimit = 30
vm.memory_failure_print_ratelimit_burst = 2
vm.memory_failure_recovery = 1

[root@host /]#

[ 181.565534] MCE 0xc38c: File was corrupted: Dev:vda3 Inode:808998 Offset:9566
[ 181.566310] MCE 0xc38c: dirty LRU page recovery: Recovered
[ 183.525425] MCE 0xc45a: Unknown page state
[ 183.527225] MCE 0xc45a: unknown page state page recovery: Failed
[ 183.527907] MCE 0xc45a: unknown page state page still referenced by -1 users
[ 185.000329] MCE 0xc524: dirty LRU page recovery: Recovered
[ 186.065231] MCE 0xc5ef: dirty LRU page recovery: Recovered
[ 188.054096] MCE 0xc6ba: clean LRU page recovery: Recovered
[ 189.565275] MCE 0xc783: clean LRU page recovery: Recovered
[ 191.692628] MCE 0xc84c: clean LRU page recovery: Recovered
[ 193.000257] MCE 0xc91d: File was corrupted: Dev:vda3 Inode:808998 Offset:6201
[ 193.001222] MCE 0xc91d: dirty LRU page recovery: Recovered
[ 194.065314] MCE 0xc9e6: dirty LRU page recovery: Recovered
[ 195.711211] MCE 0xcaaf: clean LRU page recovery: Recovered
[ 197.565339] MCE 0xcb78: dirty LRU page recovery: Recovered
[ 200.054177] MCE 0xcc41: dirty LRU page recovery: Recovered
[ 201.000272] MCE 0xcd0a: clean LRU page recovery: Recovered
[ 204.054109] MCE 0xcdd3: clean LRU page recovery: Recovered
[ 205.283189] MCE 0xcf65: clean LRU page recovery: Recovered
[ 207.110339] MCE 0xd02f: Unknown page state
[ 207.110787] MCE 0xd02f: unknown page state page recovery: Failed
[ 207.111427] MCE 0xd02f: unknown page state page still referenced by -1 users
[ 209.000134] MCE 0xd0f9: dirty LRU page recovery: Recovered
[ 210.106360] MCE 0xd1c5: dirty LRU page recovery: Recovered
[ 211.796333] me_pagecache_dirty: 3 callbacks suppressed
[ 211.796961] MCE 0xd296: File was corrupted: Dev:vda3 Inode:808998 Offset:9320
[ 211.798091] MCE 0xd296: dirty LRU page recovery: Recovered
[ 213.565288] MCE 0xd35f: clean LRU page recovery: Recovered
-------------


2013-04-11 03:53:39

by Simon Jeons

[permalink] [raw]
Subject: Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable

Hi Mitsuhiro,
On 04/11/2013 11:26 AM, Mitsuhiro Tanino wrote:
> Hi All,
> Please find a patch set that introduces these new sysctl interfaces,
> to handle a case when an memory error is detected on dirty page cache.
>
> - vm.memory_failure_dirty_panic
> - vm.memory_failure_print_ratelimit
> - vm.memory_failure_print_ratelimit_burst
>
> Problem
> ---------
> Recently, it is common that enterprise servers likely have a large
> amount of memory, especially for cloud environment. This means that
> possibility of memory failures is increased.
>
> To handle memory failure, Linux has a hwpoison feature. When a memory
> error is detected by memory scrub, the error is reported as machine
> check, uncorrected recoverable (UCR), to OS. Then OS isolates the memory
> region with memory failure if the memory page can be isolated.
> The hwpoison handles it according to the memory region, such as kernel,
> dirty cache, clean cache. If the memory region can be isolated, the
> page is marked "hwpoison" and it is not used again.
>
> When SRAO machine check is reported on a page which is included dirty
> page cache, the page is truncated because the memory is corrupted and
> data of the page cannot be written to a disk any more.
>
> As a result, if the dirty cache includes user data, the data is lost,
> and data corruption occurs if an application uses old data.

One question against mce instead of the patchset. ;-)

When check memory is bad? Before memory access? Is there a process scan
it period?

>
>
> Solution
> ---------
> The patch proposes a new sysctl interface, vm.memory_failure_dirty_panic,
> in order to prevent data corruption comes from data lost problem.
> Also this patch displays information of affected file such as device name,
> inode number, file offset and file type if the file is mapped on a memory
> and the page is dirty cache.
>
> When SRAO machine check occurs on a dirty page cache, corresponding
> data cannot be recovered any more. Therefore, the patch proposes a kernel
> option to keep a system running or force system panic in order
> to avoid further trouble such as data corruption problem of application.
>
> System administrator can select an error action using this option
> according to characteristics of target system.
>
>
>
> Use Case
> ---------
> This option is intended to be adopted in KVM guest because it is
> supposed that Linux on KVM guest operates customers business and
> it is big impact to lost or corrupt customers data by memory failure.
>
> On the other hand, this option does not recommend to apply KVM host
> as following reasons.
>
> - Making KVM host panic has a big impact because all virtual guests are
> affected by their host panic. Affected virtual guests are forced to stop
> and have to be restarted on the other hypervisor.
>
> - If disk cached model of qemu is set to "none", I/O type of virtual
> guests becomes O_DIRECT and KVM host does not cache guest's disk I/O.
> Therefore, if SRAO machine check is reported on a dirty page cache
> in KVM host, its virtual machines are not affected by the machine check.
> So the host is expected to keep operating instead of kernel panic.
>
>
> Past discussion
> --------------------
> This problem was previously discussed in the kernel community,
> (refer: mail threads pertaining to
> http://marc.info/?l=linux-kernel&m=135187403804934&w=4).
>
>>> - I worry that if a hardware error occurs, it might affect a large
>>> amount of memory all at the same time. For example, if a 4G memory
>>> block goes bad, this message will be printed a million times?
>
> As Andrew mentioned in the above threads, if 4GB memory blocks goes bad,
> error messages will be printed a million times and this behavior loses
> a system reliability.
>
> Therefore, the second patch introduces two sysctl parameters for
> __ratelimit() which is used at mce_notify_irq() in order to notify
> occurrence of machine check event to system administrator.
> The use of __ratelimit(), this patch can limit quantity of messages
> per interval to be output at syslog or terminal console.
>
> If system administrator needs to limit quantity of messages,
> these parameters are available.
>
> - vm.memory_failure_print_ratelimit:
> Specifies the minimum length of time between messages.
> By default the rate limiting is disabled.
>
> - vm.memory_failure_print_ratelimit_burst:
> Specifies the number of messages we can send before rate limiting.
>
>
>
> Test Results
> ---------
> These patches are tested on 3.8.1 kernel(FC18) using software pseudo MCE
> injection from KVM host to guest.
>
>
> ******** Host OS Screen logs(SRAO Machine Check injection) ********
> Inject software pseudo MCE into guest qemu process.
>
> (1) Load mce-inject module
> # modprobe mce-inject
>
> (2) Find a PID of target qemu-kvm and page struct
> # ps -C qemu-kvm -o pid=
> 8176
>
> (3) Edit software pseudo MCE data
> Choose a offset of page struct and insert the offset to ADDR line in mce-file.
>
> # ./page-types -p 8176 -LN | grep "___UDlA____Ma_b___________________"
> voffset offset flags
> ...
> 7fd25eb77 344d77 ___UDlA____Ma_b___________________
> 7fd25eb78 344d78 ___UDlA____Ma_b___________________
> 7fd25eb79 344d79 ___UDlA____Ma_b___________________
> 7fd25eb7a 344d7a ___UDlA____Ma_b___________________
> 7fd25eb7b 344d7b ___UDlA____Ma_b___________________
> 7fd25eb7c 344d7c ___UDlA____Ma_b___________________
> 7fd25eb7d 344d7d ___UDlA____Ma_b___________________
> ...
>
> # vi mce-file
> CPU 0 BANK 2
> STATUS UNCORRECTED SRAO 0x17a
> MCGSTATUS MCIP RIPV
> MISC 0x8c
> ADDR 0x344d77000
> EOF
>
> (4) Inject MCE
> # mce-inject mce-file
>
> Try step (3) to (4) a couple of times
>
>
>
>
> ******** Guest OS Screen logs(kdump) ********
> Receive MCE from KVM host
>
> (1) Set vm.memory_failure_dirty_panic parameter to 1.
>
> (2) When guest catches MCE injection from qemu and
> MCE hit dirty page cache, hwpoison dirty cache handler
> displays information of affected file such as Device name,
> Inode Number, Offset and File Type.
> And then, system goes a panic.
>
> ex.
> -------------
> [root@host /]# sysctl -a | grep memory_failure
> vm.memory_failure_dirty_panic = 1
> vm.memory_failure_early_kill = 0
> vm.memory_failure_recovery = 1
> [root@host /]#
> [ 517.975220] MCE 0x326e6: clean LRU page recovery: Recovered
> [ 521.969218] MCE 0x34df8: clean LRU page recovery: Recovered
> [ 525.769171] MCE 0x37509: corrupted page was clean: dropped without side effects
> [ 525.771070] MCE 0x37509: clean LRU page recovery: Recovered
> [ 529.969246] MCE 0x39c18: File was corrupted: Dev:vda3 Inode:808998 Offset:6561
> [ 529.969995] Kernel panic - not syncing: MCE 0x39c18: Force a panic because of dirty page cache was corrupted : File type:0x81a4
> [ 529.969995]
> [ 529.970055] Pid: 245, comm: kworker/0:2 Tainted: G M 3.8.1 #22
> [ 529.970055] Call Trace:
> [ 529.970055] [<ffffffff81645d1e>] panic+0xc1/0x1d0
> [ 529.970055] [<ffffffff811991fa>] me_pagecache_dirty+0xda/0x1a0
> [ 529.970055] [<ffffffff8119a0ab>] memory_failure+0x4eb/0xca0
> [ 529.970055] [<ffffffff8102f34d>] mce_process_work+0x3d/0x60
> [ 529.970055] [<ffffffff8107a5f7>] process_one_work+0x147/0x490
> [ 529.970055] [<ffffffff8102f310>] ? mce_schedule_work+0x50/0x50
> [ 529.970055] [<ffffffff8107ce8e>] worker_thread+0x15e/0x450
> [ 529.970055] [<ffffffff8107cd30>] ? busy_worker_rebind_fn+0x110/0x110
> [ 529.970055] [<ffffffff81081f50>] kthread+0xc0/0xd0
> [ 529.970055] [<ffffffff81010000>] ? ftrace_define_fields_xen_mc_entry+0xa0/0xf0
> [ 529.970055] [<ffffffff81081e90>] ? kthread_create_on_node+0x120/0x120
> [ 529.970055] [<ffffffff81657cec>] ret_from_fork+0x7c/0xb0
> [ 529.970055] [<ffffffff81081e90>] ? kthread_create_on_node+0x120/0x120
> -------------
>
> (3) Case of a number of MCE occurs
> If a number of MCE occurs during a minute, error messages are suppressed to output
> by __ratelimit() with following message.
>
> [ 414.815303] me_pagecache_dirty: 3 callbacks suppressed
>
> ex.
> -------------
> [root@host /]# sysctl -a | grep memory_failure
> vm.memory_failure_dirty_panic = 0
> vm.memory_failure_early_kill = 0
> vm.memory_failure_print_ratelimit = 30
> vm.memory_failure_print_ratelimit_burst = 2
> vm.memory_failure_recovery = 1
>
> [root@host /]#
>
> [ 181.565534] MCE 0xc38c: File was corrupted: Dev:vda3 Inode:808998 Offset:9566
> [ 181.566310] MCE 0xc38c: dirty LRU page recovery: Recovered
> [ 183.525425] MCE 0xc45a: Unknown page state
> [ 183.527225] MCE 0xc45a: unknown page state page recovery: Failed
> [ 183.527907] MCE 0xc45a: unknown page state page still referenced by -1 users
> [ 185.000329] MCE 0xc524: dirty LRU page recovery: Recovered
> [ 186.065231] MCE 0xc5ef: dirty LRU page recovery: Recovered
> [ 188.054096] MCE 0xc6ba: clean LRU page recovery: Recovered
> [ 189.565275] MCE 0xc783: clean LRU page recovery: Recovered
> [ 191.692628] MCE 0xc84c: clean LRU page recovery: Recovered
> [ 193.000257] MCE 0xc91d: File was corrupted: Dev:vda3 Inode:808998 Offset:6201
> [ 193.001222] MCE 0xc91d: dirty LRU page recovery: Recovered
> [ 194.065314] MCE 0xc9e6: dirty LRU page recovery: Recovered
> [ 195.711211] MCE 0xcaaf: clean LRU page recovery: Recovered
> [ 197.565339] MCE 0xcb78: dirty LRU page recovery: Recovered
> [ 200.054177] MCE 0xcc41: dirty LRU page recovery: Recovered
> [ 201.000272] MCE 0xcd0a: clean LRU page recovery: Recovered
> [ 204.054109] MCE 0xcdd3: clean LRU page recovery: Recovered
> [ 205.283189] MCE 0xcf65: clean LRU page recovery: Recovered
> [ 207.110339] MCE 0xd02f: Unknown page state
> [ 207.110787] MCE 0xd02f: unknown page state page recovery: Failed
> [ 207.111427] MCE 0xd02f: unknown page state page still referenced by -1 users
> [ 209.000134] MCE 0xd0f9: dirty LRU page recovery: Recovered
> [ 210.106360] MCE 0xd1c5: dirty LRU page recovery: Recovered
> [ 211.796333] me_pagecache_dirty: 3 callbacks suppressed
> [ 211.796961] MCE 0xd296: File was corrupted: Dev:vda3 Inode:808998 Offset:9320
> [ 211.798091] MCE 0xd296: dirty LRU page recovery: Recovered
> [ 213.565288] MCE 0xd35f: clean LRU page recovery: Recovered
> -------------
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-04-11 07:11:57

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable

Hi Tanino-san,

On Thu, Apr 11, 2013 at 12:26:19PM +0900, Mitsuhiro Tanino wrote:
...
> Solution
> ---------
> The patch proposes a new sysctl interface, vm.memory_failure_dirty_panic,
> in order to prevent data corruption comes from data lost problem.
> Also this patch displays information of affected file such as device name,
> inode number, file offset and file type if the file is mapped on a memory
> and the page is dirty cache.
>
> When SRAO machine check occurs on a dirty page cache, corresponding
> data cannot be recovered any more. Therefore, the patch proposes a kernel
> option to keep a system running or force system panic in order
> to avoid further trouble such as data corruption problem of application.
>
> System administrator can select an error action using this option
> according to characteristics of target system.

Can we do this in userspace?
mcelog can trigger scripts when a MCE which matches the user-configurable
conditions happens, so I think that we can trigger a kernel panic by
chekcing kernel messages from the triggered script.
For that purpose, I recently fixed the dirty/clean messaging in commit
ff604cf6d4 "mm: hwpoison: fix action_result() to print out dirty/clean".

>
> Use Case
> ---------
> This option is intended to be adopted in KVM guest because it is
> supposed that Linux on KVM guest operates customers business and
> it is big impact to lost or corrupt customers data by memory failure.
>
> On the other hand, this option does not recommend to apply KVM host
> as following reasons.
>
> - Making KVM host panic has a big impact because all virtual guests are
> affected by their host panic. Affected virtual guests are forced to stop
> and have to be restarted on the other hypervisor.

In this reasoning, you seem to assume that important data (business data)
are only handled on guest OS. That's true in most cases, but not always.
I think that the more general approach for this use case is that
we trigger kernel panic if memory errors happened on dirty pagecaches
used by 'important' processes (for example by adding process flags
controlled by prctl(),) and set it on qemu processes.

> - If disk cached model of qemu is set to "none", I/O type of virtual
> guests becomes O_DIRECT and KVM host does not cache guest's disk I/O.
> Therefore, if SRAO machine check is reported on a dirty page cache
> in KVM host, its virtual machines are not affected by the machine check.
> So the host is expected to keep operating instead of kernel panic.

What to do if there're multiple guests, and some have "none" cache and
others have other types?
I think that we need more flexible settings for this use case.

>
> Past discussion
> --------------------
> This problem was previously discussed in the kernel community,
> (refer: mail threads pertaining to
> http://marc.info/?l=linux-kernel&m=135187403804934&w=4).
>
> > > - I worry that if a hardware error occurs, it might affect a large
> > > amount of memory all at the same time. For example, if a 4G memory
> > > block goes bad, this message will be printed a million times?
>
> As Andrew mentioned in the above threads, if 4GB memory blocks goes bad,
> error messages will be printed a million times and this behavior loses
> a system reliability.

Maybe "4G memory block goes bad" is not a MCE SRAO but a MCE with higher
severity, so we have no choice but to make kernel panic.

Thanks,
Naoya Horiguchi

2013-04-11 12:51:48

by Mitsuhiro Tanino

[permalink] [raw]
Subject: Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable

(2013/04/11 12:53), Simon Jeons wrote:
> One question against mce instead of the patchset. ;-)
>
> When check memory is bad? Before memory access? Is there a process scan it period?

Hi Simon-san,

Yes, there is a process to scan memory periodically.

At Intel Nehalem-EX and CPUs after Nehalem-EX generation, MCA recovery
is supported. MCA recovery provides error detection and isolation
features to work together with OS.
One of the MCA Recovery features is Memory Scrubbing. It periodically
checks memory in the background of OS.

If Memory Scrubbing find an uncorrectable error on a memory before
OS accesses the memory bit, MCA recovery notifies SRAO error into OS
and OS handles the SRAO error using hwpoison function.

Regards,
Mitsuhiro Tanino

2013-04-11 13:00:54

by Ric Mason

[permalink] [raw]
Subject: Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable

Hi Mitsuhiro,
On 04/11/2013 08:51 PM, Mitsuhiro Tanino wrote:
> (2013/04/11 12:53), Simon Jeons wrote:
>> One question against mce instead of the patchset. ;-)
>>
>> When check memory is bad? Before memory access? Is there a process scan it period?
> Hi Simon-san,
>
> Yes, there is a process to scan memory periodically.
>
> At Intel Nehalem-EX and CPUs after Nehalem-EX generation, MCA recovery
> is supported. MCA recovery provides error detection and isolation
> features to work together with OS.
> One of the MCA Recovery features is Memory Scrubbing. It periodically
> checks memory in the background of OS.

Memory Scrubbing is a kernel thread? Where is the codes of memory scrubbing?

>
> If Memory Scrubbing find an uncorrectable error on a memory before
> OS accesses the memory bit, MCA recovery notifies SRAO error into OS

It maybe can't find memory error timely since it is sleeping when memory
error occur, can this case happened?

> and OS handles the SRAO error using hwpoison function.
>
> Regards,
> Mitsuhiro Tanino
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-04-11 13:49:19

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable

> As a result, if the dirty cache includes user data, the data is lost,
> and data corruption occurs if an application uses old data.

The application cannot use old data, the kernel code kills it if it
would do that. And if it's IO data there is an EIO triggered.

iirc the only concern in the past was that the application may miss
the asynchronous EIO because it's cleared on any fd access.

This is a general problem not specific to memory error handling,
as these asynchronous IO errors can happen due to other reason
(bad disk etc.)

If you're really concerned about this case I think the solution
is to make the EIO more sticky so that there is a higher chance
than it gets returned. This will make your data much more safe,
as it will cover all kinds of IO errors, not just the obscure memory
errors.

Or maybe have a panic knob on any IO error for any case if you don't
trust your application to check IO syscalls. But I would rather
have better EIO reporting than just giving up like this.

The problem of tying it just to any dirty data for memory errors
is that most anonymous data is dirty and it doesn't have this problem
at all (because the signals handle this and they cannot be lost)

And that is a far more common case than this relatively unlikely
case of dirty IO data.

So just doing it for "dirty" is not the right knob.

Basically I'm saying if you worry about unreliable IO error reporting
fix IO error reporting, don't add random unnecessary panics to
the memory error handling.

BTW my suspicion is that if you approach this from a data driven
perspective: that is measure how much such dirty data is typically
around in comparison to other data it will be unlikely. Such
a study can be done with the "page-types" program in tools/vm

-Andi

2013-04-11 15:15:32

by KOSAKI Motohiro

[permalink] [raw]
Subject: Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable

(4/10/13 11:26 PM), Mitsuhiro Tanino wrote:
> Hi All,
> Please find a patch set that introduces these new sysctl interfaces,
> to handle a case when an memory error is detected on dirty page cache.
>
> - vm.memory_failure_dirty_panic

Panic knob is ok to me. However I agree with Andi. If we need panic know,
it should handle generic IO error and data lost.


> - vm.memory_failure_print_ratelimit
> - vm.memory_failure_print_ratelimit_burst

But this is totally silly.
print_ratelimit might ommit important messages. Please do a right way.

2013-04-11 15:23:17

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable

On Thu, Apr 11, 2013 at 03:49:16PM +0200, Andi Kleen wrote:
> > As a result, if the dirty cache includes user data, the data is lost,
> > and data corruption occurs if an application uses old data.
>
> The application cannot use old data, the kernel code kills it if it
> would do that. And if it's IO data there is an EIO triggered.
>
> iirc the only concern in the past was that the application may miss
> the asynchronous EIO because it's cleared on any fd access.
>
> This is a general problem not specific to memory error handling,
> as these asynchronous IO errors can happen due to other reason
> (bad disk etc.)
>
> If you're really concerned about this case I think the solution
> is to make the EIO more sticky so that there is a higher chance
> than it gets returned. This will make your data much more safe,
> as it will cover all kinds of IO errors, not just the obscure memory
> errors.

I'm interested in this topic, and in previous discussion, what I was said
is that we can't expect user applications to change their behaviors when
they get EIO, so globally changing EIO's stickiness is not a great approach.
I'm working on a new pagecache tag based mechanism to solve this.
But it needs time and more discussions.
So I guess Tanino-san suggests giving up on dirty pagecache errors
as a quick solution.

Thanks,
Naoya

2013-04-11 18:10:09

by Andi Kleen

[permalink] [raw]
Subject: Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable

On Thu, Apr 11, 2013 at 11:23:08AM -0400, Naoya Horiguchi wrote:
> On Thu, Apr 11, 2013 at 03:49:16PM +0200, Andi Kleen wrote:
> > > As a result, if the dirty cache includes user data, the data is lost,
> > > and data corruption occurs if an application uses old data.
> >
> > The application cannot use old data, the kernel code kills it if it
> > would do that. And if it's IO data there is an EIO triggered.
> >
> > iirc the only concern in the past was that the application may miss
> > the asynchronous EIO because it's cleared on any fd access.
> >
> > This is a general problem not specific to memory error handling,
> > as these asynchronous IO errors can happen due to other reason
> > (bad disk etc.)
> >
> > If you're really concerned about this case I think the solution
> > is to make the EIO more sticky so that there is a higher chance
> > than it gets returned. This will make your data much more safe,
> > as it will cover all kinds of IO errors, not just the obscure memory
> > errors.
>
> I'm interested in this topic, and in previous discussion, what I was said
> is that we can't expect user applications to change their behaviors when
> they get EIO, so globally changing EIO's stickiness is not a great approach.

Not sure. Some of the current behavior may be dubious and it may
be possible to change it. But would need more analysis.

I don't think we're concerned that much about "correct" applications,
but more ones that do not check everything. So returning more
errors should be safer.

For example you could have a sysctl that enables always stick
IO error -- that keeps erroring until it is closed.

> I'm working on a new pagecache tag based mechanism to solve this.
> But it needs time and more discussions.
> So I guess Tanino-san suggests giving up on dirty pagecache errors
> as a quick solution.

A quick solution would be enabling panic for any asynchronous IO error.
I don't think the memory error code is the right point to hook into.

-Andi

2013-04-12 13:24:53

by Mitsuhiro Tanino

[permalink] [raw]
Subject: Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable

(2013/04/11 16:11), Naoya Horiguchi wrote:
> Hi Tanino-san,
>
> On Thu, Apr 11, 2013 at 12:26:19PM +0900, Mitsuhiro Tanino wrote:
> ...
>> Solution
>> ---------
>> The patch proposes a new sysctl interface, vm.memory_failure_dirty_panic,
>> in order to prevent data corruption comes from data lost problem.
>> Also this patch displays information of affected file such as device name,
>> inode number, file offset and file type if the file is mapped on a memory
>> and the page is dirty cache.
>>
>> When SRAO machine check occurs on a dirty page cache, corresponding
>> data cannot be recovered any more. Therefore, the patch proposes a kernel
>> option to keep a system running or force system panic in order
>> to avoid further trouble such as data corruption problem of application.
>>
>> System administrator can select an error action using this option
>> according to characteristics of target system.
>
> Can we do this in userspace?
> mcelog can trigger scripts when a MCE which matches the user-configurable
> conditions happens, so I think that we can trigger a kernel panic by
> chekcing kernel messages from the triggered script.
> For that purpose, I recently fixed the dirty/clean messaging in commit
> ff604cf6d4 "mm: hwpoison: fix action_result() to print out dirty/clean".

Hi Horiguchi-san,

Thank you for your comment.
I know mcelog has error trigger scripts such as page-error-trigger.

However, if userspace process triggers a kernel panic, I am afraid that
the following case is not handled.

- Several SRAO memory errors occur at the same time.
- Then, some of memory errors are related to mcelog and the others are
related to dirty cache.

In my understanding, mcelog process is killed if memory error is related
to mcelog process and mcelog can not cause a kernel panic in this case.


>> Use Case
>> ---------
>> This option is intended to be adopted in KVM guest because it is
>> supposed that Linux on KVM guest operates customers business and
>> it is big impact to lost or corrupt customers data by memory failure.
>>
>> On the other hand, this option does not recommend to apply KVM host
>> as following reasons.
>>
>> - Making KVM host panic has a big impact because all virtual guests are
>> affected by their host panic. Affected virtual guests are forced to stop
>> and have to be restarted on the other hypervisor.
>
> In this reasoning, you seem to assume that important data (business data)
> are only handled on guest OS. That's true in most cases, but not always.
> I think that the more general approach for this use case is that
> we trigger kernel panic if memory errors happened on dirty pagecaches
> used by 'important' processes (for example by adding process flags
> controlled by prctl(),) and set it on qemu processes.
>
>> - If disk cached model of qemu is set to "none", I/O type of virtual
>> guests becomes O_DIRECT and KVM host does not cache guest's disk I/O.
>> Therefore, if SRAO machine check is reported on a dirty page cache
>> in KVM host, its virtual machines are not affected by the machine check.
>> So the host is expected to keep operating instead of kernel panic.
>
> What to do if there're multiple guests, and some have "none" cache and
> others have other types?
> I think that we need more flexible settings for this use case.

OK. If I find another helpful use case, I would propose it.


>>
>> Past discussion
>> --------------------
>> This problem was previously discussed in the kernel community,
>> (refer: mail threads pertaining to
>> http://marc.info/?l=linux-kernel&m=135187403804934&w=4).
>>
>>>> - I worry that if a hardware error occurs, it might affect a large
>>>> amount of memory all at the same time. For example, if a 4G memory
>>>> block goes bad, this message will be printed a million times?
>>
>> As Andrew mentioned in the above threads, if 4GB memory blocks goes bad,
>> error messages will be printed a million times and this behavior loses
>> a system reliability.
>
> Maybe "4G memory block goes bad" is not a MCE SRAO but a MCE with higher
> severity, so we have no choice but to make kernel panic.

Yes. I agree with your opinion.

Regards,
Mitsuhiro Tanino

2013-04-12 13:38:47

by Mitsuhiro Tanino

[permalink] [raw]
Subject: Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable

(2013/04/12 3:10), Andi Kleen wrote:
> On Thu, Apr 11, 2013 at 11:23:08AM -0400, Naoya Horiguchi wrote:
>> On Thu, Apr 11, 2013 at 03:49:16PM +0200, Andi Kleen wrote:
>>>> As a result, if the dirty cache includes user data, the data is lost,
>>>> and data corruption occurs if an application uses old data.
>>>
>>> The application cannot use old data, the kernel code kills it if it
>>> would do that. And if it's IO data there is an EIO triggered.
>>>
>>> iirc the only concern in the past was that the application may miss
>>> the asynchronous EIO because it's cleared on any fd access.
>>>
>>> This is a general problem not specific to memory error handling,
>>> as these asynchronous IO errors can happen due to other reason
>>> (bad disk etc.)
>>>
>>> If you're really concerned about this case I think the solution
>>> is to make the EIO more sticky so that there is a higher chance
>>> than it gets returned. This will make your data much more safe,
>>> as it will cover all kinds of IO errors, not just the obscure memory
>>> errors.

I agree with Andi. We need to care both memory error and asynchronous
I/O error.

>> I'm interested in this topic, and in previous discussion, what I was said
>> is that we can't expect user applications to change their behaviors when
>> they get EIO, so globally changing EIO's stickiness is not a great approach.
>
> Not sure. Some of the current behavior may be dubious and it may
> be possible to change it. But would need more analysis.
>
> I don't think we're concerned that much about "correct" applications,
> but more ones that do not check everything. So returning more
> errors should be safer.
>
> For example you could have a sysctl that enables always stick
> IO error -- that keeps erroring until it is closed.
>
>> I'm working on a new pagecache tag based mechanism to solve this.
>> But it needs time and more discussions.
>> So I guess Tanino-san suggests giving up on dirty pagecache errors
>> as a quick solution.
>
> A quick solution would be enabling panic for any asynchronous IO error.
> I don't think the memory error code is the right point to hook into.

Yes. I think both short term solution and long term solution is necessary
in order to enable hwpoison feature for Linux as KVM hypervisor.

So my proposal is as follows,
For short term solution to care both memory error and I/O error:
- I will resend a panic knob to handle data lost related to dirty cache
which is caused by memory error and I/O error.

For long term solution:
- Andi's proposal or Horiguchi-san's new pagecache tag based mechanism

Regards,
Mitsuhiro Tanino

2013-04-12 13:43:56

by Mitsuhiro Tanino

[permalink] [raw]
Subject: Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable

(2013/04/11 22:00), Ric Mason wrote:
> Hi Mitsuhiro,
> On 04/11/2013 08:51 PM, Mitsuhiro Tanino wrote:
>> (2013/04/11 12:53), Simon Jeons wrote:
>>> One question against mce instead of the patchset. ;-)
>>>
>>> When check memory is bad? Before memory access? Is there a process scan it period?
>> Hi Simon-san,
>>
>> Yes, there is a process to scan memory periodically.
>>
>> At Intel Nehalem-EX and CPUs after Nehalem-EX generation, MCA recovery
>> is supported. MCA recovery provides error detection and isolation
>> features to work together with OS.
>> One of the MCA Recovery features is Memory Scrubbing. It periodically
>> checks memory in the background of OS.
>
> Memory Scrubbing is a kernel thread? Where is the codes of memory scrubbing?

Hi Ric,

No. One of the MCA Recovery features is Memory Scrubbing.
And Memory Scrubbing is a hardware feature of Intel CPU.

OS has a hwpoison feature which is included at mm/memory-failure.c.
A main function is memory_failure().

If Memory Scrubbing finds a memory error, MCA recovery notifies SRAO error
into OS and OS handles the SRAO error using hwpoison function.


>> If Memory Scrubbing find an uncorrectable error on a memory before
>> OS accesses the memory bit, MCA recovery notifies SRAO error into OS
>
> It maybe can't find memory error timely since it is sleeping when memory error occur, can this case happened?

Memory Scrubbing seems to be operated periodically but I don't have
information about how oftern it is executed.

Regards,
Mitsuhiro Tanino

2013-04-12 14:45:38

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable

On Fri, Apr 12, 2013 at 10:24:48PM +0900, Mitsuhiro Tanino wrote:
> (2013/04/11 16:11), Naoya Horiguchi wrote:
> > Hi Tanino-san,
> >
> > On Thu, Apr 11, 2013 at 12:26:19PM +0900, Mitsuhiro Tanino wrote:
> > ...
> >> Solution
> >> ---------
> >> The patch proposes a new sysctl interface, vm.memory_failure_dirty_panic,
> >> in order to prevent data corruption comes from data lost problem.
> >> Also this patch displays information of affected file such as device name,
> >> inode number, file offset and file type if the file is mapped on a memory
> >> and the page is dirty cache.
> >>
> >> When SRAO machine check occurs on a dirty page cache, corresponding
> >> data cannot be recovered any more. Therefore, the patch proposes a kernel
> >> option to keep a system running or force system panic in order
> >> to avoid further trouble such as data corruption problem of application.
> >>
> >> System administrator can select an error action using this option
> >> according to characteristics of target system.
> >
> > Can we do this in userspace?
> > mcelog can trigger scripts when a MCE which matches the user-configurable
> > conditions happens, so I think that we can trigger a kernel panic by
> > chekcing kernel messages from the triggered script.
> > For that purpose, I recently fixed the dirty/clean messaging in commit
> > ff604cf6d4 "mm: hwpoison: fix action_result() to print out dirty/clean".
>
> Hi Horiguchi-san,
>
> Thank you for your comment.
> I know mcelog has error trigger scripts such as page-error-trigger.
>
> However, if userspace process triggers a kernel panic, I am afraid that
> the following case is not handled.
>
> - Several SRAO memory errors occur at the same time.
> - Then, some of memory errors are related to mcelog and the others are
> related to dirty cache.
>
> In my understanding, mcelog process is killed if memory error is related
> to mcelog process and mcelog can not cause a kernel panic in this case.

mcelog doesn't handle important data in itself even if it suffers memory
error on its dirty pagecache. We have no critical data lost in that case,
so it seems not to be a problem for me.
Or do you mean that 2 dirty pagecache errors hit the important process and
mcelog just in time? It's too rare to be worth adding a new sysctl knob.

Thanks,
Naoya

2013-04-12 15:13:12

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable

On Fri, Apr 12, 2013 at 10:38:43PM +0900, Mitsuhiro Tanino wrote:
> (2013/04/12 3:10), Andi Kleen wrote:
> > On Thu, Apr 11, 2013 at 11:23:08AM -0400, Naoya Horiguchi wrote:
> >> On Thu, Apr 11, 2013 at 03:49:16PM +0200, Andi Kleen wrote:
> >>>> As a result, if the dirty cache includes user data, the data is lost,
> >>>> and data corruption occurs if an application uses old data.
> >>>
> >>> The application cannot use old data, the kernel code kills it if it
> >>> would do that. And if it's IO data there is an EIO triggered.
> >>>
> >>> iirc the only concern in the past was that the application may miss
> >>> the asynchronous EIO because it's cleared on any fd access.
> >>>
> >>> This is a general problem not specific to memory error handling,
> >>> as these asynchronous IO errors can happen due to other reason
> >>> (bad disk etc.)
> >>>
> >>> If you're really concerned about this case I think the solution
> >>> is to make the EIO more sticky so that there is a higher chance
> >>> than it gets returned. This will make your data much more safe,
> >>> as it will cover all kinds of IO errors, not just the obscure memory
> >>> errors.
>
> I agree with Andi. We need to care both memory error and asynchronous
> I/O error.
>
> >> I'm interested in this topic, and in previous discussion, what I was said
> >> is that we can't expect user applications to change their behaviors when
> >> they get EIO, so globally changing EIO's stickiness is not a great approach.
> >
> > Not sure. Some of the current behavior may be dubious and it may
> > be possible to change it. But would need more analysis.
> >
> > I don't think we're concerned that much about "correct" applications,
> > but more ones that do not check everything. So returning more
> > errors should be safer.
> >
> > For example you could have a sysctl that enables always stick
> > IO error -- that keeps erroring until it is closed.
> >
> >> I'm working on a new pagecache tag based mechanism to solve this.
> >> But it needs time and more discussions.
> >> So I guess Tanino-san suggests giving up on dirty pagecache errors
> >> as a quick solution.
> >
> > A quick solution would be enabling panic for any asynchronous IO error.
> > I don't think the memory error code is the right point to hook into.
>
> Yes. I think both short term solution and long term solution is necessary
> in order to enable hwpoison feature for Linux as KVM hypervisor.
>
> So my proposal is as follows,
> For short term solution to care both memory error and I/O error:
> - I will resend a panic knob to handle data lost related to dirty cache
> which is caused by memory error and I/O error.

Sorry, I still think "panic on dirty pagecache error" is feasible in userspace.
This new knob will be completely useless after memory error reporting is
fixed in the future, so whenever possible I like the userspace solution
even for a short term one.

Thanks,
Naoya

> For long term solution:
> - Andi's proposal or Horiguchi-san's new pagecache tag based mechanism

2013-04-17 05:30:50

by Simon Jeons

[permalink] [raw]
Subject: Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable

On 04/11/2013 09:49 PM, Andi Kleen wrote:
>> As a result, if the dirty cache includes user data, the data is lost,
>> and data corruption occurs if an application uses old data.

Hi Andi,

Could you give me the link of your mce testcase?

> The application cannot use old data, the kernel code kills it if it
> would do that. And if it's IO data there is an EIO triggered.
>
> iirc the only concern in the past was that the application may miss
> the asynchronous EIO because it's cleared on any fd access.
>
> This is a general problem not specific to memory error handling,
> as these asynchronous IO errors can happen due to other reason
> (bad disk etc.)
>
> If you're really concerned about this case I think the solution
> is to make the EIO more sticky so that there is a higher chance
> than it gets returned. This will make your data much more safe,
> as it will cover all kinds of IO errors, not just the obscure memory
> errors.
>
> Or maybe have a panic knob on any IO error for any case if you don't
> trust your application to check IO syscalls. But I would rather
> have better EIO reporting than just giving up like this.
>
> The problem of tying it just to any dirty data for memory errors
> is that most anonymous data is dirty and it doesn't have this problem
> at all (because the signals handle this and they cannot be lost)
>
> And that is a far more common case than this relatively unlikely
> case of dirty IO data.
>
> So just doing it for "dirty" is not the right knob.
>
> Basically I'm saying if you worry about unreliable IO error reporting
> fix IO error reporting, don't add random unnecessary panics to
> the memory error handling.
>
> BTW my suspicion is that if you approach this from a data driven
> perspective: that is measure how much such dirty data is typically
> around in comparison to other data it will be unlikely. Such
> a study can be done with the "page-types" program in tools/vm
>
> -Andi
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-04-17 05:49:56

by Simon Jeons

[permalink] [raw]
Subject: Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable

Hi Mitsuhiro,
On 04/12/2013 09:43 PM, Mitsuhiro Tanino wrote:
> (2013/04/11 22:00), Ric Mason wrote:
>> Hi Mitsuhiro,
>> On 04/11/2013 08:51 PM, Mitsuhiro Tanino wrote:
>>> (2013/04/11 12:53), Simon Jeons wrote:
>>>> One question against mce instead of the patchset. ;-)
>>>>
>>>> When check memory is bad? Before memory access? Is there a process scan it period?
>>> Hi Simon-san,
>>>
>>> Yes, there is a process to scan memory periodically.
>>>
>>> At Intel Nehalem-EX and CPUs after Nehalem-EX generation, MCA recovery
>>> is supported. MCA recovery provides error detection and isolation
>>> features to work together with OS.
>>> One of the MCA Recovery features is Memory Scrubbing. It periodically
>>> checks memory in the background of OS.
>> Memory Scrubbing is a kernel thread? Where is the codes of memory scrubbing?
> Hi Ric,
>
> No. One of the MCA Recovery features is Memory Scrubbing.

Memory Scrubbing is a process in CPU?

> And Memory Scrubbing is a hardware feature of Intel CPU.
>
> OS has a hwpoison feature which is included at mm/memory-failure.c.
> A main function is memory_failure().
>
> If Memory Scrubbing finds a memory error, MCA recovery notifies SRAO error
> into OS and OS handles the SRAO error using hwpoison function.
>
>
>>> If Memory Scrubbing find an uncorrectable error on a memory before
>>> OS accesses the memory bit, MCA recovery notifies SRAO error into OS
>> It maybe can't find memory error timely since it is sleeping when memory error occur, can this case happened?
> Memory Scrubbing seems to be operated periodically but I don't have
> information about how oftern it is executed.

If Memory Scurbbing doesn't catch memory error timely, who will send
SRAR into OS?

>
> Regards,
> Mitsuhiro Tanino
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-04-17 06:43:08

by Simon Jeons

[permalink] [raw]
Subject: Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable

Hi Naoya,
On 04/11/2013 11:23 PM, Naoya Horiguchi wrote:
> On Thu, Apr 11, 2013 at 03:49:16PM +0200, Andi Kleen wrote:
>>> As a result, if the dirty cache includes user data, the data is lost,
>>> and data corruption occurs if an application uses old data.
>> The application cannot use old data, the kernel code kills it if it
>> would do that. And if it's IO data there is an EIO triggered.
>>
>> iirc the only concern in the past was that the application may miss
>> the asynchronous EIO because it's cleared on any fd access.
>>
>> This is a general problem not specific to memory error handling,
>> as these asynchronous IO errors can happen due to other reason
>> (bad disk etc.)
>>
>> If you're really concerned about this case I think the solution
>> is to make the EIO more sticky so that there is a higher chance
>> than it gets returned. This will make your data much more safe,
>> as it will cover all kinds of IO errors, not just the obscure memory
>> errors.
> I'm interested in this topic, and in previous discussion, what I was said
> is that we can't expect user applications to change their behaviors when
> they get EIO, so globally changing EIO's stickiness is not a great approach.

The user applications will get EIO firstly or get SIG_KILL firstly?

> I'm working on a new pagecache tag based mechanism to solve this.
> But it needs time and more discussions.
> So I guess Tanino-san suggests giving up on dirty pagecache errors
> as a quick solution.
>
> Thanks,
> Naoya
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-04-17 07:14:44

by Simon Jeons

[permalink] [raw]
Subject: Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable

Hi Naoya,
On 04/11/2013 03:11 PM, Naoya Horiguchi wrote:
> Hi Tanino-san,
>
> On Thu, Apr 11, 2013 at 12:26:19PM +0900, Mitsuhiro Tanino wrote:
> ...
>> Solution
>> ---------
>> The patch proposes a new sysctl interface, vm.memory_failure_dirty_panic,
>> in order to prevent data corruption comes from data lost problem.
>> Also this patch displays information of affected file such as device name,
>> inode number, file offset and file type if the file is mapped on a memory
>> and the page is dirty cache.
>>
>> When SRAO machine check occurs on a dirty page cache, corresponding
>> data cannot be recovered any more. Therefore, the patch proposes a kernel
>> option to keep a system running or force system panic in order
>> to avoid further trouble such as data corruption problem of application.
>>
>> System administrator can select an error action using this option
>> according to characteristics of target system.
> Can we do this in userspace?
> mcelog can trigger scripts when a MCE which matches the user-configurable
> conditions happens, so I think that we can trigger a kernel panic by
> chekcing kernel messages from the triggered script.
> For that purpose, I recently fixed the dirty/clean messaging in commit
> ff604cf6d4 "mm: hwpoison: fix action_result() to print out dirty/clean".

In your commit ff604cf6d4, you mentioned that "because when we check
PageDirty in action_result() it was cleared after page isolation even if
it's dirty before error handling." Could you point out where page
isolation and clear PageDirty? I don't think is isolate_lru_pages.

>
>> Use Case
>> ---------
>> This option is intended to be adopted in KVM guest because it is
>> supposed that Linux on KVM guest operates customers business and
>> it is big impact to lost or corrupt customers data by memory failure.
>>
>> On the other hand, this option does not recommend to apply KVM host
>> as following reasons.
>>
>> - Making KVM host panic has a big impact because all virtual guests are
>> affected by their host panic. Affected virtual guests are forced to stop
>> and have to be restarted on the other hypervisor.
> In this reasoning, you seem to assume that important data (business data)
> are only handled on guest OS. That's true in most cases, but not always.
> I think that the more general approach for this use case is that
> we trigger kernel panic if memory errors happened on dirty pagecaches
> used by 'important' processes (for example by adding process flags
> controlled by prctl(),) and set it on qemu processes.
>
>> - If disk cached model of qemu is set to "none", I/O type of virtual
>> guests becomes O_DIRECT and KVM host does not cache guest's disk I/O.
>> Therefore, if SRAO machine check is reported on a dirty page cache
>> in KVM host, its virtual machines are not affected by the machine check.
>> So the host is expected to keep operating instead of kernel panic.
> What to do if there're multiple guests, and some have "none" cache and
> others have other types?
> I think that we need more flexible settings for this use case.
>
>> Past discussion
>> --------------------
>> This problem was previously discussed in the kernel community,
>> (refer: mail threads pertaining to
>> http://marc.info/?l=linux-kernel&m=135187403804934&w=4).
>>
>>>> - I worry that if a hardware error occurs, it might affect a large
>>>> amount of memory all at the same time. For example, if a 4G memory
>>>> block goes bad, this message will be printed a million times?
>> As Andrew mentioned in the above threads, if 4GB memory blocks goes bad,
>> error messages will be printed a million times and this behavior loses
>> a system reliability.
> Maybe "4G memory block goes bad" is not a MCE SRAO but a MCE with higher
> severity, so we have no choice but to make kernel panic.
>
> Thanks,
> Naoya Horiguchi
>
> --
> To unsubscribe, send a message with 'unsubscribe linux-mm' in
> the body to [email protected]. For more info on Linux MM,
> see: http://www.linux-mm.org/ .
> Don't email: <a href=mailto:"[email protected]"> [email protected] </a>

2013-04-17 13:58:29

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable

On Fri, Apr 12, 2013 at 11:13:03AM -0400, Naoya Horiguchi wrote:
...
> > So my proposal is as follows,
> > For short term solution to care both memory error and I/O error:
> > - I will resend a panic knob to handle data lost related to dirty cache
> > which is caused by memory error and I/O error.
>
> Sorry, I still think "panic on dirty pagecache error" is feasible in userspace.
> This new knob will be completely useless after memory error reporting is
> fixed in the future, so whenever possible I like the userspace solution
> even for a short term one.

My apology, you mentioned both memory error and I/O error.
So I guess that in your next post, a new sysctl knob will be implemented
around filemap_fdatawait_range() to make kernel panic immediately
if a process finds the AS_EIO set.
It's also effective for the processes which poorly handle EIO, so can
be useful even after the error reporting is fixed in the future.

Anyway, my previous comment is pointless, so ignore it.

Thanks,
Naoya Horiguchi

2013-04-17 14:16:56

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable

On Wed, Apr 17, 2013 at 02:42:51PM +0800, Simon Jeons wrote:
> Hi Naoya,
> On 04/11/2013 11:23 PM, Naoya Horiguchi wrote:
> > On Thu, Apr 11, 2013 at 03:49:16PM +0200, Andi Kleen wrote:
> >>> As a result, if the dirty cache includes user data, the data is lost,
> >>> and data corruption occurs if an application uses old data.
> >> The application cannot use old data, the kernel code kills it if it
> >> would do that. And if it's IO data there is an EIO triggered.
> >>
> >> iirc the only concern in the past was that the application may miss
> >> the asynchronous EIO because it's cleared on any fd access.
> >>
> >> This is a general problem not specific to memory error handling,
> >> as these asynchronous IO errors can happen due to other reason
> >> (bad disk etc.)
> >>
> >> If you're really concerned about this case I think the solution
> >> is to make the EIO more sticky so that there is a higher chance
> >> than it gets returned. This will make your data much more safe,
> >> as it will cover all kinds of IO errors, not just the obscure memory
> >> errors.
> > I'm interested in this topic, and in previous discussion, what I was said
> > is that we can't expect user applications to change their behaviors when
> > they get EIO, so globally changing EIO's stickiness is not a great approach.
>
> The user applications will get EIO firstly or get SIG_KILL firstly?

That depends on how the process accesses to the error page, so I can't
say which one comes first.

Thanks,
Naoya Horiguchi

2013-04-17 14:55:36

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable

On Wed, Apr 17, 2013 at 03:14:36PM +0800, Simon Jeons wrote:
> Hi Naoya,
> On 04/11/2013 03:11 PM, Naoya Horiguchi wrote:
> > Hi Tanino-san,
> >
> > On Thu, Apr 11, 2013 at 12:26:19PM +0900, Mitsuhiro Tanino wrote:
> > ...
> >> Solution
> >> ---------
> >> The patch proposes a new sysctl interface, vm.memory_failure_dirty_panic,
> >> in order to prevent data corruption comes from data lost problem.
> >> Also this patch displays information of affected file such as device name,
> >> inode number, file offset and file type if the file is mapped on a memory
> >> and the page is dirty cache.
> >>
> >> When SRAO machine check occurs on a dirty page cache, corresponding
> >> data cannot be recovered any more. Therefore, the patch proposes a kernel
> >> option to keep a system running or force system panic in order
> >> to avoid further trouble such as data corruption problem of application.
> >>
> >> System administrator can select an error action using this option
> >> according to characteristics of target system.
> > Can we do this in userspace?
> > mcelog can trigger scripts when a MCE which matches the user-configurable
> > conditions happens, so I think that we can trigger a kernel panic by
> > chekcing kernel messages from the triggered script.
> > For that purpose, I recently fixed the dirty/clean messaging in commit
> > ff604cf6d4 "mm: hwpoison: fix action_result() to print out dirty/clean".
>
> In your commit ff604cf6d4, you mentioned that "because when we check
> PageDirty in action_result() it was cleared after page isolation even if
> it's dirty before error handling." Could you point out where page
> isolation and clear PageDirty? I don't think is isolate_lru_pages.

Here is the result of ftracing of memory_failure().
cancel_dirty_page() is called inside me_pagecache_dirty(), that's it.

mceinj.sh-7662 [000] 154195.857024: funcgraph_entry: | memory_failure() {
mceinj.sh-7662 [000] 154195.857024: funcgraph_entry: 0.283 us | PageHuge();
mceinj.sh-7662 [000] 154195.857025: funcgraph_entry: 0.321 us | _cond_resched();
mceinj.sh-7662 [000] 154195.857025: funcgraph_entry: 0.348 us | hwpoison_filter();
mceinj.sh-7662 [000] 154195.857026: funcgraph_entry: 0.323 us | PageHuge();
mceinj.sh-7662 [000] 154195.857027: funcgraph_entry: 0.264 us | PageHuge();
mceinj.sh-7662 [000] 154195.857027: funcgraph_entry: | kmem_cache_alloc_trace() {
mceinj.sh-7662 [000] 154195.857028: funcgraph_entry: 0.254 us | _cond_resched();
mceinj.sh-7662 [000] 154195.857028: funcgraph_exit: 0.905 us | }
mceinj.sh-7662 [000] 154195.857029: funcgraph_entry: 0.308 us | _read_lock();
mceinj.sh-7662 [000] 154195.857029: funcgraph_entry: 0.326 us | _spin_lock();
mceinj.sh-7662 [000] 154195.857057: funcgraph_entry: | kfree() {
mceinj.sh-7662 [000] 154195.857057: funcgraph_entry: 0.252 us | __phys_addr();
mceinj.sh-7662 [000] 154195.857058: funcgraph_exit: 1.000 us | }
mceinj.sh-7662 [000] 154195.857058: funcgraph_entry: | try_to_unmap() {
mceinj.sh-7662 [000] 154195.857058: funcgraph_entry: | try_to_unmap_file() {
mceinj.sh-7662 [000] 154195.857059: funcgraph_entry: 0.430 us | _spin_lock();
mceinj.sh-7662 [000] 154195.857060: funcgraph_entry: 0.719 us | vma_prio_tree_next();
mceinj.sh-7662 [000] 154195.857061: funcgraph_entry: | try_to_unmap_one() {
mceinj.sh-7662 [000] 154195.857061: funcgraph_entry: | page_check_address() {
mceinj.sh-7662 [000] 154195.857061: funcgraph_entry: 0.256 us | PageHuge();
mceinj.sh-7662 [000] 154195.857062: funcgraph_entry: 0.419 us | _spin_lock();
mceinj.sh-7662 [000] 154195.857063: funcgraph_exit: 1.812 us | }
mceinj.sh-7662 [000] 154195.857063: funcgraph_entry: | flush_tlb_page() {
mceinj.sh-7662 [000] 154195.857064: funcgraph_entry: | native_flush_tlb_others() {
mceinj.sh-7662 [000] 154195.857064: funcgraph_entry: 0.286 us | is_uv_system();
mceinj.sh-7662 [000] 154195.857065: funcgraph_entry: | flush_tlb_others_ipi() {
mceinj.sh-7662 [000] 154195.857065: funcgraph_entry: 0.336 us | _spin_lock();
mceinj.sh-7662 [000] 154195.857066: funcgraph_entry: | physflat_send_IPI_mask() {
mceinj.sh-7662 [000] 154195.857066: funcgraph_entry: 0.405 us | default_send_IPI_mask_sequence_phys();
mceinj.sh-7662 [000] 154195.857067: funcgraph_exit: 1.032 us | }
mceinj.sh-7662 [000] 154195.857068: funcgraph_exit: 3.704 us | }
mceinj.sh-7662 [000] 154195.857069: funcgraph_exit: 5.000 us | }
mceinj.sh-7662 [000] 154195.857069: funcgraph_exit: 6.060 us | }
mceinj.sh-7662 [000] 154195.857070: funcgraph_entry: | set_page_dirty() {
mceinj.sh-7662 [000] 154195.857070: funcgraph_entry: | __set_page_dirty_buffers() {
mceinj.sh-7662 [000] 154195.857070: funcgraph_entry: 0.278 us | _spin_lock();
mceinj.sh-7662 [000] 154195.857071: funcgraph_exit: 0.972 us | }
mceinj.sh-7662 [000] 154195.857071: funcgraph_exit: 1.636 us | }
mceinj.sh-7662 [000] 154195.857072: funcgraph_entry: 0.269 us | native_set_pte_at();
mceinj.sh-7662 [000] 154195.857072: funcgraph_entry: | page_remove_rmap() {
mceinj.sh-7662 [000] 154195.857073: funcgraph_entry: 0.281 us | PageHuge();
mceinj.sh-7662 [000] 154195.857073: funcgraph_entry: | __dec_zone_page_state() {
mceinj.sh-7662 [000] 154195.857073: funcgraph_entry: 0.330 us | __dec_zone_state();
mceinj.sh-7662 [000] 154195.857074: funcgraph_exit: 0.991 us | }
mceinj.sh-7662 [000] 154195.857074: funcgraph_entry: | mem_cgroup_update_file_mapped() {
mceinj.sh-7662 [000] 154195.857075: funcgraph_entry: 0.278 us | lookup_page_cgroup();
mceinj.sh-7662 [000] 154195.857076: funcgraph_exit: 1.112 us | }
mceinj.sh-7662 [000] 154195.857076: funcgraph_exit: 3.668 us | }
mceinj.sh-7662 [000] 154195.857076: funcgraph_entry: 0.309 us | put_page();
mceinj.sh-7662 [000] 154195.857077: funcgraph_exit: + 16.206 us | }
mceinj.sh-7662 [000] 154195.857077: funcgraph_exit: + 18.641 us | }
mceinj.sh-7662 [000] 154195.857077: funcgraph_exit: + 19.336 us | }
mceinj.sh-7662 [000] 154195.857078: funcgraph_entry: | me_pagecache_dirty() {
mceinj.sh-7662 [000] 154195.857079: funcgraph_entry: | me_pagecache_clean() {
mceinj.sh-7662 [000] 154195.857079: funcgraph_entry: | delete_from_lru_cache() {
mceinj.sh-7662 [000] 154195.857080: funcgraph_entry: | isolate_lru_page() {
mceinj.sh-7662 [000] 154195.857080: funcgraph_entry: 0.424 us | _spin_lock_irq();
mceinj.sh-7662 [000] 154195.857081: funcgraph_entry: | mem_cgroup_lru_del_list() {
mceinj.sh-7662 [000] 154195.857081: funcgraph_entry: 0.278 us | lookup_page_cgroup();
mceinj.sh-7662 [000] 154195.857082: funcgraph_exit: 1.097 us | }
mceinj.sh-7662 [000] 154195.857082: funcgraph_entry: 0.381 us | __mod_zone_page_state();
mceinj.sh-7662 [000] 154195.857083: funcgraph_exit: 3.660 us | }
mceinj.sh-7662 [000] 154195.857084: funcgraph_entry: 0.384 us | put_page();
mceinj.sh-7662 [000] 154195.857084: funcgraph_exit: 5.176 us | }
mceinj.sh-7662 [000] 154195.857085: funcgraph_entry: | generic_error_remove_page() {
mceinj.sh-7662 [000] 154195.857086: funcgraph_entry: | truncate_inode_page() {
mceinj.sh-7662 [000] 154195.857086: funcgraph_entry: | do_invalidatepage() {
mceinj.sh-7662 [000] 154195.857087: funcgraph_entry: | ext4_da_invalidatepage() {
mceinj.sh-7662 [000] 154195.857087: funcgraph_entry: | ext4_invalidatepage() {
mceinj.sh-7662 [000] 154195.857088: funcgraph_entry: | jbd2_journal_invalidatepage() {
mceinj.sh-7662 [000] 154195.857088: funcgraph_entry: 0.281 us | _cond_resched();
mceinj.sh-7662 [000] 154195.857088: funcgraph_entry: | unlock_buffer() {
mceinj.sh-7662 [000] 154195.857089: funcgraph_entry: | wake_up_bit() {
mceinj.sh-7662 [000] 154195.857089: funcgraph_entry: | bit_waitqueue() {
mceinj.sh-7662 [000] 154195.857089: funcgraph_entry: 0.308 us | __phys_addr();
mceinj.sh-7662 [000] 154195.857090: funcgraph_exit: 1.005 us | }
mceinj.sh-7662 [000] 154195.857091: funcgraph_entry: 0.409 us | __wake_up_bit();
mceinj.sh-7662 [000] 154195.857091: funcgraph_exit: 2.495 us | }
mceinj.sh-7662 [000] 154195.857092: funcgraph_exit: 3.240 us | }
mceinj.sh-7662 [000] 154195.857092: funcgraph_entry: | try_to_free_buffers() {
mceinj.sh-7662 [000] 154195.857093: funcgraph_entry: 0.377 us | _spin_lock();
mceinj.sh-7662 [000] 154195.857093: funcgraph_entry: | drop_buffers() {
mceinj.sh-7662 [000] 154195.857094: funcgraph_entry: 0.427 us | put_page();
mceinj.sh-7662 [000] 154195.857095: funcgraph_exit: 1.378 us | }
mceinj.sh-7662 [000] 154195.857095: funcgraph_entry: | cancel_dirty_page() {
mceinj.sh-7662 [000] 154195.857096: funcgraph_entry: | dec_zone_page_state() {
mceinj.sh-7662 [000] 154195.857096: funcgraph_entry: | __dec_zone_page_state() {
mceinj.sh-7662 [000] 154195.857097: funcgraph_entry: 0.408 us | __dec_zone_state();
mceinj.sh-7662 [000] 154195.857097: funcgraph_exit: 1.198 us | }
mceinj.sh-7662 [000] 154195.857098: funcgraph_exit: 1.987 us | }
mceinj.sh-7662 [000] 154195.857099: funcgraph_exit: 3.303 us | }
mceinj.sh-7662 [000] 154195.857099: funcgraph_entry: | free_buffer_head() {
mceinj.sh-7662 [000] 154195.857099: funcgraph_entry: 0.579 us | kmem_cache_free();
mceinj.sh-7662 [000] 154195.857100: funcgraph_entry: 0.406 us | recalc_bh_state();
mceinj.sh-7662 [000] 154195.857101: funcgraph_exit: 2.269 us | }
mceinj.sh-7662 [000] 154195.857102: funcgraph_exit: 9.451 us | }
mceinj.sh-7662 [000] 154195.857102: funcgraph_exit: + 14.532 us | }
mceinj.sh-7662 [000] 154195.857102: funcgraph_exit: + 15.321 us | }
mceinj.sh-7662 [000] 154195.857103: funcgraph_exit: + 16.285 us | }
mceinj.sh-7662 [000] 154195.857103: funcgraph_exit: + 17.133 us | }
mceinj.sh-7662 [000] 154195.857104: funcgraph_entry: 0.439 us | cancel_dirty_page();
mceinj.sh-7662 [000] 154195.857105: funcgraph_entry: | remove_from_page_cache() {
mceinj.sh-7662 [000] 154195.857105: funcgraph_entry: 0.408 us | _spin_lock_irq();
mceinj.sh-7662 [000] 154195.857106: funcgraph_entry: | __remove_from_page_cache() {
mceinj.sh-7662 [000] 154195.857107: funcgraph_entry: | __dec_zone_page_state() {
mceinj.sh-7662 [000] 154195.857107: funcgraph_entry: 0.457 us | __dec_zone_state();
mceinj.sh-7662 [000] 154195.857108: funcgraph_exit: 1.224 us | }
mceinj.sh-7662 [000] 154195.857109: funcgraph_exit: 2.757 us | }
mceinj.sh-7662 [000] 154195.857109: funcgraph_entry: | mem_cgroup_uncharge_cache_page() {
mceinj.sh-7662 [000] 154195.857109: funcgraph_entry: | __mem_cgroup_uncharge_common() {
mceinj.sh-7662 [000] 154195.857110: funcgraph_entry: 0.421 us | lookup_page_cgroup();
mceinj.sh-7662 [000] 154195.857111: funcgraph_entry: 0.383 us | bit_spin_lock();
mceinj.sh-7662 [000] 154195.857112: funcgraph_exit: 2.119 us | }
mceinj.sh-7662 [000] 154195.857112: funcgraph_exit: 2.920 us | }
mceinj.sh-7662 [000] 154195.857112: funcgraph_exit: 7.783 us | }
mceinj.sh-7662 [000] 154195.857113: funcgraph_entry: 0.393 us | put_page();
mceinj.sh-7662 [000] 154195.857113: funcgraph_exit: + 27.960 us | }
mceinj.sh-7662 [000] 154195.857114: funcgraph_exit: + 29.017 us | }
mceinj.sh-7662 [000] 154195.857114: funcgraph_exit: + 35.595 us | }
mceinj.sh-7662 [000] 154195.857115: funcgraph_exit: + 36.476 us | }
mceinj.sh-7662 [000] 154195.857115: funcgraph_entry: | action_result() {
mceinj.sh-7662 [000] 154195.857116: funcgraph_entry: | vprintk() {

2013-04-18 00:35:27

by Simon Jeons

[permalink] [raw]
Subject: Re: [RFC Patch 0/2] mm: Add parameters to make kernel behavior at memory error on dirty cache selectable

Hi Naoya,
On 04/17/2013 10:55 PM, Naoya Horiguchi wrote:
> On Wed, Apr 17, 2013 at 03:14:36PM +0800, Simon Jeons wrote:
>> Hi Naoya,
>> On 04/11/2013 03:11 PM, Naoya Horiguchi wrote:
>>> Hi Tanino-san,
>>>
>>> On Thu, Apr 11, 2013 at 12:26:19PM +0900, Mitsuhiro Tanino wrote:
>>> ...
>>>> Solution
>>>> ---------
>>>> The patch proposes a new sysctl interface, vm.memory_failure_dirty_panic,
>>>> in order to prevent data corruption comes from data lost problem.
>>>> Also this patch displays information of affected file such as device name,
>>>> inode number, file offset and file type if the file is mapped on a memory
>>>> and the page is dirty cache.
>>>>
>>>> When SRAO machine check occurs on a dirty page cache, corresponding
>>>> data cannot be recovered any more. Therefore, the patch proposes a kernel
>>>> option to keep a system running or force system panic in order
>>>> to avoid further trouble such as data corruption problem of application.
>>>>
>>>> System administrator can select an error action using this option
>>>> according to characteristics of target system.
>>> Can we do this in userspace?
>>> mcelog can trigger scripts when a MCE which matches the user-configurable
>>> conditions happens, so I think that we can trigger a kernel panic by
>>> chekcing kernel messages from the triggered script.
>>> For that purpose, I recently fixed the dirty/clean messaging in commit
>>> ff604cf6d4 "mm: hwpoison: fix action_result() to print out dirty/clean".
>> In your commit ff604cf6d4, you mentioned that "because when we check
>> PageDirty in action_result() it was cleared after page isolation even if
>> it's dirty before error handling." Could you point out where page
>> isolation and clear PageDirty? I don't think is isolate_lru_pages.
> Here is the result of ftracing of memory_failure().
> cancel_dirty_page() is called inside me_pagecache_dirty(), that's it.

Cool! What's the option you used in this ftrace.

>
> mceinj.sh-7662 [000] 154195.857024: funcgraph_entry: | memory_failure() {
> mceinj.sh-7662 [000] 154195.857024: funcgraph_entry: 0.283 us | PageHuge();
> mceinj.sh-7662 [000] 154195.857025: funcgraph_entry: 0.321 us | _cond_resched();
> mceinj.sh-7662 [000] 154195.857025: funcgraph_entry: 0.348 us | hwpoison_filter();
> mceinj.sh-7662 [000] 154195.857026: funcgraph_entry: 0.323 us | PageHuge();
> mceinj.sh-7662 [000] 154195.857027: funcgraph_entry: 0.264 us | PageHuge();
> mceinj.sh-7662 [000] 154195.857027: funcgraph_entry: | kmem_cache_alloc_trace() {
> mceinj.sh-7662 [000] 154195.857028: funcgraph_entry: 0.254 us | _cond_resched();
> mceinj.sh-7662 [000] 154195.857028: funcgraph_exit: 0.905 us | }
> mceinj.sh-7662 [000] 154195.857029: funcgraph_entry: 0.308 us | _read_lock();
> mceinj.sh-7662 [000] 154195.857029: funcgraph_entry: 0.326 us | _spin_lock();
> mceinj.sh-7662 [000] 154195.857057: funcgraph_entry: | kfree() {
> mceinj.sh-7662 [000] 154195.857057: funcgraph_entry: 0.252 us | __phys_addr();
> mceinj.sh-7662 [000] 154195.857058: funcgraph_exit: 1.000 us | }
> mceinj.sh-7662 [000] 154195.857058: funcgraph_entry: | try_to_unmap() {
> mceinj.sh-7662 [000] 154195.857058: funcgraph_entry: | try_to_unmap_file() {
> mceinj.sh-7662 [000] 154195.857059: funcgraph_entry: 0.430 us | _spin_lock();
> mceinj.sh-7662 [000] 154195.857060: funcgraph_entry: 0.719 us | vma_prio_tree_next();
> mceinj.sh-7662 [000] 154195.857061: funcgraph_entry: | try_to_unmap_one() {
> mceinj.sh-7662 [000] 154195.857061: funcgraph_entry: | page_check_address() {
> mceinj.sh-7662 [000] 154195.857061: funcgraph_entry: 0.256 us | PageHuge();
> mceinj.sh-7662 [000] 154195.857062: funcgraph_entry: 0.419 us | _spin_lock();
> mceinj.sh-7662 [000] 154195.857063: funcgraph_exit: 1.812 us | }
> mceinj.sh-7662 [000] 154195.857063: funcgraph_entry: | flush_tlb_page() {
> mceinj.sh-7662 [000] 154195.857064: funcgraph_entry: | native_flush_tlb_others() {
> mceinj.sh-7662 [000] 154195.857064: funcgraph_entry: 0.286 us | is_uv_system();
> mceinj.sh-7662 [000] 154195.857065: funcgraph_entry: | flush_tlb_others_ipi() {
> mceinj.sh-7662 [000] 154195.857065: funcgraph_entry: 0.336 us | _spin_lock();
> mceinj.sh-7662 [000] 154195.857066: funcgraph_entry: | physflat_send_IPI_mask() {
> mceinj.sh-7662 [000] 154195.857066: funcgraph_entry: 0.405 us | default_send_IPI_mask_sequence_phys();
> mceinj.sh-7662 [000] 154195.857067: funcgraph_exit: 1.032 us | }
> mceinj.sh-7662 [000] 154195.857068: funcgraph_exit: 3.704 us | }
> mceinj.sh-7662 [000] 154195.857069: funcgraph_exit: 5.000 us | }
> mceinj.sh-7662 [000] 154195.857069: funcgraph_exit: 6.060 us | }
> mceinj.sh-7662 [000] 154195.857070: funcgraph_entry: | set_page_dirty() {
> mceinj.sh-7662 [000] 154195.857070: funcgraph_entry: | __set_page_dirty_buffers() {
> mceinj.sh-7662 [000] 154195.857070: funcgraph_entry: 0.278 us | _spin_lock();
> mceinj.sh-7662 [000] 154195.857071: funcgraph_exit: 0.972 us | }
> mceinj.sh-7662 [000] 154195.857071: funcgraph_exit: 1.636 us | }
> mceinj.sh-7662 [000] 154195.857072: funcgraph_entry: 0.269 us | native_set_pte_at();
> mceinj.sh-7662 [000] 154195.857072: funcgraph_entry: | page_remove_rmap() {
> mceinj.sh-7662 [000] 154195.857073: funcgraph_entry: 0.281 us | PageHuge();
> mceinj.sh-7662 [000] 154195.857073: funcgraph_entry: | __dec_zone_page_state() {
> mceinj.sh-7662 [000] 154195.857073: funcgraph_entry: 0.330 us | __dec_zone_state();
> mceinj.sh-7662 [000] 154195.857074: funcgraph_exit: 0.991 us | }
> mceinj.sh-7662 [000] 154195.857074: funcgraph_entry: | mem_cgroup_update_file_mapped() {
> mceinj.sh-7662 [000] 154195.857075: funcgraph_entry: 0.278 us | lookup_page_cgroup();
> mceinj.sh-7662 [000] 154195.857076: funcgraph_exit: 1.112 us | }
> mceinj.sh-7662 [000] 154195.857076: funcgraph_exit: 3.668 us | }
> mceinj.sh-7662 [000] 154195.857076: funcgraph_entry: 0.309 us | put_page();
> mceinj.sh-7662 [000] 154195.857077: funcgraph_exit: + 16.206 us | }
> mceinj.sh-7662 [000] 154195.857077: funcgraph_exit: + 18.641 us | }
> mceinj.sh-7662 [000] 154195.857077: funcgraph_exit: + 19.336 us | }
> mceinj.sh-7662 [000] 154195.857078: funcgraph_entry: | me_pagecache_dirty() {
> mceinj.sh-7662 [000] 154195.857079: funcgraph_entry: | me_pagecache_clean() {
> mceinj.sh-7662 [000] 154195.857079: funcgraph_entry: | delete_from_lru_cache() {
> mceinj.sh-7662 [000] 154195.857080: funcgraph_entry: | isolate_lru_page() {
> mceinj.sh-7662 [000] 154195.857080: funcgraph_entry: 0.424 us | _spin_lock_irq();
> mceinj.sh-7662 [000] 154195.857081: funcgraph_entry: | mem_cgroup_lru_del_list() {
> mceinj.sh-7662 [000] 154195.857081: funcgraph_entry: 0.278 us | lookup_page_cgroup();
> mceinj.sh-7662 [000] 154195.857082: funcgraph_exit: 1.097 us | }
> mceinj.sh-7662 [000] 154195.857082: funcgraph_entry: 0.381 us | __mod_zone_page_state();
> mceinj.sh-7662 [000] 154195.857083: funcgraph_exit: 3.660 us | }
> mceinj.sh-7662 [000] 154195.857084: funcgraph_entry: 0.384 us | put_page();
> mceinj.sh-7662 [000] 154195.857084: funcgraph_exit: 5.176 us | }
> mceinj.sh-7662 [000] 154195.857085: funcgraph_entry: | generic_error_remove_page() {
> mceinj.sh-7662 [000] 154195.857086: funcgraph_entry: | truncate_inode_page() {
> mceinj.sh-7662 [000] 154195.857086: funcgraph_entry: | do_invalidatepage() {
> mceinj.sh-7662 [000] 154195.857087: funcgraph_entry: | ext4_da_invalidatepage() {
> mceinj.sh-7662 [000] 154195.857087: funcgraph_entry: | ext4_invalidatepage() {
> mceinj.sh-7662 [000] 154195.857088: funcgraph_entry: | jbd2_journal_invalidatepage() {
> mceinj.sh-7662 [000] 154195.857088: funcgraph_entry: 0.281 us | _cond_resched();
> mceinj.sh-7662 [000] 154195.857088: funcgraph_entry: | unlock_buffer() {
> mceinj.sh-7662 [000] 154195.857089: funcgraph_entry: | wake_up_bit() {
> mceinj.sh-7662 [000] 154195.857089: funcgraph_entry: | bit_waitqueue() {
> mceinj.sh-7662 [000] 154195.857089: funcgraph_entry: 0.308 us | __phys_addr();
> mceinj.sh-7662 [000] 154195.857090: funcgraph_exit: 1.005 us | }
> mceinj.sh-7662 [000] 154195.857091: funcgraph_entry: 0.409 us | __wake_up_bit();
> mceinj.sh-7662 [000] 154195.857091: funcgraph_exit: 2.495 us | }
> mceinj.sh-7662 [000] 154195.857092: funcgraph_exit: 3.240 us | }
> mceinj.sh-7662 [000] 154195.857092: funcgraph_entry: | try_to_free_buffers() {
> mceinj.sh-7662 [000] 154195.857093: funcgraph_entry: 0.377 us | _spin_lock();
> mceinj.sh-7662 [000] 154195.857093: funcgraph_entry: | drop_buffers() {
> mceinj.sh-7662 [000] 154195.857094: funcgraph_entry: 0.427 us | put_page();
> mceinj.sh-7662 [000] 154195.857095: funcgraph_exit: 1.378 us | }
> mceinj.sh-7662 [000] 154195.857095: funcgraph_entry: | cancel_dirty_page() {
> mceinj.sh-7662 [000] 154195.857096: funcgraph_entry: | dec_zone_page_state() {
> mceinj.sh-7662 [000] 154195.857096: funcgraph_entry: | __dec_zone_page_state() {
> mceinj.sh-7662 [000] 154195.857097: funcgraph_entry: 0.408 us | __dec_zone_state();
> mceinj.sh-7662 [000] 154195.857097: funcgraph_exit: 1.198 us | }
> mceinj.sh-7662 [000] 154195.857098: funcgraph_exit: 1.987 us | }
> mceinj.sh-7662 [000] 154195.857099: funcgraph_exit: 3.303 us | }
> mceinj.sh-7662 [000] 154195.857099: funcgraph_entry: | free_buffer_head() {
> mceinj.sh-7662 [000] 154195.857099: funcgraph_entry: 0.579 us | kmem_cache_free();
> mceinj.sh-7662 [000] 154195.857100: funcgraph_entry: 0.406 us | recalc_bh_state();
> mceinj.sh-7662 [000] 154195.857101: funcgraph_exit: 2.269 us | }
> mceinj.sh-7662 [000] 154195.857102: funcgraph_exit: 9.451 us | }
> mceinj.sh-7662 [000] 154195.857102: funcgraph_exit: + 14.532 us | }
> mceinj.sh-7662 [000] 154195.857102: funcgraph_exit: + 15.321 us | }
> mceinj.sh-7662 [000] 154195.857103: funcgraph_exit: + 16.285 us | }
> mceinj.sh-7662 [000] 154195.857103: funcgraph_exit: + 17.133 us | }
> mceinj.sh-7662 [000] 154195.857104: funcgraph_entry: 0.439 us | cancel_dirty_page();
> mceinj.sh-7662 [000] 154195.857105: funcgraph_entry: | remove_from_page_cache() {
> mceinj.sh-7662 [000] 154195.857105: funcgraph_entry: 0.408 us | _spin_lock_irq();
> mceinj.sh-7662 [000] 154195.857106: funcgraph_entry: | __remove_from_page_cache() {
> mceinj.sh-7662 [000] 154195.857107: funcgraph_entry: | __dec_zone_page_state() {
> mceinj.sh-7662 [000] 154195.857107: funcgraph_entry: 0.457 us | __dec_zone_state();
> mceinj.sh-7662 [000] 154195.857108: funcgraph_exit: 1.224 us | }
> mceinj.sh-7662 [000] 154195.857109: funcgraph_exit: 2.757 us | }
> mceinj.sh-7662 [000] 154195.857109: funcgraph_entry: | mem_cgroup_uncharge_cache_page() {
> mceinj.sh-7662 [000] 154195.857109: funcgraph_entry: | __mem_cgroup_uncharge_common() {
> mceinj.sh-7662 [000] 154195.857110: funcgraph_entry: 0.421 us | lookup_page_cgroup();
> mceinj.sh-7662 [000] 154195.857111: funcgraph_entry: 0.383 us | bit_spin_lock();
> mceinj.sh-7662 [000] 154195.857112: funcgraph_exit: 2.119 us | }
> mceinj.sh-7662 [000] 154195.857112: funcgraph_exit: 2.920 us | }
> mceinj.sh-7662 [000] 154195.857112: funcgraph_exit: 7.783 us | }
> mceinj.sh-7662 [000] 154195.857113: funcgraph_entry: 0.393 us | put_page();
> mceinj.sh-7662 [000] 154195.857113: funcgraph_exit: + 27.960 us | }
> mceinj.sh-7662 [000] 154195.857114: funcgraph_exit: + 29.017 us | }
> mceinj.sh-7662 [000] 154195.857114: funcgraph_exit: + 35.595 us | }
> mceinj.sh-7662 [000] 154195.857115: funcgraph_exit: + 36.476 us | }
> mceinj.sh-7662 [000] 154195.857115: funcgraph_entry: | action_result() {
> mceinj.sh-7662 [000] 154195.857116: funcgraph_entry: | vprintk() {