2019-05-17 04:29:02

by Jane Chu

[permalink] [raw]
Subject: [PATCH] mm, memory-failure: clarify error message

Some user who install SIGBUS handler that does longjmp out
therefore keeping the process alive is confused by the error
message
"[188988.765862] Memory failure: 0x1840200: Killing
cellsrv:33395 due to hardware memory corruption"
Slightly modify the error message to improve clarity.

Signed-off-by: Jane Chu <[email protected]>
---
mm/memory-failure.c | 7 ++++---
1 file changed, 4 insertions(+), 3 deletions(-)

diff --git a/mm/memory-failure.c b/mm/memory-failure.c
index fc8b517..14de5e2 100644
--- a/mm/memory-failure.c
+++ b/mm/memory-failure.c
@@ -216,10 +216,9 @@ static int kill_proc(struct to_kill *tk, unsigned long pfn, int flags)
short addr_lsb = tk->size_shift;
int ret;

- pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory corruption\n",
- pfn, t->comm, t->pid);
-
if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) {
+ pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory "
+ "corruption\n", pfn, t->comm, t->pid);
ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)tk->addr,
addr_lsb, current);
} else {
@@ -229,6 +228,8 @@ static int kill_proc(struct to_kill *tk, unsigned long pfn, int flags)
* This could cause a loop when the user sets SIGBUS
* to SIG_IGN, but hopefully no one will do that?
*/
+ pr_err("Memory failure: %#lx: Sending SIGBUS to %s:%d due to hardware "
+ "memory corruption\n", pfn, t->comm, t->pid);
ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)tk->addr,
addr_lsb, t); /* synchronous? */
}
--
1.8.3.1


2019-05-17 05:37:48

by Anshuman Khandual

[permalink] [raw]
Subject: Re: [PATCH] mm, memory-failure: clarify error message



On 05/17/2019 09:38 AM, Jane Chu wrote:
> Some user who install SIGBUS handler that does longjmp out

What the longjmp about ? Are you referring to the mechanism of catching the
signal which was registered ?

> therefore keeping the process alive is confused by the error
> message
> "[188988.765862] Memory failure: 0x1840200: Killing
> cellsrv:33395 due to hardware memory corruption"

Its a valid point because those are two distinct actions.

> Slightly modify the error message to improve clarity.
>
> Signed-off-by: Jane Chu <[email protected]>
> ---
> mm/memory-failure.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index fc8b517..14de5e2 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -216,10 +216,9 @@ static int kill_proc(struct to_kill *tk, unsigned long pfn, int flags)
> short addr_lsb = tk->size_shift;
> int ret;
>
> - pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory corruption\n",
> - pfn, t->comm, t->pid);
> -
> if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) {
> + pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory "
> + "corruption\n", pfn, t->comm, t->pid);
> ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)tk->addr,
> addr_lsb, current);
> } else {
> @@ -229,6 +228,8 @@ static int kill_proc(struct to_kill *tk, unsigned long pfn, int flags)
> * This could cause a loop when the user sets SIGBUS
> * to SIG_IGN, but hopefully no one will do that?
> */
> + pr_err("Memory failure: %#lx: Sending SIGBUS to %s:%d due to hardware "
> + "memory corruption\n", pfn, t->comm, t->pid);
> ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)tk->addr,
> addr_lsb, t); /* synchronous? */

As both the pr_err() messages are very similar, could not we just switch between "Killing"
and "Sending SIGBUS to" based on a variable e.g action_[kill|sigbus] evaluated previously
with ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm).

2019-05-17 19:58:11

by Vishal Verma

[permalink] [raw]
Subject: Re: [PATCH] mm, memory-failure: clarify error message

On Thu, 2019-05-16 at 22:08 -0600, Jane Chu wrote:
> Some user who install SIGBUS handler that does longjmp out
> therefore keeping the process alive is confused by the error
> message
> "[188988.765862] Memory failure: 0x1840200: Killing
> cellsrv:33395 due to hardware memory corruption"
> Slightly modify the error message to improve clarity.
>
> Signed-off-by: Jane Chu <[email protected]>
> ---
> mm/memory-failure.c | 7 ++++---
> 1 file changed, 4 insertions(+), 3 deletions(-)
>
> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> index fc8b517..14de5e2 100644
> --- a/mm/memory-failure.c
> +++ b/mm/memory-failure.c
> @@ -216,10 +216,9 @@ static int kill_proc(struct to_kill *tk, unsigned long pfn, int flags)
> short addr_lsb = tk->size_shift;
> int ret;
>
> - pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory corruption\n",
> - pfn, t->comm, t->pid);
> -
> if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) {
> + pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory "
> + "corruption\n", pfn, t->comm, t->pid);

Minor nit, but the string shouldn't be split over multiple lines to
preserve grep-ability. In such a case it is usually considered OK to
exceed 80 characters for the line if needed.

> ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)tk->addr,
> addr_lsb, current);
> } else {
> @@ -229,6 +228,8 @@ static int kill_proc(struct to_kill *tk, unsigned long pfn, int flags)
> * This could cause a loop when the user sets SIGBUS
> * to SIG_IGN, but hopefully no one will do that?
> */
> + pr_err("Memory failure: %#lx: Sending SIGBUS to %s:%d due to hardware "
> + "memory corruption\n", pfn, t->comm, t->pid);
> ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)tk->addr,
> addr_lsb, t); /* synchronous? */
> }

2019-05-20 14:22:31

by Naoya Horiguchi

[permalink] [raw]
Subject: Re: [PATCH] mm, memory-failure: clarify error message

On Fri, May 17, 2019 at 10:18:02AM +0530, Anshuman Khandual wrote:
>
>
> On 05/17/2019 09:38 AM, Jane Chu wrote:
> > Some user who install SIGBUS handler that does longjmp out
>
> What the longjmp about ? Are you referring to the mechanism of catching the
> signal which was registered ?

AFAIK, longjmp() might be useful for signal-based retrying, so highly
optimized applications like Oracle DB might want to utilize it to handle
memory errors in application level, I guess.

>
> > therefore keeping the process alive is confused by the error
> > message
> > "[188988.765862] Memory failure: 0x1840200: Killing
> > cellsrv:33395 due to hardware memory corruption"
>
> Its a valid point because those are two distinct actions.
>
> > Slightly modify the error message to improve clarity.
> >
> > Signed-off-by: Jane Chu <[email protected]>
> > ---
> > mm/memory-failure.c | 7 ++++---
> > 1 file changed, 4 insertions(+), 3 deletions(-)
> >
> > diff --git a/mm/memory-failure.c b/mm/memory-failure.c
> > index fc8b517..14de5e2 100644
> > --- a/mm/memory-failure.c
> > +++ b/mm/memory-failure.c
> > @@ -216,10 +216,9 @@ static int kill_proc(struct to_kill *tk, unsigned long pfn, int flags)
> > short addr_lsb = tk->size_shift;
> > int ret;
> >
> > - pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory corruption\n",
> > - pfn, t->comm, t->pid);
> > -
> > if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) {
> > + pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory "
> > + "corruption\n", pfn, t->comm, t->pid);
> > ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)tk->addr,
> > addr_lsb, current);
> > } else {
> > @@ -229,6 +228,8 @@ static int kill_proc(struct to_kill *tk, unsigned long pfn, int flags)
> > * This could cause a loop when the user sets SIGBUS
> > * to SIG_IGN, but hopefully no one will do that?
> > */
> > + pr_err("Memory failure: %#lx: Sending SIGBUS to %s:%d due to hardware "
> > + "memory corruption\n", pfn, t->comm, t->pid);
> > ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)tk->addr,
> > addr_lsb, t); /* synchronous? */
>
> As both the pr_err() messages are very similar, could not we just switch between "Killing"
> and "Sending SIGBUS to" based on a variable e.g action_[kill|sigbus] evaluated previously
> with ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm).

That might need additional if sentence, which I'm not sure worth doing.
I think that the simplest fix for the reported problem (a confusing message)
is like below:

- pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory corruption\n",
+ pr_err("Memory failure: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
pfn, t->comm, t->pid);

Or, if we have a good reason to separate the message for MF_ACTION_REQUIRED and
MF_ACTION_OPTIONAL, that might be OK.

Thanks,
Naoya Horiguchi

2019-05-21 01:52:38

by Jane Chu

[permalink] [raw]
Subject: Re: [PATCH] mm, memory-failure: clarify error message

Thanks Vishal and Naoya!

-jane

On 5/20/2019 3:21 AM, Naoya Horiguchi wrote:
> On Fri, May 17, 2019 at 10:18:02AM +0530, Anshuman Khandual wrote:
>>
>> On 05/17/2019 09:38 AM, Jane Chu wrote:
>>> Some user who install SIGBUS handler that does longjmp out
>> What the longjmp about ? Are you referring to the mechanism of catching the
>> signal which was registered ?
> AFAIK, longjmp() might be useful for signal-based retrying, so highly
> optimized applications like Oracle DB might want to utilize it to handle
> memory errors in application level, I guess.
>
>>> therefore keeping the process alive is confused by the error
>>> message
>>> "[188988.765862] Memory failure: 0x1840200: Killing
>>> cellsrv:33395 due to hardware memory corruption"
>> Its a valid point because those are two distinct actions.
>>
>>> Slightly modify the error message to improve clarity.
>>>
>>> Signed-off-by: Jane Chu <[email protected]>
>>> ---
>>> mm/memory-failure.c | 7 ++++---
>>> 1 file changed, 4 insertions(+), 3 deletions(-)
>>>
>>> diff --git a/mm/memory-failure.c b/mm/memory-failure.c
>>> index fc8b517..14de5e2 100644
>>> --- a/mm/memory-failure.c
>>> +++ b/mm/memory-failure.c
>>> @@ -216,10 +216,9 @@ static int kill_proc(struct to_kill *tk, unsigned long pfn, int flags)
>>> short addr_lsb = tk->size_shift;
>>> int ret;
>>>
>>> - pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory corruption\n",
>>> - pfn, t->comm, t->pid);
>>> -
>>> if ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm) {
>>> + pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory "
>>> + "corruption\n", pfn, t->comm, t->pid);
>>> ret = force_sig_mceerr(BUS_MCEERR_AR, (void __user *)tk->addr,
>>> addr_lsb, current);
>>> } else {
>>> @@ -229,6 +228,8 @@ static int kill_proc(struct to_kill *tk, unsigned long pfn, int flags)
>>> * This could cause a loop when the user sets SIGBUS
>>> * to SIG_IGN, but hopefully no one will do that?
>>> */
>>> + pr_err("Memory failure: %#lx: Sending SIGBUS to %s:%d due to hardware "
>>> + "memory corruption\n", pfn, t->comm, t->pid);
>>> ret = send_sig_mceerr(BUS_MCEERR_AO, (void __user *)tk->addr,
>>> addr_lsb, t); /* synchronous? */
>> As both the pr_err() messages are very similar, could not we just switch between "Killing"
>> and "Sending SIGBUS to" based on a variable e.g action_[kill|sigbus] evaluated previously
>> with ((flags & MF_ACTION_REQUIRED) && t->mm == current->mm).
> That might need additional if sentence, which I'm not sure worth doing.
> I think that the simplest fix for the reported problem (a confusing message)
> is like below:
>
> - pr_err("Memory failure: %#lx: Killing %s:%d due to hardware memory corruption\n",
> + pr_err("Memory failure: %#lx: Sending SIGBUS to %s:%d due to hardware memory corruption\n",
> pfn, t->comm, t->pid);
>
> Or, if we have a good reason to separate the message for MF_ACTION_REQUIRED and
> MF_ACTION_OPTIONAL, that might be OK.
>
> Thanks,
> Naoya Horiguchi

2019-05-21 01:55:53

by Jane Chu

[permalink] [raw]
Subject: Re: [PATCH] mm, memory-failure: clarify error message

On 5/16/2019 9:48 PM, Anshuman Khandual wrote:

> On 05/17/2019 09:38 AM, Jane Chu wrote:
>> Some user who install SIGBUS handler that does longjmp out
> What the longjmp about ? Are you referring to the mechanism of catching the
> signal which was registered ?

Yes.

thanks,
-jane