2016-11-28 03:37:37

by Stephen Rothwell

[permalink] [raw]
Subject: linux-next: manual merge of the edac-amd tree with the edac tree

Hi Borislav,

Today's linux-next merge of the edac-amd tree got a conflict in:

drivers/edac/edac_mc.c

between commit:

ef91afa61088 ("edac: move documentation from edac_mc.c to edac_core.h")

from the edac tree and commit:

c73e8833bec5 ("EDAC, mc: Fix locking around mc_devices list")

from the edac-amd tree.

I fixed it up (see below - there may be more fixes needed in
edac_core.h) and can carry the fix as necessary. This is now fixed as
far as linux-next is concerned, but any non trivial conflicts should be
mentioned to your upstream maintainer when your tree is submitted for
merging. You may also want to consider cooperating with the maintainer
of the conflicting tree to minimise any particularly complex conflicts.

--
Cheers,
Stephen Rothwell

diff --cc drivers/edac/edac_mc.c
index d50653d36918,d2ea9c4f1824..000000000000
--- a/drivers/edac/edac_mc.c
+++ b/drivers/edac/edac_mc.c
@@@ -470,6 -499,24 +470,17 @@@ static struct mem_ctl_info *__find_mci_

return NULL;
}
+
-/**
- * find_mci_by_dev
- *
- * scan list of controllers looking for the one that manages
- * the 'dev' device
- * @dev: pointer to a struct device related with the MCI
- */
+ struct mem_ctl_info *find_mci_by_dev(struct device *dev)
+ {
+ struct mem_ctl_info *ret;
+
+ mutex_lock(&mem_ctls_mutex);
+ ret = __find_mci_by_dev(dev);
+ mutex_unlock(&mem_ctls_mutex);
+
+ return ret;
+ }
EXPORT_SYMBOL_GPL(find_mci_by_dev);

/*
@@@ -599,10 -646,18 +610,12 @@@ static int del_mc_from_global_list(stru
return handlers;
}

-/**
- * edac_mc_find: Search for a mem_ctl_info structure whose index is 'idx'.
- *
- * If found, return a pointer to the structure.
- * Else return NULL.
- */
struct mem_ctl_info *edac_mc_find(int idx)
{
+ struct mem_ctl_info *mci = NULL;
struct list_head *item;
- struct mem_ctl_info *mci;
+
+ mutex_lock(&mem_ctls_mutex);

list_for_each(item, &mc_devices) {
mci = list_entry(item, struct mem_ctl_info, link);


2016-11-28 08:27:45

by Borislav Petkov

[permalink] [raw]
Subject: Re: linux-next: manual merge of the edac-amd tree with the edac tree

On Mon, Nov 28, 2016 at 02:37:26PM +1100, Stephen Rothwell wrote:
> Hi Borislav,
>
> Today's linux-next merge of the edac-amd tree got a conflict in:
>
> drivers/edac/edac_mc.c
>
> between commit:
>
> ef91afa61088 ("edac: move documentation from edac_mc.c to edac_core.h")
>
> from the edac tree and commit:
>
> c73e8833bec5 ("EDAC, mc: Fix locking around mc_devices list")
>
> from the edac-amd tree.
>
> I fixed it up (see below - there may be more fixes needed in
> edac_core.h) and can carry the fix as necessary. This is now fixed as
> far as linux-next is concerned, but any non trivial conflicts should be
> mentioned to your upstream maintainer when your tree is submitted for
> merging. You may also want to consider cooperating with the maintainer
> of the conflicting tree to minimise any particularly complex conflicts.

Just one issue which has nothing to do with linux-next. There's still
that in ef91afa61088:

> +/**
> + * edac_mc_find: Search for a mem_ctl_info structure whose index is @idx.
> + *
> + * @idx: index to be seek
> + *
> + * If found, return a pointer to the structure.
> + * Else return NULL.
> + *
> + * Caller must hold mem_ctls_mutex.
> + */

That last sentence in the comment is not true anymore - edac_mc_find()
is grabbing the mutex itself as it should be. Mauro, please fix that in
your tree.

Thanks.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2016-11-30 10:50:35

by Mauro Carvalho Chehab

[permalink] [raw]
Subject: Re: linux-next: manual merge of the edac-amd tree with the edac tree

Em Mon, 28 Nov 2016 09:27:35 +0100
Borislav Petkov <[email protected]> escreveu:

> On Mon, Nov 28, 2016 at 02:37:26PM +1100, Stephen Rothwell wrote:
> > Hi Borislav,
> >
> > Today's linux-next merge of the edac-amd tree got a conflict in:
> >
> > drivers/edac/edac_mc.c
> >
> > between commit:
> >
> > ef91afa61088 ("edac: move documentation from edac_mc.c to edac_core.h")
> >
> > from the edac tree and commit:
> >
> > c73e8833bec5 ("EDAC, mc: Fix locking around mc_devices list")
> >
> > from the edac-amd tree.
> >
> > I fixed it up (see below - there may be more fixes needed in
> > edac_core.h) and can carry the fix as necessary. This is now fixed as
> > far as linux-next is concerned, but any non trivial conflicts should be
> > mentioned to your upstream maintainer when your tree is submitted for
> > merging. You may also want to consider cooperating with the maintainer
> > of the conflicting tree to minimise any particularly complex conflicts.
>
> Just one issue which has nothing to do with linux-next. There's still
> that in ef91afa61088:
>
> > +/**
> > + * edac_mc_find: Search for a mem_ctl_info structure whose index is @idx.
> > + *
> > + * @idx: index to be seek
> > + *
> > + * If found, return a pointer to the structure.
> > + * Else return NULL.
> > + *
> > + * Caller must hold mem_ctls_mutex.
> > + */
>
> That last sentence in the comment is not true anymore - edac_mc_find()
> is grabbing the mutex itself as it should be. Mauro, please fix that in
> your tree.

Fixed. If you have a stable branch, I can rebase it on the top
of your patches, in order to avoid the confict at linux-next.

Regards,
Mauro

2016-12-01 10:48:50

by Borislav Petkov

[permalink] [raw]
Subject: Re: linux-next: manual merge of the edac-amd tree with the edac tree

On Wed, Nov 30, 2016 at 08:50:13AM -0200, Mauro Carvalho Chehab wrote:
> Fixed. If you have a stable branch, I can rebase it on the top
> of your patches, in order to avoid the confict at linux-next.

http://git.kernel.org/cgit/linux/kernel/git/bp/bp.git/log/?h=for-next

but with that fixed, shouldn't the merge work without a conflict?

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2016-12-01 12:06:26

by Mauro Carvalho Chehab

[permalink] [raw]
Subject: Re: linux-next: manual merge of the edac-amd tree with the edac tree

Em Thu, 1 Dec 2016 11:48:46 +0100
Borislav Petkov <[email protected]> escreveu:

> On Wed, Nov 30, 2016 at 08:50:13AM -0200, Mauro Carvalho Chehab wrote:
> > Fixed. If you have a stable branch, I can rebase it on the top
> > of your patches, in order to avoid the confict at linux-next.
>
> http://git.kernel.org/cgit/linux/kernel/git/bp/bp.git/log/?h=for-next

Thanks!

> but with that fixed, shouldn't the merge work without a conflict?

Well, we're removing some lines from one file and adding them on another
one. Git is not smart enough to solve such conflicts automatically:

CONFLICT (content): Merge conflict in drivers/edac/edac_mc.c
Recorded preimage for 'drivers/edac/edac_mc.c'
Automatic merge failed; fix conflicts and then commit the result.

Anyway, I rebased my work on the top of your branch. The conflict
is now gone:

https://git.kernel.org/cgit/linux/kernel/git/mchehab/linux-edac.git/log/?h=linux_next

You can see the generated documentation at:
https://mchehab.fedorapeople.org/kernel_docs/admin-guide/ras.html
(user and admin guide)
and:
https://mchehab.fedorapeople.org/kernel_docs/driver-api/edac.html
(kernel API)

However, rebasing over your tree showed a new documentation gap:
./include/linux/edac.h:144: warning: Enum value 'HW_EVENT_ERR_DEFERRED' not described in enum 'hw_event_mc_err_type'

With was introduced by this commit:

commit d12a969ebbfcfc25853c4147d42b388f758e8784
Author: Yazen Ghannam <[email protected]>
Date: Thu Nov 17 17:57:32 2016 -0500

EDAC, amd64: Add Deferred Error type

Currently, deferred errors are classified as correctable in EDAC. Add a
new error type for deferred errors so that they are correctly reported
to the user.

Signed-off-by: Yazen Ghannam <[email protected]>
Cc: Aravind Gopalakrishnan <[email protected]>
Cc: linux-edac <[email protected]>
Link: http://lkml.kernel.org/r/[email protected]
Signed-off-by: Borislav Petkov <[email protected]>


Yazen introduced a "deferred error" code (whatever it means), but didn't
document what's that. Unfortunately, the patch description is also
not clear enough about what a "deferred error" means or how userspace
is supposed to handle it.

Yazen,

Could you please send us a patch adding a proper description for this
new error code?

Thanks!
Mauro

2016-12-01 16:02:14

by Yazen Ghannam

[permalink] [raw]
Subject: Re: linux-next: manual merge of the edac-amd tree with the edac tree

On Thu, Dec 01, 2016 at 10:06:17AM -0200, Mauro Carvalho Chehab wrote:
>
> However, rebasing over your tree showed a new documentation gap:
> ./include/linux/edac.h:144: warning: Enum value 'HW_EVENT_ERR_DEFERRED' not described in enum 'hw_event_mc_err_type'
>
> With was introduced by this commit:
>
> commit d12a969ebbfcfc25853c4147d42b388f758e8784
> Author: Yazen Ghannam <[email protected]>
> Date: Thu Nov 17 17:57:32 2016 -0500
>
> EDAC, amd64: Add Deferred Error type
>
> Currently, deferred errors are classified as correctable in EDAC. Add a
> new error type for deferred errors so that they are correctly reported
> to the user.
>
> Signed-off-by: Yazen Ghannam <[email protected]>
> Cc: Aravind Gopalakrishnan <[email protected]>
> Cc: linux-edac <[email protected]>
> Link: http://lkml.kernel.org/r/[email protected]
> Signed-off-by: Borislav Petkov <[email protected]>
>
>
> Yazen introduced a "deferred error" code (whatever it means), but didn't
> document what's that. Unfortunately, the patch description is also
> not clear enough about what a "deferred error" means or how userspace
> is supposed to handle it.
>
> Yazen,
>
> Could you please send us a patch adding a proper description for this
> new error code?
>

Hi Mauro,
A deferred error is an uncorrectable error whose handling can be
deferred, i.e. it's not urgent. This affects the system behavior, but
I'm now thinking that this shouldn't affect users' behavior. I think it
would be simpler to just classify deferred errors as uncorrectable
errors so that users treat them as such.

Boris,
Can we drop or revert commit d12a969ebbfc?

And can we apply a fixup like this to commit 713ad54675fd?

---
From: Yazen Ghannam <[email protected]>
Date: Thu, 1 Dec 2016 08:54:49 -0600
Subject: [PATCH] fixup! EDAC, amd64: Define and register UMC error decode
function

Signed-off-by: Yazen Ghannam <[email protected]>
---
drivers/edac/amd64_edac.c | 3 ++-
1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/edac/amd64_edac.c b/drivers/edac/amd64_edac.c
index 991b36c..245b9a0 100644
--- a/drivers/edac/amd64_edac.c
+++ b/drivers/edac/amd64_edac.c
@@ -2480,8 +2480,9 @@ static void decode_umc_error(int node_id, struct mce *m)

memset(&err, 0, sizeof(err));

+ /* Log deferred errors as uncorrectable errors. */
if (m->status & MCI_STATUS_DEFERRED)
- ecc_type = 3;
+ ecc_type = 1;

err.channel = find_umc_channel(pvt, m);
if (err.channel < 0) {
--
2.7.4

---

Thanks,
Yazen

2016-12-01 18:15:11

by Borislav Petkov

[permalink] [raw]
Subject: Re: linux-next: manual merge of the edac-amd tree with the edac tree

On Thu, Dec 01, 2016 at 11:02:04AM -0500, Yazen Ghannam wrote:
> A deferred error is an uncorrectable error whose handling can be
> deferred, i.e. it's not urgent. This affects the system behavior, but
> I'm now thinking that this shouldn't affect users' behavior. I think it
> would be simpler to just classify deferred errors as uncorrectable
> errors so that users treat them as such.

Why would we want to lie about deferred errors being uncorrectable?

And I believe deferred errors can be handled differently like freeze the
process using the page instead of killing it. And so on...

Why aren't you simply adding the documentation about
HW_EVENT_ERR_DEFERRED and be done with it? The downstream path like
tracepoint and all can handle all that just fine.

> Boris,
> Can we drop or revert commit d12a969ebbfc?

No can do. It is a public branch and there's no touching it.

--
Regards/Gruss,
Boris.

Good mailing practices for 400: avoid top-posting and trim the reply.

2016-12-01 19:58:10

by Yazen Ghannam

[permalink] [raw]
Subject: Re: linux-next: manual merge of the edac-amd tree with the edac tree

On Thu, Dec 01, 2016 at 07:15:01PM +0100, Borislav Petkov wrote:
> On Thu, Dec 01, 2016 at 11:02:04AM -0500, Yazen Ghannam wrote:
> > A deferred error is an uncorrectable error whose handling can be
> > deferred, i.e. it's not urgent. This affects the system behavior, but
> > I'm now thinking that this shouldn't affect users' behavior. I think it
> > would be simpler to just classify deferred errors as uncorrectable
> > errors so that users treat them as such.
>
> Why would we want to lie about deferred errors being uncorrectable?
>

They are uncorrectable errors that can be handled differently. If you
can't handle them then there's not much difference.

> And I believe deferred errors can be handled differently like freeze the
> process using the page instead of killing it. And so on...
>

If deferred errors can be handled differently in userspace, then you're
right we should maintain the distinction. I was thinking we'd only
handle them in the kernel.

> Why aren't you simply adding the documentation about
> HW_EVENT_ERR_DEFERRED and be done with it? The downstream path like
> tracepoint and all can handle all that just fine.
>

Okay, will do.

> > Boris,
> > Can we drop or revert commit d12a969ebbfc?
>
> No can do. It is a public branch and there's no touching it.
>

Okay, got it.

Thanks,
Yazen