LinuxLists.cc - Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

2021-05-13 14:35:38

Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

On Thu, May 13, 2021 at 10:30 AM Borislav Petkov <[email protected]> wrote:
>
> On Thu, May 13, 2021 at 10:17:47AM -0400, Alex Deucher wrote:
> > The bad pages are stored in an EEPROM on the board and the next time
> > the driver loads it reads the EEPROM so that it can reserve the bad
> > pages at init time so they don't get used again.
>
> And that works automagically on the next boot? Because that sounds like
> the right thing to do.

Yes, or driver reload, suspend/resume, etc.

>
> So practically, what happens to a GPU in such a case where the VRAM
> starts going bad? It might get exhausted eventually and the driver will
> say something along the lines of:
>
> "VRAM bad pages: 80%, consider replacing the GPU. It is operating
> currently with degrated performance."
>
> or so?

Right. The sys admin can query the bad page count and decide when to
retire the card.

>
> Yap, from a RAS perspective, that makes good sense as you're prolonging
> the life of the component while still remains operational as good as it
> can and the only user interaction you need is she/he replacing it.
>
> Sounds good.

Yes. That's the idea.

Alex

>
> Thx.
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette

2021-05-13 15:00:13

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

On Thu, May 13, 2021 at 10:32:45AM -0400, Alex Deucher wrote:
> Right. The sys admin can query the bad page count and decide when to
> retire the card.

Yap, although the driver should actively "tell" the sysadmin when some
critical counts of retired VRAM pages are reached because I doubt all
admins would go look at those counts on their own.

Btw, you say "admin" - am I to understand that those are some high end
GPU cards with ECC memory? If consumer grade stuff has this too, then
the driver should very much warn on such levels on its own because
normal users won't know what and where to look.

Other than that, the big picture sounds good to me.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-05-13 22:33:50

by Alex Deucher

[permalink] [raw]

Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

On Thu, May 13, 2021 at 10:57 AM Borislav Petkov <[email protected]> wrote:
>
> On Thu, May 13, 2021 at 10:32:45AM -0400, Alex Deucher wrote:
> > Right. The sys admin can query the bad page count and decide when to
> > retire the card.
>
> Yap, although the driver should actively "tell" the sysadmin when some
> critical counts of retired VRAM pages are reached because I doubt all
> admins would go look at those counts on their own.

I think we print something in the log as well when we hit the
threshold. I need to double check the code.

>
> Btw, you say "admin" - am I to understand that those are some high end
> GPU cards with ECC memory? If consumer grade stuff has this too, then
> the driver should very much warn on such levels on its own because
> normal users won't know what and where to look.
>

Currently it's only available on workstation and datacenter boards.

> Other than that, the big picture sounds good to me.

Thanks!

Alex

>
> Thx.
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette

2021-05-14 02:08:55

by Joshi, Mukul

[permalink] [raw]

Subject: RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

[AMD Official Use Only - Internal Distribution Only]

> -----Original Message-----
> From: Borislav Petkov <[email protected]>
> Sent: Thursday, May 13, 2021 10:58 AM
> To: Alex Deucher <[email protected]>
> Cc: Joshi, Mukul <[email protected]>; x86-ml <[email protected]>;
> Kasiviswanathan, Harish <[email protected]>; lkml <linux-
> [email protected]>; [email protected]
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
>
> [CAUTION: External Email]
>
> On Thu, May 13, 2021 at 10:32:45AM -0400, Alex Deucher wrote:
> > Right. The sys admin can query the bad page count and decide when to
> > retire the card.
>
> Yap, although the driver should actively "tell" the sysadmin when some critical
> counts of retired VRAM pages are reached because I doubt all admins would go
> look at those counts on their own.
>
> Btw, you say "admin" - am I to understand that those are some high end GPU
> cards with ECC memory? If consumer grade stuff has this too, then the driver
> should very much warn on such levels on its own because normal users won't
> know what and where to look.
>
> Other than that, the big picture sounds good to me.
>

Since now you are OK with how page retirement works, lets revisit the original
question.

Are you OK with a new MCE priority (MCE_PRIO_ACCEL) or do you want us to use
something else?

Thanks,
Mukul

> Thx.
>
> --
> Regards/Gruss,
> Boris.
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.
> kernel.org%2Ftglx%2Fnotes-about-
> netiquette&data=04%7C01%7CMukul.Joshi%40amd.com%7C50588f11ed5
> 3456b03e008d9161f765c%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0
> %7C637565146658376385%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
> MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata
> =Es0FMDNzNEKgxvFiqe1kOo9aEPK6%2BOXrhI5aWs3QH9Q%3D&reserved=
> 0

2021-05-14 09:56:21

by Borislav Petkov

[permalink] [raw]

Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

On Thu, May 13, 2021 at 11:14:30PM +0000, Joshi, Mukul wrote:
> Are you OK with a new MCE priority (MCE_PRIO_ACCEL) or do you want us to use
> something else?

I still don't know why a separate priority is needed. Maybe this still
needs answering:

> It is a deferred interrupt that generates an MCE.

Is that the same deferred interrupt which calls
amd_deferred_error_interrupt() ?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-05-28 01:22:05

by Joshi, Mukul

[permalink] [raw]

Subject: RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

[AMD Official Use Only]

> -----Original Message-----
> From: Borislav Petkov <[email protected]>
> Sent: Friday, May 14, 2021 3:04 AM
> To: Joshi, Mukul <[email protected]>
> Cc: Alex Deucher <[email protected]>; x86-ml <[email protected]>;
> Kasiviswanathan, Harish <[email protected]>; lkml <linux-
> [email protected]>; [email protected]
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
>
> [CAUTION: External Email]
>
> On Thu, May 13, 2021 at 11:14:30PM +0000, Joshi, Mukul wrote:
> > Are you OK with a new MCE priority (MCE_PRIO_ACCEL) or do you want us
> > to use something else?
>
> I still don't know why a separate priority is needed. Maybe this still needs
> answering:
>
> > It is a deferred interrupt that generates an MCE.
>
> Is that the same deferred interrupt which calls
> amd_deferred_error_interrupt() ?

Sorry picking this up after sometime. I thought I had replied to this email.
Yes it is the same deferred interrupt which calls amd_deferred_error_interrupt().

Thanks,
Mukul

>
> --
> Regards/Gruss,
> Boris.
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.
> kernel.org%2Ftglx%2Fnotes-about-
> netiquette&data=04%7C01%7CMukul.Joshi%40amd.com%7C7e04d0ad3c6
> 147a8dfd008d916a66bbc%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0
> %7C637565726326540654%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
> MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata
> =AWHJvqIbtkGl6OPC86alAtnxsleq6UVe9mcM1MjxpUQ%3D&reserved=0

2021-06-03 21:14:54

by Yazen Ghannam

[permalink] [raw]

Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

On Thu, May 27, 2021 at 03:54:27PM -0400, Joshi, Mukul wrote:
...
> > Is that the same deferred interrupt which calls
> > amd_deferred_error_interrupt() ?
>
> Sorry picking this up after sometime. I thought I had replied to this email.
> Yes it is the same deferred interrupt which calls amd_deferred_error_interrupt().
>

Mukul,

Do you expect that the driver will need to mark pages with high
correctable error counts as bad? I think the hardware folks may want the
GPU memory errors to be handled more aggressively than CPU memory
errors. The specific threshold may change from product to product, so it
may make sense to hardcode this in the driver.

We have similar functionality in the Correctable Errors Collector. But
enterprise users may prefer a direct approach done in the driver (based
on the hardware experts' guidance) instead of configuring the kernel at
runtime.

So I think having a separate priority may make sense if some special
functionality, or combination of behaviors, is needed which don't fall
under any exisiting things. In this case, "special functionality" could
be that the GPU memory needs to be handled differently than CPU memory.

Another thing is that this behavior is similar to the NFIT behavior,
i.e. there's a memory error on an external device that needs to be
handled by the device's driver. So maybe we can rename MCE_PRIO_NFIT to
be generic (MCE_PRIO_EXTERNAL?) and use that? Multiple notifiers with
the same priority is okay, right?

Thanks,
Yazen

2021-07-30 00:00:44

by Joshi, Mukul

[permalink] [raw]

Subject: RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

[AMD Official Use Only]

> -----Original Message-----
> From: Ghannam, Yazen <[email protected]>
> Sent: Thursday, June 3, 2021 5:13 PM
> To: Joshi, Mukul <[email protected]>
> Cc: Borislav Petkov <[email protected]>; Alex Deucher <[email protected]>;
> x86-ml <[email protected]>; Kasiviswanathan, Harish
> <[email protected]>; lkml <[email protected]>;
> [email protected]
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
>
> On Thu, May 27, 2021 at 03:54:27PM -0400, Joshi, Mukul wrote:
> ...
> > > Is that the same deferred interrupt which calls
> > > amd_deferred_error_interrupt() ?
> >
> > Sorry picking this up after sometime. I thought I had replied to this email.
> > Yes it is the same deferred interrupt which calls
> amd_deferred_error_interrupt().
> >
>
> Mukul,
>
> Do you expect that the driver will need to mark pages with high correctable
> error counts as bad? I think the hardware folks may want the GPU memory
> errors to be handled more aggressively than CPU memory errors. The specific
> threshold may change from product to product, so it may make sense to
> hardcode this in the driver.
>

Sorry I missed this email completely. Just saw it so responding now.

At the moment, we don't have a requirement to mark a page "bad" if there is a high correctable error counts.
Our previous GPU ASICs which support RAS, also do not have such a feature.
But you make a good point. It might be worthwhile to go and ask the hardware folks about it.

> We have similar functionality in the Correctable Errors Collector. But enterprise
> users may prefer a direct approach done in the driver (based on the hardware
> experts' guidance) instead of configuring the kernel at runtime.
>
> So I think having a separate priority may make sense if some special
> functionality, or combination of behaviors, is needed which don't fall under any
> exisiting things. In this case, "special functionality" could be that the GPU
> memory needs to be handled differently than CPU memory.
>
> Another thing is that this behavior is similar to the NFIT behavior, i.e. there's a
> memory error on an external device that needs to be handled by the device's
> driver. So maybe we can rename MCE_PRIO_NFIT to be generic
> (MCE_PRIO_EXTERNAL?) and use that? Multiple notifiers with the same priority
> is okay, right?
>
With respect to MCE priority, I was thinking of using the MCE_PRIO_EDAC instead of creating a new priority as the code in the GPU driver is doing error detection and handling the uncorrectable errors.
Not sure if that aligns with the definition of EDAC device in the kernel.

What do you think?

Regards,
Mukul

> Thanks,
> Yazen

2021-09-13 01:33:25

by Joshi, Mukul

[permalink] [raw]

Subject: RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

[AMD Official Use Only]

> -----Original Message-----
> From: amd-gfx <[email protected]> On Behalf Of Joshi,
> Mukul
> Sent: Thursday, July 29, 2021 8:00 PM
> To: Ghannam, Yazen <[email protected]>
> Cc: x86-ml <[email protected]>; Kasiviswanathan, Harish
> <[email protected]>; lkml <[email protected]>;
> [email protected]; Borislav Petkov <[email protected]>; Alex Deucher
> <[email protected]>
> Subject: RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
>
> [CAUTION: External Email]
>
> [AMD Official Use Only]
>
>
>
> > -----Original Message-----
> > From: Ghannam, Yazen <[email protected]>
> > Sent: Thursday, June 3, 2021 5:13 PM
> > To: Joshi, Mukul <[email protected]>
> > Cc: Borislav Petkov <[email protected]>; Alex Deucher
> > <[email protected]>; x86-ml <[email protected]>; Kasiviswanathan,
> > Harish <[email protected]>; lkml
> > <[email protected]>; [email protected]
> > Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for
> > Aldebaran
> >
> > On Thu, May 27, 2021 at 03:54:27PM -0400, Joshi, Mukul wrote:
> > ...
> > > > Is that the same deferred interrupt which calls
> > > > amd_deferred_error_interrupt() ?
> > >
> > > Sorry picking this up after sometime. I thought I had replied to this email.
> > > Yes it is the same deferred interrupt which calls
> > amd_deferred_error_interrupt().
> > >
> >
> > Mukul,
> >
> > Do you expect that the driver will need to mark pages with high
> > correctable error counts as bad? I think the hardware folks may want
> > the GPU memory errors to be handled more aggressively than CPU memory
> > errors. The specific threshold may change from product to product, so
> > it may make sense to hardcode this in the driver.
> >
>
> Sorry I missed this email completely. Just saw it so responding now.
>
> At the moment, we don't have a requirement to mark a page "bad" if there is a
> high correctable error counts.
> Our previous GPU ASICs which support RAS, also do not have such a feature.
> But you make a good point. It might be worthwhile to go and ask the hardware
> folks about it.
>
> > We have similar functionality in the Correctable Errors Collector. But
> > enterprise users may prefer a direct approach done in the driver
> > (based on the hardware experts' guidance) instead of configuring the kernel at
> runtime.
> >
> > So I think having a separate priority may make sense if some special
> > functionality, or combination of behaviors, is needed which don't fall
> > under any exisiting things. In this case, "special functionality"
> > could be that the GPU memory needs to be handled differently than CPU
> memory.
> >
> > Another thing is that this behavior is similar to the NFIT behavior,
> > i.e. there's a memory error on an external device that needs to be
> > handled by the device's driver. So maybe we can rename MCE_PRIO_NFIT
> > to be generic
> > (MCE_PRIO_EXTERNAL?) and use that? Multiple notifiers with the same
> > priority is okay, right?
> >
> With respect to MCE priority, I was thinking of using the MCE_PRIO_EDAC
> instead of creating a new priority as the code in the GPU driver is doing error
> detection and handling the uncorrectable errors.
> Not sure if that aligns with the definition of EDAC device in the kernel.
>
> What do you think?
>
> Regards,
> Mukul
>

After talking to Yazen, MCE_PRIO_UC might be a better choice for the MCE priority as we are dealing
only with uncorrectable errors.
I will be sending out a v2 patch with changes to use the MCE_PRIO_UC and drop the MCE_PRIO_ACCEL and see what the feedback is.

Thanks,
Mukul

> > Thanks,
> > Yazen
> _______________________________________________
> amd-gfx mailing list
> [email protected]
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Flists.fre
> edesktop.org%2Fmailman%2Flistinfo%2Famd-
> gfx&data=04%7C01%7Cmukul.joshi%40amd.com%7C7d32897fddef448ab0
> aa08d952ecf41f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%7C6376
> 31999953383488%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwMDAiLCJ
> QIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&sdata=YWZz9
> OYTMOhBl4183kV5ZYj01yw0xwNj%2BjTdXejFKH8%3D&reserved=0