2021-05-12 09:38:27

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

Hi,

so this is a drive-by review using the lore.kernel.org mail because I
wasn't CCed on this.

On Tue, May 11, 2021 at 09:30:58PM -0400, Mukul Joshi wrote:
> +static int amdgpu_bad_page_notifier(struct notifier_block *nb,
> + unsigned long val, void *data)
> +{
> + struct mce *m = (struct mce *)data;
> + struct amdgpu_device *adev = NULL;
> + uint32_t gpu_id = 0;
> + uint32_t umc_inst = 0;
> + uint32_t chan_index = 0;
> + struct ras_err_data err_data = {0, 0, 0, NULL};
> + struct eeprom_table_record err_rec;
> + uint64_t retired_page;
> +
> + /*
> + * If the error was generated in UMC_V2, which belongs to GPU UMCs,

Why does it matter if the error is a v2 UMC generated one?

IOW, can you detect the type of errors in GPU memory - I assume this
is what this is supposed to handle - from the error signature itself
instead of doing is_smca_umc_v2?

> + * and error occurred in DramECC (Extended error code = 0) then only
> + * process the error, else bail out.
> + */
> + if (!m || !(is_smca_umc_v2(m->bank) && (XEC(m->status, 0x1f) == 0x0)))
> + return NOTIFY_DONE;
> +
> + gpu_id = GET_MCA_IPID_GPUID(m->ipid);
> +
> + /*
> + * GPU Id is offset by GPU_ID_OFFSET in MCA_IPID_UMC register.
> + */
> + gpu_id -= GPU_ID_OFFSET;
> +
> + adev = find_adev(gpu_id);
> + if (!adev) {
> + dev_warn(adev->dev, "%s: Unable to find adev for gpu_id: %d\n",
> + __func__, gpu_id);
> + return NOTIFY_DONE;
> + }
> +
> + /*
> + * If it is correctable error, then print a message and return.
> + */
> + if (mce_is_correctable(m)) {
> + dev_info(adev->dev, "%s: UMC Correctable error detected.",
> + __func__);

So put yourself in the shoes of a support engineer who starts getting
all those calls about people's hardware getting correctable errors and
should they replace it and should AMD RMA the GPUs and so on and so
on..? Do you really wanna be on the receiving side of that call?

IOW, whom does printing the fact that the GPU had a corrected error
which got corrected by the hardware, help and do you really want to
upset people needlessly?

> + return NOTIFY_OK;
> + }
> +
> + /*
> + * If it is uncorrectable error, then find out UMC instance and
> + * channel index.
> + */
> + find_umc_inst_chan_index(m, &umc_inst, &chan_index);

That's a void function but it could return a pointer to the instance and
channel bundled in a struct maybe...

> +
> + dev_info(adev->dev, "Uncorrectable error detected in UMC inst: %d,"
> + " chan_idx: %d", umc_inst, chan_index);
> +
> + memset(&err_rec, 0x0, sizeof(struct eeprom_table_record));
> +
> + /*
> + * Translate UMC channel address to Physical address
> + */
> + retired_page = ADDR_OF_8KB_BLOCK(m->addr) |
> + ADDR_OF_256B_BLOCK(chan_index) |
> + OFFSET_IN_256B_BLOCK(m->addr);
> +
> + err_rec.address = m->addr;
> + err_rec.retired_page = retired_page >> AMDGPU_GPU_PAGE_SHIFT;
> + err_rec.ts = (uint64_t)ktime_get_real_seconds();
> + err_rec.err_type = AMDGPU_RAS_EEPROM_ERR_NON_RECOVERABLE;
> + err_rec.cu = 0;
> + err_rec.mem_channel = chan_index;
> + err_rec.mcumc_id = umc_inst;
> +
> + err_data.err_addr = &err_rec;
> + err_data.err_addr_cnt = 1;
> +
> + if (amdgpu_bad_page_threshold != 0) {
> + amdgpu_ras_add_bad_pages(adev, err_data.err_addr,
> + err_data.err_addr_cnt);
> + amdgpu_ras_save_bad_pages(adev);
> + }
> +
> + return NOTIFY_OK;
> +}
> +
> +static struct notifier_block amdgpu_bad_page_nb = {
> + .notifier_call = amdgpu_bad_page_notifier,
> + .priority = MCE_PRIO_ACCEL,

Nothing ever explains why this needs to be a separate priority. So how
about it?

> +};
> +
> +static void amdgpu_register_bad_pages_mca_notifier(void)
> +{
> + /*
> + * Register the x86 notifier with MCE subsystem.
> + * Please note a notifier can be registered only once
> + * with the MCE subsystem.
> + */

Why do you need this? Other users don't need that. Do you need to call
mce_unregister_decode_chain() when the driver gets removed or so?

> + if (notifier_registered == false) {

This is just silly and should be fixed properly once the issue is
understood.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette


2021-05-12 20:38:00

by Joshi, Mukul

[permalink] [raw]
Subject: RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

[AMD Official Use Only - Internal Distribution Only]


> -----Original Message-----
> From: Borislav Petkov <[email protected]>
> Sent: Wednesday, May 12, 2021 5:37 AM
> To: Joshi, Mukul <[email protected]>
> Cc: [email protected]; Kasiviswanathan, Harish
> <[email protected]>; x86-ml <[email protected]>; lkml <linux-
> [email protected]>
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
>
> [CAUTION: External Email]
>
> Hi,
>
> so this is a drive-by review using the lore.kernel.org mail because I wasn't CCed
> on this.
>
> On Tue, May 11, 2021 at 09:30:58PM -0400, Mukul Joshi wrote:
> > +static int amdgpu_bad_page_notifier(struct notifier_block *nb,
> > + unsigned long val, void *data) {
> > + struct mce *m = (struct mce *)data;
> > + struct amdgpu_device *adev = NULL;
> > + uint32_t gpu_id = 0;
> > + uint32_t umc_inst = 0;
> > + uint32_t chan_index = 0;
> > + struct ras_err_data err_data = {0, 0, 0, NULL};
> > + struct eeprom_table_record err_rec;
> > + uint64_t retired_page;
> > +
> > + /*
> > + * If the error was generated in UMC_V2, which belongs to GPU
> > + UMCs,
>
> Why does it matter if the error is a v2 UMC generated one?
>
> IOW, can you detect the type of errors in GPU memory - I assume this is what
> this is supposed to handle - from the error signature itself instead of doing
> is_smca_umc_v2?

SMCA UMCv2 corresponds to GPU's UMC MCA bank and the GPU driver is only interested in errors on GPU UMC.
We cannot know this without is_smca_umc_v2.

>
> > + * and error occurred in DramECC (Extended error code = 0) then only
> > + * process the error, else bail out.
> > + */
> > + if (!m || !(is_smca_umc_v2(m->bank) && (XEC(m->status, 0x1f) == 0x0)))
> > + return NOTIFY_DONE;
> > +
> > + gpu_id = GET_MCA_IPID_GPUID(m->ipid);
> > +
> > + /*
> > + * GPU Id is offset by GPU_ID_OFFSET in MCA_IPID_UMC register.
> > + */
> > + gpu_id -= GPU_ID_OFFSET;
> > +
> > + adev = find_adev(gpu_id);
> > + if (!adev) {
> > + dev_warn(adev->dev, "%s: Unable to find adev for gpu_id: %d\n",
> > + __func__, gpu_id);
> > + return NOTIFY_DONE;
> > + }
> > +
> > + /*
> > + * If it is correctable error, then print a message and return.
> > + */
> > + if (mce_is_correctable(m)) {
> > + dev_info(adev->dev, "%s: UMC Correctable error detected.",
> > + __func__);
>
> So put yourself in the shoes of a support engineer who starts getting all those
> calls about people's hardware getting correctable errors and should they replace
> it and should AMD RMA the GPUs and so on and so on..? Do you really wanna be
> on the receiving side of that call?
>
> IOW, whom does printing the fact that the GPU had a corrected error which got
> corrected by the hardware, help and do you really want to upset people
> needlessly?

Agree. I will remove this debug print.

>
> > + return NOTIFY_OK;
> > + }
> > +
> > + /*
> > + * If it is uncorrectable error, then find out UMC instance and
> > + * channel index.
> > + */
> > + find_umc_inst_chan_index(m, &umc_inst, &chan_index);
>
> That's a void function but it could return a pointer to the instance and channel
> bundled in a struct maybe...

Maybe. I hope its not too much of a concern if it stays the way it is.

>
> > +
> > + dev_info(adev->dev, "Uncorrectable error detected in UMC inst: %d,"
> > + " chan_idx: %d", umc_inst, chan_index);
> > +
> > + memset(&err_rec, 0x0, sizeof(struct eeprom_table_record));
> > +
> > + /*
> > + * Translate UMC channel address to Physical address
> > + */
> > + retired_page = ADDR_OF_8KB_BLOCK(m->addr) |
> > + ADDR_OF_256B_BLOCK(chan_index) |
> > + OFFSET_IN_256B_BLOCK(m->addr);
> > +
> > + err_rec.address = m->addr;
> > + err_rec.retired_page = retired_page >> AMDGPU_GPU_PAGE_SHIFT;
> > + err_rec.ts = (uint64_t)ktime_get_real_seconds();
> > + err_rec.err_type = AMDGPU_RAS_EEPROM_ERR_NON_RECOVERABLE;
> > + err_rec.cu = 0;
> > + err_rec.mem_channel = chan_index;
> > + err_rec.mcumc_id = umc_inst;
> > +
> > + err_data.err_addr = &err_rec;
> > + err_data.err_addr_cnt = 1;
> > +
> > + if (amdgpu_bad_page_threshold != 0) {
> > + amdgpu_ras_add_bad_pages(adev, err_data.err_addr,
> > + err_data.err_addr_cnt);
> > + amdgpu_ras_save_bad_pages(adev);
> > + }
> > +
> > + return NOTIFY_OK;
> > +}
> > +
> > +static struct notifier_block amdgpu_bad_page_nb = {
> > + .notifier_call = amdgpu_bad_page_notifier,
> > + .priority = MCE_PRIO_ACCEL,
>
> Nothing ever explains why this needs to be a separate priority. So how about it?

I wasn't really sure if I should use the EDAC priority here or create a new one for Accelerator devices.
I thought using EDAC priority might not be accepted by the maintainers as EDAC and GPU (Accelerator) devices
are two different class of devices.
That is the reason I create a new one.
I am OK to use EDAC priority if that is acceptable.

>
> > +};
> > +
> > +static void amdgpu_register_bad_pages_mca_notifier(void)
> > +{
> > + /*
> > + * Register the x86 notifier with MCE subsystem.
> > + * Please note a notifier can be registered only once
> > + * with the MCE subsystem.
> > + */
>
> Why do you need this? Other users don't need that. Do you need to call
> mce_unregister_decode_chain() when the driver gets removed or so?
>
> > + if (notifier_registered == false) {
>
> This is just silly and should be fixed properly once the issue is understood.

A system can have multiple GPUs and we only want a single notifier registered.
I will change the comment to explicitly state this.

Thanks,
Mukul

>
> --
> Regards/Gruss,
> Boris.
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.
> kernel.org%2Ftglx%2Fnotes-about-
> netiquette&amp;data=04%7C01%7Cmukul.joshi%40amd.com%7C1c231207786
> 446018e7208d915297942%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0
> %7C637564090158211378%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
> MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata
> =7LSOnoJWrTf5z96YFACxuRKZP1z4E4zkvtrNzjbTaPs%3D&amp;reserved=0

2021-05-12 22:23:50

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

On Wed, May 12, 2021 at 07:00:58PM +0000, Joshi, Mukul wrote:
> SMCA UMCv2 corresponds to GPU's UMC MCA bank and the GPU driver is
> only interested in errors on GPU UMC.

So that thing should be called SMCA_GPU_UMC not SMCA_UMC_V2.

> We cannot know this without is_smca_umc_v2.

You don't need it - just export smca_get_bank_type() and test the bank
type at the call site.

> Maybe. I hope its not too much of a concern if it stays the way it is.

That was just a suggestion anyway - it is not code I maintain so not my
call.

> I wasn't really sure if I should use the EDAC priority here or create a new one for Accelerator devices.
> I thought using EDAC priority might not be accepted by the maintainers as EDAC and GPU (Accelerator) devices
> are two different class of devices.
> That is the reason I create a new one.
> I am OK to use EDAC priority if that is acceptable.

I don't know what's acceptable because I still am unclear as to what
that thing is supposed to do. It seems you are interested only in
uncorrectable errors.

How are those errors reported? #MC exception, deferred interrupt, simply
logged in the bank and we find them by polling?

Then, the commit message is talking about some "bad page retirement".
What does that do? What can the user do when she sees the

"Uncorrectable error detected in UMC..."

message?

It depends on what "retiring" of GPU pages means...

In any case, dmesg should issue a human-understandable message about the
recovery action being done and what that means for the user: should she
replace the GPU, should she ignore, etc, etc.

> A system can have multiple GPUs and we only want a single notifier
> registered. I will change the comment to explicitly state this.

Actually, the notifier registration should be able to return a different
retval to state that a callback has already been registered but that
warns only currently so I'm guessing we're stuck with such ugly
"workarounds" for their shortcomings.

I'm gonna take a look whether they can be fixed though so that you don't
have to do this notifier_registered thing.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-05-13 03:24:54

by Joshi, Mukul

[permalink] [raw]
Subject: RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

[AMD Official Use Only - Internal Distribution Only]



> -----Original Message-----
> From: Borislav Petkov <[email protected]>
> Sent: Wednesday, May 12, 2021 5:06 PM
> To: Joshi, Mukul <[email protected]>
> Cc: [email protected]; Kasiviswanathan, Harish
> <[email protected]>; x86-ml <[email protected]>; lkml <linux-
> [email protected]>
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
>
> [CAUTION: External Email]
>
> On Wed, May 12, 2021 at 07:00:58PM +0000, Joshi, Mukul wrote:
> > SMCA UMCv2 corresponds to GPU's UMC MCA bank and the GPU driver is
> > only interested in errors on GPU UMC.
>
> So that thing should be called SMCA_GPU_UMC not SMCA_UMC_V2.
>
> > We cannot know this without is_smca_umc_v2.
>
> You don't need it - just export smca_get_bank_type() and test the bank type at
> the call site.

Exporting smca_get_bank_type() works fine when CONFIG_X86_MCE_AMD is defined.
I would need to put #ifdef CONFIG_X86_MCE_AMD in my code to compile the amdgpu
driver when CONFIG_X86_MCE_AMD is not defined.
I can avoid all that by using is_smca_umc_v2().
I think it would be cleaner with using is_smca_umc_v2().

>
> > Maybe. I hope its not too much of a concern if it stays the way it is.
>
> That was just a suggestion anyway - it is not code I maintain so not my call.
>
> > I wasn't really sure if I should use the EDAC priority here or create a new one
> for Accelerator devices.
> > I thought using EDAC priority might not be accepted by the maintainers
> > as EDAC and GPU (Accelerator) devices are two different class of devices.
> > That is the reason I create a new one.
> > I am OK to use EDAC priority if that is acceptable.
>
> I don't know what's acceptable because I still am unclear as to what that thing is
> supposed to do. It seems you are interested only in uncorrectable errors.
>

You can think of GPU device as a EDAC device here. It is mainly interested in
handling uncorrectable errors.

> How are those errors reported? #MC exception, deferred interrupt, simply
> logged in the bank and we find them by polling?
>

It is a deferred interrupt that generates an MCE.

> Then, the commit message is talking about some "bad page retirement".
> What does that do? What can the user do when she sees the
>
> "Uncorrectable error detected in UMC..."
>
> message?
>
> It depends on what "retiring" of GPU pages means...
>
> In any case, dmesg should issue a human-understandable message about the
> recovery action being done and what that means for the user: should she
> replace the GPU, should she ignore, etc, etc.
>

When an uncorrectable error is detected on the GPU UMC, all we are doing is
determining the physical address where the error occurred and then "retiring"
the page that address belongs to.
By retiring, we mean we reserve the page so that it is not available for allocations
to any applications.
We are providing information to the user by storing all the information about the
retired pages in EEPROM. This can be accessed through sysfs.

Hope it clears what "bad page retirement" is achieving.

Thanks,
Mukul

> > A system can have multiple GPUs and we only want a single notifier
> > registered. I will change the comment to explicitly state this.
>
> Actually, the notifier registration should be able to return a different retval to
> state that a callback has already been registered but that warns only currently so
> I'm guessing we're stuck with such ugly "workarounds" for their shortcomings.
>
> I'm gonna take a look whether they can be fixed though so that you don't have
> to do this notifier_registered thing.
>
> --
> Regards/Gruss,
> Boris.
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.
> kernel.org%2Ftglx%2Fnotes-about-
> netiquette&amp;data=04%7C01%7CMukul.Joshi%40amd.com%7C5e305845504
> 14bc53c3008d91589b50b%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0
> %7C637564503481206042%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
> MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata
> =fF2iaSQxAREAV2OwLi%2F8cN9uRuqNNKim2FpWJUBQS98%3D&amp;reserved=
> 0

2021-05-13 10:16:09

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

On Thu, May 13, 2021 at 03:20:36AM +0000, Joshi, Mukul wrote:
> Exporting smca_get_bank_type() works fine when CONFIG_X86_MCE_AMD is defined.
> I would need to put #ifdef CONFIG_X86_MCE_AMD in my code to compile the amdgpu
> driver when CONFIG_X86_MCE_AMD is not defined.
> I can avoid all that by using is_smca_umc_v2().
> I think it would be cleaner with using is_smca_umc_v2().

See how smca_get_long_name() is exported and export that function the
same way.

To save you some energy: is_smca_umc_v2() is not going to happen.

> You can think of GPU device as a EDAC device here. It is mainly
> interested in handling uncorrectable errors.

An EDAC "device", as you call it, is not interested in handling UEs. If
anything, it counts them.

> It is a deferred interrupt that generates an MCE.

Is that the same deferred interrupt which calls amd_deferred_error_interrupt() ?

> When an uncorrectable error is detected on the GPU UMC, all we are
> doing is determining the physical address where the error occurred and
> then "retiring" the page that address belongs to.

What page is that? Normal DRAM page or a page in some special GPU memory?

> By retiring, we mean we reserve the page so that it is not available
> for allocations to any applications.

We do that for normal DRAM memory pages by poisoning them. I hope you
don't mean that.

Looking at

amdgpu_ras_add_bad_pages
|-> amdgpu_vram_mgr_reserve_range

that's some VRAM thing so I'm guessing special memory on the GPU.

If so, what happens with all those "retired" pages when you reboot?
They're getting used again and potentially trigger the same UEs and the
same retiring happens?

> We are providing information to the user by storing all the
> information about the retired pages in EEPROM. This can be accessed
> through sysfs.

Ok, I'm a user and I can access that information through sysfs. What can
I do with it?

> Hope it clears what "bad page retirement" is achieving.

It is getting there.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-05-13 21:12:21

by Alex Deucher

[permalink] [raw]
Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

On Thu, May 13, 2021 at 9:26 AM Borislav Petkov <[email protected]> wrote:
>
> On Thu, May 13, 2021 at 03:20:36AM +0000, Joshi, Mukul wrote:
> > Exporting smca_get_bank_type() works fine when CONFIG_X86_MCE_AMD is defined.
> > I would need to put #ifdef CONFIG_X86_MCE_AMD in my code to compile the amdgpu
> > driver when CONFIG_X86_MCE_AMD is not defined.
> > I can avoid all that by using is_smca_umc_v2().
> > I think it would be cleaner with using is_smca_umc_v2().
>
> See how smca_get_long_name() is exported and export that function the
> same way.
>
> To save you some energy: is_smca_umc_v2() is not going to happen.
>
> > You can think of GPU device as a EDAC device here. It is mainly
> > interested in handling uncorrectable errors.
>
> An EDAC "device", as you call it, is not interested in handling UEs. If
> anything, it counts them.
>
> > It is a deferred interrupt that generates an MCE.
>
> Is that the same deferred interrupt which calls amd_deferred_error_interrupt() ?
>
> > When an uncorrectable error is detected on the GPU UMC, all we are
> > doing is determining the physical address where the error occurred and
> > then "retiring" the page that address belongs to.
>
> What page is that? Normal DRAM page or a page in some special GPU memory?
>

GPU memory.

> > By retiring, we mean we reserve the page so that it is not available
> > for allocations to any applications.
>
> We do that for normal DRAM memory pages by poisoning them. I hope you
> don't mean that.
>
> Looking at
>
> amdgpu_ras_add_bad_pages
> |-> amdgpu_vram_mgr_reserve_range
>
> that's some VRAM thing so I'm guessing special memory on the GPU.
>

Yes.

> If so, what happens with all those "retired" pages when you reboot?
> They're getting used again and potentially trigger the same UEs and the
> same retiring happens?

The bad pages are stored in an EEPROM on the board and the next time
the driver loads it reads the EEPROM so that it can reserve the bad
pages at init time so they don't get used again.

Alex


>
> > We are providing information to the user by storing all the
> > information about the retired pages in EEPROM. This can be accessed
> > through sysfs.
>
> Ok, I'm a user and I can access that information through sysfs. What can
> I do with it?
>
> > Hope it clears what "bad page retirement" is achieving.
>
> It is getting there.
>
> Thx.
>
> --
> Regards/Gruss,
> Boris.
>
> https://people.kernel.org/tglx/notes-about-netiquette
> _______________________________________________
> amd-gfx mailing list
> [email protected]
> https://lists.freedesktop.org/mailman/listinfo/amd-gfx

2021-05-13 21:17:15

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

On Thu, May 13, 2021 at 10:17:47AM -0400, Alex Deucher wrote:
> The bad pages are stored in an EEPROM on the board and the next time
> the driver loads it reads the EEPROM so that it can reserve the bad
> pages at init time so they don't get used again.

And that works automagically on the next boot? Because that sounds like
the right thing to do.

So practically, what happens to a GPU in such a case where the VRAM
starts going bad? It might get exhausted eventually and the driver will
say something along the lines of:

"VRAM bad pages: 80%, consider replacing the GPU. It is operating
currently with degrated performance."

or so?

Yap, from a RAS perspective, that makes good sense as you're prolonging
the life of the component while still remains operational as good as it
can and the only user interaction you need is she/he replacing it.

Sounds good.

Thx.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-05-14 02:08:25

by Joshi, Mukul

[permalink] [raw]
Subject: RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

[AMD Official Use Only - Internal Distribution Only]



> -----Original Message-----
> From: Borislav Petkov <[email protected]>
> Sent: Thursday, May 13, 2021 5:53 AM
> To: Joshi, Mukul <[email protected]>
> Cc: [email protected]; Kasiviswanathan, Harish
> <[email protected]>; x86-ml <[email protected]>; lkml <linux-
> [email protected]>
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
>
> [CAUTION: External Email]
>
> On Thu, May 13, 2021 at 03:20:36AM +0000, Joshi, Mukul wrote:
> > Exporting smca_get_bank_type() works fine when CONFIG_X86_MCE_AMD is
> defined.
> > I would need to put #ifdef CONFIG_X86_MCE_AMD in my code to compile
> > the amdgpu driver when CONFIG_X86_MCE_AMD is not defined.
> > I can avoid all that by using is_smca_umc_v2().
> > I think it would be cleaner with using is_smca_umc_v2().
>
> See how smca_get_long_name() is exported and export that function the same
> way.
>

That's probably not the best example to look at.
smca_get_long_name() is used in drivers/edac/mce_amd.c and this file doesn't
get compiled when CONFIG_X86_MCE_AMD is not defined.

And amdgpu driver has no dependency on CONFIG_X86_MCE_AMD.

So here is one option that we can try:
1. Export smca_get_bank_type().
2. I wrap my entire code in GPU driver with #ifdef CONFIG_X86_MCE_AMD

Will that work for you?

Thanks,
Mukul

> To save you some energy: is_smca_umc_v2() is not going to happen.


>
> > You can think of GPU device as a EDAC device here. It is mainly
> > interested in handling uncorrectable errors.
>
> An EDAC "device", as you call it, is not interested in handling UEs. If anything, it
> counts them.
>
> > It is a deferred interrupt that generates an MCE.
>
> Is that the same deferred interrupt which calls amd_deferred_error_interrupt() ?
>
> > When an uncorrectable error is detected on the GPU UMC, all we are
> > doing is determining the physical address where the error occurred and
> > then "retiring" the page that address belongs to.
>
> What page is that? Normal DRAM page or a page in some special GPU memory?
>
> > By retiring, we mean we reserve the page so that it is not available
> > for allocations to any applications.
>
> We do that for normal DRAM memory pages by poisoning them. I hope you
> don't mean that.
>
> Looking at
>
> amdgpu_ras_add_bad_pages
> |-> amdgpu_vram_mgr_reserve_range
>
> that's some VRAM thing so I'm guessing special memory on the GPU.
>
> If so, what happens with all those "retired" pages when you reboot?
> They're getting used again and potentially trigger the same UEs and the same
> retiring happens?
>
> > We are providing information to the user by storing all the
> > information about the retired pages in EEPROM. This can be accessed
> > through sysfs.
>
> Ok, I'm a user and I can access that information through sysfs. What can I do
> with it?
>
> > Hope it clears what "bad page retirement" is achieving.
>
> It is getting there.
>
> Thx.
>
> --
> Regards/Gruss,
> Boris.
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.
> kernel.org%2Ftglx%2Fnotes-about-
> netiquette&amp;data=04%7C01%7CMukul.Joshi%40amd.com%7Cd8c660fce3a2
> 4ce3c6d408d915f4efa6%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0%
> 7C637564964013263414%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAwM
> DAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata=
> %2BnJ%2B99N%2FRljoHGALimZHZG%2Bmf9jL5zP2eA44I6pbzFY%3D&amp;reser
> ved=0

2021-05-14 09:57:08

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

On Thu, May 13, 2021 at 11:10:34PM +0000, Joshi, Mukul wrote:
> That's probably not the best example to look at.

Oh, it is the *perfect* example but...

> smca_get_long_name() is used in drivers/edac/mce_amd.c and this file
> doesn't get compiled when CONFIG_X86_MCE_AMD is not defined.
>
> And amdgpu driver has no dependency on CONFIG_X86_MCE_AMD.

... maybe this will make you see it the right way: how much of the
amdgpu RAS functionality you're adding, is going to even function
without CONFIG_X86_MCE_AMD?

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette

2021-05-14 18:32:57

by Joshi, Mukul

[permalink] [raw]
Subject: RE: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

[AMD Official Use Only - Internal Distribution Only]



> -----Original Message-----
> From: Borislav Petkov <[email protected]>
> Sent: Friday, May 14, 2021 3:06 AM
> To: Joshi, Mukul <[email protected]>
> Cc: [email protected]; Kasiviswanathan, Harish
> <[email protected]>; x86-ml <[email protected]>; lkml <linux-
> [email protected]>
> Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran
>
> [CAUTION: External Email]
>
> On Thu, May 13, 2021 at 11:10:34PM +0000, Joshi, Mukul wrote:
> > That's probably not the best example to look at.
>
> Oh, it is the *perfect* example but...
>

I would prefer not to get into an argument of if it is the perfect example or not.
I would prefer to get the job done and move forward.

> > smca_get_long_name() is used in drivers/edac/mce_amd.c and this file
> > doesn't get compiled when CONFIG_X86_MCE_AMD is not defined.
> >
> > And amdgpu driver has no dependency on CONFIG_X86_MCE_AMD.
>
> ... maybe this will make you see it the right way: how much of the amdgpu RAS
> functionality you're adding, is going to even function without
> CONFIG_X86_MCE_AMD?
>

We have RAS functionality in other ASICs that is not dependent on CONFIG_X86_MCE_AMD.
So, I don't think we would want to do that just for one ASIC.
Maybe Alex D. has an opinion on it.

As I had mentioned in my previous comment, I am fine with wrapping around my code with
#ifdef CONFIG_X86_MCE_AMD and exporting smca_get_bank_type().
I hope we can move forward with this approach.

Thanks,
Mukul

> --
> Regards/Gruss,
> Boris.
>
> https://nam11.safelinks.protection.outlook.com/?url=https%3A%2F%2Fpeople.
> kernel.org%2Ftglx%2Fnotes-about-
> netiquette&amp;data=04%7C01%7CMukul.Joshi%40amd.com%7C7fded9165e6
> 64dde943808d916a6b33f%7C3dd8961fe4884e608e11a82d994e183d%7C0%7C0
> %7C637565727509040995%7CUnknown%7CTWFpbGZsb3d8eyJWIjoiMC4wLjAw
> MDAiLCJQIjoiV2luMzIiLCJBTiI6Ik1haWwiLCJXVCI6Mn0%3D%7C1000&amp;sdata
> =5hNyAEla4H%2BN0kMdCDs6iL0Dr8mVQJNxKA0UXJU7%2F8c%3D&amp;reserv
> ed=0

2021-05-14 22:53:17

by Borislav Petkov

[permalink] [raw]
Subject: Re: [PATCH] drm/amdgpu: Register bad page handler for Aldebaran

On Fri, May 14, 2021 at 01:06:33PM +0000, Joshi, Mukul wrote:
> We have RAS functionality in other ASICs that is not dependent on
> CONFIG_X86_MCE_AMD. So, I don't think we would want to do that just
> for one ASIC.

Lemme try again: you said that those errors do get reported through a
deferred interrupt. Which is likely amd_deferred_error_interrupt().

If it is that interrupt and you don't have CONFIG_X86_MCE_AMD enabled,
then you won't get any errors reported and your RAS functionality will
simply sit there inactive.

So if that above is true - something to which I'm still not getting
an answer but maybe one fine day... - so if that above is true, your
RAS functionality *needs* CONFIG_X86_MCE_AMD to be enabled in order to
*actually* function.

So you *must* make your RAS functionality depend on CONFIG_X86_MCE_AMD
- otherwise no deferred interrupts and no errors reported. It is that
simple.

--
Regards/Gruss,
Boris.

https://people.kernel.org/tglx/notes-about-netiquette