2024-02-27 21:00:29

by José Roberto de Souza

[permalink] [raw]
Subject: [PATCH 1/4] devcoredump: Add dev_coredump_put()

This is useful for drivers that don't want to keep a coredump
available after unloading.
Otherwise, the coredump would only be removed after DEVCD_TIMEOUT
seconds.

Cc: Rodrigo Vivi <[email protected]>
Cc: Mukesh Ojha <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Jonathan Cavitt <[email protected]>
Signed-off-by: José Roberto de Souza <[email protected]>
---
drivers/base/devcoredump.c | 22 ++++++++++++++++++++++
include/linux/devcoredump.h | 5 +++++
2 files changed, 27 insertions(+)

diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c
index 7e2d1f0d903a6..e96427411b87c 100644
--- a/drivers/base/devcoredump.c
+++ b/drivers/base/devcoredump.c
@@ -304,6 +304,28 @@ static ssize_t devcd_read_from_sgtable(char *buffer, loff_t offset,
offset);
}

+/**
+ * dev_coredump_put - remove device coredump
+ * @dev: the struct device for the crashed device
+ *
+ * If giving device has a coredump this removes it from file system and free
+ * associated data otherwise does nothing.
+ * This is useful for drivers that don't want to keep a coredump
+ * available after unloading.
+ */
+void dev_coredump_put(struct device *dev)
+{
+ struct device *existing;
+
+ existing = class_find_device(&devcd_class, NULL, dev,
+ devcd_match_failing);
+ if (existing) {
+ devcd_free(existing, NULL);
+ put_device(existing);
+ }
+}
+EXPORT_SYMBOL_GPL(dev_coredump_put);
+
/**
* dev_coredumpm - create device coredump with read/free methods
* @dev: the struct device for the crashed device
diff --git a/include/linux/devcoredump.h b/include/linux/devcoredump.h
index c008169ed2c6f..c8f7eb6cc1915 100644
--- a/include/linux/devcoredump.h
+++ b/include/linux/devcoredump.h
@@ -63,6 +63,8 @@ void dev_coredumpm(struct device *dev, struct module *owner,

void dev_coredumpsg(struct device *dev, struct scatterlist *table,
size_t datalen, gfp_t gfp);
+
+void dev_coredump_put(struct device *dev);
#else
static inline void dev_coredumpv(struct device *dev, void *data,
size_t datalen, gfp_t gfp)
@@ -85,6 +87,9 @@ static inline void dev_coredumpsg(struct device *dev, struct scatterlist *table,
{
_devcd_free_sgtable(table);
}
+static inline void dev_coredump_put(struct device *dev)
+{
+}
#endif /* CONFIG_DEV_COREDUMP */

#endif /* __DEVCOREDUMP_H */
--
2.44.0



2024-02-27 21:00:40

by José Roberto de Souza

[permalink] [raw]
Subject: [PATCH 2/4] devcoredump: Add dev_coredump_timeout_set()

Add function to set a custom coredump timeout.

Current 5-minute timeout may be too short for users to search and
understand what needs to be done to capture coredump to report bugs.

Cc: Rodrigo Vivi <[email protected]>
Cc: Mukesh Ojha <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Jonathan Cavitt <[email protected]>
Signed-off-by: José Roberto de Souza <[email protected]>
---
drivers/base/devcoredump.c | 23 +++++++++++++++++++++++
include/linux/devcoredump.h | 6 ++++++
2 files changed, 29 insertions(+)

diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c
index e96427411b87c..2857ceb3eb3aa 100644
--- a/drivers/base/devcoredump.c
+++ b/drivers/base/devcoredump.c
@@ -326,6 +326,29 @@ void dev_coredump_put(struct device *dev)
}
EXPORT_SYMBOL_GPL(dev_coredump_put);

+/**
+ * dev_coredump_timeout_set - set coredump timeout
+ * @dev: the struct device for the crashed device
+ * @secs_timeout: new timeout in seconds
+ *
+ * If giving device has a coredump this sets a new timeout for coredump.
+ */
+void dev_coredump_timeout_set(struct device *dev, unsigned long secs_timeout)
+{
+ struct device *existing;
+ struct devcd_entry *devcd;
+
+ existing = class_find_device(&devcd_class, NULL, dev,
+ devcd_match_failing);
+ if (!existing)
+ return;
+
+ devcd = dev_to_devcd(existing);
+ mod_delayed_work(system_wq, &devcd->del_wk, HZ * secs_timeout);
+ put_device(existing);
+}
+EXPORT_SYMBOL_GPL(dev_coredump_timeout_set);
+
/**
* dev_coredumpm - create device coredump with read/free methods
* @dev: the struct device for the crashed device
diff --git a/include/linux/devcoredump.h b/include/linux/devcoredump.h
index c8f7eb6cc1915..70fe72a5c6d36 100644
--- a/include/linux/devcoredump.h
+++ b/include/linux/devcoredump.h
@@ -65,6 +65,7 @@ void dev_coredumpsg(struct device *dev, struct scatterlist *table,
size_t datalen, gfp_t gfp);

void dev_coredump_put(struct device *dev);
+void dev_coredump_timeout_set(struct device *dev, unsigned long secs_timeout);
#else
static inline void dev_coredumpv(struct device *dev, void *data,
size_t datalen, gfp_t gfp)
@@ -90,6 +91,11 @@ static inline void dev_coredumpsg(struct device *dev, struct scatterlist *table,
static inline void dev_coredump_put(struct device *dev)
{
}
+
+static inline void dev_coredump_timeout_set(struct device *dev,
+ unsigned long secs_timeout)
+{
+}
#endif /* CONFIG_DEV_COREDUMP */

#endif /* __DEVCOREDUMP_H */
--
2.44.0


2024-02-27 21:41:50

by José Roberto de Souza

[permalink] [raw]
Subject: [PATCH 3/4] drm/xe: Remove devcoredump during driver release

This will remove devcoredump from file system and free its resources
during driver unload.

This fix the driver unload after gpu hang happened, otherwise this
it would report that Xe KMD is still in use and it would leave the
kernel in a state that Xe KMD can't be unload without a reboot.

Cc: Rodrigo Vivi <[email protected]>
Cc: Jonathan Cavitt <[email protected]>
Signed-off-by: José Roberto de Souza <[email protected]>
---
drivers/gpu/drm/xe/xe_devcoredump.c | 13 ++++++++++++-
drivers/gpu/drm/xe/xe_devcoredump.h | 5 +++++
drivers/gpu/drm/xe/xe_device.c | 4 ++++
3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c
index 68d3d623a05bf..3fef81396fe8a 100644
--- a/drivers/gpu/drm/xe/xe_devcoredump.c
+++ b/drivers/gpu/drm/xe/xe_devcoredump.c
@@ -9,6 +9,8 @@
#include <linux/devcoredump.h>
#include <generated/utsrelease.h>

+#include <drm/drm_managed.h>
+
#include "xe_device.h"
#include "xe_exec_queue.h"
#include "xe_force_wake.h"
@@ -231,5 +233,14 @@ void xe_devcoredump(struct xe_sched_job *job)
dev_coredumpm(xe->drm.dev, THIS_MODULE, coredump, 0, GFP_KERNEL,
xe_devcoredump_read, xe_devcoredump_free);
}
-#endif

+static void xe_driver_devcoredump_fini(struct drm_device *drm, void *arg)
+{
+ dev_coredump_put(drm->dev);
+}
+
+int xe_devcoredump_init(struct xe_device *xe)
+{
+ return drmm_add_action_or_reset(&xe->drm, xe_driver_devcoredump_fini, xe);
+}
+#endif
diff --git a/drivers/gpu/drm/xe/xe_devcoredump.h b/drivers/gpu/drm/xe/xe_devcoredump.h
index df8671f0b5eb2..9eba67f37234f 100644
--- a/drivers/gpu/drm/xe/xe_devcoredump.h
+++ b/drivers/gpu/drm/xe/xe_devcoredump.h
@@ -11,10 +11,15 @@ struct xe_sched_job;

#ifdef CONFIG_DEV_COREDUMP
void xe_devcoredump(struct xe_sched_job *job);
+int xe_devcoredump_init(struct xe_device *xe);
#else
static inline void xe_devcoredump(struct xe_sched_job *job)
{
}
+static inline int xe_devcoredump_init(struct xe_device *xe)
+{
+ return 0;
+}
#endif

#endif
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 919ad88f0495a..22be76537c7da 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -20,6 +20,7 @@
#include "regs/xe_regs.h"
#include "xe_bo.h"
#include "xe_debugfs.h"
+#include "xe_devcoredump.h"
#include "xe_dma_buf.h"
#include "xe_drm_client.h"
#include "xe_drv.h"
@@ -502,6 +503,9 @@ int xe_device_probe(struct xe_device *xe)
return err;
}

+ err = xe_devcoredump_init(xe);
+ if (err)
+ return err;
err = drmm_add_action_or_reset(&xe->drm, xe_driver_flr_fini, xe);
if (err)
return err;
--
2.44.0


2024-02-27 21:45:21

by José Roberto de Souza

[permalink] [raw]
Subject: [PATCH 4/4] drm/xe: Increase devcoredump timeout

5 minutes is too short for a regular user to search and understand
what he needs to do to report capture devcoredump and report a bug to
us, so here increasing this timeout to 1 hour.

Cc: Rodrigo Vivi <[email protected]>
Cc: Jonathan Cavitt <[email protected]>
Signed-off-by: José Roberto de Souza <[email protected]>
---
drivers/gpu/drm/xe/xe_devcoredump.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c
index 3fef81396fe8a..f1a8ccbed7f13 100644
--- a/drivers/gpu/drm/xe/xe_devcoredump.c
+++ b/drivers/gpu/drm/xe/xe_devcoredump.c
@@ -52,6 +52,8 @@

#ifdef CONFIG_DEV_COREDUMP

+#define XE_COREDUMP_TIMEOUT_SECS (60 * 60)
+
static struct xe_device *coredump_to_xe(const struct xe_devcoredump *coredump)
{
return container_of(coredump, struct xe_device, devcoredump);
@@ -232,6 +234,7 @@ void xe_devcoredump(struct xe_sched_job *job)

dev_coredumpm(xe->drm.dev, THIS_MODULE, coredump, 0, GFP_KERNEL,
xe_devcoredump_read, xe_devcoredump_free);
+ dev_coredump_timeout_set(xe->drm.dev, XE_COREDUMP_TIMEOUT_SECS);
}

static void xe_driver_devcoredump_fini(struct drm_device *drm, void *arg)
--
2.44.0


2024-02-28 08:27:38

by Mukesh Ojha

[permalink] [raw]
Subject: Re: [PATCH 2/4] devcoredump: Add dev_coredump_timeout_set()



On 2/28/2024 2:30 AM, José Roberto de Souza wrote:
> Add function to set a custom coredump timeout.
>
> Current 5-minute timeout may be too short for users to search and
> understand what needs to be done to capture coredump to report bugs.
>
> Cc: Rodrigo Vivi <[email protected]>
> Cc: Mukesh Ojha <[email protected]>
> Cc: Johannes Berg <[email protected]>
> Cc: Jonathan Cavitt <[email protected]>
> Signed-off-by: José Roberto de Souza <[email protected]>
> ---
> drivers/base/devcoredump.c | 23 +++++++++++++++++++++++
> include/linux/devcoredump.h | 6 ++++++
> 2 files changed, 29 insertions(+)
>
> diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c
> index e96427411b87c..2857ceb3eb3aa 100644
> --- a/drivers/base/devcoredump.c
> +++ b/drivers/base/devcoredump.c
> @@ -326,6 +326,29 @@ void dev_coredump_put(struct device *dev)
> }
> EXPORT_SYMBOL_GPL(dev_coredump_put);
>
> +/**
> + * dev_coredump_timeout_set - set coredump timeout
> + * @dev: the struct device for the crashed device
> + * @secs_timeout: new timeout in seconds
> + *
> + * If giving device has a coredump this sets a new timeout for coredump.
> + */
> +void dev_coredump_timeout_set(struct device *dev, unsigned long secs_timeout)
> +{
> + struct device *existing;
> + struct devcd_entry *devcd;
> +
> + existing = class_find_device(&devcd_class, NULL, dev,
> + devcd_match_failing);
> + if (!existing)
> + return;
> +
> + devcd = dev_to_devcd(existing);
> + mod_delayed_work(system_wq, &devcd->del_wk, HZ * secs_timeout);
> + put_device(existing);
> +}
> +EXPORT_SYMBOL_GPL(dev_coredump_timeout_set);

Why don't we provide a new exported function like [1] as calling
dev_coredumpm() and just another function to configure the timeout
does not look good.

[1]

dev_coredump_timeout(..., timeout);
__dev_coredump_timeout(...);

dev_coredumpm(..)
__dev_coredump_timeout(..., DEVCD_TIMEOUT);

dev_coredumpv()
__dev_coredump_timeout(..., DEVCD_TIMEOUT);

dev_coredumpsg()
__dev_coredump_timeout(..., DEVCD_TIMEOUT);

-Mukesh

> +
> /**
> * dev_coredumpm - create device coredump with read/free methods
> * @dev: the struct device for the crashed device
> diff --git a/include/linux/devcoredump.h b/include/linux/devcoredump.h
> index c8f7eb6cc1915..70fe72a5c6d36 100644
> --- a/include/linux/devcoredump.h
> +++ b/include/linux/devcoredump.h
> @@ -65,6 +65,7 @@ void dev_coredumpsg(struct device *dev, struct scatterlist *table,
> size_t datalen, gfp_t gfp);
>
> void dev_coredump_put(struct device *dev);
> +void dev_coredump_timeout_set(struct device *dev, unsigned long secs_timeout);
> #else
> static inline void dev_coredumpv(struct device *dev, void *data,
> size_t datalen, gfp_t gfp)
> @@ -90,6 +91,11 @@ static inline void dev_coredumpsg(struct device *dev, struct scatterlist *table,
> static inline void dev_coredump_put(struct device *dev)
> {
> }
> +
> +static inline void dev_coredump_timeout_set(struct device *dev,
> + unsigned long secs_timeout)
> +{
> +}
> #endif /* CONFIG_DEV_COREDUMP */
>
> #endif /* __DEVCOREDUMP_H */

2024-02-28 13:12:52

by José Roberto de Souza

[permalink] [raw]
Subject: Re: [PATCH 2/4] devcoredump: Add dev_coredump_timeout_set()

On Wed, 2024-02-28 at 13:56 +0530, Mukesh Ojha wrote:
>
> On 2/28/2024 2:30 AM, José Roberto de Souza wrote:
> > Add function to set a custom coredump timeout.
> >
> > Current 5-minute timeout may be too short for users to search and
> > understand what needs to be done to capture coredump to report bugs.
> >
> > Cc: Rodrigo Vivi <[email protected]>
> > Cc: Mukesh Ojha <[email protected]>
> > Cc: Johannes Berg <[email protected]>
> > Cc: Jonathan Cavitt <[email protected]>
> > Signed-off-by: José Roberto de Souza <[email protected]>
> > ---
> > drivers/base/devcoredump.c | 23 +++++++++++++++++++++++
> > include/linux/devcoredump.h | 6 ++++++
> > 2 files changed, 29 insertions(+)
> >
> > diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c
> > index e96427411b87c..2857ceb3eb3aa 100644
> > --- a/drivers/base/devcoredump.c
> > +++ b/drivers/base/devcoredump.c
> > @@ -326,6 +326,29 @@ void dev_coredump_put(struct device *dev)
> > }
> > EXPORT_SYMBOL_GPL(dev_coredump_put);
> >
> > +/**
> > + * dev_coredump_timeout_set - set coredump timeout
> > + * @dev: the struct device for the crashed device
> > + * @secs_timeout: new timeout in seconds
> > + *
> > + * If giving device has a coredump this sets a new timeout for coredump.
> > + */
> > +void dev_coredump_timeout_set(struct device *dev, unsigned long secs_timeout)
> > +{
> > + struct device *existing;
> > + struct devcd_entry *devcd;
> > +
> > + existing = class_find_device(&devcd_class, NULL, dev,
> > + devcd_match_failing);
> > + if (!existing)
> > + return;
> > +
> > + devcd = dev_to_devcd(existing);
> > + mod_delayed_work(system_wq, &devcd->del_wk, HZ * secs_timeout);
> > + put_device(existing);
> > +}
> > +EXPORT_SYMBOL_GPL(dev_coredump_timeout_set);
>
> Why don't we provide a new exported function like [1] as calling
> dev_coredumpm() and just another function to configure the timeout
> does not look good.

sounds good too, tried to be less intrusive.

will wait feedback in the first patch to send a new version.

thank you

>
> [1]
>
> dev_coredump_timeout(..., timeout);
> __dev_coredump_timeout(...);
>
> dev_coredumpm(..)
> __dev_coredump_timeout(..., DEVCD_TIMEOUT);
>
> dev_coredumpv()
> __dev_coredump_timeout(..., DEVCD_TIMEOUT);
>
> dev_coredumpsg()
> __dev_coredump_timeout(..., DEVCD_TIMEOUT);
>
> -Mukesh
>
> > +
> > /**
> > * dev_coredumpm - create device coredump with read/free methods
> > * @dev: the struct device for the crashed device
> > diff --git a/include/linux/devcoredump.h b/include/linux/devcoredump.h
> > index c8f7eb6cc1915..70fe72a5c6d36 100644
> > --- a/include/linux/devcoredump.h
> > +++ b/include/linux/devcoredump.h
> > @@ -65,6 +65,7 @@ void dev_coredumpsg(struct device *dev, struct scatterlist *table,
> > size_t datalen, gfp_t gfp);
> >
> > void dev_coredump_put(struct device *dev);
> > +void dev_coredump_timeout_set(struct device *dev, unsigned long secs_timeout);
> > #else
> > static inline void dev_coredumpv(struct device *dev, void *data,
> > size_t datalen, gfp_t gfp)
> > @@ -90,6 +91,11 @@ static inline void dev_coredumpsg(struct device *dev, struct scatterlist *table,
> > static inline void dev_coredump_put(struct device *dev)
> > {
> > }
> > +
> > +static inline void dev_coredump_timeout_set(struct device *dev,
> > + unsigned long secs_timeout)
> > +{
> > +}
> > #endif /* CONFIG_DEV_COREDUMP */
> >
> > #endif /* __DEVCOREDUMP_H */

2024-02-28 13:50:59

by Mukesh Ojha

[permalink] [raw]
Subject: Re: [PATCH 1/4] devcoredump: Add dev_coredump_put()



On 2/28/2024 2:30 AM, José Roberto de Souza wrote:
> This is useful for drivers that don't want to keep a coredump
> available after unloading.
> Otherwise, the coredump would only be removed after DEVCD_TIMEOUT
> seconds.
>
> Cc: Rodrigo Vivi <[email protected]>
> Cc: Mukesh Ojha <[email protected]>
> Cc: Johannes Berg <[email protected]>
> Cc: Jonathan Cavitt <[email protected]>
> Signed-off-by: José Roberto de Souza <[email protected]>
> ---
> drivers/base/devcoredump.c | 22 ++++++++++++++++++++++
> include/linux/devcoredump.h | 5 +++++
> 2 files changed, 27 insertions(+)
>
> diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c
> index 7e2d1f0d903a6..e96427411b87c 100644
> --- a/drivers/base/devcoredump.c
> +++ b/drivers/base/devcoredump.c
> @@ -304,6 +304,28 @@ static ssize_t devcd_read_from_sgtable(char *buffer, loff_t offset,
> offset);
> }
>
[...]

> +/**
> + * dev_coredump_put - remove device coredump
> + * @dev: the struct device for the crashed device
> + *
> + * If giving device has a coredump this removes it from file system and free
> + * associated data otherwise does nothing.
> + * This is useful for drivers that don't want to keep a coredump
> + * available after unloading.
> + */

[...]

Slight rephrasing..

""
dev_coredump_put() removes coredump, if exists, for a given device from
the file system and free its associated data otherwise, does nothing.

It is useful for modules that do not want to keep coredump
available after its unload.
""

Overall, LGTM..

-Mukesh

> +void dev_coredump_put(struct device *dev)
> +{
> + struct device *existing;
> +
> + existing = class_find_device(&devcd_class, NULL, dev,
> + devcd_match_failing);
> + if (existing) {
> + devcd_free(existing, NULL);
> + put_device(existing);
> + }
> +}
> +EXPORT_SYMBOL_GPL(dev_coredump_put);
> +
> /**
> * dev_coredumpm - create device coredump with read/free methods
> * @dev: the struct device for the crashed device
> diff --git a/include/linux/devcoredump.h b/include/linux/devcoredump.h
> index c008169ed2c6f..c8f7eb6cc1915 100644
> --- a/include/linux/devcoredump.h
> +++ b/include/linux/devcoredump.h
> @@ -63,6 +63,8 @@ void dev_coredumpm(struct device *dev, struct module *owner,
>
> void dev_coredumpsg(struct device *dev, struct scatterlist *table,
> size_t datalen, gfp_t gfp);
> +
> +void dev_coredump_put(struct device *dev);
> #else
> static inline void dev_coredumpv(struct device *dev, void *data,
> size_t datalen, gfp_t gfp)
> @@ -85,6 +87,9 @@ static inline void dev_coredumpsg(struct device *dev, struct scatterlist *table,
> {
> _devcd_free_sgtable(table);
> }
> +static inline void dev_coredump_put(struct device *dev)
> +{
> +}
> #endif /* CONFIG_DEV_COREDUMP */
>
> #endif /* __DEVCOREDUMP_H */