2024-03-04 14:39:31

by José Roberto de Souza

[permalink] [raw]
Subject: [PATCH v3 1/4] devcoredump: Add dev_coredump_put()

It is useful for modules that do not want to keep coredump available
after its unload.
Otherwise, the coredump would only be removed after DEVCD_TIMEOUT
seconds.

v2:
- dev_coredump_put() documentation updated (Mukesh)

Cc: Rodrigo Vivi <[email protected]>
Cc: Mukesh Ojha <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Jonathan Cavitt <[email protected]>
Acked-by: Jonathan Cavitt <[email protected]>
Signed-off-by: José Roberto de Souza <[email protected]>
---
drivers/base/devcoredump.c | 23 +++++++++++++++++++++++
include/linux/devcoredump.h | 5 +++++
2 files changed, 28 insertions(+)

diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c
index 7e2d1f0d903a6..82aeb09b3d1b5 100644
--- a/drivers/base/devcoredump.c
+++ b/drivers/base/devcoredump.c
@@ -304,6 +304,29 @@ static ssize_t devcd_read_from_sgtable(char *buffer, loff_t offset,
offset);
}

+/**
+ * dev_coredump_put - remove device coredump
+ * @dev: the struct device for the crashed device
+ *
+ * dev_coredump_put() removes coredump, if exists, for a given device from
+ * the file system and free its associated data otherwise, does nothing.
+ *
+ * It is useful for modules that do not want to keep coredump
+ * available after its unload.
+ */
+void dev_coredump_put(struct device *dev)
+{
+ struct device *existing;
+
+ existing = class_find_device(&devcd_class, NULL, dev,
+ devcd_match_failing);
+ if (existing) {
+ devcd_free(existing, NULL);
+ put_device(existing);
+ }
+}
+EXPORT_SYMBOL_GPL(dev_coredump_put);
+
/**
* dev_coredumpm - create device coredump with read/free methods
* @dev: the struct device for the crashed device
diff --git a/include/linux/devcoredump.h b/include/linux/devcoredump.h
index c008169ed2c6f..c8f7eb6cc1915 100644
--- a/include/linux/devcoredump.h
+++ b/include/linux/devcoredump.h
@@ -63,6 +63,8 @@ void dev_coredumpm(struct device *dev, struct module *owner,

void dev_coredumpsg(struct device *dev, struct scatterlist *table,
size_t datalen, gfp_t gfp);
+
+void dev_coredump_put(struct device *dev);
#else
static inline void dev_coredumpv(struct device *dev, void *data,
size_t datalen, gfp_t gfp)
@@ -85,6 +87,9 @@ static inline void dev_coredumpsg(struct device *dev, struct scatterlist *table,
{
_devcd_free_sgtable(table);
}
+static inline void dev_coredump_put(struct device *dev)
+{
+}
#endif /* CONFIG_DEV_COREDUMP */

#endif /* __DEVCOREDUMP_H */
--
2.44.0



2024-03-04 14:40:17

by José Roberto de Souza

[permalink] [raw]
Subject: [PATCH v3 2/4] devcoredump: Add dev_coredumpm_timeout()

Add function to set a custom coredump timeout.

Current 5-minute timeout may be too short for users to search and
understand what needs to be done to capture coredump to report bugs.

v2:
- replace dev_coredump_timeout_set() by dev_coredumpm_timeout() (Mukesh)

v3:
- make dev_coredumpm() static inline (Johannes)

Cc: Rodrigo Vivi <[email protected]>
Cc: Mukesh Ojha <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Jonathan Cavitt <[email protected]>
Acked-by: Jonathan Cavitt <[email protected]>
Signed-off-by: José Roberto de Souza <[email protected]>
---
drivers/base/devcoredump.c | 23 ++++++++--------
include/linux/devcoredump.h | 54 ++++++++++++++++++++++++++++---------
2 files changed, 54 insertions(+), 23 deletions(-)

diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c
index 82aeb09b3d1b5..c795edad1b969 100644
--- a/drivers/base/devcoredump.c
+++ b/drivers/base/devcoredump.c
@@ -18,9 +18,6 @@ static struct class devcd_class;
/* global disable flag, for security purposes */
static bool devcd_disabled;

-/* if data isn't read by userspace after 5 minutes then delete it */
-#define DEVCD_TIMEOUT (HZ * 60 * 5)
-
struct devcd_entry {
struct device devcd_dev;
void *data;
@@ -328,7 +325,8 @@ void dev_coredump_put(struct device *dev)
EXPORT_SYMBOL_GPL(dev_coredump_put);

/**
- * dev_coredumpm - create device coredump with read/free methods
+ * dev_coredumpm_timeout - create device coredump with read/free methods with a
+ * custom timeout.
* @dev: the struct device for the crashed device
* @owner: the module that contains the read/free functions, use %THIS_MODULE
* @data: data cookie for the @read/@free functions
@@ -336,17 +334,20 @@ EXPORT_SYMBOL_GPL(dev_coredump_put);
* @gfp: allocation flags
* @read: function to read from the given buffer
* @free: function to free the given buffer
+ * @timeout: time in jiffies to remove coredump
*
* Creates a new device coredump for the given device. If a previous one hasn't
* been read yet, the new coredump is discarded. The data lifetime is determined
* by the device coredump framework and when it is no longer needed the @free
* function will be called to free the data.
*/
-void dev_coredumpm(struct device *dev, struct module *owner,
- void *data, size_t datalen, gfp_t gfp,
- ssize_t (*read)(char *buffer, loff_t offset, size_t count,
- void *data, size_t datalen),
- void (*free)(void *data))
+void dev_coredumpm_timeout(struct device *dev, struct module *owner,
+ void *data, size_t datalen, gfp_t gfp,
+ ssize_t (*read)(char *buffer, loff_t offset,
+ size_t count, void *data,
+ size_t datalen),
+ void (*free)(void *data),
+ unsigned long timeout)
{
static atomic_t devcd_count = ATOMIC_INIT(0);
struct devcd_entry *devcd;
@@ -403,7 +404,7 @@ void dev_coredumpm(struct device *dev, struct module *owner,
dev_set_uevent_suppress(&devcd->devcd_dev, false);
kobject_uevent(&devcd->devcd_dev.kobj, KOBJ_ADD);
INIT_DELAYED_WORK(&devcd->del_wk, devcd_del);
- schedule_delayed_work(&devcd->del_wk, DEVCD_TIMEOUT);
+ schedule_delayed_work(&devcd->del_wk, timeout);
mutex_unlock(&devcd->mutex);
return;
put_device:
@@ -414,7 +415,7 @@ void dev_coredumpm(struct device *dev, struct module *owner,
free:
free(data);
}
-EXPORT_SYMBOL_GPL(dev_coredumpm);
+EXPORT_SYMBOL_GPL(dev_coredumpm_timeout);

/**
* dev_coredumpsg - create device coredump that uses scatterlist as data
diff --git a/include/linux/devcoredump.h b/include/linux/devcoredump.h
index c8f7eb6cc1915..56e606eb4640b 100644
--- a/include/linux/devcoredump.h
+++ b/include/linux/devcoredump.h
@@ -12,6 +12,9 @@
#include <linux/scatterlist.h>
#include <linux/slab.h>

+/* if data isn't read by userspace after 5 minutes then delete it */
+#define DEVCOREDUMP_TIMEOUT (HZ * 60 * 5)
+
/*
* _devcd_free_sgtable - free all the memory of the given scatterlist table
* (i.e. both pages and scatterlist instances)
@@ -50,16 +53,17 @@ static inline void _devcd_free_sgtable(struct scatterlist *table)
kfree(delete_iter);
}

-
#ifdef CONFIG_DEV_COREDUMP
void dev_coredumpv(struct device *dev, void *data, size_t datalen,
gfp_t gfp);

-void dev_coredumpm(struct device *dev, struct module *owner,
- void *data, size_t datalen, gfp_t gfp,
- ssize_t (*read)(char *buffer, loff_t offset, size_t count,
- void *data, size_t datalen),
- void (*free)(void *data));
+void dev_coredumpm_timeout(struct device *dev, struct module *owner,
+ void *data, size_t datalen, gfp_t gfp,
+ ssize_t (*read)(char *buffer, loff_t offset,
+ size_t count, void *data,
+ size_t datalen),
+ void (*free)(void *data),
+ unsigned long timeout);

void dev_coredumpsg(struct device *dev, struct scatterlist *table,
size_t datalen, gfp_t gfp);
@@ -72,12 +76,13 @@ static inline void dev_coredumpv(struct device *dev, void *data,
vfree(data);
}

-static inline void
-dev_coredumpm(struct device *dev, struct module *owner,
- void *data, size_t datalen, gfp_t gfp,
- ssize_t (*read)(char *buffer, loff_t offset, size_t count,
- void *data, size_t datalen),
- void (*free)(void *data))
+void dev_coredumpm_timeout(struct device *dev, struct module *owner,
+ void *data, size_t datalen, gfp_t gfp,
+ ssize_t (*read)(char *buffer, loff_t offset,
+ size_t count, void *data,
+ size_t datalen),
+ void (*free)(void *data),
+ unsigned long timeout)
{
free(data);
}
@@ -92,4 +97,29 @@ static inline void dev_coredump_put(struct device *dev)
}
#endif /* CONFIG_DEV_COREDUMP */

+/**
+ * dev_coredumpm - create device coredump with read/free methods
+ * @dev: the struct device for the crashed device
+ * @owner: the module that contains the read/free functions, use %THIS_MODULE
+ * @data: data cookie for the @read/@free functions
+ * @datalen: length of the data
+ * @gfp: allocation flags
+ * @read: function to read from the given buffer
+ * @free: function to free the given buffer
+ *
+ * Creates a new device coredump for the given device. If a previous one hasn't
+ * been read yet, the new coredump is discarded. The data lifetime is determined
+ * by the device coredump framework and when it is no longer needed the @free
+ * function will be called to free the data.
+ */
+static inline void dev_coredumpm(struct device *dev, struct module *owner,
+ void *data, size_t datalen, gfp_t gfp,
+ ssize_t (*read)(char *buffer, loff_t offset, size_t count,
+ void *data, size_t datalen),
+ void (*free)(void *data))
+{
+ dev_coredumpm_timeout(dev, owner, data, datalen, gfp, read, free,
+ DEVCOREDUMP_TIMEOUT);
+}
+
#endif /* __DEVCOREDUMP_H */
--
2.44.0


2024-03-04 14:40:18

by José Roberto de Souza

[permalink] [raw]
Subject: [PATCH v3 4/4] drm/xe: Increase devcoredump timeout

5 minutes is too short for a regular user to search and understand
what he needs to do to report capture devcoredump and report a bug to
us, so here increasing this timeout to 1 hour.

Cc: Rodrigo Vivi <[email protected]>
Cc: Jonathan Cavitt <[email protected]>
Acked-by: Jonathan Cavitt <[email protected]>
Signed-off-by: José Roberto de Souza <[email protected]>
---
drivers/gpu/drm/xe/xe_devcoredump.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c
index 109c027188bd9..08c60531e0db4 100644
--- a/drivers/gpu/drm/xe/xe_devcoredump.c
+++ b/drivers/gpu/drm/xe/xe_devcoredump.c
@@ -52,6 +52,9 @@

#ifdef CONFIG_DEV_COREDUMP

+/* 1 hour timeout */
+#define XE_COREDUMP_TIMEOUT_JIFFIES (60 * 60 * HZ)
+
static struct xe_device *coredump_to_xe(const struct xe_devcoredump *coredump)
{
return container_of(coredump, struct xe_device, devcoredump);
@@ -229,8 +232,9 @@ void xe_devcoredump(struct xe_sched_job *job)
drm_info(&xe->drm, "Check your /sys/class/drm/card%d/device/devcoredump/data\n",
xe->drm.primary->index);

- dev_coredumpm(xe->drm.dev, THIS_MODULE, coredump, 0, GFP_KERNEL,
- xe_devcoredump_read, xe_devcoredump_free);
+ dev_coredumpm_timeout(xe->drm.dev, THIS_MODULE, coredump, 0, GFP_KERNEL,
+ xe_devcoredump_read, xe_devcoredump_free,
+ XE_COREDUMP_TIMEOUT_JIFFIES);
}

static void xe_driver_devcoredump_fini(struct drm_device *drm, void *arg)
--
2.44.0


2024-03-04 15:12:06

by José Roberto de Souza

[permalink] [raw]
Subject: [PATCH v3 3/4] drm/xe: Remove devcoredump during driver release

This will remove devcoredump from file system and free its resources
during driver unload.

This fix the driver unload after gpu hang happened, otherwise this
it would report that Xe KMD is still in use and it would leave the
kernel in a state that Xe KMD can't be unload without a reboot.

Cc: Rodrigo Vivi <[email protected]>
Cc: Jonathan Cavitt <[email protected]>
Acked-by: Jonathan Cavitt <[email protected]>
Signed-off-by: José Roberto de Souza <[email protected]>
---
drivers/gpu/drm/xe/xe_devcoredump.c | 13 ++++++++++++-
drivers/gpu/drm/xe/xe_devcoredump.h | 5 +++++
drivers/gpu/drm/xe/xe_device.c | 4 ++++
3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c
index 0fcd306803236..109c027188bd9 100644
--- a/drivers/gpu/drm/xe/xe_devcoredump.c
+++ b/drivers/gpu/drm/xe/xe_devcoredump.c
@@ -9,6 +9,8 @@
#include <linux/devcoredump.h>
#include <generated/utsrelease.h>

+#include <drm/drm_managed.h>
+
#include "xe_device.h"
#include "xe_exec_queue.h"
#include "xe_force_wake.h"
@@ -230,5 +232,14 @@ void xe_devcoredump(struct xe_sched_job *job)
dev_coredumpm(xe->drm.dev, THIS_MODULE, coredump, 0, GFP_KERNEL,
xe_devcoredump_read, xe_devcoredump_free);
}
-#endif

+static void xe_driver_devcoredump_fini(struct drm_device *drm, void *arg)
+{
+ dev_coredump_put(drm->dev);
+}
+
+int xe_devcoredump_init(struct xe_device *xe)
+{
+ return drmm_add_action_or_reset(&xe->drm, xe_driver_devcoredump_fini, xe);
+}
+#endif
diff --git a/drivers/gpu/drm/xe/xe_devcoredump.h b/drivers/gpu/drm/xe/xe_devcoredump.h
index df8671f0b5eb2..9eba67f37234f 100644
--- a/drivers/gpu/drm/xe/xe_devcoredump.h
+++ b/drivers/gpu/drm/xe/xe_devcoredump.h
@@ -11,10 +11,15 @@ struct xe_sched_job;

#ifdef CONFIG_DEV_COREDUMP
void xe_devcoredump(struct xe_sched_job *job);
+int xe_devcoredump_init(struct xe_device *xe);
#else
static inline void xe_devcoredump(struct xe_sched_job *job)
{
}
+static inline int xe_devcoredump_init(struct xe_device *xe)
+{
+ return 0;
+}
#endif

#endif
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 919ad88f0495a..22be76537c7da 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -20,6 +20,7 @@
#include "regs/xe_regs.h"
#include "xe_bo.h"
#include "xe_debugfs.h"
+#include "xe_devcoredump.h"
#include "xe_dma_buf.h"
#include "xe_drm_client.h"
#include "xe_drv.h"
@@ -502,6 +503,9 @@ int xe_device_probe(struct xe_device *xe)
return err;
}

+ err = xe_devcoredump_init(xe);
+ if (err)
+ return err;
err = drmm_add_action_or_reset(&xe->drm, xe_driver_flr_fini, xe);
if (err)
return err;
--
2.44.0


2024-03-04 23:57:43

by Lucas De Marchi

[permalink] [raw]
Subject: Re: [PATCH v3 2/4] devcoredump: Add dev_coredumpm_timeout()

On Mon, Mar 04, 2024 at 06:39:03AM -0800, Jose Souza wrote:
>Add function to set a custom coredump timeout.
>
>Current 5-minute timeout may be too short for users to search and
>understand what needs to be done to capture coredump to report bugs.
>
>v2:
>- replace dev_coredump_timeout_set() by dev_coredumpm_timeout() (Mukesh)
>
>v3:
>- make dev_coredumpm() static inline (Johannes)


but let's wait the discussion settle on v2.

Lucas De Marchi

2024-03-11 20:56:53

by José Roberto de Souza

[permalink] [raw]
Subject: Re: [PATCH v3 2/4] devcoredump: Add dev_coredumpm_timeout()

Hi Johannes and Mukesh

On Mon, 2024-03-04 at 17:56 -0600, Lucas De Marchi wrote:
> On Mon, Mar 04, 2024 at 06:39:03AM -0800, Jose Souza wrote:
> > Add function to set a custom coredump timeout.
> >
> > Current 5-minute timeout may be too short for users to search and
> > understand what needs to be done to capture coredump to report bugs.
> >
> > v2:
> > - replace dev_coredump_timeout_set() by dev_coredumpm_timeout() (Mukesh)
> >
> > v3:
> > - make dev_coredumpm() static inline (Johannes)
>
>
> but let's wait the discussion settle on v2.

Lucas is fine with this(see v2 thread).
Can you please take another look at this series?

thank you


>
> Lucas De Marchi

2024-03-25 17:43:38

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] devcoredump: Add dev_coredump_put()

On Mon, 2024-03-04 at 06:39 -0800, José Roberto de Souza wrote:
> It is useful for modules that do not want to keep coredump available
> after its unload.

Why not though? Maybe if this is a common case we should have devm_...
coredump functions? Dunno.

Anyway, I'm fine with this, even though I'd like to see a bit more
discussion than "do not want". Code looks good.

Reviewed-by: Johannes Berg <[email protected]>

johannes

2024-03-25 17:58:11

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH v3 2/4] devcoredump: Add dev_coredumpm_timeout()

On Mon, 2024-03-04 at 06:39 -0800, José Roberto de Souza wrote:
> Add function to set a custom coredump timeout.
>
> Current 5-minute timeout may be too short for users to search and
> understand what needs to be done to capture coredump to report bugs.

> + */
> +static inline void dev_coredumpm(struct device *dev, struct module *owner,
> + void *data, size_t datalen, gfp_t gfp,
> + ssize_t (*read)(char *buffer, loff_t offset, size_t count,
> + void *data, size_t datalen),
> + void (*free)(void *data))

nit: looks like you missed a space on the 'free' line

I don't think we'll actually _solve_ the discussion of whether or not
this makes sense.

I still think it's a bad idea to hang on to the dumps in core kernel
memory since they can be big (I would've said ath12k is big with 55MB,
but Rodrigo said graphics could be up to 2GB!), without being able to
page it out, etc. That's just a waste of memory, for what I don't think
is even a good reason.

So dunno.

However, I also don't like to exercise any power that I might randomly
hold just because I happened to write the code in the first place... And
if you want to shoot yourselves in the foot with any of this, should I
really disagree? I feel I've voiced my objections enough, and Lucas has
also tried to find ways of making a userspace implementation work for
you.

I'd perhaps argue that the documentation for the functions should be
more opinionated and actually recommend against using a large timeout
(like you want) for all these reasons, but other than that, the code
looks fine to me.

johannes

2024-03-25 18:42:37

by José Roberto de Souza

[permalink] [raw]
Subject: Re: [PATCH v3 2/4] devcoredump: Add dev_coredumpm_timeout()

On Mon, 2024-03-25 at 18:00 +0100, Johannes Berg wrote:
> On Mon, 2024-03-04 at 06:39 -0800, José Roberto de Souza wrote:
> > Add function to set a custom coredump timeout.
> >
> > Current 5-minute timeout may be too short for users to search and
> > understand what needs to be done to capture coredump to report bugs.
>
> > + */
> > +static inline void dev_coredumpm(struct device *dev, struct module *owner,
> > + void *data, size_t datalen, gfp_t gfp,
> > + ssize_t (*read)(char *buffer, loff_t offset, size_t count,
> > + void *data, size_t datalen),
> > + void (*free)(void *data))
>
> nit: looks like you missed a space on the 'free' line
>
> I don't think we'll actually _solve_ the discussion of whether or not
> this makes sense.
>
> I still think it's a bad idea to hang on to the dumps in core kernel
> memory since they can be big (I would've said ath12k is big with 55MB,
> but Rodrigo said graphics could be up to 2GB!), without being able to
> page it out, etc. That's just a waste of memory, for what I don't think
> is even a good reason.
>

There is a discussion to have a udev script to copy dump from memory to disk and then the script can write 0 to coredump/data and erase it from
memory. But if distro don't have this udev script it is still good to have larger timeout to allow user to capture it.

The 2GB usage are for cases when UMD developers enables the capture of every single buffer, regular capture size depends on the complexity of the
application but it is generally round 2MB.

> So dunno.
>
> However, I also don't like to exercise any power that I might randomly
> hold just because I happened to write the code in the first place... And
> if you want to shoot yourselves in the foot with any of this, should I
> really disagree? I feel I've voiced my objections enough, and Lucas has
> also tried to find ways of making a userspace implementation work for
> you.
>
> I'd perhaps argue that the documentation for the functions should be
> more opinionated and actually recommend against using a large timeout
> (like you want) for all these reasons, but other than that, the code
> looks fine to me.

@timeout: time in jiffies to remove coredump. DEVCOREDUMP_TIMEOUT is the value for dev_coredumpm() and it should be used unless it is absolutely
necessary a larger timeout.

Or do you have a better suggestion?

>
> johannes

2024-03-25 18:56:08

by José Roberto de Souza

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] devcoredump: Add dev_coredump_put()

On Mon, 2024-03-25 at 17:48 +0100, Johannes Berg wrote:
> On Mon, 2024-03-04 at 06:39 -0800, José Roberto de Souza wrote:
> > It is useful for modules that do not want to keep coredump available
> > after its unload.
>
> Why not though? Maybe if this is a common case we should have devm_...
> coredump functions? Dunno.
>
> Anyway, I'm fine with this, even though I'd like to see a bit more
> discussion than "do not want". Code looks good.
>
> Reviewed-by: Johannes Berg <[email protected]>

Thank you, can you please add this patch to your next pull request? Or should us add it to the next drm/intel-drm pull request?

Again thank you, just this patch will already unblock some work for us.

>
> johannes

2024-04-05 16:02:04

by José Roberto de Souza

[permalink] [raw]
Subject: Re: [PATCH v3 1/4] devcoredump: Add dev_coredump_put()

On Mon, 2024-03-25 at 11:36 -0700, José Roberto de Souza wrote:
> On Mon, 2024-03-25 at 17:48 +0100, Johannes Berg wrote:
> > On Mon, 2024-03-04 at 06:39 -0800, José Roberto de Souza wrote:
> > > It is useful for modules that do not want to keep coredump available
> > > after its unload.
> >
> > Why not though? Maybe if this is a common case we should have devm_...
> > coredump functions? Dunno.
> >
> > Anyway, I'm fine with this, even though I'd like to see a bit more
> > discussion than "do not want". Code looks good.
> >
> > Reviewed-by: Johannes Berg <[email protected]>
>
> Thank you, can you please add this patch to your next pull request? Or should us add it to the next drm/intel-drm pull request?

ping

>
> Again thank you, just this patch will already unblock some work for us.
>
> >
> > johannes
>