2024-02-28 16:57:30

by José Roberto de Souza

[permalink] [raw]
Subject: [PATCH v2 1/4] devcoredump: Add dev_coredump_put()

It is useful for modules that do not want to keep coredump available
after its unload.
Otherwise, the coredump would only be removed after DEVCD_TIMEOUT
seconds.

v2:
- dev_coredump_put() documentation updated (Mukesh)

Cc: Rodrigo Vivi <[email protected]>
Cc: Mukesh Ojha <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Jonathan Cavitt <[email protected]>
Signed-off-by: José Roberto de Souza <[email protected]>
---
drivers/base/devcoredump.c | 23 +++++++++++++++++++++++
include/linux/devcoredump.h | 5 +++++
2 files changed, 28 insertions(+)

diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c
index 7e2d1f0d903a6..82aeb09b3d1b5 100644
--- a/drivers/base/devcoredump.c
+++ b/drivers/base/devcoredump.c
@@ -304,6 +304,29 @@ static ssize_t devcd_read_from_sgtable(char *buffer, loff_t offset,
offset);
}

+/**
+ * dev_coredump_put - remove device coredump
+ * @dev: the struct device for the crashed device
+ *
+ * dev_coredump_put() removes coredump, if exists, for a given device from
+ * the file system and free its associated data otherwise, does nothing.
+ *
+ * It is useful for modules that do not want to keep coredump
+ * available after its unload.
+ */
+void dev_coredump_put(struct device *dev)
+{
+ struct device *existing;
+
+ existing = class_find_device(&devcd_class, NULL, dev,
+ devcd_match_failing);
+ if (existing) {
+ devcd_free(existing, NULL);
+ put_device(existing);
+ }
+}
+EXPORT_SYMBOL_GPL(dev_coredump_put);
+
/**
* dev_coredumpm - create device coredump with read/free methods
* @dev: the struct device for the crashed device
diff --git a/include/linux/devcoredump.h b/include/linux/devcoredump.h
index c008169ed2c6f..c8f7eb6cc1915 100644
--- a/include/linux/devcoredump.h
+++ b/include/linux/devcoredump.h
@@ -63,6 +63,8 @@ void dev_coredumpm(struct device *dev, struct module *owner,

void dev_coredumpsg(struct device *dev, struct scatterlist *table,
size_t datalen, gfp_t gfp);
+
+void dev_coredump_put(struct device *dev);
#else
static inline void dev_coredumpv(struct device *dev, void *data,
size_t datalen, gfp_t gfp)
@@ -85,6 +87,9 @@ static inline void dev_coredumpsg(struct device *dev, struct scatterlist *table,
{
_devcd_free_sgtable(table);
}
+static inline void dev_coredump_put(struct device *dev)
+{
+}
#endif /* CONFIG_DEV_COREDUMP */

#endif /* __DEVCOREDUMP_H */
--
2.44.0



2024-02-28 16:57:37

by José Roberto de Souza

[permalink] [raw]
Subject: [PATCH v2 2/4] devcoredump: Add dev_coredumpm_timeout()

Add function to set a custom coredump timeout.

Current 5-minute timeout may be too short for users to search and
understand what needs to be done to capture coredump to report bugs.

v2:
- replace dev_coredump_timeout_set() by dev_coredumpm_timeout() (Mukesh)

Cc: Rodrigo Vivi <[email protected]>
Cc: Mukesh Ojha <[email protected]>
Cc: Johannes Berg <[email protected]>
Cc: Jonathan Cavitt <[email protected]>
Signed-off-by: José Roberto de Souza <[email protected]>
---
drivers/base/devcoredump.c | 44 +++++++++++++++++++++++++++++++------
include/linux/devcoredump.h | 18 +++++++++++++++
2 files changed, 55 insertions(+), 7 deletions(-)

diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c
index 82aeb09b3d1b5..07b89f7f735d8 100644
--- a/drivers/base/devcoredump.c
+++ b/drivers/base/devcoredump.c
@@ -328,7 +328,8 @@ void dev_coredump_put(struct device *dev)
EXPORT_SYMBOL_GPL(dev_coredump_put);

/**
- * dev_coredumpm - create device coredump with read/free methods
+ * dev_coredumpm_timeout - create device coredump with read/free methods with a
+ * custom timeout.
* @dev: the struct device for the crashed device
* @owner: the module that contains the read/free functions, use %THIS_MODULE
* @data: data cookie for the @read/@free functions
@@ -336,17 +337,20 @@ EXPORT_SYMBOL_GPL(dev_coredump_put);
* @gfp: allocation flags
* @read: function to read from the given buffer
* @free: function to free the given buffer
+ * @timeout: time in jiffies to remove coredump
*
* Creates a new device coredump for the given device. If a previous one hasn't
* been read yet, the new coredump is discarded. The data lifetime is determined
* by the device coredump framework and when it is no longer needed the @free
* function will be called to free the data.
*/
-void dev_coredumpm(struct device *dev, struct module *owner,
- void *data, size_t datalen, gfp_t gfp,
- ssize_t (*read)(char *buffer, loff_t offset, size_t count,
- void *data, size_t datalen),
- void (*free)(void *data))
+void dev_coredumpm_timeout(struct device *dev, struct module *owner,
+ void *data, size_t datalen, gfp_t gfp,
+ ssize_t (*read)(char *buffer, loff_t offset,
+ size_t count, void *data,
+ size_t datalen),
+ void (*free)(void *data),
+ unsigned long timeout)
{
static atomic_t devcd_count = ATOMIC_INIT(0);
struct devcd_entry *devcd;
@@ -403,7 +407,7 @@ void dev_coredumpm(struct device *dev, struct module *owner,
dev_set_uevent_suppress(&devcd->devcd_dev, false);
kobject_uevent(&devcd->devcd_dev.kobj, KOBJ_ADD);
INIT_DELAYED_WORK(&devcd->del_wk, devcd_del);
- schedule_delayed_work(&devcd->del_wk, DEVCD_TIMEOUT);
+ schedule_delayed_work(&devcd->del_wk, timeout);
mutex_unlock(&devcd->mutex);
return;
put_device:
@@ -414,6 +418,32 @@ void dev_coredumpm(struct device *dev, struct module *owner,
free:
free(data);
}
+EXPORT_SYMBOL_GPL(dev_coredumpm_timeout);
+
+/**
+ * dev_coredumpm - create device coredump with read/free methods
+ * @dev: the struct device for the crashed device
+ * @owner: the module that contains the read/free functions, use %THIS_MODULE
+ * @data: data cookie for the @read/@free functions
+ * @datalen: length of the data
+ * @gfp: allocation flags
+ * @read: function to read from the given buffer
+ * @free: function to free the given buffer
+ *
+ * Creates a new device coredump for the given device. If a previous one hasn't
+ * been read yet, the new coredump is discarded. The data lifetime is determined
+ * by the device coredump framework and when it is no longer needed the @free
+ * function will be called to free the data.
+ */
+void dev_coredumpm(struct device *dev, struct module *owner,
+ void *data, size_t datalen, gfp_t gfp,
+ ssize_t (*read)(char *buffer, loff_t offset, size_t count,
+ void *data, size_t datalen),
+ void (*free)(void *data))
+{
+ dev_coredumpm_timeout(dev, owner, data, datalen, gfp, read, free,
+ DEVCD_TIMEOUT);
+}
EXPORT_SYMBOL_GPL(dev_coredumpm);

/**
diff --git a/include/linux/devcoredump.h b/include/linux/devcoredump.h
index c8f7eb6cc1915..5e1e4deab07a6 100644
--- a/include/linux/devcoredump.h
+++ b/include/linux/devcoredump.h
@@ -55,6 +55,14 @@ static inline void _devcd_free_sgtable(struct scatterlist *table)
void dev_coredumpv(struct device *dev, void *data, size_t datalen,
gfp_t gfp);

+void dev_coredumpm_timeout(struct device *dev, struct module *owner,
+ void *data, size_t datalen, gfp_t gfp,
+ ssize_t (*read)(char *buffer, loff_t offset,
+ size_t count, void *data,
+ size_t datalen),
+ void (*free)(void *data),
+ unsigned long timeout);
+
void dev_coredumpm(struct device *dev, struct module *owner,
void *data, size_t datalen, gfp_t gfp,
ssize_t (*read)(char *buffer, loff_t offset, size_t count,
@@ -72,6 +80,16 @@ static inline void dev_coredumpv(struct device *dev, void *data,
vfree(data);
}

+void dev_coredumpm_timeout(struct device *dev, struct module *owner,
+ void *data, size_t datalen, gfp_t gfp,
+ ssize_t (*read)(char *buffer, loff_t offset,
+ size_t count, void *data,
+ size_t datalen),
+ void (*free)(void *data),
+ unsigned long timeout)
+{
+}
+
static inline void
dev_coredumpm(struct device *dev, struct module *owner,
void *data, size_t datalen, gfp_t gfp,
--
2.44.0


2024-02-28 16:58:25

by José Roberto de Souza

[permalink] [raw]
Subject: [PATCH v2 4/4] drm/xe: Increase devcoredump timeout

5 minutes is too short for a regular user to search and understand
what he needs to do to report capture devcoredump and report a bug to
us, so here increasing this timeout to 1 hour.

Cc: Rodrigo Vivi <[email protected]>
Cc: Jonathan Cavitt <[email protected]>
Signed-off-by: José Roberto de Souza <[email protected]>
---
drivers/gpu/drm/xe/xe_devcoredump.c | 8 ++++++--
1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c
index 3fef81396fe8a..a26aac63d6ecf 100644
--- a/drivers/gpu/drm/xe/xe_devcoredump.c
+++ b/drivers/gpu/drm/xe/xe_devcoredump.c
@@ -52,6 +52,9 @@

#ifdef CONFIG_DEV_COREDUMP

+/* 1 hour timeout */
+#define XE_COREDUMP_TIMEOUT_JIFFIES (60 * 60 * HZ)
+
static struct xe_device *coredump_to_xe(const struct xe_devcoredump *coredump)
{
return container_of(coredump, struct xe_device, devcoredump);
@@ -230,8 +233,9 @@ void xe_devcoredump(struct xe_sched_job *job)
drm_info(&xe->drm, "Check your /sys/class/drm/card%d/device/devcoredump/data\n",
xe->drm.primary->index);

- dev_coredumpm(xe->drm.dev, THIS_MODULE, coredump, 0, GFP_KERNEL,
- xe_devcoredump_read, xe_devcoredump_free);
+ dev_coredumpm_timeout(xe->drm.dev, THIS_MODULE, coredump, 0, GFP_KERNEL,
+ xe_devcoredump_read, xe_devcoredump_free,
+ XE_COREDUMP_TIMEOUT_JIFFIES);
}

static void xe_driver_devcoredump_fini(struct drm_device *drm, void *arg)
--
2.44.0


2024-02-28 17:02:28

by Cavitt, Jonathan

[permalink] [raw]
Subject: RE: [PATCH v2 1/4] devcoredump: Add dev_coredump_put()

-----Original Message-----
From: Souza, Jose <[email protected]>
Sent: Wednesday, February 28, 2024 8:57 AM
To: [email protected]; [email protected]
Cc: Vivi, Rodrigo <[email protected]>; Mukesh Ojha <[email protected]>; Johannes Berg <[email protected]>; Cavitt, Jonathan <[email protected]>; Souza, Jose <[email protected]>
Subject: [PATCH v2 1/4] devcoredump: Add dev_coredump_put()
>
> It is useful for modules that do not want to keep coredump available
> after its unload.
> Otherwise, the coredump would only be removed after DEVCD_TIMEOUT
> seconds.
>
> v2:
> - dev_coredump_put() documentation updated (Mukesh)
>
> Cc: Rodrigo Vivi <[email protected]>
> Cc: Mukesh Ojha <[email protected]>
> Cc: Johannes Berg <[email protected]>
> Cc: Jonathan Cavitt <[email protected]>
> Signed-off-by: José Roberto de Souza <[email protected]>


All patches look solid. Apply ack to all:
Acked-by: Jonathan Cavitt <[email protected]>
-Jonathan Cavitt


> ---
> drivers/base/devcoredump.c | 23 +++++++++++++++++++++++
> include/linux/devcoredump.h | 5 +++++
> 2 files changed, 28 insertions(+)
>
> diff --git a/drivers/base/devcoredump.c b/drivers/base/devcoredump.c
> index 7e2d1f0d903a6..82aeb09b3d1b5 100644
> --- a/drivers/base/devcoredump.c
> +++ b/drivers/base/devcoredump.c
> @@ -304,6 +304,29 @@ static ssize_t devcd_read_from_sgtable(char *buffer, loff_t offset,
> offset);
> }
>
> +/**
> + * dev_coredump_put - remove device coredump
> + * @dev: the struct device for the crashed device
> + *
> + * dev_coredump_put() removes coredump, if exists, for a given device from
> + * the file system and free its associated data otherwise, does nothing.
> + *
> + * It is useful for modules that do not want to keep coredump
> + * available after its unload.
> + */
> +void dev_coredump_put(struct device *dev)
> +{
> + struct device *existing;
> +
> + existing = class_find_device(&devcd_class, NULL, dev,
> + devcd_match_failing);
> + if (existing) {
> + devcd_free(existing, NULL);
> + put_device(existing);
> + }
> +}
> +EXPORT_SYMBOL_GPL(dev_coredump_put);
> +
> /**
> * dev_coredumpm - create device coredump with read/free methods
> * @dev: the struct device for the crashed device
> diff --git a/include/linux/devcoredump.h b/include/linux/devcoredump.h
> index c008169ed2c6f..c8f7eb6cc1915 100644
> --- a/include/linux/devcoredump.h
> +++ b/include/linux/devcoredump.h
> @@ -63,6 +63,8 @@ void dev_coredumpm(struct device *dev, struct module *owner,
>
> void dev_coredumpsg(struct device *dev, struct scatterlist *table,
> size_t datalen, gfp_t gfp);
> +
> +void dev_coredump_put(struct device *dev);
> #else
> static inline void dev_coredumpv(struct device *dev, void *data,
> size_t datalen, gfp_t gfp)
> @@ -85,6 +87,9 @@ static inline void dev_coredumpsg(struct device *dev, struct scatterlist *table,
> {
> _devcd_free_sgtable(table);
> }
> +static inline void dev_coredump_put(struct device *dev)
> +{
> +}
> #endif /* CONFIG_DEV_COREDUMP */
>
> #endif /* __DEVCOREDUMP_H */
> --
> 2.44.0
>
>

2024-02-28 17:06:37

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] devcoredump: Add dev_coredumpm_timeout()

> Current 5-minute timeout may be too short for users to search and
> understand what needs to be done to capture coredump to report bugs.

Conceptually, I'm not sure I understand this. Users should probably have
a script to capture coredumps to a file in the filesystem, possibly with
additional data such as 'dmesg' at the time of the dump.

Having this stick around longer in core kernel memory (not even
swappable) seems like a bad idea?

What kind of timeout were you thinking? Maybe you'd want 10 minutes? An
hour?

Also, then, why should the timeout be device-specific? If the user is
going to need time to find stuff, then surely that applies regardless of
the device?

So ... I guess I don't really like this, and don't really see how it
makes sense. Arguably, 5 minutes even is too long, not too short,
because you should have scripting that captures it, writes it to disk,
and all that can happen in the space of seconds, rather than minutes.
It's trivial to write such a script with a udev trigger or similar.

If we wanted to, we could even have a script that not only captures it
to disk, but also deletes it again from disk after a day or something,
so if you didn't care you don't get things accumulating. But I don't see
why the kernel should need to hang on to all the (possibly big) core
dump in RAM, for whatever time. And I also don't like the device-
dependency very much, TBH.



But if we do go there eventually:

> +void dev_coredumpm(struct device *dev, struct module *owner,
> + void *data, size_t datalen, gfp_t gfp,
> + ssize_t (*read)(char *buffer, loff_t offset, size_t count,
> + void *data, size_t datalen),
> + void (*free)(void *data))
> +{
> + dev_coredumpm_timeout(dev, owner, data, datalen, gfp, read, free,
> + DEVCD_TIMEOUT);
> +}
> EXPORT_SYMBOL_GPL(dev_coredumpm);

This could be a trivial static inline now, if you just put DEVCD_TIMEOUT
into the header file. Seems better than exporting another whole function
for it. Then you also don't need the no-op version of it.

johannes


2024-02-28 17:24:13

by José Roberto de Souza

[permalink] [raw]
Subject: [PATCH v2 3/4] drm/xe: Remove devcoredump during driver release

This will remove devcoredump from file system and free its resources
during driver unload.

This fix the driver unload after gpu hang happened, otherwise this
it would report that Xe KMD is still in use and it would leave the
kernel in a state that Xe KMD can't be unload without a reboot.

Cc: Rodrigo Vivi <[email protected]>
Cc: Jonathan Cavitt <[email protected]>
Signed-off-by: José Roberto de Souza <[email protected]>
---
drivers/gpu/drm/xe/xe_devcoredump.c | 13 ++++++++++++-
drivers/gpu/drm/xe/xe_devcoredump.h | 5 +++++
drivers/gpu/drm/xe/xe_device.c | 4 ++++
3 files changed, 21 insertions(+), 1 deletion(-)

diff --git a/drivers/gpu/drm/xe/xe_devcoredump.c b/drivers/gpu/drm/xe/xe_devcoredump.c
index 68d3d623a05bf..3fef81396fe8a 100644
--- a/drivers/gpu/drm/xe/xe_devcoredump.c
+++ b/drivers/gpu/drm/xe/xe_devcoredump.c
@@ -9,6 +9,8 @@
#include <linux/devcoredump.h>
#include <generated/utsrelease.h>

+#include <drm/drm_managed.h>
+
#include "xe_device.h"
#include "xe_exec_queue.h"
#include "xe_force_wake.h"
@@ -231,5 +233,14 @@ void xe_devcoredump(struct xe_sched_job *job)
dev_coredumpm(xe->drm.dev, THIS_MODULE, coredump, 0, GFP_KERNEL,
xe_devcoredump_read, xe_devcoredump_free);
}
-#endif

+static void xe_driver_devcoredump_fini(struct drm_device *drm, void *arg)
+{
+ dev_coredump_put(drm->dev);
+}
+
+int xe_devcoredump_init(struct xe_device *xe)
+{
+ return drmm_add_action_or_reset(&xe->drm, xe_driver_devcoredump_fini, xe);
+}
+#endif
diff --git a/drivers/gpu/drm/xe/xe_devcoredump.h b/drivers/gpu/drm/xe/xe_devcoredump.h
index df8671f0b5eb2..9eba67f37234f 100644
--- a/drivers/gpu/drm/xe/xe_devcoredump.h
+++ b/drivers/gpu/drm/xe/xe_devcoredump.h
@@ -11,10 +11,15 @@ struct xe_sched_job;

#ifdef CONFIG_DEV_COREDUMP
void xe_devcoredump(struct xe_sched_job *job);
+int xe_devcoredump_init(struct xe_device *xe);
#else
static inline void xe_devcoredump(struct xe_sched_job *job)
{
}
+static inline int xe_devcoredump_init(struct xe_device *xe)
+{
+ return 0;
+}
#endif

#endif
diff --git a/drivers/gpu/drm/xe/xe_device.c b/drivers/gpu/drm/xe/xe_device.c
index 919ad88f0495a..22be76537c7da 100644
--- a/drivers/gpu/drm/xe/xe_device.c
+++ b/drivers/gpu/drm/xe/xe_device.c
@@ -20,6 +20,7 @@
#include "regs/xe_regs.h"
#include "xe_bo.h"
#include "xe_debugfs.h"
+#include "xe_devcoredump.h"
#include "xe_dma_buf.h"
#include "xe_drm_client.h"
#include "xe_drv.h"
@@ -502,6 +503,9 @@ int xe_device_probe(struct xe_device *xe)
return err;
}

+ err = xe_devcoredump_init(xe);
+ if (err)
+ return err;
err = drmm_add_action_or_reset(&xe->drm, xe_driver_flr_fini, xe);
if (err)
return err;
--
2.44.0


2024-02-28 18:10:48

by José Roberto de Souza

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] devcoredump: Add dev_coredumpm_timeout()

On Wed, 2024-02-28 at 18:05 +0100, Johannes Berg wrote:
> > Current 5-minute timeout may be too short for users to search and
> > understand what needs to be done to capture coredump to report bugs.
>
> Conceptually, I'm not sure I understand this. Users should probably have
> a script to capture coredumps to a file in the filesystem, possibly with
> additional data such as 'dmesg' at the time of the dump.
>
> Having this stick around longer in core kernel memory (not even
> swappable) seems like a bad idea?
>
> What kind of timeout were you thinking? Maybe you'd want 10 minutes? An
> hour?
>
> Also, then, why should the timeout be device-specific? If the user is
> going to need time to find stuff, then surely that applies regardless of
> the device?
>
> So ... I guess I don't really like this, and don't really see how it
> makes sense. Arguably, 5 minutes even is too long, not too short,
> because you should have scripting that captures it, writes it to disk,
> and all that can happen in the space of seconds, rather than minutes.
> It's trivial to write such a script with a udev trigger or similar.
>
> If we wanted to, we could even have a script that not only captures it
> to disk, but also deletes it again from disk after a day or something,
> so if you didn't care you don't get things accumulating. But I don't see
> why the kernel should need to hang on to all the (possibly big) core
> dump in RAM, for whatever time. And I also don't like the device-
> dependency very much, TBH.

In my opinion, the timeout should depend on the type of device driver.

In the case of server-class Ethernet cards, where corporate users automate most tasks, five minutes might even be considered excessive.

For our case, GPUs, users might experience minor glitches and only search for what happened after finishing their current task (writing an email,
ending a gaming match, watching a YouTube video, etc.).
If they land on https://drm.pages.freedesktop.org/intel-docs/how-to-file-i915-bugs.html or the future Xe version of that page, following the
instructions alone may take inexperienced Linux users more than five minutes.

I have set the timeout to one hour in the Xe driver, but this could increase if we start receiving user complaints.

>
>
>
> But if we do go there eventually:
>
> > +void dev_coredumpm(struct device *dev, struct module *owner,
> > + void *data, size_t datalen, gfp_t gfp,
> > + ssize_t (*read)(char *buffer, loff_t offset, size_t count,
> > + void *data, size_t datalen),
> > + void (*free)(void *data))
> > +{
> > + dev_coredumpm_timeout(dev, owner, data, datalen, gfp, read, free,
> > + DEVCD_TIMEOUT);
> > +}
> >  EXPORT_SYMBOL_GPL(dev_coredumpm);
>
> This could be a trivial static inline now, if you just put DEVCD_TIMEOUT
> into the header file. Seems better than exporting another whole function
> for it. Then you also don't need the no-op version of it.

works for me too, will wait an ack in the above explanation.

>
> johannes
>

2024-03-01 08:48:55

by Johannes Berg

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] devcoredump: Add dev_coredumpm_timeout()

On Wed, 2024-02-28 at 17:56 +0000, Souza, Jose wrote:
>
> In my opinion, the timeout should depend on the type of device driver.
>
> In the case of server-class Ethernet cards, where corporate users automate most tasks, five minutes might even be considered excessive.
>
> For our case, GPUs, users might experience minor glitches and only search for what happened after finishing their current task (writing an email,
> ending a gaming match, watching a YouTube video, etc.).
> If they land on https://drm.pages.freedesktop.org/intel-docs/how-to-file-i915-bugs.html or the future Xe version of that page, following the
> instructions alone may take inexperienced Linux users more than five minutes.

That's all not wrong, but I don't see why you wouldn't automate this
even on end user machines? I feel you're boxing the problem in by
wanting to solve it entirely in the kernel?

> I have set the timeout to one hour in the Xe driver, but this could increase if we start receiving user complaints.

At an hour now, people will probably start arguing that "indefinitely"
is about right? But at that point you're probably back to persisting
them on disk anyway? Or maybe glitches happen during logout/shutdown ...

Anyway, I don't want to block this because I just don't care enough
about how you do things, but I think the kernel is the wrong place to
solve this problem... The intent here was to give some userspace time to
grab it (and yes for that 5 minutes is already way too long), not the
users. That's also part of the reason we only hold on to a single
instance, since I didn't want it to keep consuming more and more memory
for it if happens repeatedly.

johannes

2024-03-04 14:59:10

by José Roberto de Souza

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] devcoredump: Add dev_coredumpm_timeout()

On Fri, 2024-03-01 at 09:38 +0100, Johannes Berg wrote:
> > On Wed, 2024-02-28 at 17:56 +0000, Souza, Jose wrote:
> > > >
> > > > In my opinion, the timeout should depend on the type of device driver.
> > > >
> > > > In the case of server-class Ethernet cards, where corporate users automate most tasks, five minutes might even be considered excessive.
> > > >
> > > > For our case, GPUs, users might experience minor glitches and only search for what happened after finishing their current task (writing an email,
> > > > ending a gaming match, watching a YouTube video, etc.).
> > > > If they land on https://drm.pages.freedesktop.org/intel-docs/how-to-file-i915-bugs.html or the future Xe version of that page, following the
> > > > instructions alone may take inexperienced Linux users more than five minutes.
> >
> > That's all not wrong, but I don't see why you wouldn't automate this
> > even on end user machines? I feel you're boxing the problem in by
> > wanting to solve it entirely in the kernel?

The other part of the stack that we provide are the libraries implementing Vulkan and OpenGL APIs, I don't think we could ship scripts that needs
elevated privileges to read and store coredump.

> >
> > > > I have set the timeout to one hour in the Xe driver, but this could increase if we start receiving user complaints.
> >
> > At an hour now, people will probably start arguing that "indefinitely"
> > is about right? But at that point you're probably back to persisting
> > them on disk anyway? Or maybe glitches happen during logout/shutdown ...

i915 driver don't use coredump and it persist the error dump in memory until user frees it or reboot it and we got no complains.

> >
> > Anyway, I don't want to block this because I just don't care enough
> > about how you do things, but I think the kernel is the wrong place to
> > solve this problem... The intent here was to give some userspace time to
> > grab it (and yes for that 5 minutes is already way too long), not the
> > users. That's also part of the reason we only hold on to a single
> > instance, since I didn't want it to keep consuming more and more memory
> > for it if happens repeatedly.
> >

okay so will move forward with other version applying your suggestion to make dev_coredumpm() static inline and move to the header.

thank you for the feedback

> > johannes

2024-03-04 23:56:52

by Lucas De Marchi

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] devcoredump: Add dev_coredumpm_timeout()

On Mon, Mar 04, 2024 at 02:29:03PM +0000, Jose Souza wrote:
>On Fri, 2024-03-01 at 09:38 +0100, Johannes Berg wrote:
>> > On Wed, 2024-02-28 at 17:56 +0000, Souza, Jose wrote:
>> > > >
>> > > > In my opinion, the timeout should depend on the type of device driver.
>> > > >
>> > > > In the case of server-class Ethernet cards, where corporate users automate most tasks, five minutes might even be considered excessive.
>> > > >
>> > > > For our case, GPUs, users might experience minor glitches and only search for what happened after finishing their current task (writing an email,
>> > > > ending a gaming match, watching a YouTube video, etc.).
>> > > > If they land on https://drm.pages.freedesktop.org/intel-docs/how-to-file-i915-bugs.html or the future Xe version of that page, following the
>> > > > instructions alone may take inexperienced Linux users more than five minutes.
>> >
>> > That's all not wrong, but I don't see why you wouldn't automate this
>> > even on end user machines? I feel you're boxing the problem in by
>> > wanting to solve it entirely in the kernel?
>
>The other part of the stack that we provide are the libraries implementing Vulkan and OpenGL APIs, I don't think we could ship scripts that needs
>elevated privileges to read and store coredump.

it's still a very valid point though. Why are we doing this only on
kernel side or mesa side rather than doing it in the proper place? As
Johannes said, this could very well be automated via udev rules.
Distros automate getting the coredump already with systemd-coredump and
the like. Why wouldn't we do it similarly for GPU? Handling this at
the proper place you leave the policy there for "how long to retain the
log", "maximum size", "rotation", etc.... outside of the kernel.

For the purposes of reporting a bug, wouldn't it be better to instruct
users to get the log that was saved to disk so they don't risk losing
it? I view the timeout more as a "protection" from the kernel side to
not waste memory if the complete stack is not in place. It shoudln't
be viewed as a timeout for how long the *user* will take to get the log
and create bug reports.

Lucas De Marchi

>
>> >
>> > > > I have set the timeout to one hour in the Xe driver, but this could increase if we start receiving user complaints.
>> >
>> > At an hour now, people will probably start arguing that "indefinitely"
>> > is about right? But at that point you're probably back to persisting
>> > them on disk anyway? Or maybe glitches happen during logout/shutdown ...
>
>i915 driver don't use coredump and it persist the error dump in memory until user frees it or reboot it and we got no complains.
>
>> >
>> > Anyway, I don't want to block this because I just don't care enough
>> > about how you do things, but I think the kernel is the wrong place to
>> > solve this problem... The intent here was to give some userspace time to
>> > grab it (and yes for that 5 minutes is already way too long), not the
>> > users. That's also part of the reason we only hold on to a single
>> > instance, since I didn't want it to keep consuming more and more memory
>> > for it if happens repeatedly.
>> >
>
>okay so will move forward with other version applying your suggestion to make dev_coredumpm() static inline and move to the header.
>
>thank you for the feedback
>
>> > johannes
>

2024-03-05 14:22:39

by José Roberto de Souza

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] devcoredump: Add dev_coredumpm_timeout()

On Mon, 2024-03-04 at 17:55 -0600, Lucas De Marchi wrote:
> On Mon, Mar 04, 2024 at 02:29:03PM +0000, Jose Souza wrote:
> > On Fri, 2024-03-01 at 09:38 +0100, Johannes Berg wrote:
> > > > On Wed, 2024-02-28 at 17:56 +0000, Souza, Jose wrote:
> > > > > >
> > > > > > In my opinion, the timeout should depend on the type of device driver.
> > > > > >
> > > > > > In the case of server-class Ethernet cards, where corporate users automate most tasks, five minutes might even be considered excessive.
> > > > > >
> > > > > > For our case, GPUs, users might experience minor glitches and only search for what happened after finishing their current task (writing an email,
> > > > > > ending a gaming match, watching a YouTube video, etc.).
> > > > > > If they land on https://drm.pages.freedesktop.org/intel-docs/how-to-file-i915-bugs.html or the future Xe version of that page, following the
> > > > > > instructions alone may take inexperienced Linux users more than five minutes.
> > > >
> > > > That's all not wrong, but I don't see why you wouldn't automate this
> > > > even on end user machines? I feel you're boxing the problem in by
> > > > wanting to solve it entirely in the kernel?
> >
> > The other part of the stack that we provide are the libraries implementing Vulkan and OpenGL APIs, I don't think we could ship scripts that needs
> > elevated privileges to read and store coredump.
>
> it's still a very valid point though. Why are we doing this only on
> kernel side or mesa side rather than doing it in the proper place? As
> Johannes said, this could very well be automated via udev rules.
> Distros automate getting the coredump already with systemd-coredump and
> the like. Why wouldn't we do it similarly for GPU? Handling this at
> the proper place you leave the policy there for "how long to retain the
> log", "maximum size", "rotation", etc.... outside of the kernel.

Where and how would this udev rules be distributed?
There is portable way to do that for distros that don't ship with systemd?

>
> For the purposes of reporting a bug, wouldn't it be better to instruct
> users to get the log that was saved to disk so they don't risk losing
> it? I view the timeout more as a "protection" from the kernel side to
> not waste memory if the complete stack is not in place. It shoudln't
> be viewed as a timeout for how long the *user* will take to get the log
> and create bug reports.
>
> Lucas De Marchi
>
> >
> > > >
> > > > > > I have set the timeout to one hour in the Xe driver, but this could increase if we start receiving user complaints.
> > > >
> > > > At an hour now, people will probably start arguing that "indefinitely"
> > > > is about right? But at that point you're probably back to persisting
> > > > them on disk anyway? Or maybe glitches happen during logout/shutdown ...
> >
> > i915 driver don't use coredump and it persist the error dump in memory until user frees it or reboot it and we got no complains.
> >
> > > >
> > > > Anyway, I don't want to block this because I just don't care enough
> > > > about how you do things, but I think the kernel is the wrong place to
> > > > solve this problem... The intent here was to give some userspace time to
> > > > grab it (and yes for that 5 minutes is already way too long), not the
> > > > users. That's also part of the reason we only hold on to a single
> > > > instance, since I didn't want it to keep consuming more and more memory
> > > > for it if happens repeatedly.
> > > >
> >
> > okay so will move forward with other version applying your suggestion to make dev_coredumpm() static inline and move to the header.
> >
> > thank you for the feedback
> >
> > > > johannes
> >

2024-03-05 15:22:48

by Lucas De Marchi

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] devcoredump: Add dev_coredumpm_timeout()

On Tue, Mar 05, 2024 at 02:21:45PM +0000, Jose Souza wrote:
>On Mon, 2024-03-04 at 17:55 -0600, Lucas De Marchi wrote:
>> On Mon, Mar 04, 2024 at 02:29:03PM +0000, Jose Souza wrote:
>> > On Fri, 2024-03-01 at 09:38 +0100, Johannes Berg wrote:
>> > > > On Wed, 2024-02-28 at 17:56 +0000, Souza, Jose wrote:
>> > > > > >
>> > > > > > In my opinion, the timeout should depend on the type of device driver.
>> > > > > >
>> > > > > > In the case of server-class Ethernet cards, where corporate users automate most tasks, five minutes might even be considered excessive.
>> > > > > >
>> > > > > > For our case, GPUs, users might experience minor glitches and only search for what happened after finishing their current task (writing an email,
>> > > > > > ending a gaming match, watching a YouTube video, etc.).
>> > > > > > If they land on https://drm.pages.freedesktop.org/intel-docs/how-to-file-i915-bugs.html or the future Xe version of that page, following the
>> > > > > > instructions alone may take inexperienced Linux users more than five minutes.
>> > > >
>> > > > That's all not wrong, but I don't see why you wouldn't automate this
>> > > > even on end user machines? I feel you're boxing the problem in by
>> > > > wanting to solve it entirely in the kernel?
>> >
>> > The other part of the stack that we provide are the libraries implementing Vulkan and OpenGL APIs, I don't think we could ship scripts that needs
>> > elevated privileges to read and store coredump.
>>
>> it's still a very valid point though. Why are we doing this only on
>> kernel side or mesa side rather than doing it in the proper place? As
>> Johannes said, this could very well be automated via udev rules.
>> Distros automate getting the coredump already with systemd-coredump and
>> the like. Why wouldn't we do it similarly for GPU? Handling this at
>> the proper place you leave the policy there for "how long to retain the
>> log", "maximum size", "rotation", etc.... outside of the kernel.
>
>Where and how would this udev rules be distributed?

it depends on where you implement such a logic to collect gpu coredump.
It might be a new project, it might be a daemon from mesa itself, it
might be extending systemd-coredump. Your decision on where to
implement it will influence what's the reach it will have.

>There is portable way to do that for distros that don't ship with systemd?

If you do it in one place, people who care can probably replicate to
other environments.

Lucas De Marchi

>
>>
>> For the purposes of reporting a bug, wouldn't it be better to instruct
>> users to get the log that was saved to disk so they don't risk losing
>> it? I view the timeout more as a "protection" from the kernel side to
>> not waste memory if the complete stack is not in place. It shoudln't
>> be viewed as a timeout for how long the *user* will take to get the log
>> and create bug reports.
>>
>> Lucas De Marchi
>>
>> >
>> > > >
>> > > > > > I have set the timeout to one hour in the Xe driver, but this could increase if we start receiving user complaints.
>> > > >
>> > > > At an hour now, people will probably start arguing that "indefinitely"
>> > > > is about right? But at that point you're probably back to persisting
>> > > > them on disk anyway? Or maybe glitches happen during logout/shutdown ...
>> >
>> > i915 driver don't use coredump and it persist the error dump in memory until user frees it or reboot it and we got no complains.
>> >
>> > > >
>> > > > Anyway, I don't want to block this because I just don't care enough
>> > > > about how you do things, but I think the kernel is the wrong place to
>> > > > solve this problem... The intent here was to give some userspace time to
>> > > > grab it (and yes for that 5 minutes is already way too long), not the
>> > > > users. That's also part of the reason we only hold on to a single
>> > > > instance, since I didn't want it to keep consuming more and more memory
>> > > > for it if happens repeatedly.
>> > > >
>> >
>> > okay so will move forward with other version applying your suggestion to make dev_coredumpm() static inline and move to the header.
>> >
>> > thank you for the feedback
>> >
>> > > > johannes
>> >
>

2024-03-05 15:39:11

by José Roberto de Souza

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] devcoredump: Add dev_coredumpm_timeout()

On Tue, 2024-03-05 at 09:22 -0600, Lucas De Marchi wrote:
> On Tue, Mar 05, 2024 at 02:21:45PM +0000, Jose Souza wrote:
> > On Mon, 2024-03-04 at 17:55 -0600, Lucas De Marchi wrote:
> > > On Mon, Mar 04, 2024 at 02:29:03PM +0000, Jose Souza wrote:
> > > > On Fri, 2024-03-01 at 09:38 +0100, Johannes Berg wrote:
> > > > > > On Wed, 2024-02-28 at 17:56 +0000, Souza, Jose wrote:
> > > > > > > >
> > > > > > > > In my opinion, the timeout should depend on the type of device driver.
> > > > > > > >
> > > > > > > > In the case of server-class Ethernet cards, where corporate users automate most tasks, five minutes might even be considered excessive.
> > > > > > > >
> > > > > > > > For our case, GPUs, users might experience minor glitches and only search for what happened after finishing their current task (writing an email,
> > > > > > > > ending a gaming match, watching a YouTube video, etc.).
> > > > > > > > If they land on https://drm.pages.freedesktop.org/intel-docs/how-to-file-i915-bugs.html or the future Xe version of that page, following the
> > > > > > > > instructions alone may take inexperienced Linux users more than five minutes.
> > > > > >
> > > > > > That's all not wrong, but I don't see why you wouldn't automate this
> > > > > > even on end user machines? I feel you're boxing the problem in by
> > > > > > wanting to solve it entirely in the kernel?
> > > >
> > > > The other part of the stack that we provide are the libraries implementing Vulkan and OpenGL APIs, I don't think we could ship scripts that needs
> > > > elevated privileges to read and store coredump.
> > >
> > > it's still a very valid point though. Why are we doing this only on
> > > kernel side or mesa side rather than doing it in the proper place? As
> > > Johannes said, this could very well be automated via udev rules.
> > > Distros automate getting the coredump already with systemd-coredump and
> > > the like. Why wouldn't we do it similarly for GPU? Handling this at
> > > the proper place you leave the policy there for "how long to retain the
> > > log", "maximum size", "rotation", etc.... outside of the kernel.
> >
> > Where and how would this udev rules be distributed?
>
> it depends on where you implement such a logic to collect gpu coredump.
> It might be a new project, it might be a daemon from mesa itself, it
> might be extending systemd-coredump. Your decision on where to
> implement it will influence what's the reach it will have.

Don't make sense to be in Mesa, compute and media stacks also needs it.

>
> > There is portable way to do that for distros that don't ship with systemd?
>
> If you do it in one place, people who care can probably replicate to
> other environments.

But then the 5 min timeout is still problematic.

In my opinion we can have this automation, make it store codedump in disk, do the dump rotation... but also have a 1 hour timeout.
The automation can write "0" to devcoredump/data and free the dump from memory for the distros that supports this automation.

>
> Lucas De Marchi
>
> >
> > >
> > > For the purposes of reporting a bug, wouldn't it be better to instruct
> > > users to get the log that was saved to disk so they don't risk losing
> > > it? I view the timeout more as a "protection" from the kernel side to
> > > not waste memory if the complete stack is not in place. It shoudln't
> > > be viewed as a timeout for how long the *user* will take to get the log
> > > and create bug reports.
> > >
> > > Lucas De Marchi
> > >
> > > >
> > > > > >
> > > > > > > > I have set the timeout to one hour in the Xe driver, but this could increase if we start receiving user complaints.
> > > > > >
> > > > > > At an hour now, people will probably start arguing that "indefinitely"
> > > > > > is about right? But at that point you're probably back to persisting
> > > > > > them on disk anyway? Or maybe glitches happen during logout/shutdown ...
> > > >
> > > > i915 driver don't use coredump and it persist the error dump in memory until user frees it or reboot it and we got no complains.
> > > >
> > > > > >
> > > > > > Anyway, I don't want to block this because I just don't care enough
> > > > > > about how you do things, but I think the kernel is the wrong place to
> > > > > > solve this problem... The intent here was to give some userspace time to
> > > > > > grab it (and yes for that 5 minutes is already way too long), not the
> > > > > > users. That's also part of the reason we only hold on to a single
> > > > > > instance, since I didn't want it to keep consuming more and more memory
> > > > > > for it if happens repeatedly.
> > > > > >
> > > >
> > > > okay so will move forward with other version applying your suggestion to make dev_coredumpm() static inline and move to the header.
> > > >
> > > > thank you for the feedback
> > > >
> > > > > > johannes
> > > >
> >

2024-03-08 14:53:40

by Lucas De Marchi

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] devcoredump: Add dev_coredumpm_timeout()

On Tue, Mar 05, 2024 at 03:38:58PM +0000, Jose Souza wrote:
>On Tue, 2024-03-05 at 09:22 -0600, Lucas De Marchi wrote:
>> On Tue, Mar 05, 2024 at 02:21:45PM +0000, Jose Souza wrote:
>> > On Mon, 2024-03-04 at 17:55 -0600, Lucas De Marchi wrote:
>> > > On Mon, Mar 04, 2024 at 02:29:03PM +0000, Jose Souza wrote:
>> > > > On Fri, 2024-03-01 at 09:38 +0100, Johannes Berg wrote:
>> > > > > > On Wed, 2024-02-28 at 17:56 +0000, Souza, Jose wrote:
>> > > > > > > >
>> > > > > > > > In my opinion, the timeout should depend on the type of device driver.
>> > > > > > > >
>> > > > > > > > In the case of server-class Ethernet cards, where corporate users automate most tasks, five minutes might even be considered excessive.
>> > > > > > > >
>> > > > > > > > For our case, GPUs, users might experience minor glitches and only search for what happened after finishing their current task (writing an email,
>> > > > > > > > ending a gaming match, watching a YouTube video, etc.).
>> > > > > > > > If they land on https://drm.pages.freedesktop.org/intel-docs/how-to-file-i915-bugs.html or the future Xe version of that page, following the
>> > > > > > > > instructions alone may take inexperienced Linux users more than five minutes.
>> > > > > >
>> > > > > > That's all not wrong, but I don't see why you wouldn't automate this
>> > > > > > even on end user machines? I feel you're boxing the problem in by
>> > > > > > wanting to solve it entirely in the kernel?
>> > > >
>> > > > The other part of the stack that we provide are the libraries implementing Vulkan and OpenGL APIs, I don't think we could ship scripts that needs
>> > > > elevated privileges to read and store coredump.
>> > >
>> > > it's still a very valid point though. Why are we doing this only on
>> > > kernel side or mesa side rather than doing it in the proper place? As
>> > > Johannes said, this could very well be automated via udev rules.
>> > > Distros automate getting the coredump already with systemd-coredump and
>> > > the like. Why wouldn't we do it similarly for GPU? Handling this at
>> > > the proper place you leave the policy there for "how long to retain the
>> > > log", "maximum size", "rotation", etc.... outside of the kernel.
>> >
>> > Where and how would this udev rules be distributed?
>>
>> it depends on where you implement such a logic to collect gpu coredump.
>> It might be a new project, it might be a daemon from mesa itself, it
>> might be extending systemd-coredump. Your decision on where to
>> implement it will influence what's the reach it will have.
>
>Don't make sense to be in Mesa, compute and media stacks also needs it.

so what? A project can't have something that is useful to other
projects? Anyway, having it in mesa is just one of the possibilities I
mentinoned.

>
>>
>> > There is portable way to do that for distros that don't ship with systemd?
>>
>> If you do it in one place, people who care can probably replicate to
>> other environments.
>
>But then the 5 min timeout is still problematic.
>
>In my opinion we can have this automation, make it store codedump in disk, do the dump rotation... but also have a 1 hour timeout.
>The automation can write "0" to devcoredump/data and free the dump from memory for the distros that supports this automation.

IMO it should not be treated as advanced automation, but rather the
normal way to collect dev coredump. It's much more useful to the end user
than documenting "oh, if you see a glitch on the screen, hurry up you
have X min to look at file /path/to/log to get it from the kernel. And
btw the glitch couldn be something else that does not generate a
coredump, so if you don't have it, it's normal)".

It's not up to me to decide though. If maintainers think it's ok due to
be a small change with no dire consequences, then fine.

Lucas De Marchi

>
>>
>> Lucas De Marchi
>>
>> >
>> > >
>> > > For the purposes of reporting a bug, wouldn't it be better to instruct
>> > > users to get the log that was saved to disk so they don't risk losing
>> > > it? I view the timeout more as a "protection" from the kernel side to
>> > > not waste memory if the complete stack is not in place. It shoudln't
>> > > be viewed as a timeout for how long the *user* will take to get the log
>> > > and create bug reports.
>> > >
>> > > Lucas De Marchi
>> > >
>> > > >
>> > > > > >
>> > > > > > > > I have set the timeout to one hour in the Xe driver, but this could increase if we start receiving user complaints.
>> > > > > >
>> > > > > > At an hour now, people will probably start arguing that "indefinitely"
>> > > > > > is about right? But at that point you're probably back to persisting
>> > > > > > them on disk anyway? Or maybe glitches happen during logout/shutdown ...
>> > > >
>> > > > i915 driver don't use coredump and it persist the error dump in memory until user frees it or reboot it and we got no complains.
>> > > >
>> > > > > >
>> > > > > > Anyway, I don't want to block this because I just don't care enough
>> > > > > > about how you do things, but I think the kernel is the wrong place to
>> > > > > > solve this problem... The intent here was to give some userspace time to
>> > > > > > grab it (and yes for that 5 minutes is already way too long), not the
>> > > > > > users. That's also part of the reason we only hold on to a single
>> > > > > > instance, since I didn't want it to keep consuming more and more memory
>> > > > > > for it if happens repeatedly.
>> > > > > >
>> > > >
>> > > > okay so will move forward with other version applying your suggestion to make dev_coredumpm() static inline and move to the header.
>> > > >
>> > > > thank you for the feedback
>> > > >
>> > > > > > johannes
>> > > >
>> >
>

2024-03-08 15:14:52

by Rodrigo Vivi

[permalink] [raw]
Subject: Re: [PATCH v2 2/4] devcoredump: Add dev_coredumpm_timeout()

On Fri, Mar 08, 2024 at 08:53:24AM -0600, Lucas De Marchi wrote:
> On Tue, Mar 05, 2024 at 03:38:58PM +0000, Jose Souza wrote:
> > On Tue, 2024-03-05 at 09:22 -0600, Lucas De Marchi wrote:
> > > On Tue, Mar 05, 2024 at 02:21:45PM +0000, Jose Souza wrote:
> > > > On Mon, 2024-03-04 at 17:55 -0600, Lucas De Marchi wrote:
> > > > > On Mon, Mar 04, 2024 at 02:29:03PM +0000, Jose Souza wrote:
> > > > > > On Fri, 2024-03-01 at 09:38 +0100, Johannes Berg wrote:
> > > > > > > > On Wed, 2024-02-28 at 17:56 +0000, Souza, Jose wrote:
> > > > > > > > > >
> > > > > > > > > > In my opinion, the timeout should depend on the type of device driver.
> > > > > > > > > >
> > > > > > > > > > In the case of server-class Ethernet cards, where corporate users automate most tasks, five minutes might even be considered excessive.
> > > > > > > > > >
> > > > > > > > > > For our case, GPUs, users might experience minor glitches and only search for what happened after finishing their current task (writing an email,
> > > > > > > > > > ending a gaming match, watching a YouTube video, etc.).
> > > > > > > > > > If they land on https://drm.pages.freedesktop.org/intel-docs/how-to-file-i915-bugs.html or the future Xe version of that page, following the
> > > > > > > > > > instructions alone may take inexperienced Linux users more than five minutes.
> > > > > > > >
> > > > > > > > That's all not wrong, but I don't see why you wouldn't automate this
> > > > > > > > even on end user machines? I feel you're boxing the problem in by
> > > > > > > > wanting to solve it entirely in the kernel?
> > > > > >
> > > > > > The other part of the stack that we provide are the libraries implementing Vulkan and OpenGL APIs, I don't think we could ship scripts that needs
> > > > > > elevated privileges to read and store coredump.
> > > > >
> > > > > it's still a very valid point though. Why are we doing this only on
> > > > > kernel side or mesa side rather than doing it in the proper place? As
> > > > > Johannes said, this could very well be automated via udev rules.
> > > > > Distros automate getting the coredump already with systemd-coredump and
> > > > > the like. Why wouldn't we do it similarly for GPU? Handling this at
> > > > > the proper place you leave the policy there for "how long to retain the
> > > > > log", "maximum size", "rotation", etc.... outside of the kernel.
> > > >
> > > > Where and how would this udev rules be distributed?

Perhaps we can have igt tool to distribute the udev rule?

> > >
> > > it depends on where you implement such a logic to collect gpu coredump.
> > > It might be a new project, it might be a daemon from mesa itself, it
> > > might be extending systemd-coredump. Your decision on where to
> > > implement it will influence what's the reach it will have.
> >
> > Don't make sense to be in Mesa, compute and media stacks also needs it.
>
> so what? A project can't have something that is useful to other
> projects? Anyway, having it in mesa is just one of the possibilities I
> mentinoned.

There's one big case that I don't see in this discussion here but some
folks might be wondering about.

It might not be a matter of where and how we distribute the rule or how
we capture the log.

We might have cases where the dump is extremely huge: 2GB+ if we are
setting to capture all the textures and everything else.
And only the very first hang is useful for debugging.

So, in this case we perhaps want to avoid extra huge captures.
Okay, most of that could be scripted out with udev. If the dump
is huge don't write 1 back to data so you don't keep capturing
more, but then after 5 minutes we will have more huge captures
that we could likely be avoided to start with if we had the option
to do so.

This would possibly be one argument for a single capture or longer
timeout before coredump removal.

But well, I believe this case only happen in debug machines by devs
so maybe not a so strong case in the end of the day.

>
> >
> > >
> > > > There is portable way to do that for distros that don't ship with systemd?
> > >
> > > If you do it in one place, people who care can probably replicate to
> > > other environments.
> >
> > But then the 5 min timeout is still problematic.
> >
> > In my opinion we can have this automation, make it store codedump in disk, do the dump rotation... but also have a 1 hour timeout.
> > The automation can write "0" to devcoredump/data and free the dump from memory for the distros that supports this automation.
>
> IMO it should not be treated as advanced automation, but rather the
> normal way to collect dev coredump. It's much more useful to the end user
> than documenting "oh, if you see a glitch on the screen, hurry up you
> have X min to look at file /path/to/log to get it from the kernel. And
> btw the glitch couldn be something else that does not generate a
> coredump, so if you don't have it, it's normal)".
>
> It's not up to me to decide though. If maintainers think it's ok due to
> be a small change with no dire consequences, then fine.
>
> Lucas De Marchi
>
> >
> > >
> > > Lucas De Marchi
> > >
> > > >
> > > > >
> > > > > For the purposes of reporting a bug, wouldn't it be better to instruct
> > > > > users to get the log that was saved to disk so they don't risk losing
> > > > > it? I view the timeout more as a "protection" from the kernel side to
> > > > > not waste memory if the complete stack is not in place. It shoudln't
> > > > > be viewed as a timeout for how long the *user* will take to get the log
> > > > > and create bug reports.
> > > > >
> > > > > Lucas De Marchi
> > > > >
> > > > > >
> > > > > > > >
> > > > > > > > > > I have set the timeout to one hour in the Xe driver, but this could increase if we start receiving user complaints.
> > > > > > > >
> > > > > > > > At an hour now, people will probably start arguing that "indefinitely"
> > > > > > > > is about right? But at that point you're probably back to persisting
> > > > > > > > them on disk anyway? Or maybe glitches happen during logout/shutdown ...
> > > > > >
> > > > > > i915 driver don't use coredump and it persist the error dump in memory until user frees it or reboot it and we got no complains.
> > > > > >
> > > > > > > >
> > > > > > > > Anyway, I don't want to block this because I just don't care enough
> > > > > > > > about how you do things, but I think the kernel is the wrong place to
> > > > > > > > solve this problem... The intent here was to give some userspace time to
> > > > > > > > grab it (and yes for that 5 minutes is already way too long), not the
> > > > > > > > users. That's also part of the reason we only hold on to a single
> > > > > > > > instance, since I didn't want it to keep consuming more and more memory
> > > > > > > > for it if happens repeatedly.
> > > > > > > >
> > > > > >
> > > > > > okay so will move forward with other version applying your suggestion to make dev_coredumpm() static inline and move to the header.
> > > > > >
> > > > > > thank you for the feedback
> > > > > >
> > > > > > > > johannes
> > > > > >
> > > >
> >