2020-04-30 17:03:55

by Doug Anderson

[permalink] [raw]
Subject: [PATCH] kgdboc: Be a bit more robust about handling earlycon leaving

The original implementation of kgdboc_earlycon included a comment
about how it was impossible to get notified about the boot console
going away without making changes to the Linux core. Since folks
often don't want to change the Linux core for kgdb's purposes, the
kgdboc_earlycon implementation did a bit of polling to figure out when
the boot console went away.

It turns out, though, that it is possible to get notified about the
boot console going away. The solution is either clever or a hack
depending on your viewpoint. ...or, perhaps, a clever hack. All we
need to do is head-patch the "exit" routine of the boot console. We
know that "struct console" must be writable because it has a "next"
pointer in it, so we can just put our own exit routine in, do our
stuff, and then call back to the original.

This works great to get notified about the boot console going away.
The (slight) problem is that in the context of the boot console's exit
routine we can't call tty_find_polling_driver(). We solve this by
kicking off some work on the system_wq when we get notified and this
works pretty well.

This is all still not perfect but seems as good you'll be able to do
without more intrusive changes to the Linux core. It feels like the
window where the debugger isn't available should be tiny on all
systems now, even those systems that manage to init their tty earlier.

Signed-off-by: Douglas Anderson <[email protected]>
---
This is based upon Daniel's recent patch [1] which is, in turn, based
on my recent series [2]. In theory this patch could be folded into
the patches of the original series, but presumably it will be less
churn / spamming of people to just treat it as a patch atop. I wish I
had thought of this trick earlier, but I suppose these kinds of things
only occur to people at 4 am in bed when they can't sleep.

[1] https://lore.kernel.org/r/[email protected]
[2] https://lore.kernel.org/r/[email protected]

drivers/tty/serial/kgdboc.c | 112 ++++++++++++++++++++++++++----------
1 file changed, 81 insertions(+), 31 deletions(-)

diff --git a/drivers/tty/serial/kgdboc.c b/drivers/tty/serial/kgdboc.c
index 596213272ec3..4bef5e8ab991 100644
--- a/drivers/tty/serial/kgdboc.c
+++ b/drivers/tty/serial/kgdboc.c
@@ -22,6 +22,7 @@
#include <linux/module.h>
#include <linux/platform_device.h>
#include <linux/serial_core.h>
+#include <linux/workqueue.h>

#define MAX_CONFIG_LEN 40

@@ -46,6 +47,7 @@ static struct platform_device *kgdboc_pdev;
#ifdef CONFIG_KGDB_SERIAL_CONSOLE
static struct kgdb_io kgdboc_earlycon_io_ops;
struct console *earlycon;
+int (*earlycon_orig_exit)(struct console *con);
#else /* ! CONFIG_KGDB_SERIAL_CONSOLE */
#define earlycon NULL
#endif /* ! CONFIG_KGDB_SERIAL_CONSOLE */
@@ -161,20 +163,9 @@ static bool is_earlycon_still_valid(void)
return false;
}

-static void cleanup_earlycon_if_invalid(void)
-{
- console_lock();
- if (earlycon && !is_earlycon_still_valid()) {
- pr_warn("earlycon vanished; unregistering\n");
- cleanup_earlycon();
- }
- console_unlock();
-}
-
#else /* ! CONFIG_KGDB_SERIAL_CONSOLE */

static inline void cleanup_earlycon(void) { ; }
-static inline void cleanup_earlycon_if_invalid(void) { ; }

#endif /* ! CONFIG_KGDB_SERIAL_CONSOLE */

@@ -251,14 +242,6 @@ static int configure_kgdboc(void)
kgdboc_unregister_kbd();
configured = 0;

- /*
- * Each time we run configure_kgdboc() but don't find a console, use
- * that as a chance to validate that our earlycon didn't vanish on
- * us. If it vanished we should unregister which will disable kgdb
- * if we're the last I/O module.
- */
- cleanup_earlycon_if_invalid();
-
return err;
}

@@ -481,15 +464,11 @@ static void kgdboc_earlycon_put_char(u8 chr)
static void kgdboc_earlycon_pre_exp_handler(void)
{
/*
- * We don't get notified when the boot console is unregistered.
- * Double-check when we enter the debugger. Unfortunately we
- * can't really unregister ourselves now, so we panic. We rely
- * on kgdb's ability to detect re-entrancy to make the panic
- * take effect.
- *
- * NOTE: if you're here in the lull when the real console has
- * replaced the boot console but our init hasn't run yet it's
- * possible that the "keep_bootcon" argument may help.
+ * There's a small window where the boot console exited but
+ * kgdb_earlycon is still registered because
+ * kgdboc_earlycon_exit_work_fn() hasn't run yet. See the comments
+ * in that function. The window should be tiny so hopefully nobody
+ * will ever hit this panic() but it's better to be safe than sorry.
*/
if (earlycon && !is_earlycon_still_valid())
panic("KGDB earlycon vanished and nothing replaced it\n");
@@ -509,10 +488,75 @@ static struct kgdb_io kgdboc_earlycon_io_ops = {
.is_console = true,
};

+static void kgdboc_earlycon_exit_work_fn(struct work_struct *work)
+{
+ /*
+ * Often this function races with init_kgdboc() since (due to probe
+ * ordering) init_kgdboc() often gets called right after serial drivers
+ * have registered. It doesn't matter which one wins. Both places try
+ * to configure because, even though they _often_ happen at nearly the
+ * same time, the two functions are not _guaranteed_ to happen
+ * at nearly the same time. Maybe, for instance, someone managed to
+ * init their main tty at an earlier initcall level. In such a case
+ * we'd run much earlier than init_kgdboc().
+ */
+ mutex_lock(&config_mutex);
+
+ /* See if the real kgdboc is ready */
+ if (configured != 1)
+ configure_kgdboc();
+
+ /* Cleanup earlycon even if real kgdboc wasn't ready */
+ cleanup_earlycon();
+
+ mutex_unlock(&config_mutex);
+}
+
+DECLARE_WORK(kgdboc_earlycon_exit_work, kgdboc_earlycon_exit_work_fn);
+
#define MAX_CONSOLE_NAME_LEN (sizeof((struct console *) 0)->name)
static char kgdboc_earlycon_param[MAX_CONSOLE_NAME_LEN] __initdata;
static bool kgdboc_earlycon_late_enable __initdata;

+int kgdboc_earlycon_trap_exit(struct console *con)
+{
+ int ret = 0;
+
+ /*
+ * The earlycon is gone. Presumably a real tty for the main kgdboc
+ * is ready for us. Ideally we'd register right here right now but
+ * we're being called under some locks. Queue some work so we can
+ * do it ASAP.
+ *
+ * NOTE: there's a bit of time between now and when our worker will
+ * get called and the earlycon is gone, though testing on one system
+ * showed the worker starting even before our function exited. The
+ * possibility of a little bit of time is why the
+ * kgdboc_earlycon_pre_exp_handler() function double-checks that
+ * earlycon is valid.
+ *
+ * In theory we would be "safer" if we unregistered kgdb_earlycon
+ * here and then just registered the main kgdboc in our worker. We
+ * don't do that because the debug core is expecting earlycon to be
+ * _replaced_ by the main kgdboc. If instead we do an
+ * unregister/reregister and someone had gdb attached then the debug
+ * core will throw a hissyfit. The worst thing that will happen if
+ * we try to drop in the debugger (because we hit a breakpoint or
+ * we crashed) and then we'll panic. ...but what else should we have
+ * done between these two points in time?
+ */
+ queue_work(system_wq, &kgdboc_earlycon_exit_work);
+
+ /* Our trap has done its work. Disarm it now */
+ if (earlycon_orig_exit) {
+ con->exit = earlycon_orig_exit;
+ earlycon_orig_exit = NULL;
+ ret = con->exit(con);
+ }
+
+ return ret;
+}
+
static int __init kgdboc_earlycon_init(char *opt)
{
struct console *con;
@@ -530,7 +574,6 @@ static int __init kgdboc_earlycon_init(char *opt)
(!opt || !opt[0] || strcmp(con->name, opt) == 0))
break;
}
- console_unlock();

if (!con) {
/*
@@ -551,7 +594,7 @@ static int __init kgdboc_earlycon_init(char *opt)
} else {
pr_info("Couldn't find kgdb earlycon\n");
}
- return 0;
+ goto unlock;
}

earlycon = con;
@@ -559,9 +602,16 @@ static int __init kgdboc_earlycon_init(char *opt)
if (kgdb_register_io_module(&kgdboc_earlycon_io_ops) != 0) {
earlycon = NULL;
pr_info("Failed to register kgdb with earlycon\n");
- return 0;
+ } else {
+ /* Trap exit so we know when earlycon goes away. */
+ earlycon_orig_exit = earlycon->exit;
+ earlycon->exit = kgdboc_earlycon_trap_exit;
}

+unlock:
+ console_unlock();
+
+ /* Non-zero means malformed option so we always return zero */
return 0;
}
early_param("kgdboc_earlycon", kgdboc_earlycon_init);
--
2.26.2.303.gf8c07b1a785-goog


2020-05-01 11:51:39

by Daniel Thompson

[permalink] [raw]
Subject: Re: [PATCH] kgdboc: Be a bit more robust about handling earlycon leaving

On Thu, Apr 30, 2020 at 09:59:09AM -0700, Douglas Anderson wrote:
> The original implementation of kgdboc_earlycon included a comment
> about how it was impossible to get notified about the boot console
> going away without making changes to the Linux core. Since folks
> often don't want to change the Linux core for kgdb's purposes, the
> kgdboc_earlycon implementation did a bit of polling to figure out when
> the boot console went away.
>
> It turns out, though, that it is possible to get notified about the
> boot console going away. The solution is either clever or a hack
> depending on your viewpoint. ...or, perhaps, a clever hack. All we
> need to do is head-patch the "exit" routine of the boot console. We
> know that "struct console" must be writable because it has a "next"
> pointer in it, so we can just put our own exit routine in, do our
> stuff, and then call back to the original.

I think I'm in the hack camp on this one!


> This works great to get notified about the boot console going away.
> The (slight) problem is that in the context of the boot console's exit
> routine we can't call tty_find_polling_driver().

I presume this is something to do with the tty_mutex?


> We solve this by
> kicking off some work on the system_wq when we get notified and this
> works pretty well.

There are some problems with the workqueue approach.

Firstly, its runs too early on many systems (esp. those that register
the console from a console initcall. kgdboc cannot find the tty, I think
because the console is registered a long time before the corresponding
tty comes up.

Secondly I am seeing warnings related to the tty_mutex where the
might_sleep() machinery ends up flushing the active workqueue.

[ 39.298332] ------------[ cut here ]------------
[ 39.298332] WARNING: CPU: 0 PID: 5 at kernel/workqueue.c:3033 __flush_work+00
[ 39.298332] Modules linked in:
[ 39.298332] CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.7.0-rc3+ #47
[ 39.298332] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-204
[ 39.298332] Workqueue: events kgdboc_earlycon_exit_work_fn
[ 39.298332] RIP: 0010:__flush_work+0x19c/0x1c0
[ 39.298332] Code: 4c 8b 6d 20 e9 06 ff ff ff 41 c6 04 24 00 fb 45 31 f6 eb 8f
[ 39.298332] RSP: 0018:ffff993500033dd0 EFLAGS: 00010246
[ 39.298332] RAX: 0000000000000000 RBX: ffffffffadcd0720 RCX: 0000000000000001
[ 39.298332] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffadcd0820
[ 39.298332] RBP: ffff8a633ec299c0 R08: 0000000000000000 R09: 0000000000000001
[ 39.298332] R10: 000000000000000a R11: f000000000000000 R12: 00000000ffffffed
[ 39.298332] R13: ffff8a633e408840 R14: 0000000000000000 R15: ffff8a633e408840
[ 39.298332] FS: 0000000000000000(0000) GS:ffff8a633ec00000(0000) knlGS:00000
[ 39.298332] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 39.298332] CR2: ffff8a6333201000 CR3: 0000000032a0a000 CR4: 00000000000006f0
[ 39.298332] Call Trace:
[ 39.298332] ? _cond_resched+0x10/0x20
[ 39.298332] ? mutex_lock+0x9/0x30
[ 39.298332] ? tty_find_polling_driver+0x134/0x1a0
[ 39.298332] configure_kgdboc+0x12d/0x1c0
[ 39.298332] kgdboc_earlycon_exit_work_fn+0x1a/0x40
[ 39.298332] process_one_work+0x1d3/0x380
[ 39.298332] worker_thread+0x45/0x3c0
[ 39.298332] kthread+0xf6/0x130
[ 39.298332] ? process_one_work+0x380/0x380
[ 39.298332] ? kthread_park+0x80/0x80
[ 39.298332] ret_from_fork+0x22/0x40
[ 39.298332] ---[ end trace 1190f578d6e11204 ]---
[ 39.298338] KGDB: Unregistered I/O driver kgdboc_earlycon, debugger disabled


Daniel.

2020-05-01 14:00:06

by Daniel Thompson

[permalink] [raw]
Subject: Re: [PATCH] kgdboc: Be a bit more robust about handling earlycon leaving

On Fri, May 01, 2020 at 12:49:43PM +0100, Daniel Thompson wrote:
> On Thu, Apr 30, 2020 at 09:59:09AM -0700, Douglas Anderson wrote:
> > The original implementation of kgdboc_earlycon included a comment
> > about how it was impossible to get notified about the boot console
> > going away without making changes to the Linux core. Since folks
> > often don't want to change the Linux core for kgdb's purposes, the
> > kgdboc_earlycon implementation did a bit of polling to figure out when
> > the boot console went away.
> >
> > It turns out, though, that it is possible to get notified about the
> > boot console going away. The solution is either clever or a hack
> > depending on your viewpoint. ...or, perhaps, a clever hack. All we
> > need to do is head-patch the "exit" routine of the boot console. We
> > know that "struct console" must be writable because it has a "next"
> > pointer in it, so we can just put our own exit routine in, do our
> > stuff, and then call back to the original.
>
> I think I'm in the hack camp on this one!
>
>
> > This works great to get notified about the boot console going away.
> > The (slight) problem is that in the context of the boot console's exit
> > routine we can't call tty_find_polling_driver().
>
> I presume this is something to do with the tty_mutex?
>
>
> > We solve this by
> > kicking off some work on the system_wq when we get notified and this
> > works pretty well.
>
> There are some problems with the workqueue approach.

... so did a couple of experiments to avoid the workqueue.

It occured to me that, since we have interfered with deinit() then the
console hasn't actually been uninitialized meaning we could still use it.
This does exposes us to risks similar to keep_bootcon but in exchange
there is no window where kgdb is broken (and no need to panic).

My prototype is minimal but I did wonder about ripping out all the
code to defend against removal of the earlycon and simply keep the
earlycon around until a new kgdbio handler is installed.


Daniel.


diff --git a/drivers/tty/serial/kgdboc.c b/drivers/tty/serial/kgdboc.c
index 596213272ec3..05d58f9f38b1 100644
--- a/drivers/tty/serial/kgdboc.c
+++ b/drivers/tty/serial/kgdboc.c
@@ -46,6 +46,7 @@ static struct platform_device *kgdboc_pdev;
#ifdef CONFIG_KGDB_SERIAL_CONSOLE
static struct kgdb_io kgdboc_earlycon_io_ops;
struct console *earlycon;
+static int (*earlycon_orig_exit)(struct console *con);
#else /* ! CONFIG_KGDB_SERIAL_CONSOLE */
#define earlycon NULL
#endif /* ! CONFIG_KGDB_SERIAL_CONSOLE */
@@ -478,25 +479,28 @@ static void kgdboc_earlycon_put_char(u8 chr)
earlycon->write(earlycon, &chr, 1);
}

-static void kgdboc_earlycon_pre_exp_handler(void)
+static int kgdboc_earlycon_deferred_exit(struct console *con)
{
/*
- * We don't get notified when the boot console is unregistered.
- * Double-check when we enter the debugger. Unfortunately we
- * can't really unregister ourselves now, so we panic. We rely
- * on kgdb's ability to detect re-entrancy to make the panic
- * take effect.
- *
- * NOTE: if you're here in the lull when the real console has
- * replaced the boot console but our init hasn't run yet it's
- * possible that the "keep_bootcon" argument may help.
+ * restoring the exit function ensures the earlycon is
+ * deinitialized when/if we find a suitable tty
*/
- if (earlycon && !is_earlycon_still_valid())
- panic("KGDB earlycon vanished and nothing replaced it\n");
+ con->exit = earlycon_orig_exit;
+
+ return 0;
}

static void kgdboc_earlycon_deinit(void)
{
+ if (!earlycon)
+ return;
+
+ if (earlycon->exit == kgdboc_earlycon_deferred_exit) {
+ earlycon->exit = earlycon_orig_exit;
+ } else if (earlycon->exit) {
+ earlycon->exit(earlycon);
+ }
+
earlycon = NULL;
}

@@ -504,7 +508,6 @@ static struct kgdb_io kgdboc_earlycon_io_ops = {
.name = "kgdboc_earlycon",
.read_char = kgdboc_earlycon_get_char,
.write_char = kgdboc_earlycon_put_char,
- .pre_exception = kgdboc_earlycon_pre_exp_handler,
.deinit = kgdboc_earlycon_deinit,
.is_console = true,
};
@@ -562,6 +565,9 @@ static int __init kgdboc_earlycon_init(char *opt)
return 0;
}

+ earlycon_orig_exit = con->exit;
+ con->exit = kgdboc_earlycon_deferred_exit;
+
return 0;
}
early_param("kgdboc_earlycon", kgdboc_earlycon_init);

2020-05-01 17:37:49

by Doug Anderson

[permalink] [raw]
Subject: Re: [PATCH] kgdboc: Be a bit more robust about handling earlycon leaving

Hi,

On Fri, May 1, 2020 at 4:49 AM Daniel Thompson
<[email protected]> wrote:
>
> On Thu, Apr 30, 2020 at 09:59:09AM -0700, Douglas Anderson wrote:
> > The original implementation of kgdboc_earlycon included a comment
> > about how it was impossible to get notified about the boot console
> > going away without making changes to the Linux core. Since folks
> > often don't want to change the Linux core for kgdb's purposes, the
> > kgdboc_earlycon implementation did a bit of polling to figure out when
> > the boot console went away.
> >
> > It turns out, though, that it is possible to get notified about the
> > boot console going away. The solution is either clever or a hack
> > depending on your viewpoint. ...or, perhaps, a clever hack. All we
> > need to do is head-patch the "exit" routine of the boot console. We
> > know that "struct console" must be writable because it has a "next"
> > pointer in it, so we can just put our own exit routine in, do our
> > stuff, and then call back to the original.
>
> I think I'm in the hack camp on this one!

LOL. We can drop it if need be and wait to see if someone actually
ends up in the dead-zone. Though I gotta earn the name "kernel
hacker" somehow, no? ;-)


> > This works great to get notified about the boot console going away.
> > The (slight) problem is that in the context of the boot console's exit
> > routine we can't call tty_find_polling_driver().
>
> I presume this is something to do with the tty_mutex?

Actually, no.

trying to acquire lock:
ffffff81b34436a8 (&port->mutex){+.+.}-{3:3}, at: uart_poll_init+0x70/0x164
but task is already holding lock:
ffffff81b34436a8 (&port->mutex){+.+.}-{3:3}, at: uart_add_one_port+0x84/0x49c

Here's the stack (I just put a call to kgdboc_earlycon_exit_work_fn()
straight into kgdboc_earlycon_trap_exit() to reproduce which is why it
still looks like there's a worker:

uart_poll_init+0x70/0x164
tty_find_polling_driver+0x130/0x18c
configure_kgdboc+0x9c/0x164
kgdboc_earlycon_exit_work_fn+0x30/0x5c
kgdboc_earlycon_trap_exit+0x1c/0x4c
unregister_console+0xd4/0x100
register_console+0x374/0x39c
uart_add_one_port+0x418/0x49c
qcom_geni_serial_probe+0x230/0x330

If we continue on this patch I'll add it to the commit message.


> > We solve this by
> > kicking off some work on the system_wq when we get notified and this
> > works pretty well.
>
> There are some problems with the workqueue approach.
>
> Firstly, its runs too early on many systems (esp. those that register
> the console from a console initcall. kgdboc cannot find the tty, I think
> because the console is registered a long time before the corresponding
> tty comes up.

Ah, I see. So the normal console comes along an inits and kicks the
boot console out but the tty isn't there yet. :( I guess my solution
only works if uart_add_one_port() is used and we hit this case in
uart_configure_port():

/*
* If this driver supports console, and it hasn't been
* successfully registered yet, try to re-register it.
* It may be that the port was not available.
*/

...if you've done something to register the console earlier then
you're outta luck. I'm not sure there's a ton we can do here. I
don't think we can really transition over to running kgdb on the main
console because, if nothing else, the write() function for those tend
to grab "uport->lock" which doesn't seem like it's always safe. Then
there'd be the problem of adding read() to all these consoles and
assuming it's non-blocking.

I feel like maybe the best that can be done is, now that earlycon
works as well as it does, to suggest that people _stop_ trying to
register their main console so early. ;-)

(OK, now that I've read your 2nd message I guess there is something we
can do: keep using the boot console).


In general, though, my patch shouldn't necessarily make anything
_worse_ for you. When the boot console's exit() routine is called
then you really can't use it any more. It shouldn't hurt to check if
the TTY is available so we can switch right away. I guess if your
case is common, though, we'd maybe want to get rid of the
cleanup_earlycon() in kgdboc_earlycon_exit_work_fn() because if you
attached gdb in kgdb_earlycon the debug_core will throw its hissyfit.
We can just cross our fingers and hope the TTY comes along before we
end up in the debugger again.


> Secondly I am seeing warnings related to the tty_mutex where the
> might_sleep() machinery ends up flushing the active workqueue.
>
> [ 39.298332] ------------[ cut here ]------------
> [ 39.298332] WARNING: CPU: 0 PID: 5 at kernel/workqueue.c:3033 __flush_work+00
> [ 39.298332] Modules linked in:
> [ 39.298332] CPU: 0 PID: 5 Comm: kworker/0:0 Not tainted 5.7.0-rc3+ #47
> [ 39.298332] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS ?-204
> [ 39.298332] Workqueue: events kgdboc_earlycon_exit_work_fn
> [ 39.298332] RIP: 0010:__flush_work+0x19c/0x1c0
> [ 39.298332] Code: 4c 8b 6d 20 e9 06 ff ff ff 41 c6 04 24 00 fb 45 31 f6 eb 8f
> [ 39.298332] RSP: 0018:ffff993500033dd0 EFLAGS: 00010246
> [ 39.298332] RAX: 0000000000000000 RBX: ffffffffadcd0720 RCX: 0000000000000001
> [ 39.298332] RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffffadcd0820
> [ 39.298332] RBP: ffff8a633ec299c0 R08: 0000000000000000 R09: 0000000000000001
> [ 39.298332] R10: 000000000000000a R11: f000000000000000 R12: 00000000ffffffed
> [ 39.298332] R13: ffff8a633e408840 R14: 0000000000000000 R15: ffff8a633e408840
> [ 39.298332] FS: 0000000000000000(0000) GS:ffff8a633ec00000(0000) knlGS:00000
> [ 39.298332] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
> [ 39.298332] CR2: ffff8a6333201000 CR3: 0000000032a0a000 CR4: 00000000000006f0
> [ 39.298332] Call Trace:
> [ 39.298332] ? _cond_resched+0x10/0x20
> [ 39.298332] ? mutex_lock+0x9/0x30
> [ 39.298332] ? tty_find_polling_driver+0x134/0x1a0
> [ 39.298332] configure_kgdboc+0x12d/0x1c0
> [ 39.298332] kgdboc_earlycon_exit_work_fn+0x1a/0x40
> [ 39.298332] process_one_work+0x1d3/0x380
> [ 39.298332] worker_thread+0x45/0x3c0
> [ 39.298332] kthread+0xf6/0x130
> [ 39.298332] ? process_one_work+0x380/0x380
> [ 39.298332] ? kthread_park+0x80/0x80
> [ 39.298332] ret_from_fork+0x22/0x40
> [ 39.298332] ---[ end trace 1190f578d6e11204 ]---
> [ 39.298338] KGDB: Unregistered I/O driver kgdboc_earlycon, debugger disabled

This is weird. Why don't I see this? Oh, I see. It's because your
console is replacing the boot console so early so "workqueue_init"
hasn't run yet. Mine happens much, much later.

...I can try to dig more if we want to continue going down this path,
but in general it should be fine to grab a mutex on a worker, right?
...and the workers shouldn't start running until it's safe to run? Is
this just a race where we don't set "wq_online = true" early enough
and pending work has run or something? Maybe if you drop into the
debugger at the time of this warning you'll find that some other task
is running and somewhere midway though workqueue_init()?

-Doug

2020-05-01 17:38:16

by Doug Anderson

[permalink] [raw]
Subject: Re: [PATCH] kgdboc: Be a bit more robust about handling earlycon leaving

Hi,

On Fri, May 1, 2020 at 6:32 AM Daniel Thompson
<[email protected]> wrote:
>
> On Fri, May 01, 2020 at 12:49:43PM +0100, Daniel Thompson wrote:
> > On Thu, Apr 30, 2020 at 09:59:09AM -0700, Douglas Anderson wrote:
> > > The original implementation of kgdboc_earlycon included a comment
> > > about how it was impossible to get notified about the boot console
> > > going away without making changes to the Linux core. Since folks
> > > often don't want to change the Linux core for kgdb's purposes, the
> > > kgdboc_earlycon implementation did a bit of polling to figure out when
> > > the boot console went away.
> > >
> > > It turns out, though, that it is possible to get notified about the
> > > boot console going away. The solution is either clever or a hack
> > > depending on your viewpoint. ...or, perhaps, a clever hack. All we
> > > need to do is head-patch the "exit" routine of the boot console. We
> > > know that "struct console" must be writable because it has a "next"
> > > pointer in it, so we can just put our own exit routine in, do our
> > > stuff, and then call back to the original.
> >
> > I think I'm in the hack camp on this one!
> >
> >
> > > This works great to get notified about the boot console going away.
> > > The (slight) problem is that in the context of the boot console's exit
> > > routine we can't call tty_find_polling_driver().
> >
> > I presume this is something to do with the tty_mutex?
> > > We solve this by
> > > kicking off some work on the system_wq when we get notified and this
> > > works pretty well.
> >
> > There are some problems with the workqueue approach.
>
> ... so did a couple of experiments to avoid the workqueue.
>
> It occured to me that, since we have interfered with deinit() then the
> console hasn't actually been uninitialized meaning we could still use it.
> This does exposes us to risks similar to keep_bootcon but in exchange
> there is no window where kgdb is broken (and no need to panic).
>
> My prototype is minimal but I did wonder about ripping out all the
> code to defend against removal of the earlycon and simply keep the
> earlycon around until a new kgdbio handler is installed.

It took me a little while, but I finally see what you're saying.
You're saying that we keep calling into the boot console even though
it's no longer in the list of consoles. Then we temporarily disable
the boot console's exit routine until kgdb_earlycon() is done. (side
note: the exit routine was recently added and probably most consoles
don't use it).

OK, that doesn't seem totally insane. It actually works OK for you?

It's probably at least worth a warning in the log if we detect that
we're using the boot console and it's not in the console list anymore.
Then if kgdb starts misbehaving someone might have a clue.

If your solution is OK we might also want to remove the call to
cleanup_earlycon_if_invalid() in configure_kgdboc() too.

I think you might win the "hackiest solution" prize, but your solution
definitely does seem better because I can't think of any other good
way to handle people whose consoles register a long time before their
tty. ;-)


-Doug

2020-05-04 11:55:57

by Daniel Thompson

[permalink] [raw]
Subject: Re: [PATCH] kgdboc: Be a bit more robust about handling earlycon leaving

On Fri, May 01, 2020 at 10:36:14AM -0700, Doug Anderson wrote:
> Hi,
>
> On Fri, May 1, 2020 at 6:32 AM Daniel Thompson
> <[email protected]> wrote:
> >
> > On Fri, May 01, 2020 at 12:49:43PM +0100, Daniel Thompson wrote:
> > > On Thu, Apr 30, 2020 at 09:59:09AM -0700, Douglas Anderson wrote:
> > > > The original implementation of kgdboc_earlycon included a comment
> > > > about how it was impossible to get notified about the boot console
> > > > going away without making changes to the Linux core. Since folks
> > > > often don't want to change the Linux core for kgdb's purposes, the
> > > > kgdboc_earlycon implementation did a bit of polling to figure out when
> > > > the boot console went away.
> > > >
> > > > It turns out, though, that it is possible to get notified about the
> > > > boot console going away. The solution is either clever or a hack
> > > > depending on your viewpoint. ...or, perhaps, a clever hack. All we
> > > > need to do is head-patch the "exit" routine of the boot console. We
> > > > know that "struct console" must be writable because it has a "next"
> > > > pointer in it, so we can just put our own exit routine in, do our
> > > > stuff, and then call back to the original.
> > >
> > > I think I'm in the hack camp on this one!
> > >
> > >
> > > > This works great to get notified about the boot console going away.
> > > > The (slight) problem is that in the context of the boot console's exit
> > > > routine we can't call tty_find_polling_driver().
> > >
> > > I presume this is something to do with the tty_mutex?
> > > > We solve this by
> > > > kicking off some work on the system_wq when we get notified and this
> > > > works pretty well.
> > >
> > > There are some problems with the workqueue approach.
> >
> > ... so did a couple of experiments to avoid the workqueue.
> >
> > It occured to me that, since we have interfered with deinit() then the
> > console hasn't actually been uninitialized meaning we could still use it.
> > This does exposes us to risks similar to keep_bootcon but in exchange
> > there is no window where kgdb is broken (and no need to panic).
> >
> > My prototype is minimal but I did wonder about ripping out all the
> > code to defend against removal of the earlycon and simply keep the
> > earlycon around until a new kgdbio handler is installed.
>
> It took me a little while, but I finally see what you're saying.
> You're saying that we keep calling into the boot console even though
> it's no longer in the list of consoles. Then we temporarily disable
> the boot console's exit routine until kgdb_earlycon() is done. (side
> note: the exit routine was recently added and probably most consoles
> don't use it).

Certainly none of the devices with a read() method have an exit().


> OK, that doesn't seem totally insane. It actually works OK for you?

I tested on qemu/x86-64 (8250) and qemu/arm64 (pl011). In both cases it
works well.


> It's probably at least worth a warning in the log if we detect that
> we're using the boot console and it's not in the console list anymore.
> Then if kgdb starts misbehaving someone might have a clue.

Yes, I'll add that.


> If your solution is OK we might also want to remove the call to
> cleanup_earlycon_if_invalid() in configure_kgdboc() too.

That's what I thought, yes. Although it might be best to handle that
by ripping it out of the original patch set.


> I think you might win the "hackiest solution" prize, but your solution
> definitely does seem better because I can't think of any other good
> way to handle people whose consoles register a long time before their
> tty. ;-)

That's not a prize I was especially anxious to win...

However the results seem quite pleasing from a user point of view:
*if* we hit a breakpoint then the system really will be leaving no
stone unturned in its attempt to have talk (and listen) to the user.

If we want things to avoid hacking at the console structure we could
provide a direct function call API from earlycon to kgdboc so we limit
earlycon_kgdboc so it can *only* attach to the earlycon (and either
defer the exit() or leave a comment in earlycon so that if exit() were
ever added it doesn't break kgdboc).


Daniel.

2020-05-07 20:20:56

by Doug Anderson

[permalink] [raw]
Subject: Re: [PATCH] kgdboc: Be a bit more robust about handling earlycon leaving

Hi,

On Mon, May 4, 2020 at 4:53 AM Daniel Thompson
<[email protected]> wrote:
>
> On Fri, May 01, 2020 at 10:36:14AM -0700, Doug Anderson wrote:
> > Hi,
> >
> > On Fri, May 1, 2020 at 6:32 AM Daniel Thompson
> > <[email protected]> wrote:
> > >
> > > On Fri, May 01, 2020 at 12:49:43PM +0100, Daniel Thompson wrote:
> > > > On Thu, Apr 30, 2020 at 09:59:09AM -0700, Douglas Anderson wrote:
> > > > > The original implementation of kgdboc_earlycon included a comment
> > > > > about how it was impossible to get notified about the boot console
> > > > > going away without making changes to the Linux core. Since folks
> > > > > often don't want to change the Linux core for kgdb's purposes, the
> > > > > kgdboc_earlycon implementation did a bit of polling to figure out when
> > > > > the boot console went away.
> > > > >
> > > > > It turns out, though, that it is possible to get notified about the
> > > > > boot console going away. The solution is either clever or a hack
> > > > > depending on your viewpoint. ...or, perhaps, a clever hack. All we
> > > > > need to do is head-patch the "exit" routine of the boot console. We
> > > > > know that "struct console" must be writable because it has a "next"
> > > > > pointer in it, so we can just put our own exit routine in, do our
> > > > > stuff, and then call back to the original.
> > > >
> > > > I think I'm in the hack camp on this one!
> > > >
> > > >
> > > > > This works great to get notified about the boot console going away.
> > > > > The (slight) problem is that in the context of the boot console's exit
> > > > > routine we can't call tty_find_polling_driver().
> > > >
> > > > I presume this is something to do with the tty_mutex?
> > > > > We solve this by
> > > > > kicking off some work on the system_wq when we get notified and this
> > > > > works pretty well.
> > > >
> > > > There are some problems with the workqueue approach.
> > >
> > > ... so did a couple of experiments to avoid the workqueue.
> > >
> > > It occured to me that, since we have interfered with deinit() then the
> > > console hasn't actually been uninitialized meaning we could still use it.
> > > This does exposes us to risks similar to keep_bootcon but in exchange
> > > there is no window where kgdb is broken (and no need to panic).
> > >
> > > My prototype is minimal but I did wonder about ripping out all the
> > > code to defend against removal of the earlycon and simply keep the
> > > earlycon around until a new kgdbio handler is installed.
> >
> > It took me a little while, but I finally see what you're saying.
> > You're saying that we keep calling into the boot console even though
> > it's no longer in the list of consoles. Then we temporarily disable
> > the boot console's exit routine until kgdb_earlycon() is done. (side
> > note: the exit routine was recently added and probably most consoles
> > don't use it).
>
> Certainly none of the devices with a read() method have an exit().
>
>
> > OK, that doesn't seem totally insane. It actually works OK for you?
>
> I tested on qemu/x86-64 (8250) and qemu/arm64 (pl011). In both cases it
> works well.
>
>
> > It's probably at least worth a warning in the log if we detect that
> > we're using the boot console and it's not in the console list anymore.
> > Then if kgdb starts misbehaving someone might have a clue.
>
> Yes, I'll add that.
>
>
> > If your solution is OK we might also want to remove the call to
> > cleanup_earlycon_if_invalid() in configure_kgdboc() too.
>
> That's what I thought, yes. Although it might be best to handle that
> by ripping it out of the original patch set.

I've incorporated ideas from my patch and yours into a v4 patchset of
the original series. Note that I didn't include your earlycon
deferral patchset [1] in my series which means it'll need to be
rebased. Hopefully this is OK since I think the rebase will be easy,
but yell if you want an extra pair of eyes on it.

[1] https://lore.kernel.org/r/[email protected]


-Doug