2024-03-17 21:28:03

by Nam Cao

[permalink] [raw]
Subject: Re: Boot hang with SiFive PLIC when routing I2C-HID level-triggered interrupts

Cc: HID folks

On 14/Mar/2024 Eva Kurchatova wrote:
> If an I2C-HID controller level-triggered IRQ line is routed directly as
> a PLIC IRQ, and we spam input early enough in kernel boot process
> (Somewhere between initializing NET, ALSA subsystems and before
> i2c-hid driver init), then there is a chance of kernel locking up
> completely and not going any further.
>
> There are no kernel messages printed with all the IRQ, task hang
> debugging enabled - other than (sometimes) it reports sched RT
> throttling after a few seconds. Basic timer interrupt handling is
> intact - fbdev tty cursor is still blinking.
>
> It appears that in such a case the I2C-HID IRQ line is raised; PLIC
> notifies the (single) boot system hart, kernel claims the IRQ and
> immediately completes it by writing to CLAIM/COMPLETE register.
> No access to the I2C controller (OpenCores) or I2C-HID registers
> is made, so the HID report is never consumed and IRQ line stays
> raised forever. The kernel endlessly claims & completes IRQs
> without doing any work with the device. It doesn't always end up this
> way; sometimes boot process completes and there are no signs of
> interrupt storm or stuck IRQ processing afterwards.

It seems I2C HID's interrupt handler (i2c_hid_irq) returns immediately if
I2C_HID_READ_PENDING is set. This flag is supposed to be cleared in
i2c_hid_xfer(), but since the (threaded) interrupt handler runs at higher
priority, the flag is never cleared. So we have a lock-up: interrupt
handler won't do anything unless the flag is cleared, but the clearing of
this flag is done in a lower priority task which never gets scheduled while
the interrupt handler is active.

There is RT throttling to prevent RT tasks from locking up the system like
this. I don't know much about scheduling stuffs, so I am not really sure
why RT throttling does not work. I think because RT throttling triggers
when RT tasks take too much CPU time, but in this case hard interrupt
handlers take lots of CPU time too (~50% according to my measurement), so
RT throttling doesn't trigger often enough (in this case, it triggers once
and never again). Again, I don't know much about scheduler so I may be
talking nonsense here.

The flag I2C_HID_READ_PENDING seems to be used to make sure that only 1
I2C operation can happen at a time. But this seems pointless, because I2C
subsystem already takes care of this. So I think we can just remove it.

Can you give the below patch a try?

diff --git a/drivers/hid/i2c-hid/i2c-hid-core.c b/drivers/hid/i2c-hid/i2c-hid-core.c
index 2735cd585af0..799ad0ef9c4a 100644
--- a/drivers/hid/i2c-hid/i2c-hid-core.c
+++ b/drivers/hid/i2c-hid/i2c-hid-core.c
@@ -64,7 +64,6 @@
/* flags */
#define I2C_HID_STARTED 0
#define I2C_HID_RESET_PENDING 1
-#define I2C_HID_READ_PENDING 2

#define I2C_HID_PWR_ON 0x00
#define I2C_HID_PWR_SLEEP 0x01
@@ -190,15 +189,10 @@ static int i2c_hid_xfer(struct i2c_hid *ihid,
msgs[n].len = recv_len;
msgs[n].buf = recv_buf;
n++;
-
- set_bit(I2C_HID_READ_PENDING, &ihid->flags);
}

ret = i2c_transfer(client->adapter, msgs, n);

- if (recv_len)
- clear_bit(I2C_HID_READ_PENDING, &ihid->flags);
-
if (ret != n)
return ret < 0 ? ret : -EIO;

@@ -566,9 +560,6 @@ static irqreturn_t i2c_hid_irq(int irq, void *dev_id)
{
struct i2c_hid *ihid = dev_id;

- if (test_bit(I2C_HID_READ_PENDING, &ihid->flags))
- return IRQ_HANDLED;
-
i2c_hid_get_input(ihid);

return IRQ_HANDLED;


2024-03-18 08:48:50

by Eva Kurchatova

[permalink] [raw]
Subject: Re: Boot hang with SiFive PLIC when routing I2C-HID level-triggered interrupts

On Sun, Mar 17, 2024 at 11:27 PM Nam Cao <[email protected]> wrote:
>
> Cc: HID folks
>
> On 14/Mar/2024 Eva Kurchatova wrote:
> > If an I2C-HID controller level-triggered IRQ line is routed directly as
> > a PLIC IRQ, and we spam input early enough in kernel boot process
> > (Somewhere between initializing NET, ALSA subsystems and before
> > i2c-hid driver init), then there is a chance of kernel locking up
> > completely and not going any further.
> >
> > There are no kernel messages printed with all the IRQ, task hang
> > debugging enabled - other than (sometimes) it reports sched RT
> > throttling after a few seconds. Basic timer interrupt handling is
> > intact - fbdev tty cursor is still blinking.
> >
> > It appears that in such a case the I2C-HID IRQ line is raised; PLIC
> > notifies the (single) boot system hart, kernel claims the IRQ and
> > immediately completes it by writing to CLAIM/COMPLETE register.
> > No access to the I2C controller (OpenCores) or I2C-HID registers
> > is made, so the HID report is never consumed and IRQ line stays
> > raised forever. The kernel endlessly claims & completes IRQs
> > without doing any work with the device. It doesn't always end up this
> > way; sometimes boot process completes and there are no signs of
> > interrupt storm or stuck IRQ processing afterwards.
>
> It seems I2C HID's interrupt handler (i2c_hid_irq) returns immediately if
> I2C_HID_READ_PENDING is set. This flag is supposed to be cleared in
> i2c_hid_xfer(), but since the (threaded) interrupt handler runs at higher
> priority, the flag is never cleared. So we have a lock-up: interrupt
> handler won't do anything unless the flag is cleared, but the clearing of
> this flag is done in a lower priority task which never gets scheduled while
> the interrupt handler is active.
>
> There is RT throttling to prevent RT tasks from locking up the system like
> this. I don't know much about scheduling stuffs, so I am not really sure
> why RT throttling does not work. I think because RT throttling triggers
> when RT tasks take too much CPU time, but in this case hard interrupt
> handlers take lots of CPU time too (~50% according to my measurement), so
> RT throttling doesn't trigger often enough (in this case, it triggers once
> and never again). Again, I don't know much about scheduler so I may be
> talking nonsense here.
>
> The flag I2C_HID_READ_PENDING seems to be used to make sure that only 1
> I2C operation can happen at a time. But this seems pointless, because I2C
> subsystem already takes care of this. So I think we can just remove it.
>
> Can you give the below patch a try?
>
> diff --git a/drivers/hid/i2c-hid/i2c-hid-core.c b/drivers/hid/i2c-hid/i2c-hid-core.c
> index 2735cd585af0..799ad0ef9c4a 100644
> --- a/drivers/hid/i2c-hid/i2c-hid-core.c
> +++ b/drivers/hid/i2c-hid/i2c-hid-core.c
> @@ -64,7 +64,6 @@
> /* flags */
> #define I2C_HID_STARTED 0
> #define I2C_HID_RESET_PENDING 1
> -#define I2C_HID_READ_PENDING 2
>
> #define I2C_HID_PWR_ON 0x00
> #define I2C_HID_PWR_SLEEP 0x01
> @@ -190,15 +189,10 @@ static int i2c_hid_xfer(struct i2c_hid *ihid,
> msgs[n].len = recv_len;
> msgs[n].buf = recv_buf;
> n++;
> -
> - set_bit(I2C_HID_READ_PENDING, &ihid->flags);
> }
>
> ret = i2c_transfer(client->adapter, msgs, n);
>
> - if (recv_len)
> - clear_bit(I2C_HID_READ_PENDING, &ihid->flags);
> -
> if (ret != n)
> return ret < 0 ? ret : -EIO;
>
> @@ -566,9 +560,6 @@ static irqreturn_t i2c_hid_irq(int irq, void *dev_id)
> {
> struct i2c_hid *ihid = dev_id;
>
> - if (test_bit(I2C_HID_READ_PENDING, &ihid->flags))
> - return IRQ_HANDLED;
> -
> i2c_hid_get_input(ihid);
>
> return IRQ_HANDLED;

Patch applied cleanly on top of 6.7.9, builds OK (No warns, etc).

This indeed fixes the hang completely.

I modified RVVM to send millions of keystroke events per second,
and put `reboot` as a service hook in the guest. It has been continuously
rebooting without a hitch for the last 30 minutes or so (Full boot takes
around 2 seconds), whereas unpatched 6.7.9 hangs almost immediately
in such conditions (Reverted your patch & rebuilt to be sure).

Thank you very much for this! Hope to see it upstreamed soon

2024-03-18 09:01:25

by Nam Cao

[permalink] [raw]
Subject: Re: Boot hang with SiFive PLIC when routing I2C-HID level-triggered interrupts

On 18/Mar/2024 Eva Kurchatova wrote:
> On Sun, Mar 17, 2024 at 11:27 PM Nam Cao <[email protected]> wrote:
> > It seems I2C HID's interrupt handler (i2c_hid_irq) returns immediately if
> > I2C_HID_READ_PENDING is set. This flag is supposed to be cleared in
> > i2c_hid_xfer(), but since the (threaded) interrupt handler runs at higher
> > priority, the flag is never cleared. So we have a lock-up: interrupt
> > handler won't do anything unless the flag is cleared, but the clearing of
> > this flag is done in a lower priority task which never gets scheduled while
> > the interrupt handler is active.
> >
> > There is RT throttling to prevent RT tasks from locking up the system like
> > this. I don't know much about scheduling stuffs, so I am not really sure
> > why RT throttling does not work. I think because RT throttling triggers
> > when RT tasks take too much CPU time, but in this case hard interrupt
> > handlers take lots of CPU time too (~50% according to my measurement), so
> > RT throttling doesn't trigger often enough (in this case, it triggers once
> > and never again). Again, I don't know much about scheduler so I may be
> > talking nonsense here.
> >
> > The flag I2C_HID_READ_PENDING seems to be used to make sure that only 1
> > I2C operation can happen at a time. But this seems pointless, because I2C
> > subsystem already takes care of this. So I think we can just remove it.
> >
> > Can you give the below patch a try?
> >
> > diff --git a/drivers/hid/i2c-hid/i2c-hid-core.c b/drivers/hid/i2c-hid/i2c-hid-core.c
> > index 2735cd585af0..799ad0ef9c4a 100644
> > --- a/drivers/hid/i2c-hid/i2c-hid-core.c
> > +++ b/drivers/hid/i2c-hid/i2c-hid-core.c
> > @@ -64,7 +64,6 @@
> > /* flags */
> > #define I2C_HID_STARTED 0
> > #define I2C_HID_RESET_PENDING 1
> > -#define I2C_HID_READ_PENDING 2
> >
> > #define I2C_HID_PWR_ON 0x00
> > #define I2C_HID_PWR_SLEEP 0x01
> > @@ -190,15 +189,10 @@ static int i2c_hid_xfer(struct i2c_hid *ihid,
> > msgs[n].len = recv_len;
> > msgs[n].buf = recv_buf;
> > n++;
> > -
> > - set_bit(I2C_HID_READ_PENDING, &ihid->flags);
> > }
> >
> > ret = i2c_transfer(client->adapter, msgs, n);
> >
> > - if (recv_len)
> > - clear_bit(I2C_HID_READ_PENDING, &ihid->flags);
> > -
> > if (ret != n)
> > return ret < 0 ? ret : -EIO;
> >
> > @@ -566,9 +560,6 @@ static irqreturn_t i2c_hid_irq(int irq, void *dev_id)
> > {
> > struct i2c_hid *ihid = dev_id;
> >
> > - if (test_bit(I2C_HID_READ_PENDING, &ihid->flags))
> > - return IRQ_HANDLED;
> > -
> > i2c_hid_get_input(ihid);
> >
> > return IRQ_HANDLED;
>
> Patch applied cleanly on top of 6.7.9, builds OK (No warns, etc).
>
> This indeed fixes the hang completely.

Nice! I assume I can add
Reported-and-tested-by: Eva Kurchatova <[email protected]>
to the patch?

> I modified RVVM to send millions of keystroke events per second,
> and put `reboot` as a service hook in the guest. It has been continuously
> rebooting without a hitch for the last 30 minutes or so (Full boot takes
> around 2 seconds), whereas unpatched 6.7.9 hangs almost immediately
> in such conditions (Reverted your patch & rebuilt to be sure).
>
> Thank you very much for this! Hope to see it upstreamed soon

Thank you for the report, your investigation helped a lot.

I am still confused why RT throttling doesn't unstuck the kernel in this
case. I will consult some people and investigate more on this. But I think
this patch is good on its own, so I will send a proper patch shortly.

Best regards,
Nam

2024-03-18 10:25:55

by Eva Kurchatova

[permalink] [raw]
Subject: Re: Boot hang with SiFive PLIC when routing I2C-HID level-triggered interrupts

On Mon, Mar 18, 2024 at 11:01 AM Nam Cao <[email protected]> wrote:
> Nice! I assume I can add
> Reported-and-tested-by: Eva Kurchatova <[email protected]>
> to the patch?
>

Yes.
You can also add link to upstream RVVM repo if anyone is interested in
reproduction. This RVVM patch applied to 0.6 makes a keystroke storm:
(window_update() is called on each display redraw and has access to hid_kb)

diff --git a/src/devices/fb_window.c b/src/devices/fb_window.c
index f170e2d..17e2519 100644
--- a/src/devices/fb_window.c
+++ b/src/devices/fb_window.c
@@ -77,6 +77,11 @@ static const uint8_t rvvm_logo_pix[] = {

static void window_update(rvvm_mmio_dev_t* device)
{
+ fb_window_t* win = device->data;
+ for (size_t i=0; i<100000; ++i) {
+ hid_keyboard_press(win->keyboard, HID_KEY_LEFTCTRL);
+ hid_keyboard_release(win->keyboard, HID_KEY_LEFTCTRL);
+ }
fb_window_update((fb_window_t*)device->data);
}

> I am still confused why RT throttling doesn't unstuck the kernel in this
> case. I will consult some people and investigate more on this. But I think
> this patch is good on its own, so I will send a proper patch shortly.
>
> Best regards,
> Nam

RT throttling kicked in *very* rarely, in most cases where the unpatched
kernel was actually stuck RT throttling wasn't reported at all.

It didn't hang every time either, so it's possible that RT throttling
helped sometimes, but not enough to always recover from such a loop:
[incoming IRQ]->[IRQ claimed]->[no handling]->[IRQ completion (which
immediately triggers phase 1 again)]

Best regards,
Eva