2023-07-11 02:17:54

by Krister Johansen

[permalink] [raw]
Subject: [PATCH net] net: ena: fix shift-out-of-bounds in exponential backoff

The ENA adapters on our instances occasionally reset. Once recently
logged a UBSAN failure to console in the process:

UBSAN: shift-out-of-bounds in build/linux/drivers/net/ethernet/amazon/ena/ena_com.c:540:13
shift exponent 32 is too large for 32-bit type 'unsigned int'
CPU: 28 PID: 70012 Comm: kworker/u72:2 Kdump: loaded not tainted 5.15.117
Hardware name: Amazon EC2 c5d.9xlarge/, BIOS 1.0 10/16/2017
Workqueue: ena ena_fw_reset_device [ena]
Call Trace:
<TASK>
dump_stack_lvl+0x4a/0x63
dump_stack+0x10/0x16
ubsan_epilogue+0x9/0x36
__ubsan_handle_shift_out_of_bounds.cold+0x61/0x10e
? __const_udelay+0x43/0x50
ena_delay_exponential_backoff_us.cold+0x16/0x1e [ena]
wait_for_reset_state+0x54/0xa0 [ena]
ena_com_dev_reset+0xc8/0x110 [ena]
ena_down+0x3fe/0x480 [ena]
ena_destroy_device+0xeb/0xf0 [ena]
ena_fw_reset_device+0x30/0x50 [ena]
process_one_work+0x22b/0x3d0
worker_thread+0x4d/0x3f0
? process_one_work+0x3d0/0x3d0
kthread+0x12a/0x150
? set_kthread_struct+0x50/0x50
ret_from_fork+0x22/0x30
</TASK>

Apparently, the reset delays are getting so large they can trigger a
UBSAN panic.

Looking at the code, the current timeout is capped at 5000us. Using a
base value of 100us, the current code will overflow after (1<<29). Even
at values before 32, this function wraps around, perhaps
unintentionally.

Cap the value of the exponent used for this backoff at (1<<16) which is
larger than currently necessary, but large enough to support bigger
values in the future.

Cc: [email protected]
Fixes: 4bb7f4cf60e3 ("net: ena: reduce driver load time")
Signed-off-by: Krister Johansen <[email protected]>
---
drivers/net/ethernet/amazon/ena/ena_com.c | 3 +++
1 file changed, 3 insertions(+)

diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c b/drivers/net/ethernet/amazon/ena/ena_com.c
index 451c3a1b6255..633b321d7fdd 100644
--- a/drivers/net/ethernet/amazon/ena/ena_com.c
+++ b/drivers/net/ethernet/amazon/ena/ena_com.c
@@ -35,6 +35,8 @@

#define ENA_REGS_ADMIN_INTR_MASK 1

+#define ENA_MAX_BACKOFF_DELAY_EXP 16U
+
#define ENA_MIN_ADMIN_POLL_US 100

#define ENA_MAX_ADMIN_POLL_US 5000
@@ -536,6 +538,7 @@ static int ena_com_comp_status_to_errno(struct ena_com_admin_queue *admin_queue,

static void ena_delay_exponential_backoff_us(u32 exp, u32 delay_us)
{
+ exp = min_t(u32, exp, ENA_MAX_BACKOFF_DELAY_EXP);
delay_us = max_t(u32, ENA_MIN_ADMIN_POLL_US, delay_us);
delay_us = min_t(u32, delay_us * (1U << exp), ENA_MAX_ADMIN_POLL_US);
usleep_range(delay_us, 2 * delay_us);
--
2.25.1



2023-07-11 07:50:24

by Leon Romanovsky

[permalink] [raw]
Subject: Re: [PATCH net] net: ena: fix shift-out-of-bounds in exponential backoff

On Mon, Jul 10, 2023 at 06:36:21PM -0700, Krister Johansen wrote:
> The ENA adapters on our instances occasionally reset. Once recently
> logged a UBSAN failure to console in the process:
>
> UBSAN: shift-out-of-bounds in build/linux/drivers/net/ethernet/amazon/ena/ena_com.c:540:13
> shift exponent 32 is too large for 32-bit type 'unsigned int'
> CPU: 28 PID: 70012 Comm: kworker/u72:2 Kdump: loaded not tainted 5.15.117
> Hardware name: Amazon EC2 c5d.9xlarge/, BIOS 1.0 10/16/2017
> Workqueue: ena ena_fw_reset_device [ena]
> Call Trace:
> <TASK>
> dump_stack_lvl+0x4a/0x63
> dump_stack+0x10/0x16
> ubsan_epilogue+0x9/0x36
> __ubsan_handle_shift_out_of_bounds.cold+0x61/0x10e
> ? __const_udelay+0x43/0x50
> ena_delay_exponential_backoff_us.cold+0x16/0x1e [ena]
> wait_for_reset_state+0x54/0xa0 [ena]
> ena_com_dev_reset+0xc8/0x110 [ena]
> ena_down+0x3fe/0x480 [ena]
> ena_destroy_device+0xeb/0xf0 [ena]
> ena_fw_reset_device+0x30/0x50 [ena]
> process_one_work+0x22b/0x3d0
> worker_thread+0x4d/0x3f0
> ? process_one_work+0x3d0/0x3d0
> kthread+0x12a/0x150
> ? set_kthread_struct+0x50/0x50
> ret_from_fork+0x22/0x30
> </TASK>
>
> Apparently, the reset delays are getting so large they can trigger a
> UBSAN panic.
>
> Looking at the code, the current timeout is capped at 5000us. Using a
> base value of 100us, the current code will overflow after (1<<29). Even
> at values before 32, this function wraps around, perhaps
> unintentionally.
>
> Cap the value of the exponent used for this backoff at (1<<16) which is
> larger than currently necessary, but large enough to support bigger
> values in the future.
>
> Cc: [email protected]
> Fixes: 4bb7f4cf60e3 ("net: ena: reduce driver load time")
> Signed-off-by: Krister Johansen <[email protected]>
> ---
> drivers/net/ethernet/amazon/ena/ena_com.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c b/drivers/net/ethernet/amazon/ena/ena_com.c
> index 451c3a1b6255..633b321d7fdd 100644
> --- a/drivers/net/ethernet/amazon/ena/ena_com.c
> +++ b/drivers/net/ethernet/amazon/ena/ena_com.c
> @@ -35,6 +35,8 @@
>
> #define ENA_REGS_ADMIN_INTR_MASK 1
>
> +#define ENA_MAX_BACKOFF_DELAY_EXP 16U
> +
> #define ENA_MIN_ADMIN_POLL_US 100
>
> #define ENA_MAX_ADMIN_POLL_US 5000
> @@ -536,6 +538,7 @@ static int ena_com_comp_status_to_errno(struct ena_com_admin_queue *admin_queue,
>
> static void ena_delay_exponential_backoff_us(u32 exp, u32 delay_us)
> {
> + exp = min_t(u32, exp, ENA_MAX_BACKOFF_DELAY_EXP);

Thanks,
Reviewed-by: Leon Romanovsky <[email protected]>

2023-07-11 19:01:04

by Shay Agroskin

[permalink] [raw]
Subject: Re: [PATCH net] net: ena: fix shift-out-of-bounds in exponential backoff


Krister Johansen <[email protected]> writes:

> The ENA adapters on our instances occasionally reset. Once
> recently
> logged a UBSAN failure to console in the process:
>
> UBSAN: shift-out-of-bounds in
> build/linux/drivers/net/ethernet/amazon/ena/ena_com.c:540:13
> shift exponent 32 is too large for 32-bit type 'unsigned int'
> CPU: 28 PID: 70012 Comm: kworker/u72:2 Kdump: loaded not
> tainted 5.15.117
> Hardware name: Amazon EC2 c5d.9xlarge/, BIOS 1.0 10/16/2017
> Workqueue: ena ena_fw_reset_device [ena]
> Call Trace:
> <TASK>
> dump_stack_lvl+0x4a/0x63
> dump_stack+0x10/0x16
> ubsan_epilogue+0x9/0x36
> __ubsan_handle_shift_out_of_bounds.cold+0x61/0x10e
> ? __const_udelay+0x43/0x50
> ena_delay_exponential_backoff_us.cold+0x16/0x1e [ena]
> wait_for_reset_state+0x54/0xa0 [ena]
> ena_com_dev_reset+0xc8/0x110 [ena]
> ena_down+0x3fe/0x480 [ena]
> ena_destroy_device+0xeb/0xf0 [ena]
> ena_fw_reset_device+0x30/0x50 [ena]
> process_one_work+0x22b/0x3d0
> worker_thread+0x4d/0x3f0
> ? process_one_work+0x3d0/0x3d0
> kthread+0x12a/0x150
> ? set_kthread_struct+0x50/0x50
> ret_from_fork+0x22/0x30
> </TASK>
>
> Apparently, the reset delays are getting so large they can
> trigger a
> UBSAN panic.
>
> Looking at the code, the current timeout is capped at 5000us.
> Using a
> base value of 100us, the current code will overflow after
> (1<<29). Even
> at values before 32, this function wraps around, perhaps
> unintentionally.
>
> Cap the value of the exponent used for this backoff at (1<<16)
> which is
> larger than currently necessary, but large enough to support
> bigger
> values in the future.
>
> Cc: [email protected]
> Fixes: 4bb7f4cf60e3 ("net: ena: reduce driver load time")
> Signed-off-by: Krister Johansen <[email protected]>
> ---
> drivers/net/ethernet/amazon/ena/ena_com.c | 3 +++
> 1 file changed, 3 insertions(+)
>
> diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c
> b/drivers/net/ethernet/amazon/ena/ena_com.c
> index 451c3a1b6255..633b321d7fdd 100644
> --- a/drivers/net/ethernet/amazon/ena/ena_com.c
> +++ b/drivers/net/ethernet/amazon/ena/ena_com.c
> @@ -35,6 +35,8 @@
>
> #define ENA_REGS_ADMIN_INTR_MASK 1
>
> +#define ENA_MAX_BACKOFF_DELAY_EXP 16U
> +
> #define ENA_MIN_ADMIN_POLL_US 100
>
> #define ENA_MAX_ADMIN_POLL_US 5000
> @@ -536,6 +538,7 @@ static int
> ena_com_comp_status_to_errno(struct ena_com_admin_queue
> *admin_queue,
>
> static void ena_delay_exponential_backoff_us(u32 exp, u32
> delay_us)
> {
> + exp = min_t(u32, exp, ENA_MAX_BACKOFF_DELAY_EXP);
> delay_us = max_t(u32, ENA_MIN_ADMIN_POLL_US, delay_us);
> delay_us = min_t(u32, delay_us * (1U << exp),
> ENA_MAX_ADMIN_POLL_US);
> usleep_range(delay_us, 2 * delay_us);

Hi, thanks for submitting this patch (:

Going over the logic here, the driver sleeps for `delay_us`
micro-seconds in each iteration that this function gets called.

For an exp = 14 it'd sleep (I added units notation)
delay_us * (2 ^ exp) us = 100 * (2 ^ 14) us = (10 * (2 ^ 14)) /
(1000000) s = 1.6 s

For an exp = 15 it'd sleep
(10 * (2 ^ 15)) / (1000000) = 3.2s

To even get close to an overflow value, say exp=29 the driver
would sleep in a single iteration
53687 s = 14.9 hours.

The driver should stop trying to get a response from the device
after a timeout period received from the device which is 3 seconds
by default.

The point being, it seems very unlikely to hit this overflow. Did
you experience it or was the issue discovered by a static analyzer
?

Regarding the patch itself, I don't mind adding it since exp=16
limit should be more than enough to wait for the device's
response.
Reviewed-by: Shay Agroskin <[email protected]>

Thanks,
Shay

2023-07-11 23:03:43

by Krister Johansen

[permalink] [raw]
Subject: Re: [PATCH net] net: ena: fix shift-out-of-bounds in exponential backoff

On Tue, Jul 11, 2023 at 08:47:32PM +0300, Shay Agroskin wrote:
>
> Krister Johansen <[email protected]> writes:
>
> > diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c
> > b/drivers/net/ethernet/amazon/ena/ena_com.c
> > index 451c3a1b6255..633b321d7fdd 100644
> > --- a/drivers/net/ethernet/amazon/ena/ena_com.c
> > +++ b/drivers/net/ethernet/amazon/ena/ena_com.c
> > @@ -35,6 +35,8 @@
> > #define ENA_REGS_ADMIN_INTR_MASK 1
> > +#define ENA_MAX_BACKOFF_DELAY_EXP 16U
> > +
> > #define ENA_MIN_ADMIN_POLL_US 100
> > #define ENA_MAX_ADMIN_POLL_US 5000
> > @@ -536,6 +538,7 @@ static int ena_com_comp_status_to_errno(struct
> > ena_com_admin_queue *admin_queue,
> > static void ena_delay_exponential_backoff_us(u32 exp, u32 delay_us)
> > {
> > + exp = min_t(u32, exp, ENA_MAX_BACKOFF_DELAY_EXP);
> > delay_us = max_t(u32, ENA_MIN_ADMIN_POLL_US, delay_us);
> > delay_us = min_t(u32, delay_us * (1U << exp), ENA_MAX_ADMIN_POLL_US);
> > usleep_range(delay_us, 2 * delay_us);
>
> Hi, thanks for submitting this patch (:

Absolutely; thanks for the review!

> Going over the logic here, the driver sleeps for `delay_us` micro-seconds in
> each iteration that this function gets called.
>
> For an exp = 14 it'd sleep (I added units notation)
> delay_us * (2 ^ exp) us = 100 * (2 ^ 14) us = (10 * (2 ^ 14)) / (1000000) s
> = 1.6 s
>
> For an exp = 15 it'd sleep
> (10 * (2 ^ 15)) / (1000000) = 3.2s
>
> To even get close to an overflow value, say exp=29 the driver would sleep in
> a single iteration
> 53687 s = 14.9 hours.
>
> The driver should stop trying to get a response from the device after a
> timeout period received from the device which is 3 seconds by default.
>
> The point being, it seems very unlikely to hit this overflow. Did you
> experience it or was the issue discovered by a static analyzer ?

No, no use of fuzzing or static analysis. This was hit on a production
instance that was having ENA trouble.

I'm apparently reading the code differently. I thought this line:

> > delay_us = min_t(u32, delay_us * (1U << exp), ENA_MAX_ADMIN_POLL_US);

Was going to cap that delay_us at (delay_us * (1U << exp)) or
5000us, whichever is smaller. By that measure, if delay_us is 100 and
ENA_MAX_ADMIN_POLL_US is 5000, this should start getting capped after
exp = 6, correct? By my estimate, that puts it at between 160ms and
320ms of sleeping before one could hit this problem.

I went and pulled the logs out of the archive and have the following
timeline. This is seconds from boot as reported by dmesg:

11244.226583 - ena warns TX not completed on time, 10112000 usecs since
last napi execution, missing tx timeout val of 5000 msec

11245.190453 - netdev watchdog fires

11245.190781 - ena records Transmit timeout
11245.250739 - ena records Trigger reset on

11246.812620 - UBSAN message to console

11248.590441 - ena reports Reset inidication didn't turn off
11250.633545 - ena reports failure to reset device
12013.529338 - last logline before new boot

While the difference between the panic and the trigger reset is more
than 320ms, it is definitely on the order of seconds instead of hours.

> Regarding the patch itself, I don't mind adding it since exp=16 limit should
> be more than enough to wait for the device's response.
> Reviewed-by: Shay Agroskin <[email protected]>

Thanks,

-K

2023-07-12 23:04:10

by patchwork-bot+netdevbpf

[permalink] [raw]
Subject: Re: [PATCH net] net: ena: fix shift-out-of-bounds in exponential backoff

Hello:

This patch was applied to netdev/net.git (main)
by Jakub Kicinski <[email protected]>:

On Mon, 10 Jul 2023 18:36:21 -0700 you wrote:
> The ENA adapters on our instances occasionally reset. Once recently
> logged a UBSAN failure to console in the process:
>
> UBSAN: shift-out-of-bounds in build/linux/drivers/net/ethernet/amazon/ena/ena_com.c:540:13
> shift exponent 32 is too large for 32-bit type 'unsigned int'
> CPU: 28 PID: 70012 Comm: kworker/u72:2 Kdump: loaded not tainted 5.15.117
> Hardware name: Amazon EC2 c5d.9xlarge/, BIOS 1.0 10/16/2017
> Workqueue: ena ena_fw_reset_device [ena]
> Call Trace:
> <TASK>
> dump_stack_lvl+0x4a/0x63
> dump_stack+0x10/0x16
> ubsan_epilogue+0x9/0x36
> __ubsan_handle_shift_out_of_bounds.cold+0x61/0x10e
> ? __const_udelay+0x43/0x50
> ena_delay_exponential_backoff_us.cold+0x16/0x1e [ena]
> wait_for_reset_state+0x54/0xa0 [ena]
> ena_com_dev_reset+0xc8/0x110 [ena]
> ena_down+0x3fe/0x480 [ena]
> ena_destroy_device+0xeb/0xf0 [ena]
> ena_fw_reset_device+0x30/0x50 [ena]
> process_one_work+0x22b/0x3d0
> worker_thread+0x4d/0x3f0
> ? process_one_work+0x3d0/0x3d0
> kthread+0x12a/0x150
> ? set_kthread_struct+0x50/0x50
> ret_from_fork+0x22/0x30
> </TASK>
>
> [...]

Here is the summary with links:
- [net] net: ena: fix shift-out-of-bounds in exponential backoff
https://git.kernel.org/netdev/net/c/1e9cb763e9ba

You are awesome, thank you!
--
Deet-doot-dot, I am a bot.
https://korg.docs.kernel.org/patchwork/pwbot.html



2023-07-13 07:59:18

by Shay Agroskin

[permalink] [raw]
Subject: Re: [PATCH net] net: ena: fix shift-out-of-bounds in exponential backoff


Krister Johansen <[email protected]> writes:

>
> On Tue, Jul 11, 2023 at 08:47:32PM +0300, Shay Agroskin wrote:
>>
>> Krister Johansen <[email protected]> writes:
>>
>> > diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c
>> > b/drivers/net/ethernet/amazon/ena/ena_com.c
>> > index 451c3a1b6255..633b321d7fdd 100644
>> > --- a/drivers/net/ethernet/amazon/ena/ena_com.c
>> > +++ b/drivers/net/ethernet/amazon/ena/ena_com.c
>> > @@ -35,6 +35,8 @@
>> > #define ENA_REGS_ADMIN_INTR_MASK 1
>> > +#define ENA_MAX_BACKOFF_DELAY_EXP 16U
>> > +
>> > #define ENA_MIN_ADMIN_POLL_US 100
>> > #define ENA_MAX_ADMIN_POLL_US 5000
>> > @@ -536,6 +538,7 @@ static int
>> > ena_com_comp_status_to_errno(struct
>> > ena_com_admin_queue *admin_queue,
>> > static void ena_delay_exponential_backoff_us(u32 exp, u32
>> > delay_us)
>> > {
>> > + exp = min_t(u32, exp, ENA_MAX_BACKOFF_DELAY_EXP);
>> > delay_us = max_t(u32, ENA_MIN_ADMIN_POLL_US, delay_us);
>> > delay_us = min_t(u32, delay_us * (1U << exp),
>> > ENA_MAX_ADMIN_POLL_US);
>> > usleep_range(delay_us, 2 * delay_us);
>>
>> Hi, thanks for submitting this patch (:
>
> Absolutely; thanks for the review!
>
>> Going over the logic here, the driver sleeps for `delay_us`
>> micro-seconds in
>> each iteration that this function gets called.
>>
>> For an exp = 14 it'd sleep (I added units notation)
>> delay_us * (2 ^ exp) us = 100 * (2 ^ 14) us = (10 * (2 ^ 14)) /
>> (1000000) s
>> = 1.6 s
>>
>> For an exp = 15 it'd sleep
>> (10 * (2 ^ 15)) / (1000000) = 3.2s
>>
>> To even get close to an overflow value, say exp=29 the driver
>> would sleep in
>> a single iteration
>> 53687 s = 14.9 hours.
>>
>> The driver should stop trying to get a response from the device
>> after a
>> timeout period received from the device which is 3 seconds by
>> default.
>>
>> The point being, it seems very unlikely to hit this
>> overflow. Did you
>> experience it or was the issue discovered by a static analyzer
>> ?
>
> No, no use of fuzzing or static analysis. This was hit on a
> production
> instance that was having ENA trouble.
>
> I'm apparently reading the code differently. I thought this
> line:
>
>> > delay_us = min_t(u32, delay_us * (1U << exp),
>> > ENA_MAX_ADMIN_POLL_US);
>
> Was going to cap that delay_us at (delay_us * (1U << exp)) or
> 5000us, whichever is smaller. By that measure, if delay_us is
> 100 and
> ENA_MAX_ADMIN_POLL_US is 5000, this should start getting capped
> after
> exp = 6, correct? By my estimate, that puts it at between 160ms
> and
> 320ms of sleeping before one could hit this problem.
>
> I went and pulled the logs out of the archive and have the
> following
> timeline. This is seconds from boot as reported by dmesg:
>
> 11244.226583 - ena warns TX not completed on time, 10112000
> usecs since
> last napi execution, missing tx timeout val of 5000 msec
>
> 11245.190453 - netdev watchdog fires
>
> 11245.190781 - ena records Transmit timeout
> 11245.250739 - ena records Trigger reset on
>
> 11246.812620 - UBSAN message to console
>
> 11248.590441 - ena reports Reset inidication didn't turn off
> 11250.633545 - ena reports failure to reset device
> 12013.529338 - last logline before new boot
>
> While the difference between the panic and the trigger reset is
> more
> than 320ms, it is definitely on the order of seconds instead of
> hours.
>

Yup you're right. I was so entangled in my exponent calculations
that I completely missed the min_t expression there.

That's quite an awkward design to be honest, I hope to submit a
re-write for it in one of the releases.

Thanks again for the work you've put into it

2023-07-13 15:58:18

by David Laight

[permalink] [raw]
Subject: RE: [PATCH net] net: ena: fix shift-out-of-bounds in exponential backoff

From: Leon Romanovsky
> Sent: 11 July 2023 08:26
...
> > +#define ENA_MAX_BACKOFF_DELAY_EXP 16U
> > +
> > #define ENA_MIN_ADMIN_POLL_US 100
> >
> > #define ENA_MAX_ADMIN_POLL_US 5000
> > @@ -536,6 +538,7 @@ static int ena_com_comp_status_to_errno(struct ena_com_admin_queue *admin_queue,
> >
> > static void ena_delay_exponential_backoff_us(u32 exp, u32 delay_us)
> > {
> > + exp = min_t(u32, exp, ENA_MAX_BACKOFF_DELAY_EXP);

This shouldn't need to be a min_t()

David

-
Registered Address Lakeside, Bramley Road, Mount Farm, Milton Keynes, MK1 1PT, UK
Registration No: 1397386 (Wales)


2023-07-14 00:19:01

by Krister Johansen

[permalink] [raw]
Subject: Re: [PATCH net] net: ena: fix shift-out-of-bounds in exponential backoff

On Thu, Jul 13, 2023 at 10:46:55AM +0300, Shay Agroskin wrote:
>
> Krister Johansen <[email protected]> writes:
>
> >
> > On Tue, Jul 11, 2023 at 08:47:32PM +0300, Shay Agroskin wrote:
> > >
> > > Krister Johansen <[email protected]> writes:
> > >
> > > > diff --git a/drivers/net/ethernet/amazon/ena/ena_com.c
> > > > b/drivers/net/ethernet/amazon/ena/ena_com.c
> > > > index 451c3a1b6255..633b321d7fdd 100644
> > > > --- a/drivers/net/ethernet/amazon/ena/ena_com.c
> > > > +++ b/drivers/net/ethernet/amazon/ena/ena_com.c
> > > > @@ -35,6 +35,8 @@
> > > > #define ENA_REGS_ADMIN_INTR_MASK 1
> > > > +#define ENA_MAX_BACKOFF_DELAY_EXP 16U
> > > > +
> > > > #define ENA_MIN_ADMIN_POLL_US 100
> > > > #define ENA_MAX_ADMIN_POLL_US 5000
> > > > @@ -536,6 +538,7 @@ static int >
> > > ena_com_comp_status_to_errno(struct
> > > > ena_com_admin_queue *admin_queue,
> > > > static void ena_delay_exponential_backoff_us(u32 exp, u32 >
> > > delay_us)
> > > > {
> > > > + exp = min_t(u32, exp, ENA_MAX_BACKOFF_DELAY_EXP);
> > > > delay_us = max_t(u32, ENA_MIN_ADMIN_POLL_US, delay_us);
> > > > delay_us = min_t(u32, delay_us * (1U << exp), >
> > > ENA_MAX_ADMIN_POLL_US);
> > > > usleep_range(delay_us, 2 * delay_us);
> > >
> > > Hi, thanks for submitting this patch (:
> >
> > Absolutely; thanks for the review!
> >
> > > Going over the logic here, the driver sleeps for `delay_us`
> > > micro-seconds in
> > > each iteration that this function gets called.
> > >
> > > For an exp = 14 it'd sleep (I added units notation)
> > > delay_us * (2 ^ exp) us = 100 * (2 ^ 14) us = (10 * (2 ^ 14)) /
> > > (1000000) s
> > > = 1.6 s
> > >
> > > For an exp = 15 it'd sleep
> > > (10 * (2 ^ 15)) / (1000000) = 3.2s
> > >
> > > To even get close to an overflow value, say exp=29 the driver would
> > > sleep in
> > > a single iteration
> > > 53687 s = 14.9 hours.
> > >
> > > The driver should stop trying to get a response from the device
> > > after a
> > > timeout period received from the device which is 3 seconds by
> > > default.
> > >
> > > The point being, it seems very unlikely to hit this overflow. Did
> > > you
> > > experience it or was the issue discovered by a static analyzer ?
> >
> > No, no use of fuzzing or static analysis. This was hit on a production
> > instance that was having ENA trouble.
> >
> > I'm apparently reading the code differently. I thought this line:
> >
> > > > delay_us = min_t(u32, delay_us * (1U << exp), >
> > > ENA_MAX_ADMIN_POLL_US);
> >
> > Was going to cap that delay_us at (delay_us * (1U << exp)) or
> > 5000us, whichever is smaller. By that measure, if delay_us is 100 and
> > ENA_MAX_ADMIN_POLL_US is 5000, this should start getting capped after
> > exp = 6, correct? By my estimate, that puts it at between 160ms and
> > 320ms of sleeping before one could hit this problem.
> >
> > I went and pulled the logs out of the archive and have the following
> > timeline. This is seconds from boot as reported by dmesg:
> >
> > 11244.226583 - ena warns TX not completed on time, 10112000 usecs
> > since
> > last napi execution, missing tx timeout val of 5000 msec
> >
> > 11245.190453 - netdev watchdog fires
> >
> > 11245.190781 - ena records Transmit timeout
> > 11245.250739 - ena records Trigger reset on
> >
> > 11246.812620 - UBSAN message to console
> >
> > 11248.590441 - ena reports Reset inidication didn't turn off
> > 11250.633545 - ena reports failure to reset device
> > 12013.529338 - last logline before new boot
> >
> > While the difference between the panic and the trigger reset is more
> > than 320ms, it is definitely on the order of seconds instead of hours.
> >
>
> Yup you're right. I was so entangled in my exponent calculations that I
> completely missed the min_t expression there.
>
> That's quite an awkward design to be honest, I hope to submit a re-write for
> it in one of the releases.
>
> Thanks again for the work you've put into it

Totally welcome. I appreciate the speedy reviews from you and Leon.

Thanks again,

-K