2022-01-17 17:13:52

by Ziyang Xuan (William)

[permalink] [raw]
Subject: [PATCH net] can: isotp: isotp_rcv_cf(): fix so->rx race problem

When receive a FF, the current code logic does not consider the real
so->rx.state but set so->rx.state to ISOTP_IDLE directly. That will
make so->rx accessed by multiple receiving processes concurrently.

The following syz problem is one of the scenarios. so->rx.len is
changed by isotp_rcv_ff() during isotp_rcv_cf(), so->rx.len equals
0 before alloc_skb() and equals 4096 after alloc_skb(). That will
trigger skb_over_panic() in skb_put().

=======================================================
CPU: 1 PID: 19 Comm: ksoftirqd/1 Not tainted 5.16.0-rc8-syzkaller #0
RIP: 0010:skb_panic+0x16c/0x16e net/core/skbuff.c:113
Call Trace:
<TASK>
skb_over_panic net/core/skbuff.c:118 [inline]
skb_put.cold+0x24/0x24 net/core/skbuff.c:1990
isotp_rcv_cf net/can/isotp.c:570 [inline]
isotp_rcv+0xa38/0x1e30 net/can/isotp.c:668
deliver net/can/af_can.c:574 [inline]
can_rcv_filter+0x445/0x8d0 net/can/af_can.c:635
can_receive+0x31d/0x580 net/can/af_can.c:665
can_rcv+0x120/0x1c0 net/can/af_can.c:696
__netif_receive_skb_one_core+0x114/0x180 net/core/dev.c:5465
__netif_receive_skb+0x24/0x1b0 net/core/dev.c:5579

Check so->rx.state equals ISOTP_IDLE firstly in isotp_rcv_ff().
Make sure so->rx idle when receive another new packet. And set
so->rx.state to ISOTP_IDLE after whole packet being received.

Fixes: e057dd3fc20f ("can: add ISO 15765-2:2016 transport protocol")
Reported-by: [email protected]
Signed-off-by: Ziyang Xuan <[email protected]>
---
net/can/isotp.c | 28 +++++++++++++++++-----------
1 file changed, 17 insertions(+), 11 deletions(-)

diff --git a/net/can/isotp.c b/net/can/isotp.c
index df6968b28bf4..a4b174f860f3 100644
--- a/net/can/isotp.c
+++ b/net/can/isotp.c
@@ -443,8 +443,10 @@ static int isotp_rcv_ff(struct sock *sk, struct canfd_frame *cf, int ae)
int off;
int ff_pci_sz;

+ if (so->rx.state != ISOTP_IDLE)
+ return 0;
+
hrtimer_cancel(&so->rxtimer);
- so->rx.state = ISOTP_IDLE;

/* get the used sender LL_DL from the (first) CAN frame data length */
so->rx.ll_dl = padlen(cf->len);
@@ -518,8 +520,6 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae,
so->lastrxcf_tstamp = skb->tstamp;
}

- hrtimer_cancel(&so->rxtimer);
-
/* CFs are never longer than the FF */
if (cf->len > so->rx.ll_dl)
return 1;
@@ -531,15 +531,15 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae,
return 1;
}

+ hrtimer_cancel(&so->rxtimer);
+
if ((cf->data[ae] & 0x0F) != so->rx.sn) {
/* wrong sn detected - report 'illegal byte sequence' */
sk->sk_err = EILSEQ;
if (!sock_flag(sk, SOCK_DEAD))
sk_error_report(sk);

- /* reset rx state */
- so->rx.state = ISOTP_IDLE;
- return 1;
+ goto err_out;
}
so->rx.sn++;
so->rx.sn %= 16;
@@ -551,21 +551,18 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae,
}

if (so->rx.idx >= so->rx.len) {
- /* we are done */
- so->rx.state = ISOTP_IDLE;
-
if ((so->opt.flags & ISOTP_CHECK_PADDING) &&
check_pad(so, cf, i + 1, so->opt.rxpad_content)) {
/* malformed PDU - report 'not a data message' */
sk->sk_err = EBADMSG;
if (!sock_flag(sk, SOCK_DEAD))
sk_error_report(sk);
- return 1;
+ goto err_out;
}

nskb = alloc_skb(so->rx.len, gfp_any());
if (!nskb)
- return 1;
+ goto err_out;

memcpy(skb_put(nskb, so->rx.len), so->rx.buf,
so->rx.len);
@@ -573,6 +570,10 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae,
nskb->tstamp = skb->tstamp;
nskb->dev = skb->dev;
isotp_rcv_skb(nskb, sk);
+
+ /* we are done */
+ so->rx.state = ISOTP_IDLE;
+
return 0;
}

@@ -591,6 +592,11 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae,
/* we reached the specified blocksize so->rxfc.bs */
isotp_send_fc(sk, ae, ISOTP_FC_CTS);
return 0;
+
+err_out:
+ /* reset rx state */
+ so->rx.state = ISOTP_IDLE;
+ return 1;
}

static void isotp_rcv(struct sk_buff *skb, void *data)
--
2.25.1


2022-01-19 22:26:38

by Oliver Hartkopp

[permalink] [raw]
Subject: Re: [PATCH net] can: isotp: isotp_rcv_cf(): fix so->rx race problem

Hi,

the referenced syzbot issue has already been fixed in upstream here:

https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=5f33a09e769a9da0482f20a6770a342842443776

("can: isotp: convert struct tpcon::{idx,len} to unsigned int")

Additionally this fix changes some behaviour that is required by the ISO
16765-2 specification (see below).

On 17.01.22 13:01, Ziyang Xuan wrote:
> When receive a FF, the current code logic does not consider the real
> so->rx.state but set so->rx.state to ISOTP_IDLE directly. That will
> make so->rx accessed by multiple receiving processes concurrently.

This is intentionally. "multiple receiving processes" are not allowed
resp. specified by ISO 15765-2.

> The following syz problem is one of the scenarios. so->rx.len is
> changed by isotp_rcv_ff() during isotp_rcv_cf(), so->rx.len equals
> 0 before alloc_skb() and equals 4096 after alloc_skb(). That will
> trigger skb_over_panic() in skb_put().
>
> =======================================================
> CPU: 1 PID: 19 Comm: ksoftirqd/1 Not tainted 5.16.0-rc8-syzkaller #0
> RIP: 0010:skb_panic+0x16c/0x16e net/core/skbuff.c:113
> Call Trace:
> <TASK>
> skb_over_panic net/core/skbuff.c:118 [inline]
> skb_put.cold+0x24/0x24 net/core/skbuff.c:1990
> isotp_rcv_cf net/can/isotp.c:570 [inline]
> isotp_rcv+0xa38/0x1e30 net/can/isotp.c:668
> deliver net/can/af_can.c:574 [inline]
> can_rcv_filter+0x445/0x8d0 net/can/af_can.c:635
> can_receive+0x31d/0x580 net/can/af_can.c:665
> can_rcv+0x120/0x1c0 net/can/af_can.c:696
> __netif_receive_skb_one_core+0x114/0x180 net/core/dev.c:5465
> __netif_receive_skb+0x24/0x1b0 net/core/dev.c:5579
>
> Check so->rx.state equals ISOTP_IDLE firstly in isotp_rcv_ff().
> Make sure so->rx idle when receive another new packet. And set
> so->rx.state to ISOTP_IDLE after whole packet being received.
>
> Fixes: e057dd3fc20f ("can: add ISO 15765-2:2016 transport protocol")
> Reported-by: [email protected]
> Signed-off-by: Ziyang Xuan <[email protected]>
> ---
> net/can/isotp.c | 28 +++++++++++++++++-----------
> 1 file changed, 17 insertions(+), 11 deletions(-)
>
> diff --git a/net/can/isotp.c b/net/can/isotp.c
> index df6968b28bf4..a4b174f860f3 100644
> --- a/net/can/isotp.c
> +++ b/net/can/isotp.c
> @@ -443,8 +443,10 @@ static int isotp_rcv_ff(struct sock *sk, struct canfd_frame *cf, int ae)
> int off;
> int ff_pci_sz;
>
> + if (so->rx.state != ISOTP_IDLE)
> + return 0;
> +
> hrtimer_cancel(&so->rxtimer);
> - so->rx.state = ISOTP_IDLE;

No matter in which receiving state we receive a first frame (FF) we are
required to start a fresh reception process and/or terminate the current
attempt.

Best regards,
Oliver

>
> /* get the used sender LL_DL from the (first) CAN frame data length */
> so->rx.ll_dl = padlen(cf->len);
> @@ -518,8 +520,6 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae,
> so->lastrxcf_tstamp = skb->tstamp;
> }
>
> - hrtimer_cancel(&so->rxtimer);
> -
> /* CFs are never longer than the FF */
> if (cf->len > so->rx.ll_dl)
> return 1;
> @@ -531,15 +531,15 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae,
> return 1;
> }
>
> + hrtimer_cancel(&so->rxtimer);
> +
> if ((cf->data[ae] & 0x0F) != so->rx.sn) {
> /* wrong sn detected - report 'illegal byte sequence' */
> sk->sk_err = EILSEQ;
> if (!sock_flag(sk, SOCK_DEAD))
> sk_error_report(sk);
>
> - /* reset rx state */
> - so->rx.state = ISOTP_IDLE;
> - return 1;
> + goto err_out;
> }
> so->rx.sn++;
> so->rx.sn %= 16;
> @@ -551,21 +551,18 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae,
> }
>
> if (so->rx.idx >= so->rx.len) {
> - /* we are done */
> - so->rx.state = ISOTP_IDLE;
> -
> if ((so->opt.flags & ISOTP_CHECK_PADDING) &&
> check_pad(so, cf, i + 1, so->opt.rxpad_content)) {
> /* malformed PDU - report 'not a data message' */
> sk->sk_err = EBADMSG;
> if (!sock_flag(sk, SOCK_DEAD))
> sk_error_report(sk);
> - return 1;
> + goto err_out;
> }
>
> nskb = alloc_skb(so->rx.len, gfp_any());
> if (!nskb)
> - return 1;
> + goto err_out;
>
> memcpy(skb_put(nskb, so->rx.len), so->rx.buf,
> so->rx.len);
> @@ -573,6 +570,10 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae,
> nskb->tstamp = skb->tstamp;
> nskb->dev = skb->dev;
> isotp_rcv_skb(nskb, sk);
> +
> + /* we are done */
> + so->rx.state = ISOTP_IDLE;
> +
> return 0;
> }
>
> @@ -591,6 +592,11 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae,
> /* we reached the specified blocksize so->rxfc.bs */
> isotp_send_fc(sk, ae, ISOTP_FC_CTS);
> return 0;
> +
> +err_out:
> + /* reset rx state */
> + so->rx.state = ISOTP_IDLE;
> + return 1;
> }
>
> static void isotp_rcv(struct sk_buff *skb, void *data)

2022-01-20 09:10:24

by Ziyang Xuan (William)

[permalink] [raw]
Subject: Re: [PATCH net] can: isotp: isotp_rcv_cf(): fix so->rx race problem

> Hi,
>
> the referenced syzbot issue has already been fixed in upstream here:
>
> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=5f33a09e769a9da0482f20a6770a342842443776
>
> ("can: isotp: convert struct tpcon::{idx,len} to unsigned int")
>
> Additionally this fix changes some behaviour that is required by the ISO 16765-2 specification (see below).
>
> On 17.01.22 13:01, Ziyang Xuan wrote:
>> When receive a FF, the current code logic does not consider the real
>> so->rx.state but set so->rx.state to ISOTP_IDLE directly. That will
>> make so->rx accessed by multiple receiving processes concurrently.
>
> This is intentionally. "multiple receiving processes" are not allowed resp. specified by ISO 15765-2.

Does it can be a network attack?

It receives packets from network, but unexpected packets order make server panic.

>
>> The following syz problem is one of the scenarios. so->rx.len is
>> changed by isotp_rcv_ff() during isotp_rcv_cf(), so->rx.len equals
>> 0 before alloc_skb() and equals 4096 after alloc_skb(). That will
>> trigger skb_over_panic() in skb_put().
>>
>> =======================================================
>> CPU: 1 PID: 19 Comm: ksoftirqd/1 Not tainted 5.16.0-rc8-syzkaller #0
>> RIP: 0010:skb_panic+0x16c/0x16e net/core/skbuff.c:113
>> Call Trace:
>>   <TASK>
>>   skb_over_panic net/core/skbuff.c:118 [inline]
>>   skb_put.cold+0x24/0x24 net/core/skbuff.c:1990
>>   isotp_rcv_cf net/can/isotp.c:570 [inline]
>>   isotp_rcv+0xa38/0x1e30 net/can/isotp.c:668
>>   deliver net/can/af_can.c:574 [inline]
>>   can_rcv_filter+0x445/0x8d0 net/can/af_can.c:635
>>   can_receive+0x31d/0x580 net/can/af_can.c:665
>>   can_rcv+0x120/0x1c0 net/can/af_can.c:696
>>   __netif_receive_skb_one_core+0x114/0x180 net/core/dev.c:5465
>>   __netif_receive_skb+0x24/0x1b0 net/core/dev.c:5579
>>
>> Check so->rx.state equals ISOTP_IDLE firstly in isotp_rcv_ff().
>> Make sure so->rx idle when receive another new packet. And set
>> so->rx.state to ISOTP_IDLE after whole packet being received.
>>
>> Fixes: e057dd3fc20f ("can: add ISO 15765-2:2016 transport protocol")
>> Reported-by: [email protected]
>> Signed-off-by: Ziyang Xuan <[email protected]>
>> ---
>>   net/can/isotp.c | 28 +++++++++++++++++-----------
>>   1 file changed, 17 insertions(+), 11 deletions(-)
>>
>> diff --git a/net/can/isotp.c b/net/can/isotp.c
>> index df6968b28bf4..a4b174f860f3 100644
>> --- a/net/can/isotp.c
>> +++ b/net/can/isotp.c
>> @@ -443,8 +443,10 @@ static int isotp_rcv_ff(struct sock *sk, struct canfd_frame *cf, int ae)
>>       int off;
>>       int ff_pci_sz;
>>   +    if (so->rx.state != ISOTP_IDLE)
>> +        return 0;
>> +
>>       hrtimer_cancel(&so->rxtimer);
>> -    so->rx.state = ISOTP_IDLE;
>
> No matter in which receiving state we receive a first frame (FF) we are required to start a fresh reception process and/or terminate the current attempt.
>
> Best regards,
> Oliver
>
>>         /* get the used sender LL_DL from the (first) CAN frame data length */
>>       so->rx.ll_dl = padlen(cf->len);
>> @@ -518,8 +520,6 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae,
>>           so->lastrxcf_tstamp = skb->tstamp;
>>       }
>>   -    hrtimer_cancel(&so->rxtimer);
>> -
>>       /* CFs are never longer than the FF */
>>       if (cf->len > so->rx.ll_dl)
>>           return 1;
>> @@ -531,15 +531,15 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae,
>>               return 1;
>>       }
>>   +    hrtimer_cancel(&so->rxtimer);
>> +
>>       if ((cf->data[ae] & 0x0F) != so->rx.sn) {
>>           /* wrong sn detected - report 'illegal byte sequence' */
>>           sk->sk_err = EILSEQ;
>>           if (!sock_flag(sk, SOCK_DEAD))
>>               sk_error_report(sk);
>>   -        /* reset rx state */
>> -        so->rx.state = ISOTP_IDLE;
>> -        return 1;
>> +        goto err_out;
>>       }
>>       so->rx.sn++;
>>       so->rx.sn %= 16;
>> @@ -551,21 +551,18 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae,
>>       }
>>         if (so->rx.idx >= so->rx.len) {
>> -        /* we are done */
>> -        so->rx.state = ISOTP_IDLE;
>> -
>>           if ((so->opt.flags & ISOTP_CHECK_PADDING) &&
>>               check_pad(so, cf, i + 1, so->opt.rxpad_content)) {
>>               /* malformed PDU - report 'not a data message' */
>>               sk->sk_err = EBADMSG;
>>               if (!sock_flag(sk, SOCK_DEAD))
>>                   sk_error_report(sk);
>> -            return 1;
>> +            goto err_out;
>>           }
>>             nskb = alloc_skb(so->rx.len, gfp_any());
>>           if (!nskb)
>> -            return 1;
>> +            goto err_out;
>>             memcpy(skb_put(nskb, so->rx.len), so->rx.buf,
>>                  so->rx.len);
>> @@ -573,6 +570,10 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae,
>>           nskb->tstamp = skb->tstamp;
>>           nskb->dev = skb->dev;
>>           isotp_rcv_skb(nskb, sk);
>> +
>> +        /* we are done */
>> +        so->rx.state = ISOTP_IDLE;
>> +
>>           return 0;
>>       }
>>   @@ -591,6 +592,11 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae,
>>       /* we reached the specified blocksize so->rxfc.bs */
>>       isotp_send_fc(sk, ae, ISOTP_FC_CTS);
>>       return 0;
>> +
>> +err_out:
>> +    /* reset rx state */
>> +    so->rx.state = ISOTP_IDLE;
>> +    return 1;
>>   }
>>     static void isotp_rcv(struct sk_buff *skb, void *data)
> .

2022-01-20 12:57:01

by Oliver Hartkopp

[permalink] [raw]
Subject: Re: [PATCH net] can: isotp: isotp_rcv_cf(): fix so->rx race problem



On 18.01.22 13:46, Ziyang Xuan (William) wrote:
>> Hi,
>>
>> the referenced syzbot issue has already been fixed in upstream here:
>>
>> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=5f33a09e769a9da0482f20a6770a342842443776
>>
>> ("can: isotp: convert struct tpcon::{idx,len} to unsigned int")
>>
>> Additionally this fix changes some behaviour that is required by the ISO 16765-2 specification (see below).
>>
>> On 17.01.22 13:01, Ziyang Xuan wrote:
>>> When receive a FF, the current code logic does not consider the real
>>> so->rx.state but set so->rx.state to ISOTP_IDLE directly. That will
>>> make so->rx accessed by multiple receiving processes concurrently.
>>
>> This is intentionally. "multiple receiving processes" are not allowed resp. specified by ISO 15765-2.
>
> Does it can be a network attack?

Yes. You can see it like this. The ISO 15765-2 protocol is an unreliable
UDP-like datagram protocol and the session layer takes care about
timeouts and packet lost.

If you want to disturb that protocol you can also send PDUs with
out-of-sync packet counters which will make the receiver drop the
communication attempt.

This is 'CAN-style' ... as usually the bus is very reliable. Security
and reliable communication is done on top of these protocols.

> It receives packets from network, but unexpected packets order make server panic.

Haha, no :-)

Unexpected packets should not make the server panic but only drop the
communication process.

In the case pointed out by syzbot the unsigned 32 bit length information
was stored in a signed 32 bit integer which caused a sanity check to fail.

This is now fixed with the patch from Marc.

Regards,
Oliver

2022-01-21 21:06:42

by Ziyang Xuan (William)

[permalink] [raw]
Subject: Re: [PATCH net] can: isotp: isotp_rcv_cf(): fix so->rx race problem

> On 18.01.22 13:46, Ziyang Xuan (William) wrote:
>>> Hi,
>>>
>>> the referenced syzbot issue has already been fixed in upstream here:
>>>
>>> https://git.kernel.org/pub/scm/linux/kernel/git/netdev/net.git/commit/?id=5f33a09e769a9da0482f20a6770a342842443776
>>>
>>> ("can: isotp: convert struct tpcon::{idx,len} to unsigned int")
>>>
>>> Additionally this fix changes some behaviour that is required by the ISO 16765-2 specification (see below).
>>>
>>> On 17.01.22 13:01, Ziyang Xuan wrote:
>>>> When receive a FF, the current code logic does not consider the real
>>>> so->rx.state but set so->rx.state to ISOTP_IDLE directly. That will
>>>> make so->rx accessed by multiple receiving processes concurrently.
>>>
>>> This is intentionally. "multiple receiving processes" are not allowed resp. specified by ISO 15765-2.
>>
>> Does it can be a network attack?
>
> Yes. You can see it like this. The ISO 15765-2 protocol is an unreliable UDP-like datagram protocol and the session layer takes care about timeouts and packet lost.
>
> If you want to disturb that protocol you can also send PDUs with out-of-sync packet counters which will make the receiver drop the communication attempt.
>
> This is 'CAN-style' ... as usually the bus is very reliable. Security and reliable communication is done on top of these protocols.
>
>> It receives packets from network, but unexpected packets order make server panic.
>
> Haha, no :-)
>
> Unexpected packets should not make the server panic but only drop the communication process.

I have reproduced the syz problem with Marc's commit, the commit can not fix the panic problem.
So I tried to find the root cause for panic and gave my solution.

Marc's commit just fix the condition that packet size bigger than INT_MAX which trigger
tpcon::{idx,len} integer overflow, but the packet size is 4096 in the syz problem.

so->rx.len is 0 after the following logic in isotp_rcv_ff():

/* get the FF_DL */
so->rx.len = (cf->data[ae] & 0x0F) << 8;
so->rx.len += cf->data[ae + 1];

so->rx.len is 4096 after the following logic in isotp_rcv_ff():

/* FF_DL = 0 => get real length from next 4 bytes */
so->rx.len = cf->data[ae + 2] << 24;
so->rx.len += cf->data[ae + 3] << 16;
so->rx.len += cf->data[ae + 4] << 8;
so->rx.len += cf->data[ae + 5];

so->rx.len is 0 before alloc_skb() and is 4096 after alloc_skb() in isotp_rcv_cf(). The following
skb_put() will trigger panic.

The following log is my reproducing log with Marc's commit and my debug modification in isotp_rcv_cf().

[ 150.605776][ C6] isotp_rcv_cf: before alloc_skb so->rc.len: 0, after alloc_skb so->rx.len: 4096
[ 150.611477][ C6] skbuff: skb_over_panic: text:ffffffff881ff7be len:4096 put:4096 head:ffff88807f93a800 data:ffff88807f93a800 tail:0x1000 end:0xc0 dev:<NULL>
[ 150.615837][ C6] ------------[ cut here ]------------
[ 150.617238][ C6] kernel BUG at net/core/skbuff.c:113!

> In the case pointed out by syzbot the unsigned 32 bit length information was stored in a signed 32 bit integer which caused a sanity check to fail.
>
> This is now fixed with the patch from Marc.
>
> Regards,
> Oliver
> .

2022-01-21 21:09:59

by Oliver Hartkopp

[permalink] [raw]
Subject: Re: [PATCH net] can: isotp: isotp_rcv_cf(): fix so->rx race problem


On 20.01.22 07:24, Ziyang Xuan (William) wrote:

> I have reproduced the syz problem with Marc's commit, the commit can not fix the panic problem.
> So I tried to find the root cause for panic and gave my solution.
>
> Marc's commit just fix the condition that packet size bigger than INT_MAX which trigger
> tpcon::{idx,len} integer overflow, but the packet size is 4096 in the syz problem.
>
> so->rx.len is 0 after the following logic in isotp_rcv_ff():
>
> /* get the FF_DL */
> so->rx.len = (cf->data[ae] & 0x0F) << 8;
> so->rx.len += cf->data[ae + 1];
>
> so->rx.len is 4096 after the following logic in isotp_rcv_ff():
>
> /* FF_DL = 0 => get real length from next 4 bytes */
> so->rx.len = cf->data[ae + 2] << 24;
> so->rx.len += cf->data[ae + 3] << 16;
> so->rx.len += cf->data[ae + 4] << 8;
> so->rx.len += cf->data[ae + 5];
>

In these cases the values 0 could be the minimum value in so->rx.len -
but e.g. the value 0 can not show up in isotp_rcv_cf() as this function
requires so->rx.state to be ISOTP_WAIT_DATA.

And when so->rx.len is 0 in isotp_rcv_ff() this check

if (so->rx.len + ae + off + ff_pci_sz < so->rx.ll_dl)
return 1;

will return from isotp_rcv_ff() before ISOTP_WAIT_DATA is set at the
end. So after that above check we are still in ISOTP_IDLE state.

Or did I miss something here?

> so->rx.len is 0 before alloc_skb() and is 4096 after alloc_skb() in isotp_rcv_cf(). The following
> skb_put() will trigger panic.
>
> The following log is my reproducing log with Marc's commit and my debug modification in isotp_rcv_cf().
>
> [ 150.605776][ C6] isotp_rcv_cf: before alloc_skb so->rc.len: 0, after alloc_skb so->rx.len: 4096


But so->rx_len is not a value that is modified by alloc_skb():

nskb = alloc_skb(so->rx.len, gfp_any());
if (!nskb)
return 1;

memcpy(skb_put(nskb, so->rx.len), so->rx.buf,
so->rx.len);


Can you send your debug modification changes please?

Best regards,
Oliver

> [ 150.611477][ C6] skbuff: skb_over_panic: text:ffffffff881ff7be len:4096 put:4096 head:ffff88807f93a800 data:ffff88807f93a800 tail:0x1000 end:0xc0 dev:<NULL>
> [ 150.615837][ C6] ------------[ cut here ]------------
> [ 150.617238][ C6] kernel BUG at net/core/skbuff.c:113!
>

2022-01-21 21:40:44

by Ziyang Xuan (William)

[permalink] [raw]
Subject: Re: [PATCH net] can: isotp: isotp_rcv_cf(): fix so->rx race problem

>
> On 20.01.22 07:24, Ziyang Xuan (William) wrote:
>
>> I have reproduced the syz problem with Marc's commit, the commit can not fix the panic problem.
>> So I tried to find the root cause for panic and gave my solution.
>>
>> Marc's commit just fix the condition that packet size bigger than INT_MAX which trigger
>> tpcon::{idx,len} integer overflow, but the packet size is 4096 in the syz problem.
>>
>> so->rx.len is 0 after the following logic in isotp_rcv_ff():
>>
>> /* get the FF_DL */
>> so->rx.len = (cf->data[ae] & 0x0F) << 8;
>> so->rx.len += cf->data[ae + 1];
>>
>> so->rx.len is 4096 after the following logic in isotp_rcv_ff():
>>
>> /* FF_DL = 0 => get real length from next 4 bytes */
>> so->rx.len = cf->data[ae + 2] << 24;
>> so->rx.len += cf->data[ae + 3] << 16;
>> so->rx.len += cf->data[ae + 4] << 8;
>> so->rx.len += cf->data[ae + 5];
>>
>
> In these cases the values 0 could be the minimum value in so->rx.len - but e.g. the value 0 can not show up in isotp_rcv_cf() as this function requires so->rx.state to be ISOTP_WAIT_DATA.

Consider the scenario that isotp_rcv_cf() and isotp_rcv_cf() are concurrent for the same isotp_sock as following sequence:

isotp_rcv_cf()
if (so->rx.state != ISOTP_WAIT_DATA) [false]
isotp_rcv_ff()
so->rx.state = ISOTP_IDLE
/* get the FF_DL */ [so->rx.len == 0]
alloc_skb() [so->rx.len == 0]
/* FF_DL = 0 => get real length from next 4 bytes */ [so->rx.len == 4096]
skb_put(nskb, so->rx.len) [so->rx.len == 4096]
skb_over_panic()

>
> And when so->rx.len is 0 in isotp_rcv_ff() this check
>
> if (so->rx.len + ae + off + ff_pci_sz < so->rx.ll_dl)
>         return 1;
>
> will return from isotp_rcv_ff() before ISOTP_WAIT_DATA is set at the end. So after that above check we are still in ISOTP_IDLE state.
>
> Or did I miss something here?
>
>> so->rx.len is 0 before alloc_skb() and is 4096 after alloc_skb() in isotp_rcv_cf(). The following
>> skb_put() will trigger panic.
>>
>> The following log is my reproducing log with Marc's commit and my debug modification in isotp_rcv_cf().
>>
>> [  150.605776][    C6] isotp_rcv_cf: before alloc_skb so->rc.len: 0, after alloc_skb so->rx.len: 4096
>
>
> But so->rx_len is not a value that is modified by alloc_skb():
>
>                 nskb = alloc_skb(so->rx.len, gfp_any());
>                 if (!nskb)
>                         return 1;
>
>                 memcpy(skb_put(nskb, so->rx.len), so->rx.buf,
>                        so->rx.len);
>
>
> Can you send your debug modification changes please?

My reproducing debug as attachment and following:

diff --git a/net/can/isotp.c b/net/can/isotp.c
index df6968b28bf4..8b12d63b4d59 100644
--- a/net/can/isotp.c
+++ b/net/can/isotp.c
@@ -119,8 +119,8 @@ enum {
};

struct tpcon {
- int idx;
- int len;
+ unsigned int idx;
+ unsigned int len;
u32 state;
u8 bs;
u8 sn;
@@ -505,6 +505,7 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae,
struct isotp_sock *so = isotp_sk(sk);
struct sk_buff *nskb;
int i;
+ bool unexpection = false;

if (so->rx.state != ISOTP_WAIT_DATA)
return 0;
@@ -562,11 +563,13 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae,
sk_error_report(sk);
return 1;
}
-
+ if (so->rx.len == 0)
+ unexpection = true;
nskb = alloc_skb(so->rx.len, gfp_any());
if (!nskb)
return 1;
-
+ if (unexpection)
+ printk("%s: before alloc_skb so->rc.len: 0, after alloc_skb so->rx.len: %u\n", __func__, so->rx.len);
memcpy(skb_put(nskb, so->rx.len), so->rx.buf,
so->rx.len);


>
> Best regards,
> Oliver
>
>> [  150.611477][    C6] skbuff: skb_over_panic: text:ffffffff881ff7be len:4096 put:4096 head:ffff88807f93a800 data:ffff88807f93a800 tail:0x1000 end:0xc0 dev:<NULL>
>> [  150.615837][    C6] ------------[ cut here ]------------
>> [  150.617238][    C6] kernel BUG at net/core/skbuff.c:113!
>>
>
> .


Attachments:
0001-can-isotp-debug-for-reproducing-isotp_rcv-panic.patch (1.34 kB)

2022-01-21 22:17:51

by Oliver Hartkopp

[permalink] [raw]
Subject: Re: [PATCH net] can: isotp: isotp_rcv_cf(): fix so->rx race problem



On 20.01.22 12:28, Ziyang Xuan (William) wrote:
>>
>> On 20.01.22 07:24, Ziyang Xuan (William) wrote:
>>
>>> I have reproduced the syz problem with Marc's commit, the commit can not fix the panic problem.
>>> So I tried to find the root cause for panic and gave my solution.
>>>
>>> Marc's commit just fix the condition that packet size bigger than INT_MAX which trigger
>>> tpcon::{idx,len} integer overflow, but the packet size is 4096 in the syz problem.
>>>
>>> so->rx.len is 0 after the following logic in isotp_rcv_ff():
>>>
>>> /* get the FF_DL */
>>> so->rx.len = (cf->data[ae] & 0x0F) << 8;
>>> so->rx.len += cf->data[ae + 1];
>>>
>>> so->rx.len is 4096 after the following logic in isotp_rcv_ff():
>>>
>>> /* FF_DL = 0 => get real length from next 4 bytes */
>>> so->rx.len = cf->data[ae + 2] << 24;
>>> so->rx.len += cf->data[ae + 3] << 16;
>>> so->rx.len += cf->data[ae + 4] << 8;
>>> so->rx.len += cf->data[ae + 5];
>>>
>>
>> In these cases the values 0 could be the minimum value in so->rx.len - but e.g. the value 0 can not show up in isotp_rcv_cf() as this function requires so->rx.state to be ISOTP_WAIT_DATA.
>
> Consider the scenario that isotp_rcv_cf() and isotp_rcv_cf() are concurrent for the same isotp_sock as following sequence:

o_O

Sorry but the receive path is not designed to handle concurrent
receptions that would run isotp_rcv_cf() and isotp_rcv_ff() simultaneously.

> isotp_rcv_cf()
> if (so->rx.state != ISOTP_WAIT_DATA) [false]
> isotp_rcv_ff()
> so->rx.state = ISOTP_IDLE
> /* get the FF_DL */ [so->rx.len == 0]
> alloc_skb() [so->rx.len == 0]
> /* FF_DL = 0 => get real length from next 4 bytes */ [so->rx.len == 4096]
> skb_put(nskb, so->rx.len) [so->rx.len == 4096]
> skb_over_panic()
>

Even though this case is not possible with a real CAN bus due to the CAN
frame transmission times we could introduce some locking (or dropping of
concurrent CAN frames) in isotp_rcv() - but this code runs in net
softirq context ...

Regards,
Oliver


>>
>> And when so->rx.len is 0 in isotp_rcv_ff() this check
>>
>> if (so->rx.len + ae + off + ff_pci_sz < so->rx.ll_dl)
>>         return 1;
>>
>> will return from isotp_rcv_ff() before ISOTP_WAIT_DATA is set at the end. So after that above check we are still in ISOTP_IDLE state.
>>
>> Or did I miss something here?
>>
>>> so->rx.len is 0 before alloc_skb() and is 4096 after alloc_skb() in isotp_rcv_cf(). The following
>>> skb_put() will trigger panic.
>>>
>>> The following log is my reproducing log with Marc's commit and my debug modification in isotp_rcv_cf().
>>>
>>> [  150.605776][    C6] isotp_rcv_cf: before alloc_skb so->rc.len: 0, after alloc_skb so->rx.len: 4096
>>
>>
>> But so->rx_len is not a value that is modified by alloc_skb():
>>
>>                 nskb = alloc_skb(so->rx.len, gfp_any());
>>                 if (!nskb)
>>                         return 1;
>>
>>                 memcpy(skb_put(nskb, so->rx.len), so->rx.buf,
>>                        so->rx.len);
>>
>>
>> Can you send your debug modification changes please?
>
> My reproducing debug as attachment and following:
>
> diff --git a/net/can/isotp.c b/net/can/isotp.c
> index df6968b28bf4..8b12d63b4d59 100644
> --- a/net/can/isotp.c
> +++ b/net/can/isotp.c
> @@ -119,8 +119,8 @@ enum {
> };
>
> struct tpcon {
> - int idx;
> - int len;
> + unsigned int idx;
> + unsigned int len;
> u32 state;
> u8 bs;
> u8 sn;
> @@ -505,6 +505,7 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae,
> struct isotp_sock *so = isotp_sk(sk);
> struct sk_buff *nskb;
> int i;
> + bool unexpection = false;
>
> if (so->rx.state != ISOTP_WAIT_DATA)
> return 0;
> @@ -562,11 +563,13 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae,
> sk_error_report(sk);
> return 1;
> }
> -
> + if (so->rx.len == 0)
> + unexpection = true;
> nskb = alloc_skb(so->rx.len, gfp_any());
> if (!nskb)
> return 1;
> -
> + if (unexpection)
> + printk("%s: before alloc_skb so->rc.len: 0, after alloc_skb so->rx.len: %u\n", __func__, so->rx.len);
> memcpy(skb_put(nskb, so->rx.len), so->rx.buf,
> so->rx.len);
>
>
>>
>> Best regards,
>> Oliver
>>
>>> [  150.611477][    C6] skbuff: skb_over_panic: text:ffffffff881ff7be len:4096 put:4096 head:ffff88807f93a800 data:ffff88807f93a800 tail:0x1000 end:0xc0 dev:<NULL>
>>> [  150.615837][    C6] ------------[ cut here ]------------
>>> [  150.617238][    C6] kernel BUG at net/core/skbuff.c:113!
>>>
>>
>> .

2022-01-22 00:30:04

by Ziyang Xuan (William)

[permalink] [raw]
Subject: Re: [PATCH net] can: isotp: isotp_rcv_cf(): fix so->rx race problem

>
> On 20.01.22 12:28, Ziyang Xuan (William) wrote:
>>>
>>> On 20.01.22 07:24, Ziyang Xuan (William) wrote:
>>>
>>>> I have reproduced the syz problem with Marc's commit, the commit can not fix the panic problem.
>>>> So I tried to find the root cause for panic and gave my solution.
>>>>
>>>> Marc's commit just fix the condition that packet size bigger than INT_MAX which trigger
>>>> tpcon::{idx,len} integer overflow, but the packet size is 4096 in the syz problem.
>>>>
>>>> so->rx.len is 0 after the following logic in isotp_rcv_ff():
>>>>
>>>> /* get the FF_DL */
>>>> so->rx.len = (cf->data[ae] & 0x0F) << 8;
>>>> so->rx.len += cf->data[ae + 1];
>>>>
>>>> so->rx.len is 4096 after the following logic in isotp_rcv_ff():
>>>>
>>>> /* FF_DL = 0 => get real length from next 4 bytes */
>>>> so->rx.len = cf->data[ae + 2] << 24;
>>>> so->rx.len += cf->data[ae + 3] << 16;
>>>> so->rx.len += cf->data[ae + 4] << 8;
>>>> so->rx.len += cf->data[ae + 5];
>>>>
>>>
>>> In these cases the values 0 could be the minimum value in so->rx.len - but e.g. the value 0 can not show up in isotp_rcv_cf() as this function requires so->rx.state to be ISOTP_WAIT_DATA.
>>
>> Consider the scenario that isotp_rcv_cf() and isotp_rcv_cf() are concurrent for the same isotp_sock as following sequence:
>
> o_O
>
> Sorry but the receive path is not designed to handle concurrent receptions that would run isotp_rcv_cf() and isotp_rcv_ff() simultaneously.
>
>> isotp_rcv_cf()
>> if (so->rx.state != ISOTP_WAIT_DATA) [false]
>>                         isotp_rcv_ff()
>>                         so->rx.state = ISOTP_IDLE
>>                         /* get the FF_DL */ [so->rx.len == 0]
>> alloc_skb() [so->rx.len == 0]
>>                         /* FF_DL = 0 => get real length from next 4 bytes */ [so->rx.len == 4096]
>> skb_put(nskb, so->rx.len) [so->rx.len == 4096]
>> skb_over_panic()
>>
>
> Even though this case is not possible with a real CAN bus due to the CAN frame transmission times we could introduce some locking (or dropping of concurrent CAN frames) in isotp_rcv() - but this code runs in net softirq context ...
>

I thought the kernel code logic should make sure the kernel availability no matter what happens in
user space code. And tx path has considered so->tx race condition actually but rx path for so->rx.

> Regards,
> Oliver
>
>
>>>
>>> And when so->rx.len is 0 in isotp_rcv_ff() this check
>>>
>>> if (so->rx.len + ae + off + ff_pci_sz < so->rx.ll_dl)
>>>          return 1;
>>>
>>> will return from isotp_rcv_ff() before ISOTP_WAIT_DATA is set at the end. So after that above check we are still in ISOTP_IDLE state.
>>>
>>> Or did I miss something here?
>>>
>>>> so->rx.len is 0 before alloc_skb() and is 4096 after alloc_skb() in isotp_rcv_cf(). The following
>>>> skb_put() will trigger panic.
>>>>
>>>> The following log is my reproducing log with Marc's commit and my debug modification in isotp_rcv_cf().
>>>>
>>>> [  150.605776][    C6] isotp_rcv_cf: before alloc_skb so->rc.len: 0, after alloc_skb so->rx.len: 4096
>>>
>>>
>>> But so->rx_len is not a value that is modified by alloc_skb():
>>>
>>>                  nskb = alloc_skb(so->rx.len, gfp_any());
>>>                  if (!nskb)
>>>                          return 1;
>>>
>>>                  memcpy(skb_put(nskb, so->rx.len), so->rx.buf,
>>>                         so->rx.len);
>>>
>>>
>>> Can you send your debug modification changes please?
>>
>> My reproducing debug as attachment and following:
>>
>> diff --git a/net/can/isotp.c b/net/can/isotp.c
>> index df6968b28bf4..8b12d63b4d59 100644
>> --- a/net/can/isotp.c
>> +++ b/net/can/isotp.c
>> @@ -119,8 +119,8 @@ enum {
>>   };
>>
>>   struct tpcon {
>> -       int idx;
>> -       int len;
>> +       unsigned int idx;
>> +       unsigned int len;
>>          u32 state;
>>          u8 bs;
>>          u8 sn;
>> @@ -505,6 +505,7 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae,
>>          struct isotp_sock *so = isotp_sk(sk);
>>          struct sk_buff *nskb;
>>          int i;
>> +       bool unexpection = false;
>>
>>          if (so->rx.state != ISOTP_WAIT_DATA)
>>                  return 0;
>> @@ -562,11 +563,13 @@ static int isotp_rcv_cf(struct sock *sk, struct canfd_frame *cf, int ae,
>>                                  sk_error_report(sk);
>>                          return 1;
>>                  }
>> -
>> +               if (so->rx.len == 0)
>> +                       unexpection = true;
>>                  nskb = alloc_skb(so->rx.len, gfp_any());
>>                  if (!nskb)
>>                          return 1;
>> -
>> +               if (unexpection)
>> +                       printk("%s: before alloc_skb so->rc.len: 0, after alloc_skb so->rx.len: %u\n", __func__, so->rx.len);
>>                  memcpy(skb_put(nskb, so->rx.len), so->rx.buf,
>>                         so->rx.len);
>>
>>
>>>
>>> Best regards,
>>> Oliver
>>>
>>>> [  150.611477][    C6] skbuff: skb_over_panic: text:ffffffff881ff7be len:4096 put:4096 head:ffff88807f93a800 data:ffff88807f93a800 tail:0x1000 end:0xc0 dev:<NULL>
>>>> [  150.615837][    C6] ------------[ cut here ]------------
>>>> [  150.617238][    C6] kernel BUG at net/core/skbuff.c:113!
>>>>
>>>
>>> .
> .

2022-01-28 16:53:21

by Oliver Hartkopp

[permalink] [raw]
Subject: Re: [PATCH net] can: isotp: isotp_rcv_cf(): fix so->rx race problem

Hello Ziyang Xuan,

On 21.01.22 02:50, Ziyang Xuan (William) wrote:
>>
>> On 20.01.22 12:28, Ziyang Xuan (William) wrote:
>>>>
>>>> On 20.01.22 07:24, Ziyang Xuan (William) wrote:
>>>>
>>>>> I have reproduced the syz problem with Marc's commit, the commit can not fix the panic problem.
>>>>> So I tried to find the root cause for panic and gave my solution.
>>>>>
>>>>> Marc's commit just fix the condition that packet size bigger than INT_MAX which trigger
>>>>> tpcon::{idx,len} integer overflow, but the packet size is 4096 in the syz problem.
>>>>>
>>>>> so->rx.len is 0 after the following logic in isotp_rcv_ff():
>>>>>
>>>>> /* get the FF_DL */
>>>>> so->rx.len = (cf->data[ae] & 0x0F) << 8;
>>>>> so->rx.len += cf->data[ae + 1];
>>>>>
>>>>> so->rx.len is 4096 after the following logic in isotp_rcv_ff():
>>>>>
>>>>> /* FF_DL = 0 => get real length from next 4 bytes */
>>>>> so->rx.len = cf->data[ae + 2] << 24;
>>>>> so->rx.len += cf->data[ae + 3] << 16;
>>>>> so->rx.len += cf->data[ae + 4] << 8;
>>>>> so->rx.len += cf->data[ae + 5];
>>>>>
>>>>
>>>> In these cases the values 0 could be the minimum value in so->rx.len - but e.g. the value 0 can not show up in isotp_rcv_cf() as this function requires so->rx.state to be ISOTP_WAIT_DATA.
>>>
>>> Consider the scenario that isotp_rcv_cf() and isotp_rcv_cf() are concurrent for the same isotp_sock as following sequence:
>>
>> o_O
>>
>> Sorry but the receive path is not designed to handle concurrent receptions that would run isotp_rcv_cf() and isotp_rcv_ff() simultaneously.
>>
>>> isotp_rcv_cf()
>>> if (so->rx.state != ISOTP_WAIT_DATA) [false]
>>>                         isotp_rcv_ff()
>>>                         so->rx.state = ISOTP_IDLE
>>>                         /* get the FF_DL */ [so->rx.len == 0]
>>> alloc_skb() [so->rx.len == 0]
>>>                         /* FF_DL = 0 => get real length from next 4 bytes */ [so->rx.len == 4096]
>>> skb_put(nskb, so->rx.len) [so->rx.len == 4096]
>>> skb_over_panic()
>>>
>>
>> Even though this case is not possible with a real CAN bus due to the CAN frame transmission times we could introduce some locking (or dropping of concurrent CAN frames) in isotp_rcv() - but this code runs in net softirq context ...
>>

As discussed off-list I added a spin_lock() in isotp_rcv() as
https://www.kernel.org/doc/htmldocs/kernel-locking/lock-softirqs.html
suggests.

Please give this patch[1] a try in your test setup.

Many thanks,
Oliver

[1]:
https://lore.kernel.org/linux-can/[email protected]/T/

2022-01-30 12:58:20

by Oliver Hartkopp

[permalink] [raw]
Subject: Re: [PATCH net] can: isotp: isotp_rcv_cf(): fix so->rx race problem

Answering myself ...

I've seen the frame processing sometimes freezes for one second when
stressing the isotp_rcv() from multiple sources. This finally freezes
the entire softirq which is either not good and not needed as we only
need to fix this race for stress tests - and not for real world usage
that does not create this case.

Therefore I created a V2 patch which uses the spin_trylock() to simply
drop the incomming frame in the race condition.

https://lore.kernel.org/linux-can/[email protected]/T/

Please take a look, if it also fixes the issue in your test setup.

Thanks & best regards,
Oliver

On 27.01.22 20:44, Oliver Hartkopp wrote:
> Hello Ziyang Xuan,
>
> On 21.01.22 02:50, Ziyang Xuan (William) wrote:
>>>
>>> On 20.01.22 12:28, Ziyang Xuan (William) wrote:
>>>>>
>>>>> On 20.01.22 07:24, Ziyang Xuan (William) wrote:
>>>>>
>>>>>> I have reproduced the syz problem with Marc's commit, the commit
>>>>>> can not fix the panic problem.
>>>>>> So I tried to find the root cause for panic and gave my solution.
>>>>>>
>>>>>> Marc's commit just fix the condition that packet size bigger than
>>>>>> INT_MAX which trigger
>>>>>> tpcon::{idx,len} integer overflow, but the packet size is 4096 in
>>>>>> the syz problem.
>>>>>>
>>>>>> so->rx.len is 0 after the following logic in isotp_rcv_ff():
>>>>>>
>>>>>> /* get the FF_DL */
>>>>>> so->rx.len = (cf->data[ae] & 0x0F) << 8;
>>>>>> so->rx.len += cf->data[ae + 1];
>>>>>>
>>>>>> so->rx.len is 4096 after the following logic in isotp_rcv_ff():
>>>>>>
>>>>>> /* FF_DL = 0 => get real length from next 4 bytes */
>>>>>> so->rx.len = cf->data[ae + 2] << 24;
>>>>>> so->rx.len += cf->data[ae + 3] << 16;
>>>>>> so->rx.len += cf->data[ae + 4] << 8;
>>>>>> so->rx.len += cf->data[ae + 5];
>>>>>>
>>>>>
>>>>> In these cases the values 0 could be the minimum value in
>>>>> so->rx.len - but e.g. the value 0 can not show up in isotp_rcv_cf()
>>>>> as this function requires so->rx.state to be ISOTP_WAIT_DATA.
>>>>
>>>> Consider the scenario that isotp_rcv_cf() and isotp_rcv_cf() are
>>>> concurrent for the same isotp_sock as following sequence:
>>>
>>> o_O
>>>
>>> Sorry but the receive path is not designed to handle concurrent
>>> receptions that would run isotp_rcv_cf() and isotp_rcv_ff()
>>> simultaneously.
>>>
>>>> isotp_rcv_cf()
>>>> if (so->rx.state != ISOTP_WAIT_DATA) [false]
>>>>                          isotp_rcv_ff()
>>>>                          so->rx.state = ISOTP_IDLE
>>>>                          /* get the FF_DL */ [so->rx.len == 0]
>>>> alloc_skb() [so->rx.len == 0]
>>>>                          /* FF_DL = 0 => get real length from next 4
>>>> bytes */ [so->rx.len == 4096]
>>>> skb_put(nskb, so->rx.len) [so->rx.len == 4096]
>>>> skb_over_panic()
>>>>
>>>
>>> Even though this case is not possible with a real CAN bus due to the
>>> CAN frame transmission times we could introduce some locking (or
>>> dropping of concurrent CAN frames) in isotp_rcv() - but this code
>>> runs in net softirq context ...
>>>
>
> As discussed off-list I added a spin_lock() in isotp_rcv() as
> https://www.kernel.org/doc/htmldocs/kernel-locking/lock-softirqs.html
> suggests.
>
> Please give this patch[1] a try in your test setup.
>
> Many thanks,
> Oliver
>
> [1]:
> https://lore.kernel.org/linux-can/[email protected]/T/
>

2022-01-30 14:34:32

by Marc Kleine-Budde

[permalink] [raw]
Subject: Re: [PATCH net] can: isotp: isotp_rcv_cf(): fix so->rx race problem

On 28.01.2022 08:56:19, Oliver Hartkopp wrote:
> I've seen the frame processing sometimes freezes for one second when
> stressing the isotp_rcv() from multiple sources. This finally freezes
> the entire softirq which is either not good and not needed as we only
> need to fix this race for stress tests - and not for real world usage
> that does not create this case.

Hmmm, this doesn't sound good. Can you test with LOCKDEP enabled?

>
> Therefore I created a V2 patch which uses the spin_trylock() to simply drop
> the incomming frame in the race condition.
>
> https://lore.kernel.org/linux-can/[email protected]/T/
>
> Please take a look, if it also fixes the issue in your test setup.

regards,
Marc

--
Pengutronix e.K. | Marc Kleine-Budde |
Embedded Linux | https://www.pengutronix.de |
Vertretung West/Dortmund | Phone: +49-231-2826-924 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |


Attachments:
(No filename) (1.00 kB)
signature.asc (499.00 B)
Download all attachments

2022-01-30 14:47:32

by Oliver Hartkopp

[permalink] [raw]
Subject: Re: [PATCH net] can: isotp: isotp_rcv_cf(): fix so->rx race problem



On 28.01.22 09:07, Marc Kleine-Budde wrote:
> On 28.01.2022 08:56:19, Oliver Hartkopp wrote:
>> I've seen the frame processing sometimes freezes for one second when
>> stressing the isotp_rcv() from multiple sources. This finally freezes
>> the entire softirq which is either not good and not needed as we only
>> need to fix this race for stress tests - and not for real world usage
>> that does not create this case.
>
> Hmmm, this doesn't sound good. Can you test with LOCKDEP enabled?

In kernel config? I enabled almost everything with LOCKing ;-)


CONFIG_LOCKDEP_SUPPORT=y

CONFIG_HAVE_HARDLOCKUP_DETECTOR_PERF=y

CONFIG_ASN1=y
CONFIG_UNINLINE_SPIN_UNLOCK=y
CONFIG_ARCH_SUPPORTS_ATOMIC_RMW=y
CONFIG_MUTEX_SPIN_ON_OWNER=y
CONFIG_RWSEM_SPIN_ON_OWNER=y
CONFIG_LOCK_SPIN_ON_OWNER=y
CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
CONFIG_QUEUED_SPINLOCKS=y
CONFIG_ARCH_USE_QUEUED_RWLOCKS=y
CONFIG_QUEUED_RWLOCKS=y
CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE=y
CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE=y
CONFIG_ARCH_HAS_SYSCALL_WRAPPER=y
CONFIG_FREEZER=y

CONFIG_HWSPINLOCK=y

CONFIG_I8253_LOCK=y

#
# Debug Oops, Lockups and Hangs
#
# CONFIG_PANIC_ON_OOPS is not set
CONFIG_PANIC_ON_OOPS_VALUE=0
CONFIG_PANIC_TIMEOUT=0
CONFIG_LOCKUP_DETECTOR=y
CONFIG_SOFTLOCKUP_DETECTOR=y
# CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0
CONFIG_HARDLOCKUP_DETECTOR_PERF=y
CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y
CONFIG_HARDLOCKUP_DETECTOR=y
# CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=0
CONFIG_DETECT_HUNG_TASK=y
CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=120 <--- the isotp timeout is 1000 ms
what I can observe here with v1 patch

# CONFIG_BOOTPARAM_HUNG_TASK_PANIC is not set
CONFIG_BOOTPARAM_HUNG_TASK_PANIC_VALUE=0
# CONFIG_WQ_WATCHDOG is not set
# CONFIG_TEST_LOCKUP is not set
# end of Debug Oops, Lockups and Hangs

Only this debugging stuff is not really enabled:

#
# Lock Debugging (spinlocks, mutexes, etc...)
#
CONFIG_LOCK_DEBUGGING_SUPPORT=y
# CONFIG_PROVE_LOCKING is not set
# CONFIG_LOCK_STAT is not set
# CONFIG_DEBUG_RT_MUTEXES is not set
# CONFIG_DEBUG_SPINLOCK is not set
# CONFIG_DEBUG_MUTEXES is not set
# CONFIG_DEBUG_WW_MUTEX_SLOWPATH is not set
# CONFIG_DEBUG_RWSEMS is not set
# CONFIG_DEBUG_LOCK_ALLOC is not set
# CONFIG_DEBUG_ATOMIC_SLEEP is not set
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_LOCK_TORTURE_TEST is not set
# CONFIG_WW_MUTEX_SELFTEST is not set
# CONFIG_SCF_TORTURE_TEST is not set
# CONFIG_CSD_LOCK_WAIT_DEBUG is not set
# end of Lock Debugging (spinlocks, mutexes, etc...)

Would this help to be enabled for this test (of the v1 patch?

Best regards,
Oliver

2022-01-30 16:16:00

by Marc Kleine-Budde

[permalink] [raw]
Subject: Re: [PATCH net] can: isotp: isotp_rcv_cf(): fix so->rx race problem

On 28.01.2022 09:32:40, Oliver Hartkopp wrote:
>
>
> On 28.01.22 09:07, Marc Kleine-Budde wrote:
> > On 28.01.2022 08:56:19, Oliver Hartkopp wrote:
> > > I've seen the frame processing sometimes freezes for one second when
> > > stressing the isotp_rcv() from multiple sources. This finally freezes
> > > the entire softirq which is either not good and not needed as we only
> > > need to fix this race for stress tests - and not for real world usage
> > > that does not create this case.
> >
> > Hmmm, this doesn't sound good. Can you test with LOCKDEP enabled?

> In kernel config? I enabled almost everything with LOCKing ;-)
>
>
> CONFIG_LOCKDEP_SUPPORT=y
>
> CONFIG_HAVE_HARDLOCKUP_DETECTOR_PERF=y
>
> CONFIG_ASN1=y
> CONFIG_UNINLINE_SPIN_UNLOCK=y
> CONFIG_ARCH_SUPPORTS_ATOMIC_RMW=y
> CONFIG_MUTEX_SPIN_ON_OWNER=y
> CONFIG_RWSEM_SPIN_ON_OWNER=y
> CONFIG_LOCK_SPIN_ON_OWNER=y
> CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
> CONFIG_QUEUED_SPINLOCKS=y
> CONFIG_ARCH_USE_QUEUED_RWLOCKS=y
> CONFIG_QUEUED_RWLOCKS=y
> CONFIG_ARCH_HAS_NON_OVERLAPPING_ADDRESS_SPACE=y
> CONFIG_ARCH_HAS_SYNC_CORE_BEFORE_USERMODE=y
> CONFIG_ARCH_HAS_SYSCALL_WRAPPER=y
> CONFIG_FREEZER=y
>
> CONFIG_HWSPINLOCK=y
>
> CONFIG_I8253_LOCK=y
>
> #
> # Debug Oops, Lockups and Hangs
> #
> # CONFIG_PANIC_ON_OOPS is not set
> CONFIG_PANIC_ON_OOPS_VALUE=0
> CONFIG_PANIC_TIMEOUT=0
> CONFIG_LOCKUP_DETECTOR=y
> CONFIG_SOFTLOCKUP_DETECTOR=y
> # CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set
> CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0
> CONFIG_HARDLOCKUP_DETECTOR_PERF=y
> CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y
> CONFIG_HARDLOCKUP_DETECTOR=y
> # CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is not set
> CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=0
> CONFIG_DETECT_HUNG_TASK=y
> CONFIG_DEFAULT_HUNG_TASK_TIMEOUT=120 <--- the isotp timeout is 1000 ms what
> I can observe here with v1 patch
>
> # CONFIG_BOOTPARAM_HUNG_TASK_PANIC is not set
> CONFIG_BOOTPARAM_HUNG_TASK_PANIC_VALUE=0
> # CONFIG_WQ_WATCHDOG is not set
> # CONFIG_TEST_LOCKUP is not set
> # end of Debug Oops, Lockups and Hangs
>
> Only this debugging stuff is not really enabled:
>
> #
> # Lock Debugging (spinlocks, mutexes, etc...)
> #
> CONFIG_LOCK_DEBUGGING_SUPPORT=y
> # CONFIG_PROVE_LOCKING is not set
CONFIG_PROVE_LOCKING=y

Marc

> # CONFIG_LOCK_STAT is not set
> # CONFIG_DEBUG_RT_MUTEXES is not set
> # CONFIG_DEBUG_SPINLOCK is not set
> # CONFIG_DEBUG_MUTEXES is not set
> # CONFIG_DEBUG_WW_MUTEX_SLOWPATH is not set
> # CONFIG_DEBUG_RWSEMS is not set
> # CONFIG_DEBUG_LOCK_ALLOC is not set
> # CONFIG_DEBUG_ATOMIC_SLEEP is not set
> # CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
> # CONFIG_LOCK_TORTURE_TEST is not set
> # CONFIG_WW_MUTEX_SELFTEST is not set
> # CONFIG_SCF_TORTURE_TEST is not set
> # CONFIG_CSD_LOCK_WAIT_DEBUG is not set
> # end of Lock Debugging (spinlocks, mutexes, etc...)
>
> Would this help to be enabled for this test (of the v1 patch?
>
> Best regards,
> Oliver
>
>

--
Pengutronix e.K. | Marc Kleine-Budde |
Embedded Linux | https://www.pengutronix.de |
Vertretung West/Dortmund | Phone: +49-231-2826-924 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |


Attachments:
(No filename) (3.22 kB)
signature.asc (499.00 B)
Download all attachments

2022-01-31 11:10:54

by Oliver Hartkopp

[permalink] [raw]
Subject: Re: [PATCH net] can: isotp: isotp_rcv_cf(): fix so->rx race problem

Hello Marc, hello William,

On 28.01.22 09:46, Marc Kleine-Budde wrote:
> On 28.01.2022 09:32:40, Oliver Hartkopp wrote:
>>
>>
>> On 28.01.22 09:07, Marc Kleine-Budde wrote:
>>> On 28.01.2022 08:56:19, Oliver Hartkopp wrote:
>>>> I've seen the frame processing sometimes freezes for one second when
>>>> stressing the isotp_rcv() from multiple sources. This finally freezes
>>>> the entire softirq which is either not good and not needed as we only
>>>> need to fix this race for stress tests - and not for real world usage
>>>> that does not create this case.
>>>
>>> Hmmm, this doesn't sound good. Can you test with LOCKDEP enabled?


>> #
>> # Lock Debugging (spinlocks, mutexes, etc...)
>> #
>> CONFIG_LOCK_DEBUGGING_SUPPORT=y
>> # CONFIG_PROVE_LOCKING is not set
> CONFIG_PROVE_LOCKING=y

Now enabled even more locking (seen relevant kernel config at the end).

It turns out that there is no visible difference when using spin_lock()
or spin_trylock().

I only got some of these kernel log entries

Jan 28 11:13:14 silver kernel: [ 2396.323211] perf: interrupt took too
long (2549 > 2500), lowering kernel.perf_event_max_sample_rate to 78250
Jan 28 11:25:49 silver kernel: [ 3151.172773] perf: interrupt took too
long (3188 > 3186), lowering kernel.perf_event_max_sample_rate to 62500
Jan 28 11:45:24 silver kernel: [ 4325.583328] perf: interrupt took too
long (4009 > 3985), lowering kernel.perf_event_max_sample_rate to 49750
Jan 28 12:15:46 silver kernel: [ 6148.238246] perf: interrupt took too
long (5021 > 5011), lowering kernel.perf_event_max_sample_rate to 39750
Jan 28 13:01:45 silver kernel: [ 8907.303715] perf: interrupt took too
long (6285 > 6276), lowering kernel.perf_event_max_sample_rate to 31750

But I get these sporadically anyway. No other LOCKDEP splat.

At least the issue reported by William should be fixed now - but I'm
still unclear whether spin_lock() or spin_trylock() is the best approach
here in the NET_RX softirq?!?

Best regards,
Oliver


$ grep LOCK .config | grep -v BLOCK | grep -v CLOCK
CONFIG_LOCKDEP_SUPPORT=y
# CONFIG_PM_WAKELOCKS is not set
CONFIG_HAVE_HARDLOCKUP_DETECTOR_PERF=y
# CONFIG_LOCK_EVENT_COUNTS is not set
CONFIG_UNINLINE_SPIN_UNLOCK=y
CONFIG_LOCK_SPIN_ON_OWNER=y
CONFIG_ARCH_USE_QUEUED_SPINLOCKS=y
CONFIG_QUEUED_SPINLOCKS=y
CONFIG_ARCH_USE_QUEUED_RWLOCKS=y
CONFIG_QUEUED_RWLOCKS=y
CONFIG_SPLIT_PTLOCK_CPUS=4
CONFIG_ARCH_ENABLE_SPLIT_PMD_PTLOCK=y
CONFIG_PCI_LOCKLESS_CONFIG=y
# CONFIG_DRM_DEBUG_MODESET_LOCK is not set
CONFIG_HWSPINLOCK=y
CONFIG_I8253_LOCK=y
CONFIG_FILE_LOCKING=y
# CONFIG_SECURITY_LOCKDOWN_LSM is not set
# CONFIG_SECURITY_LANDLOCK is not set
# CONFIG_CRYPTO_DEV_PADLOCK is not set
CONFIG_ARCH_USE_CMPXCHG_LOCKREF=y
CONFIG_LOCKUP_DETECTOR=y
CONFIG_SOFTLOCKUP_DETECTOR=y
# CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC_VALUE=0
CONFIG_HARDLOCKUP_DETECTOR_PERF=y
CONFIG_HARDLOCKUP_CHECK_TIMESTAMP=y
CONFIG_HARDLOCKUP_DETECTOR=y
# CONFIG_BOOTPARAM_HARDLOCKUP_PANIC is not set
CONFIG_BOOTPARAM_HARDLOCKUP_PANIC_VALUE=0
# CONFIG_TEST_LOCKUP is not set
CONFIG_LOCK_DEBUGGING_SUPPORT=y
CONFIG_PROVE_LOCKING=y
CONFIG_PROVE_RAW_LOCK_NESTING=y
# CONFIG_LOCK_STAT is not set
CONFIG_DEBUG_SPINLOCK=y
CONFIG_DEBUG_LOCK_ALLOC=y
CONFIG_LOCKDEP=y
CONFIG_LOCKDEP_BITS=15
CONFIG_LOCKDEP_CHAINS_BITS=16
CONFIG_LOCKDEP_STACK_TRACE_BITS=19
CONFIG_LOCKDEP_STACK_TRACE_HASH_BITS=14
CONFIG_LOCKDEP_CIRCULAR_QUEUE_BITS=12
CONFIG_DEBUG_LOCKDEP=y
# CONFIG_DEBUG_LOCKING_API_SELFTESTS is not set
# CONFIG_LOCK_TORTURE_TEST is not set
# CONFIG_CSD_LOCK_WAIT_DEBUG is not set

2022-02-08 13:42:51

by Marc Kleine-Budde

[permalink] [raw]
Subject: Re: [PATCH net] can: isotp: isotp_rcv_cf(): fix so->rx race problem

On 28.01.2022 15:48:05, Oliver Hartkopp wrote:
> Hello Marc, hello William,
>
> On 28.01.22 09:46, Marc Kleine-Budde wrote:
> > On 28.01.2022 09:32:40, Oliver Hartkopp wrote:
> > >
> > >
> > > On 28.01.22 09:07, Marc Kleine-Budde wrote:
> > > > On 28.01.2022 08:56:19, Oliver Hartkopp wrote:
> > > > > I've seen the frame processing sometimes freezes for one second when
> > > > > stressing the isotp_rcv() from multiple sources. This finally freezes
> > > > > the entire softirq which is either not good and not needed as we only
> > > > > need to fix this race for stress tests - and not for real world usage
> > > > > that does not create this case.
> > > >
> > > > Hmmm, this doesn't sound good. Can you test with LOCKDEP enabled?
>
>
> > > #
> > > # Lock Debugging (spinlocks, mutexes, etc...)
> > > #
> > > CONFIG_LOCK_DEBUGGING_SUPPORT=y
> > > # CONFIG_PROVE_LOCKING is not set
> > CONFIG_PROVE_LOCKING=y
>
> Now enabled even more locking (seen relevant kernel config at the end).
>
> It turns out that there is no visible difference when using spin_lock() or
> spin_trylock().
>
> I only got some of these kernel log entries
>
> Jan 28 11:13:14 silver kernel: [ 2396.323211] perf: interrupt took too long
> (2549 > 2500), lowering kernel.perf_event_max_sample_rate to 78250
> Jan 28 11:25:49 silver kernel: [ 3151.172773] perf: interrupt took too long
> (3188 > 3186), lowering kernel.perf_event_max_sample_rate to 62500
> Jan 28 11:45:24 silver kernel: [ 4325.583328] perf: interrupt took too long
> (4009 > 3985), lowering kernel.perf_event_max_sample_rate to 49750
> Jan 28 12:15:46 silver kernel: [ 6148.238246] perf: interrupt took too long
> (5021 > 5011), lowering kernel.perf_event_max_sample_rate to 39750
> Jan 28 13:01:45 silver kernel: [ 8907.303715] perf: interrupt took too long
> (6285 > 6276), lowering kernel.perf_event_max_sample_rate to 31750
>
> But I get these sporadically anyway. No other LOCKDEP splat.
>
> At least the issue reported by William should be fixed now - but I'm still
> unclear whether spin_lock() or spin_trylock() is the best approach here in
> the NET_RX softirq?!?

With the !spin_trylock() -> return you are saying if something
concurrent happens, drop it. This doesn't sound correct.

Marc

--
Pengutronix e.K. | Marc Kleine-Budde |
Embedded Linux | https://www.pengutronix.de |
Vertretung West/Dortmund | Phone: +49-231-2826-924 |
Amtsgericht Hildesheim, HRA 2686 | Fax: +49-5121-206917-5555 |


Attachments:
(No filename) (2.52 kB)
signature.asc (499.00 B)
Download all attachments

2022-02-09 12:54:04

by Oliver Hartkopp

[permalink] [raw]
Subject: Re: [PATCH net] can: isotp: isotp_rcv_cf(): fix so->rx race problem

Hi Marc,

On 07.02.22 09:11, Marc Kleine-Budde wrote:
> On 28.01.2022 15:48:05, Oliver Hartkopp wrote:
>> Hello Marc, hello William,
>>
>> On 28.01.22 09:46, Marc Kleine-Budde wrote:
>>> On 28.01.2022 09:32:40, Oliver Hartkopp wrote:
>>>>
>>>>
>>>> On 28.01.22 09:07, Marc Kleine-Budde wrote:
>>>>> On 28.01.2022 08:56:19, Oliver Hartkopp wrote:
>>>>>> I've seen the frame processing sometimes freezes for one second when
>>>>>> stressing the isotp_rcv() from multiple sources. This finally freezes
>>>>>> the entire softirq which is either not good and not needed as we only
>>>>>> need to fix this race for stress tests - and not for real world usage
>>>>>> that does not create this case.
>>>>>
>>>>> Hmmm, this doesn't sound good. Can you test with LOCKDEP enabled?
>>
>>
>>>> #
>>>> # Lock Debugging (spinlocks, mutexes, etc...)
>>>> #
>>>> CONFIG_LOCK_DEBUGGING_SUPPORT=y
>>>> # CONFIG_PROVE_LOCKING is not set
>>> CONFIG_PROVE_LOCKING=y
>>
>> Now enabled even more locking (seen relevant kernel config at the end).
>>
>> It turns out that there is no visible difference when using spin_lock() or
>> spin_trylock().
>>
>> I only got some of these kernel log entries
>>
>> Jan 28 11:13:14 silver kernel: [ 2396.323211] perf: interrupt took too long
>> (2549 > 2500), lowering kernel.perf_event_max_sample_rate to 78250
>> Jan 28 11:25:49 silver kernel: [ 3151.172773] perf: interrupt took too long
>> (3188 > 3186), lowering kernel.perf_event_max_sample_rate to 62500
>> Jan 28 11:45:24 silver kernel: [ 4325.583328] perf: interrupt took too long
>> (4009 > 3985), lowering kernel.perf_event_max_sample_rate to 49750
>> Jan 28 12:15:46 silver kernel: [ 6148.238246] perf: interrupt took too long
>> (5021 > 5011), lowering kernel.perf_event_max_sample_rate to 39750
>> Jan 28 13:01:45 silver kernel: [ 8907.303715] perf: interrupt took too long
>> (6285 > 6276), lowering kernel.perf_event_max_sample_rate to 31750
>>
>> But I get these sporadically anyway. No other LOCKDEP splat.
>>
>> At least the issue reported by William should be fixed now - but I'm still
>> unclear whether spin_lock() or spin_trylock() is the best approach here in
>> the NET_RX softirq?!?
>
> With the !spin_trylock() -> return you are saying if something
> concurrent happens, drop it. This doesn't sound correct.

Yes, I had the same feeling and did some extensive load tests using both
variants.

It turned out the standard spin_lock() works excellent to fix the issue.

Thanks for taking it for upstream here:
https://lore.kernel.org/linux-can/[email protected]/T/#t

Best regards,
Oliver