2021-02-12 18:20:14

by Daniel Wagner

[permalink] [raw]
Subject: [PATCH] nvme-tcp: Check if request has started before processing it

blk_mq_tag_to_rq() will always return a request if the command_id is
in the valid range. Check if the request has been started. If we
blindly process the request we might double complete a request which
can be fatal.

Signed-off-by: Daniel Wagner <[email protected]>
---

This patch is against nvme-5.12.

There is one blk_mq_tag_to_rq() in nvme_tcp_recv_ddgst() which I
didn't update as I am not sure if it's also needed.

py-crash> bt
#0 0xffffffffa76a33de in arch_atomic_try_cmpxchg (new=<optimized out>, old=<optimized out>, v=<optimized out>)
at ../arch/x86/include/asm/atomic.h:200
#1 atomic_try_cmpxchg (new=<optimized out>, old=<optimized out>, v=<optimized out>) at ../include/asm-generic/atomic-instrumented.h:695
#2 queued_spin_lock (lock=<optimized out>) at ../include/asm-generic/qspinlock.h:78
#3 do_raw_spin_lock_flags (flags=<optimized out>, lock=<optimized out>) at ../include/linux/spinlock.h:193
#4 __raw_spin_lock_irqsave (lock=<optimized out>) at ../include/linux/spinlock_api_smp.h:119
#5 _raw_spin_lock_irqsave (lock=0x8 <__UNIQUE_ID_license257+8>) at ../kernel/locking/spinlock.c:159
#6 0xffffffffa6eea418 in complete (x=0x0 <__UNIQUE_ID_license257>) at ../kernel/sched/completion.c:32
#7 0xffffffffa721f99c in blk_mq_force_complete_rq (rq=0x8 <__UNIQUE_ID_license257+8>) at ../block/blk-mq.c:634
#8 0xffffffffa721fa0a in blk_mq_complete_request (rq=<optimized out>) at ../block/blk-mq.c:672
#9 0xffffffffc0b092ef in nvme_end_request (result=..., status=<optimized out>, req=<optimized out>) at ../drivers/nvme/host/nvme.h:477
#10 nvme_tcp_process_nvme_cqe (cqe=<optimized out>, queue=<optimized out>) at ../drivers/nvme/host/tcp.c:485
rq = 0xffff948b840d0000
hdr = <optimized out>
ret = 0
queue = 0xffff949501dd8110
result = 0
#11 nvme_tcp_handle_comp (pdu=<optimized out>, queue=<optimized out>) at ../drivers/nvme/host/tcp.c:542
#12 nvme_tcp_recv_pdu (len=<optimized out>, offset=<optimized out>, skb=<optimized out>, queue=<optimized out>) at ../drivers/nvme/host/tcp.c:660
#13 nvme_tcp_recv_skb (desc=<optimized out>, skb=<optimized out>, offset=24, len=0) at ../drivers/nvme/host/tcp.c:805
#14 0xffffffffa7598af5 in tcp_read_sock
(sk=0x8 <__UNIQUE_ID_license257+8>, desc=0xa <__UNIQUE_ID_license257+10>, recv_actor=0x1 <__UNIQUE_ID_license257+1>) at ../net/ipv4/tcp.c:1645
#15 0xffffffffc0b075b8 in nvme_tcp_try_recv (queue=0xffff949501dd8110) at ../drivers/nvme/host/tcp.c:1102
#16 0xffffffffc0b08fc7 in nvme_tcp_io_work (w=0xffff949501dd8118) at ../drivers/nvme/host/tcp.c:1126
#17 0xffffffffa6eba4e4 in process_one_work (worker=0xffff948d1b633ec0, work=0xffff949501dd8118) at ../kernel/workqueue.c:2273
#18 0xffffffffa6eba6fd in worker_thread (__worker=0xffff948d1b633ec0) at ../kernel/workqueue.c:2419
#19 0xffffffffa6ec0a3d in kthread (_create=0xffff948d1b618ec0) at ../kernel/kthread.c:268
#20 0xffffffffa7800215 in ret_from_fork () at ../arch/x86/entry/entry_64.S:351

py-crash> p /x ((struct request*)0xffff948b840d0000)->state
$2 = 0x2

drivers/nvme/host/tcp.c | 8 ++++----
1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/drivers/nvme/host/tcp.c b/drivers/nvme/host/tcp.c
index 69f59d2c5799..4bec705ce8e6 100644
--- a/drivers/nvme/host/tcp.c
+++ b/drivers/nvme/host/tcp.c
@@ -485,7 +485,7 @@ static int nvme_tcp_process_nvme_cqe(struct nvme_tcp_queue *queue,
struct request *rq;

rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), cqe->command_id);
- if (!rq) {
+ if (!rq || !blk_mq_request_started(rq)) {
dev_err(queue->ctrl->ctrl.device,
"queue %d tag 0x%x not found\n",
nvme_tcp_queue_id(queue), cqe->command_id);
@@ -506,7 +506,7 @@ static int nvme_tcp_handle_c2h_data(struct nvme_tcp_queue *queue,
struct request *rq;

rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
- if (!rq) {
+ if (!rq || !blk_mq_request_started(rq)) {
dev_err(queue->ctrl->ctrl.device,
"queue %d tag %#x not found\n",
nvme_tcp_queue_id(queue), pdu->command_id);
@@ -610,7 +610,7 @@ static int nvme_tcp_handle_r2t(struct nvme_tcp_queue *queue,
int ret;

rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
- if (!rq) {
+ if (!rq || !blk_mq_request_started(rq)) {
dev_err(queue->ctrl->ctrl.device,
"queue %d tag %#x not found\n",
nvme_tcp_queue_id(queue), pdu->command_id);
@@ -696,7 +696,7 @@ static int nvme_tcp_recv_data(struct nvme_tcp_queue *queue, struct sk_buff *skb,
struct request *rq;

rq = blk_mq_tag_to_rq(nvme_tcp_tagset(queue), pdu->command_id);
- if (!rq) {
+ if (!rq || !blk_mq_request_started(rq)) {
dev_err(queue->ctrl->ctrl.device,
"queue %d tag %#x not found\n",
nvme_tcp_queue_id(queue), pdu->command_id);
--
2.29.2


2021-02-12 21:04:27

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH] nvme-tcp: Check if request has started before processing it

> blk_mq_tag_to_rq() will always return a request if the command_id is
> in the valid range. Check if the request has been started. If we
> blindly process the request we might double complete a request which
> can be fatal.

How did you get to this one? did the controller send a completion for
a completed/bogus request?

2021-02-12 21:14:03

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] nvme-tcp: Check if request has started before processing it

On Fri, Feb 12, 2021 at 12:58:27PM -0800, Sagi Grimberg wrote:
> > blk_mq_tag_to_rq() will always return a request if the command_id is
> > in the valid range. Check if the request has been started. If we
> > blindly process the request we might double complete a request which
> > can be fatal.
>
> How did you get to this one? did the controller send a completion for
> a completed/bogus request?

If that is the case, then that must mean it's possible the driver could
have started the command id just before the bogus completion check. Data
iorruption, right?

2021-02-12 21:53:38

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH] nvme-tcp: Check if request has started before processing it


>>> blk_mq_tag_to_rq() will always return a request if the command_id is
>>> in the valid range. Check if the request has been started. If we
>>> blindly process the request we might double complete a request which
>>> can be fatal.
>>
>> How did you get to this one? did the controller send a completion for
>> a completed/bogus request?
>
> If that is the case, then that must mean it's possible the driver could
> have started the command id just before the bogus completion check. Data
> iorruption, right?

Yes, which is why I don't think this check is very useful..

2021-02-13 08:48:24

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH] nvme-tcp: Check if request has started before processing it

On 2/12/21 7:17 PM, Daniel Wagner wrote:
> blk_mq_tag_to_rq() will always return a request if the command_id is
> in the valid range. Check if the request has been started. If we
> blindly process the request we might double complete a request which
> can be fatal.
>
> Signed-off-by: Daniel Wagner <[email protected]>
> ---
>
> This patch is against nvme-5.12.
>
> There is one blk_mq_tag_to_rq() in nvme_tcp_recv_ddgst() which I
> didn't update as I am not sure if it's also needed.
>
I guess it is; this patch is essentially a protection against invalid
frames, and as such affects all places.

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

2021-02-13 08:49:09

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH] nvme-tcp: Check if request has started before processing it

On 2/12/21 10:49 PM, Sagi Grimberg wrote:
>
>>>> blk_mq_tag_to_rq() will always return a request if the command_id is
>>>> in the valid range. Check if the request has been started. If we
>>>> blindly process the request we might double complete a request which
>>>> can be fatal.
>>>
>>> How did you get to this one? did the controller send a completion for
>>> a completed/bogus request?
>>
>> If that is the case, then that must mean it's possible the driver could
>> have started the command id just before the bogus completion check. Data
>> iorruption, right?
>
> Yes, which is why I don't think this check is very useful..

I actually view that as a valid protection against spoofed frames.
Without it it's easy to crash the machine by injecting fake completions
with random command ids.

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

2021-02-15 10:50:45

by Daniel Wagner

[permalink] [raw]
Subject: Re: [PATCH] nvme-tcp: Check if request has started before processing it

On Sat, Feb 13, 2021 at 09:46:41AM +0100, Hannes Reinecke wrote:
> On 2/12/21 10:49 PM, Sagi Grimberg wrote:
> >
> > > > > blk_mq_tag_to_rq() will always return a request if the command_id is
> > > > > in the valid range. Check if the request has been started. If we
> > > > > blindly process the request we might double complete a request which
> > > > > can be fatal.
> > > >
> > > > How did you get to this one? did the controller send a completion for
> > > > a completed/bogus request?
> > >
> > > If that is the case, then that must mean it's possible the driver could
> > > have started the command id just before the bogus completion check. Data
> > > iorruption, right?

'during TCP LIF toggles and aggr relocates' testing the host
crashes. TBH, I do not really know what is happening or what the test
does. Still trying to figure out what's going on. I was just very
surprised how much the code trusts the other side to behave correctly.

> > Yes, which is why I don't think this check is very useful..
>
> I actually view that as a valid protection against spoofed frames.
> Without it it's easy to crash the machine by injecting fake completions with
> random command ids.

In this test scenario it's not even a spoofed frame; maybe just a
confused controller.

2021-02-15 21:27:47

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH] nvme-tcp: Check if request has started before processing it


>>>>> blk_mq_tag_to_rq() will always return a request if the command_id is
>>>>> in the valid range. Check if the request has been started. If we
>>>>> blindly process the request we might double complete a request which
>>>>> can be fatal.
>>>>
>>>> How did you get to this one? did the controller send a completion for
>>>> a completed/bogus request?
>>>
>>> If that is the case, then that must mean it's possible the driver could
>>> have started the command id just before the bogus completion check. Data
>>> iorruption, right?
>>
>> Yes, which is why I don't think this check is very useful..
>
> I actually view that as a valid protection against spoofed frames.
> Without it it's easy to crash the machine by injecting fake completions
> with random command ids.

And this doesn't help because the command can have been easily reused
and started... What is this protecting against? Note that none of the
other transports checks that, why should tcp?

2021-02-15 21:34:05

by Sagi Grimberg

[permalink] [raw]
Subject: Re: [PATCH] nvme-tcp: Check if request has started before processing it


>>>>>> blk_mq_tag_to_rq() will always return a request if the command_id is
>>>>>> in the valid range. Check if the request has been started. If we
>>>>>> blindly process the request we might double complete a request which
>>>>>> can be fatal.
>>>>>
>>>>> How did you get to this one? did the controller send a completion for
>>>>> a completed/bogus request?
>>>>
>>>> If that is the case, then that must mean it's possible the driver could
>>>> have started the command id just before the bogus completion check. Data
>>>> iorruption, right?
>
> 'during TCP LIF toggles and aggr relocates' testing the host
> crashes. TBH, I do not really know what is happening or what the test
> does. Still trying to figure out what's going on.

Well, I think we should probably figure out why that is happening first.

> I was just very
> surprised how much the code trusts the other side to behave correctly./

What does pci/rdma/fc does differently? What does scsi do here?

>>> Yes, which is why I don't think this check is very useful..
>>
>> I actually view that as a valid protection against spoofed frames.
>> Without it it's easy to crash the machine by injecting fake completions with
>> random command ids.
>
> In this test scenario it's not even a spoofed frame; maybe just a
> confused controller.

Maybe... I am still not sure how this patch helps here though...

2021-02-16 08:53:23

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH] nvme-tcp: Check if request has started before processing it

On 2/15/21 10:23 PM, Sagi Grimberg wrote:
>
>>>>>> blk_mq_tag_to_rq() will always return a request if the command_id is
>>>>>> in the valid range. Check if the request has been started. If we
>>>>>> blindly process the request we might double complete a request which
>>>>>> can be fatal.
>>>>>
>>>>> How did you get to this one? did the controller send a completion for
>>>>> a completed/bogus request?
>>>>
>>>> If that is the case, then that must mean it's possible the driver could
>>>> have started the command id just before the bogus completion check.
>>>> Data
>>>> iorruption, right?
>>>
>>> Yes, which is why I don't think this check is very useful..
>>
>> I actually view that as a valid protection against spoofed frames.
>> Without it it's easy to crash the machine by injecting fake
>> completions with random command ids.
>
> And this doesn't help because the command can have been easily reused
> and started... What is this protecting against? Note that none of the
> other transports checks that, why should tcp?

Because it's particularly easy to spoof packets on tcp.
All other nvme-of transports are layered on top of other transports
which do some sanity checks already, so it becomes really hard to inject
invalid NVMe-oF frames for those.
NVMe-TCP has none of these protections, making it really easy to inject
faulty frames (or, heaven forbid, running a packet fuzzer).

And crashing the machine on invalid frames is always a bad idea; I would
have expected NVMe-TCP to drop them.

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

2021-02-26 12:38:00

by Daniel Wagner

[permalink] [raw]
Subject: Re: [PATCH] nvme-tcp: Check if request has started before processing it

On Mon, Feb 15, 2021 at 01:29:45PM -0800, Sagi Grimberg wrote:
> Well, I think we should probably figure out why that is happening first.

I got my hands on a tcpdump trace. I've trimmed it to this:


No. Time Source Destination Protocol Length Info
1 0.000000 10.228.194.30 10.228.38.214 NVMe 138 NVMe Write
2 0.000285 10.228.38.214 10.228.194.30 NVMe/TCP 90 Ready To Transfer
3 0.000591 10.228.194.30 10.228.38.214 NVMe 4186 NVMe Write: Data
4 0.000673 10.228.38.214 10.228.194.30 TCP 66 4420 → 58535 [ACK] Seq=25 Ack=4193 Win=241 Len=0 TSval=2655324576 TSecr=1497295579
5 0.002140 10.228.38.214 10.228.194.30 NVMe 90 NVMe Write: Response
6 0.002511 10.228.194.30 10.228.38.175 NVMe 138 NVMe Write
7 0.002812 10.228.38.175 10.228.194.30 NVMe/TCP 90 Ready To Transfer
8 0.003006 10.228.194.30 10.228.38.175 NVMe 4186 NVMe Write: Data
9 0.003098 10.228.38.175 10.228.194.30 TCP 66 4420 → 51241 [ACK] Seq=25 Ack=4193 Win=241 Len=0 TSval=2183410196 TSecr=3601034207
10 0.004420 10.228.38.175 10.228.194.30 NVMe 90 NVMe Write: Response
11 0.004890 10.228.38.214 10.228.194.30 NVMe/TCP 90
12 0.004969 10.228.38.175 10.228.194.30 NVMe/TCP 90


The last few seconds contain only normal writes and suddenly the host
receives two invalid packets. From what I see the host doesn't misbehave
at all. I wonder if it would be possible to detect the invalid packet by
locking at the PDU header only. If this would be possible we could
discard it early and do not try to use the invalid command id.

Here the last two packets with details:


No. Time Source Destination Protocol Length Info
11 0.004890 10.228.38.214 10.228.194.30 NVMe/TCP 90

Frame 11: 90 bytes on wire (720 bits), 90 bytes captured (720 bits)
Encapsulation type: Ethernet (1)
Arrival Time: Feb 23, 2021 18:16:08.780574000 CET
[Time shift for this packet: 0.000000000 seconds]
Epoch Time: 1614100568.780574000 seconds
[Time delta from previous captured frame: 0.000470000 seconds]
[Time delta from previous displayed frame: 0.000470000 seconds]
[Time since reference or first frame: 0.004890000 seconds]
Frame Number: 11
Frame Length: 90 bytes (720 bits)
Capture Length: 90 bytes (720 bits)
[Frame is marked: False]
[Frame is ignored: False]
[Protocols in frame: eth:ethertype:ip:tcp:nvme-tcp]
[Coloring Rule Name: TCP]
[Coloring Rule String: tcp]
Ethernet II, Src: IntelCor_41:16:c0 (b4:96:91:41:16:c0), Dst: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
Destination: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
Address: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
.... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
.... ...0 .... .... .... .... = IG bit: Individual address (unicast)
Source: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
Address: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
.... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
.... ...0 .... .... .... .... = IG bit: Individual address (unicast)
Type: IPv4 (0x0800)
Internet Protocol Version 4, Src: 10.228.38.214, Dst: 10.228.194.30
0100 .... = Version: 4
.... 0101 = Header Length: 20 bytes (5)
Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
0000 00.. = Differentiated Services Codepoint: Default (0)
.... ..00 = Explicit Congestion Notification: Not ECN-Capable Transport (0)
Total Length: 76
Identification: 0x0000 (0)
Flags: 0x40, Don't fragment
0... .... = Reserved bit: Not set
.1.. .... = Don't fragment: Set
..0. .... = More fragments: Not set
Fragment Offset: 0
Time to Live: 64
Protocol: TCP (6)
Header Checksum: 0x0000 [validation disabled]
[Header checksum status: Unverified]
Source Address: 10.228.38.214
Destination Address: 10.228.194.30
Transmission Control Protocol, Src Port: 4420, Dst Port: 46909, Seq: 1, Ack: 1, Len: 24
Source Port: 4420
Destination Port: 46909
[Stream index: 2]
[TCP Segment Len: 24]
Sequence Number: 1 (relative sequence number)
Sequence Number (raw): 4175488220
[Next Sequence Number: 25 (relative sequence number)]
Acknowledgment Number: 1 (relative ack number)
Acknowledgment number (raw): 2966903626
1000 .... = Header Length: 32 bytes (8)
Flags: 0x018 (PSH, ACK)
000. .... .... = Reserved: Not set
...0 .... .... = Nonce: Not set
.... 0... .... = Congestion Window Reduced (CWR): Not set
.... .0.. .... = ECN-Echo: Not set
.... ..0. .... = Urgent: Not set
.... ...1 .... = Acknowledgment: Set
.... .... 1... = Push: Set
.... .... .0.. = Reset: Not set
.... .... ..0. = Syn: Not set
.... .... ...0 = Fin: Not set
[TCP Flags: ·······AP···]
Window: 257
[Calculated window size: 257]
[Window size scaling factor: -1 (unknown)]
Checksum: 0xfefa [unverified]
[Checksum Status: Unverified]
Urgent Pointer: 0
Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
TCP Option - No-Operation (NOP)
Kind: No-Operation (1)
TCP Option - No-Operation (NOP)
Kind: No-Operation (1)
TCP Option - Timestamps: TSval 4211986351, TSecr 1497291787
Kind: Time Stamp Option (8)
Length: 10
Timestamp value: 4211986351
Timestamp echo reply: 1497291787
[SEQ/ACK analysis]
[Bytes in flight: 24]
[Bytes sent since last PSH flag: 24]
[Timestamps]
[Time since first frame in this TCP stream: 0.000000000 seconds]
[Time since previous frame in this TCP stream: 0.000000000 seconds]
TCP payload (24 bytes)
[PDU Size: 24]
NVM Express Fabrics TCP
Pdu Type: CapsuleResponse (5)
Pdu Specific Flags: 0x00
.... ...0 = PDU Header Digest: Not set
.... ..0. = PDU Data Digest: Not set
.... .0.. = PDU Data Last: Not set
.... 0... = PDU Data Success: Not set
Pdu Header Length: 24
Pdu Data Offset: 0
Packet Length: 24
Unknown Data: 02000400000000001c0000001f000000

0000 00 00 0c 9f f5 a8 b4 96 91 41 16 c0 08 00 45 00 .........A....E.
0010 00 4c 00 00 40 00 40 06 00 00 0a e4 26 d6 0a e4 .L..@.@.....&...
0020 c2 1e 11 44 b7 3d f8 e0 e4 dc b0 d7 5b 4a 80 18 ...D.=......[J..
0030 01 01 fe fa 00 00 01 01 08 0a fb 0d cf af 59 3e ..............Y>
0040 dc 0b 05 00 18 00 18 00 00 00 02 00 04 00 00 00 ................
0050 00 00 1c 00 00 00 1f 00 00 00 ..........

No. Time Source Destination Protocol Length Info
12 0.004969 10.228.38.175 10.228.194.30 NVMe/TCP 90

Frame 12: 90 bytes on wire (720 bits), 90 bytes captured (720 bits)
Encapsulation type: Ethernet (1)
Arrival Time: Feb 23, 2021 18:16:08.780653000 CET
[Time shift for this packet: 0.000000000 seconds]
Epoch Time: 1614100568.780653000 seconds
[Time delta from previous captured frame: 0.000079000 seconds]
[Time delta from previous displayed frame: 0.000079000 seconds]
[Time since reference or first frame: 0.004969000 seconds]
Frame Number: 12
Frame Length: 90 bytes (720 bits)
Capture Length: 90 bytes (720 bits)
[Frame is marked: False]
[Frame is ignored: False]
[Protocols in frame: eth:ethertype:ip:tcp:nvme-tcp]
[Coloring Rule Name: TCP]
[Coloring Rule String: tcp]
Ethernet II, Src: IntelCor_41:16:c0 (b4:96:91:41:16:c0), Dst: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
Destination: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
Address: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
.... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
.... ...0 .... .... .... .... = IG bit: Individual address (unicast)
Source: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
Address: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
.... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
.... ...0 .... .... .... .... = IG bit: Individual address (unicast)
Type: IPv4 (0x0800)
Internet Protocol Version 4, Src: 10.228.38.175, Dst: 10.228.194.30
0100 .... = Version: 4
.... 0101 = Header Length: 20 bytes (5)
Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
0000 00.. = Differentiated Services Codepoint: Default (0)
.... ..00 = Explicit Congestion Notification: Not ECN-Capable Transport (0)
Total Length: 76
Identification: 0x0000 (0)
Flags: 0x40, Don't fragment
0... .... = Reserved bit: Not set
.1.. .... = Don't fragment: Set
..0. .... = More fragments: Not set
Fragment Offset: 0
Time to Live: 64
Protocol: TCP (6)
Header Checksum: 0x0000 [validation disabled]
[Header checksum status: Unverified]
Source Address: 10.228.38.175
Destination Address: 10.228.194.30
Transmission Control Protocol, Src Port: 4420, Dst Port: 34895, Seq: 1, Ack: 1, Len: 24
Source Port: 4420
Destination Port: 34895
[Stream index: 3]
[TCP Segment Len: 24]
Sequence Number: 1 (relative sequence number)
Sequence Number (raw): 3092812012
[Next Sequence Number: 25 (relative sequence number)]
Acknowledgment Number: 1 (relative ack number)
Acknowledgment number (raw): 2384147181
1000 .... = Header Length: 32 bytes (8)
Flags: 0x018 (PSH, ACK)
000. .... .... = Reserved: Not set
...0 .... .... = Nonce: Not set
.... 0... .... = Congestion Window Reduced (CWR): Not set
.... .0.. .... = ECN-Echo: Not set
.... ..0. .... = Urgent: Not set
.... ...1 .... = Acknowledgment: Set
.... .... 1... = Push: Set
.... .... .0.. = Reset: Not set
.... .... ..0. = Syn: Not set
.... .... ...0 = Fin: Not set
[TCP Flags: ·······AP···]
Window: 257
[Calculated window size: 257]
[Window size scaling factor: -1 (unknown)]
Checksum: 0xfed3 [unverified]
[Checksum Status: Unverified]
Urgent Pointer: 0
Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
TCP Option - No-Operation (NOP)
Kind: No-Operation (1)
TCP Option - No-Operation (NOP)
Kind: No-Operation (1)
TCP Option - Timestamps: TSval 3874335934, TSecr 3601030412
Kind: Time Stamp Option (8)
Length: 10
Timestamp value: 3874335934
Timestamp echo reply: 3601030412
[SEQ/ACK analysis]
[Bytes in flight: 24]
[Bytes sent since last PSH flag: 24]
[Timestamps]
[Time since first frame in this TCP stream: 0.000000000 seconds]
[Time since previous frame in this TCP stream: 0.000000000 seconds]
TCP payload (24 bytes)
[PDU Size: 24]
NVM Express Fabrics TCP
Pdu Type: CapsuleResponse (5)
Pdu Specific Flags: 0x00
.... ...0 = PDU Header Digest: Not set
.... ..0. = PDU Data Digest: Not set
.... .0.. = PDU Data Last: Not set
.... 0... = PDU Data Success: Not set
Pdu Header Length: 24
Pdu Data Offset: 0
Packet Length: 24
Unknown Data: 02000400000000001b0000001f000000

0000 00 00 0c 9f f5 a8 b4 96 91 41 16 c0 08 00 45 00 .........A....E.
0010 00 4c 00 00 40 00 40 06 00 00 0a e4 26 af 0a e4 .L..@.@.....&...
0020 c2 1e 11 44 88 4f b8 58 90 ec 8e 1b 32 ed 80 18 ...D.O.X....2...
0030 01 01 fe d3 00 00 01 01 08 0a e6 ed ac be d6 a3 ................
0040 5d 0c 05 00 18 00 18 00 00 00 02 00 04 00 00 00 ]...............
0050 00 00 1b 00 00 00 1f 00 00 00 ..........

2021-02-26 12:57:32

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH] nvme-tcp: Check if request has started before processing it

On 2/26/21 1:35 PM, Daniel Wagner wrote:
> On Mon, Feb 15, 2021 at 01:29:45PM -0800, Sagi Grimberg wrote:
>> Well, I think we should probably figure out why that is happening first.
>
> I got my hands on a tcpdump trace. I've trimmed it to this:
>
>
> No. Time Source Destination Protocol Length Info
> 1 0.000000 10.228.194.30 10.228.38.214 NVMe 138 NVMe Write
> 2 0.000285 10.228.38.214 10.228.194.30 NVMe/TCP 90 Ready To Transfer
> 3 0.000591 10.228.194.30 10.228.38.214 NVMe 4186 NVMe Write: Data
> 4 0.000673 10.228.38.214 10.228.194.30 TCP 66 4420 → 58535 [ACK] Seq=25 Ack=4193 Win=241 Len=0 TSval=2655324576 TSecr=1497295579
> 5 0.002140 10.228.38.214 10.228.194.30 NVMe 90 NVMe Write: Response
> 6 0.002511 10.228.194.30 10.228.38.175 NVMe 138 NVMe Write
> 7 0.002812 10.228.38.175 10.228.194.30 NVMe/TCP 90 Ready To Transfer
> 8 0.003006 10.228.194.30 10.228.38.175 NVMe 4186 NVMe Write: Data
> 9 0.003098 10.228.38.175 10.228.194.30 TCP 66 4420 → 51241 [ACK] Seq=25 Ack=4193 Win=241 Len=0 TSval=2183410196 TSecr=3601034207
> 10 0.004420 10.228.38.175 10.228.194.30 NVMe 90 NVMe Write: Response
> 11 0.004890 10.228.38.214 10.228.194.30 NVMe/TCP 90
> 12 0.004969 10.228.38.175 10.228.194.30 NVMe/TCP 90
>
>
> The last few seconds contain only normal writes and suddenly the host
> receives two invalid packets. From what I see the host doesn't misbehave
> at all. I wonder if it would be possible to detect the invalid packet by
> locking at the PDU header only. If this would be possible we could
> discard it early and do not try to use the invalid command id.
>
> Here the last two packets with details:
>
>
> No. Time Source Destination Protocol Length Info
> 11 0.004890 10.228.38.214 10.228.194.30 NVMe/TCP 90
>
> Frame 11: 90 bytes on wire (720 bits), 90 bytes captured (720 bits)
> Encapsulation type: Ethernet (1)
> Arrival Time: Feb 23, 2021 18:16:08.780574000 CET
> [Time shift for this packet: 0.000000000 seconds]
> Epoch Time: 1614100568.780574000 seconds
> [Time delta from previous captured frame: 0.000470000 seconds]
> [Time delta from previous displayed frame: 0.000470000 seconds]
> [Time since reference or first frame: 0.004890000 seconds]
> Frame Number: 11
> Frame Length: 90 bytes (720 bits)
> Capture Length: 90 bytes (720 bits)
> [Frame is marked: False]
> [Frame is ignored: False]
> [Protocols in frame: eth:ethertype:ip:tcp:nvme-tcp]
> [Coloring Rule Name: TCP]
> [Coloring Rule String: tcp]
> Ethernet II, Src: IntelCor_41:16:c0 (b4:96:91:41:16:c0), Dst: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
> Destination: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
> Address: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
> .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
> .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
> Source: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
> Address: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
> .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
> .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
> Type: IPv4 (0x0800)
> Internet Protocol Version 4, Src: 10.228.38.214, Dst: 10.228.194.30
> 0100 .... = Version: 4
> .... 0101 = Header Length: 20 bytes (5)
> Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
> 0000 00.. = Differentiated Services Codepoint: Default (0)
> .... ..00 = Explicit Congestion Notification: Not ECN-Capable Transport (0)
> Total Length: 76
> Identification: 0x0000 (0)
> Flags: 0x40, Don't fragment
> 0... .... = Reserved bit: Not set
> .1.. .... = Don't fragment: Set
> ..0. .... = More fragments: Not set
> Fragment Offset: 0
> Time to Live: 64
> Protocol: TCP (6)
> Header Checksum: 0x0000 [validation disabled]
> [Header checksum status: Unverified]
> Source Address: 10.228.38.214
> Destination Address: 10.228.194.30
> Transmission Control Protocol, Src Port: 4420, Dst Port: 46909, Seq: 1, Ack: 1, Len: 24
> Source Port: 4420
> Destination Port: 46909
> [Stream index: 2]
> [TCP Segment Len: 24]
> Sequence Number: 1 (relative sequence number)
> Sequence Number (raw): 4175488220
> [Next Sequence Number: 25 (relative sequence number)]
> Acknowledgment Number: 1 (relative ack number)
> Acknowledgment number (raw): 2966903626
> 1000 .... = Header Length: 32 bytes (8)
> Flags: 0x018 (PSH, ACK)
> 000. .... .... = Reserved: Not set
> ...0 .... .... = Nonce: Not set
> .... 0... .... = Congestion Window Reduced (CWR): Not set
> .... .0.. .... = ECN-Echo: Not set
> .... ..0. .... = Urgent: Not set
> .... ...1 .... = Acknowledgment: Set
> .... .... 1... = Push: Set
> .... .... .0.. = Reset: Not set
> .... .... ..0. = Syn: Not set
> .... .... ...0 = Fin: Not set
> [TCP Flags: ·······AP···]
> Window: 257
> [Calculated window size: 257]
> [Window size scaling factor: -1 (unknown)]
> Checksum: 0xfefa [unverified]
> [Checksum Status: Unverified]
> Urgent Pointer: 0
> Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
> TCP Option - No-Operation (NOP)
> Kind: No-Operation (1)
> TCP Option - No-Operation (NOP)
> Kind: No-Operation (1)
> TCP Option - Timestamps: TSval 4211986351, TSecr 1497291787
> Kind: Time Stamp Option (8)
> Length: 10
> Timestamp value: 4211986351
> Timestamp echo reply: 1497291787
> [SEQ/ACK analysis]
> [Bytes in flight: 24]
> [Bytes sent since last PSH flag: 24]
> [Timestamps]
> [Time since first frame in this TCP stream: 0.000000000 seconds]
> [Time since previous frame in this TCP stream: 0.000000000 seconds]
> TCP payload (24 bytes)
> [PDU Size: 24]
> NVM Express Fabrics TCP
> Pdu Type: CapsuleResponse (5)
> Pdu Specific Flags: 0x00
> .... ...0 = PDU Header Digest: Not set
> .... ..0. = PDU Data Digest: Not set
> .... .0.. = PDU Data Last: Not set
> .... 0... = PDU Data Success: Not set
> Pdu Header Length: 24
> Pdu Data Offset: 0
> Packet Length: 24
> Unknown Data: 02000400000000001c0000001f000000
>
> 0000 00 00 0c 9f f5 a8 b4 96 91 41 16 c0 08 00 45 00 .........A....E.
> 0010 00 4c 00 00 40 00 40 06 00 00 0a e4 26 d6 0a e4 .L..@.@.....&...
> 0020 c2 1e 11 44 b7 3d f8 e0 e4 dc b0 d7 5b 4a 80 18 ...D.=......[J..
> 0030 01 01 fe fa 00 00 01 01 08 0a fb 0d cf af 59 3e ..............Y>
> 0040 dc 0b 05 00 18 00 18 00 00 00 02 00 04 00 00 00 ................
> 0050 00 00 1c 00 00 00 1f 00 00 00 ..........
>
> No. Time Source Destination Protocol Length Info
> 12 0.004969 10.228.38.175 10.228.194.30 NVMe/TCP 90
>
> Frame 12: 90 bytes on wire (720 bits), 90 bytes captured (720 bits)
> Encapsulation type: Ethernet (1)
> Arrival Time: Feb 23, 2021 18:16:08.780653000 CET
> [Time shift for this packet: 0.000000000 seconds]
> Epoch Time: 1614100568.780653000 seconds
> [Time delta from previous captured frame: 0.000079000 seconds]
> [Time delta from previous displayed frame: 0.000079000 seconds]
> [Time since reference or first frame: 0.004969000 seconds]
> Frame Number: 12
> Frame Length: 90 bytes (720 bits)
> Capture Length: 90 bytes (720 bits)
> [Frame is marked: False]
> [Frame is ignored: False]
> [Protocols in frame: eth:ethertype:ip:tcp:nvme-tcp]
> [Coloring Rule Name: TCP]
> [Coloring Rule String: tcp]
> Ethernet II, Src: IntelCor_41:16:c0 (b4:96:91:41:16:c0), Dst: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
> Destination: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
> Address: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
> .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
> .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
> Source: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
> Address: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
> .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
> .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
> Type: IPv4 (0x0800)
> Internet Protocol Version 4, Src: 10.228.38.175, Dst: 10.228.194.30
> 0100 .... = Version: 4
> .... 0101 = Header Length: 20 bytes (5)
> Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
> 0000 00.. = Differentiated Services Codepoint: Default (0)
> .... ..00 = Explicit Congestion Notification: Not ECN-Capable Transport (0)
> Total Length: 76
> Identification: 0x0000 (0)
> Flags: 0x40, Don't fragment
> 0... .... = Reserved bit: Not set
> .1.. .... = Don't fragment: Set
> ..0. .... = More fragments: Not set
> Fragment Offset: 0
> Time to Live: 64
> Protocol: TCP (6)
> Header Checksum: 0x0000 [validation disabled]
> [Header checksum status: Unverified]
> Source Address: 10.228.38.175
> Destination Address: 10.228.194.30
> Transmission Control Protocol, Src Port: 4420, Dst Port: 34895, Seq: 1, Ack: 1, Len: 24
> Source Port: 4420
> Destination Port: 34895
> [Stream index: 3]
> [TCP Segment Len: 24]
> Sequence Number: 1 (relative sequence number)
> Sequence Number (raw): 3092812012
> [Next Sequence Number: 25 (relative sequence number)]
> Acknowledgment Number: 1 (relative ack number)
> Acknowledgment number (raw): 2384147181
> 1000 .... = Header Length: 32 bytes (8)
> Flags: 0x018 (PSH, ACK)
> 000. .... .... = Reserved: Not set
> ...0 .... .... = Nonce: Not set
> .... 0... .... = Congestion Window Reduced (CWR): Not set
> .... .0.. .... = ECN-Echo: Not set
> .... ..0. .... = Urgent: Not set
> .... ...1 .... = Acknowledgment: Set
> .... .... 1... = Push: Set
> .... .... .0.. = Reset: Not set
> .... .... ..0. = Syn: Not set
> .... .... ...0 = Fin: Not set
> [TCP Flags: ·······AP···]
> Window: 257
> [Calculated window size: 257]
> [Window size scaling factor: -1 (unknown)]
> Checksum: 0xfed3 [unverified]
> [Checksum Status: Unverified]
> Urgent Pointer: 0
> Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
> TCP Option - No-Operation (NOP)
> Kind: No-Operation (1)
> TCP Option - No-Operation (NOP)
> Kind: No-Operation (1)
> TCP Option - Timestamps: TSval 3874335934, TSecr 3601030412
> Kind: Time Stamp Option (8)
> Length: 10
> Timestamp value: 3874335934
> Timestamp echo reply: 3601030412
> [SEQ/ACK analysis]
> [Bytes in flight: 24]
> [Bytes sent since last PSH flag: 24]
> [Timestamps]
> [Time since first frame in this TCP stream: 0.000000000 seconds]
> [Time since previous frame in this TCP stream: 0.000000000 seconds]
> TCP payload (24 bytes)
> [PDU Size: 24]
> NVM Express Fabrics TCP
> Pdu Type: CapsuleResponse (5)
> Pdu Specific Flags: 0x00
> .... ...0 = PDU Header Digest: Not set
> .... ..0. = PDU Data Digest: Not set
> .... .0.. = PDU Data Last: Not set
> .... 0... = PDU Data Success: Not set
> Pdu Header Length: 24
> Pdu Data Offset: 0
> Packet Length: 24
> Unknown Data: 02000400000000001b0000001f000000
>
> 0000 00 00 0c 9f f5 a8 b4 96 91 41 16 c0 08 00 45 00 .........A....E.
> 0010 00 4c 00 00 40 00 40 06 00 00 0a e4 26 af 0a e4 .L..@.@.....&...
> 0020 c2 1e 11 44 88 4f b8 58 90 ec 8e 1b 32 ed 80 18 ...D.O.X....2...
> 0030 01 01 fe d3 00 00 01 01 08 0a e6 ed ac be d6 a3 ................
> 0040 5d 0c 05 00 18 00 18 00 00 00 02 00 04 00 00 00 ]...............
> 0050 00 00 1b 00 00 00 1f 00 00 00 ..........
>
As I suspected, we did receive an invalid frame.
Data digest would have saved us, but then it's not enabled.

So we do need to check if the request is valid before processing it.

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

2021-02-26 16:16:01

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] nvme-tcp: Check if request has started before processing it

On Fri, Feb 26, 2021 at 01:54:00PM +0100, Hannes Reinecke wrote:
> On 2/26/21 1:35 PM, Daniel Wagner wrote:
> > On Mon, Feb 15, 2021 at 01:29:45PM -0800, Sagi Grimberg wrote:
> > > Well, I think we should probably figure out why that is happening first.
> >
> > I got my hands on a tcpdump trace. I've trimmed it to this:
> >
> >
> > No. Time Source Destination Protocol Length Info
> > 1 0.000000 10.228.194.30 10.228.38.214 NVMe 138 NVMe Write
> > 2 0.000285 10.228.38.214 10.228.194.30 NVMe/TCP 90 Ready To Transfer
> > 3 0.000591 10.228.194.30 10.228.38.214 NVMe 4186 NVMe Write: Data
> > 4 0.000673 10.228.38.214 10.228.194.30 TCP 66 4420 → 58535 [ACK] Seq=25 Ack=4193 Win=241 Len=0 TSval=2655324576 TSecr=1497295579
> > 5 0.002140 10.228.38.214 10.228.194.30 NVMe 90 NVMe Write: Response
> > 6 0.002511 10.228.194.30 10.228.38.175 NVMe 138 NVMe Write
> > 7 0.002812 10.228.38.175 10.228.194.30 NVMe/TCP 90 Ready To Transfer
> > 8 0.003006 10.228.194.30 10.228.38.175 NVMe 4186 NVMe Write: Data
> > 9 0.003098 10.228.38.175 10.228.194.30 TCP 66 4420 → 51241 [ACK] Seq=25 Ack=4193 Win=241 Len=0 TSval=2183410196 TSecr=3601034207
> > 10 0.004420 10.228.38.175 10.228.194.30 NVMe 90 NVMe Write: Response
> > 11 0.004890 10.228.38.214 10.228.194.30 NVMe/TCP 90
> > 12 0.004969 10.228.38.175 10.228.194.30 NVMe/TCP 90
> >
> >
> > The last few seconds contain only normal writes and suddenly the host
> > receives two invalid packets. From what I see the host doesn't misbehave
> > at all. I wonder if it would be possible to detect the invalid packet by
> > locking at the PDU header only. If this would be possible we could
> > discard it early and do not try to use the invalid command id.
> >
> > Here the last two packets with details:
> >
> >
> > No. Time Source Destination Protocol Length Info
> > 11 0.004890 10.228.38.214 10.228.194.30 NVMe/TCP 90
> >
> > Frame 11: 90 bytes on wire (720 bits), 90 bytes captured (720 bits)
> > Encapsulation type: Ethernet (1)
> > Arrival Time: Feb 23, 2021 18:16:08.780574000 CET
> > [Time shift for this packet: 0.000000000 seconds]
> > Epoch Time: 1614100568.780574000 seconds
> > [Time delta from previous captured frame: 0.000470000 seconds]
> > [Time delta from previous displayed frame: 0.000470000 seconds]
> > [Time since reference or first frame: 0.004890000 seconds]
> > Frame Number: 11
> > Frame Length: 90 bytes (720 bits)
> > Capture Length: 90 bytes (720 bits)
> > [Frame is marked: False]
> > [Frame is ignored: False]
> > [Protocols in frame: eth:ethertype:ip:tcp:nvme-tcp]
> > [Coloring Rule Name: TCP]
> > [Coloring Rule String: tcp]
> > Ethernet II, Src: IntelCor_41:16:c0 (b4:96:91:41:16:c0), Dst: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
> > Destination: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
> > Address: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
> > .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
> > .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
> > Source: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
> > Address: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
> > .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
> > .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
> > Type: IPv4 (0x0800)
> > Internet Protocol Version 4, Src: 10.228.38.214, Dst: 10.228.194.30
> > 0100 .... = Version: 4
> > .... 0101 = Header Length: 20 bytes (5)
> > Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
> > 0000 00.. = Differentiated Services Codepoint: Default (0)
> > .... ..00 = Explicit Congestion Notification: Not ECN-Capable Transport (0)
> > Total Length: 76
> > Identification: 0x0000 (0)
> > Flags: 0x40, Don't fragment
> > 0... .... = Reserved bit: Not set
> > .1.. .... = Don't fragment: Set
> > ..0. .... = More fragments: Not set
> > Fragment Offset: 0
> > Time to Live: 64
> > Protocol: TCP (6)
> > Header Checksum: 0x0000 [validation disabled]
> > [Header checksum status: Unverified]
> > Source Address: 10.228.38.214
> > Destination Address: 10.228.194.30
> > Transmission Control Protocol, Src Port: 4420, Dst Port: 46909, Seq: 1, Ack: 1, Len: 24
> > Source Port: 4420
> > Destination Port: 46909
> > [Stream index: 2]
> > [TCP Segment Len: 24]
> > Sequence Number: 1 (relative sequence number)
> > Sequence Number (raw): 4175488220
> > [Next Sequence Number: 25 (relative sequence number)]
> > Acknowledgment Number: 1 (relative ack number)
> > Acknowledgment number (raw): 2966903626
> > 1000 .... = Header Length: 32 bytes (8)
> > Flags: 0x018 (PSH, ACK)
> > 000. .... .... = Reserved: Not set
> > ...0 .... .... = Nonce: Not set
> > .... 0... .... = Congestion Window Reduced (CWR): Not set
> > .... .0.. .... = ECN-Echo: Not set
> > .... ..0. .... = Urgent: Not set
> > .... ...1 .... = Acknowledgment: Set
> > .... .... 1... = Push: Set
> > .... .... .0.. = Reset: Not set
> > .... .... ..0. = Syn: Not set
> > .... .... ...0 = Fin: Not set
> > [TCP Flags: ·······AP···]
> > Window: 257
> > [Calculated window size: 257]
> > [Window size scaling factor: -1 (unknown)]
> > Checksum: 0xfefa [unverified]
> > [Checksum Status: Unverified]
> > Urgent Pointer: 0
> > Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
> > TCP Option - No-Operation (NOP)
> > Kind: No-Operation (1)
> > TCP Option - No-Operation (NOP)
> > Kind: No-Operation (1)
> > TCP Option - Timestamps: TSval 4211986351, TSecr 1497291787
> > Kind: Time Stamp Option (8)
> > Length: 10
> > Timestamp value: 4211986351
> > Timestamp echo reply: 1497291787
> > [SEQ/ACK analysis]
> > [Bytes in flight: 24]
> > [Bytes sent since last PSH flag: 24]
> > [Timestamps]
> > [Time since first frame in this TCP stream: 0.000000000 seconds]
> > [Time since previous frame in this TCP stream: 0.000000000 seconds]
> > TCP payload (24 bytes)
> > [PDU Size: 24]
> > NVM Express Fabrics TCP
> > Pdu Type: CapsuleResponse (5)
> > Pdu Specific Flags: 0x00
> > .... ...0 = PDU Header Digest: Not set
> > .... ..0. = PDU Data Digest: Not set
> > .... .0.. = PDU Data Last: Not set
> > .... 0... = PDU Data Success: Not set
> > Pdu Header Length: 24
> > Pdu Data Offset: 0
> > Packet Length: 24
> > Unknown Data: 02000400000000001c0000001f000000
> >
> > 0000 00 00 0c 9f f5 a8 b4 96 91 41 16 c0 08 00 45 00 .........A....E.
> > 0010 00 4c 00 00 40 00 40 06 00 00 0a e4 26 d6 0a e4 .L..@.@.....&...
> > 0020 c2 1e 11 44 b7 3d f8 e0 e4 dc b0 d7 5b 4a 80 18 ...D.=......[J..
> > 0030 01 01 fe fa 00 00 01 01 08 0a fb 0d cf af 59 3e ..............Y>
> > 0040 dc 0b 05 00 18 00 18 00 00 00 02 00 04 00 00 00 ................
> > 0050 00 00 1c 00 00 00 1f 00 00 00 ..........
> >
> > No. Time Source Destination Protocol Length Info
> > 12 0.004969 10.228.38.175 10.228.194.30 NVMe/TCP 90
> >
> > Frame 12: 90 bytes on wire (720 bits), 90 bytes captured (720 bits)
> > Encapsulation type: Ethernet (1)
> > Arrival Time: Feb 23, 2021 18:16:08.780653000 CET
> > [Time shift for this packet: 0.000000000 seconds]
> > Epoch Time: 1614100568.780653000 seconds
> > [Time delta from previous captured frame: 0.000079000 seconds]
> > [Time delta from previous displayed frame: 0.000079000 seconds]
> > [Time since reference or first frame: 0.004969000 seconds]
> > Frame Number: 12
> > Frame Length: 90 bytes (720 bits)
> > Capture Length: 90 bytes (720 bits)
> > [Frame is marked: False]
> > [Frame is ignored: False]
> > [Protocols in frame: eth:ethertype:ip:tcp:nvme-tcp]
> > [Coloring Rule Name: TCP]
> > [Coloring Rule String: tcp]
> > Ethernet II, Src: IntelCor_41:16:c0 (b4:96:91:41:16:c0), Dst: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
> > Destination: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
> > Address: Cisco_9f:f5:a8 (00:00:0c:9f:f5:a8)
> > .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
> > .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
> > Source: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
> > Address: IntelCor_41:16:c0 (b4:96:91:41:16:c0)
> > .... ..0. .... .... .... .... = LG bit: Globally unique address (factory default)
> > .... ...0 .... .... .... .... = IG bit: Individual address (unicast)
> > Type: IPv4 (0x0800)
> > Internet Protocol Version 4, Src: 10.228.38.175, Dst: 10.228.194.30
> > 0100 .... = Version: 4
> > .... 0101 = Header Length: 20 bytes (5)
> > Differentiated Services Field: 0x00 (DSCP: CS0, ECN: Not-ECT)
> > 0000 00.. = Differentiated Services Codepoint: Default (0)
> > .... ..00 = Explicit Congestion Notification: Not ECN-Capable Transport (0)
> > Total Length: 76
> > Identification: 0x0000 (0)
> > Flags: 0x40, Don't fragment
> > 0... .... = Reserved bit: Not set
> > .1.. .... = Don't fragment: Set
> > ..0. .... = More fragments: Not set
> > Fragment Offset: 0
> > Time to Live: 64
> > Protocol: TCP (6)
> > Header Checksum: 0x0000 [validation disabled]
> > [Header checksum status: Unverified]
> > Source Address: 10.228.38.175
> > Destination Address: 10.228.194.30
> > Transmission Control Protocol, Src Port: 4420, Dst Port: 34895, Seq: 1, Ack: 1, Len: 24
> > Source Port: 4420
> > Destination Port: 34895
> > [Stream index: 3]
> > [TCP Segment Len: 24]
> > Sequence Number: 1 (relative sequence number)
> > Sequence Number (raw): 3092812012
> > [Next Sequence Number: 25 (relative sequence number)]
> > Acknowledgment Number: 1 (relative ack number)
> > Acknowledgment number (raw): 2384147181
> > 1000 .... = Header Length: 32 bytes (8)
> > Flags: 0x018 (PSH, ACK)
> > 000. .... .... = Reserved: Not set
> > ...0 .... .... = Nonce: Not set
> > .... 0... .... = Congestion Window Reduced (CWR): Not set
> > .... .0.. .... = ECN-Echo: Not set
> > .... ..0. .... = Urgent: Not set
> > .... ...1 .... = Acknowledgment: Set
> > .... .... 1... = Push: Set
> > .... .... .0.. = Reset: Not set
> > .... .... ..0. = Syn: Not set
> > .... .... ...0 = Fin: Not set
> > [TCP Flags: ·······AP···]
> > Window: 257
> > [Calculated window size: 257]
> > [Window size scaling factor: -1 (unknown)]
> > Checksum: 0xfed3 [unverified]
> > [Checksum Status: Unverified]
> > Urgent Pointer: 0
> > Options: (12 bytes), No-Operation (NOP), No-Operation (NOP), Timestamps
> > TCP Option - No-Operation (NOP)
> > Kind: No-Operation (1)
> > TCP Option - No-Operation (NOP)
> > Kind: No-Operation (1)
> > TCP Option - Timestamps: TSval 3874335934, TSecr 3601030412
> > Kind: Time Stamp Option (8)
> > Length: 10
> > Timestamp value: 3874335934
> > Timestamp echo reply: 3601030412
> > [SEQ/ACK analysis]
> > [Bytes in flight: 24]
> > [Bytes sent since last PSH flag: 24]
> > [Timestamps]
> > [Time since first frame in this TCP stream: 0.000000000 seconds]
> > [Time since previous frame in this TCP stream: 0.000000000 seconds]
> > TCP payload (24 bytes)
> > [PDU Size: 24]
> > NVM Express Fabrics TCP
> > Pdu Type: CapsuleResponse (5)
> > Pdu Specific Flags: 0x00
> > .... ...0 = PDU Header Digest: Not set
> > .... ..0. = PDU Data Digest: Not set
> > .... .0.. = PDU Data Last: Not set
> > .... 0... = PDU Data Success: Not set
> > Pdu Header Length: 24
> > Pdu Data Offset: 0
> > Packet Length: 24
> > Unknown Data: 02000400000000001b0000001f000000
> >
> > 0000 00 00 0c 9f f5 a8 b4 96 91 41 16 c0 08 00 45 00 .........A....E.
> > 0010 00 4c 00 00 40 00 40 06 00 00 0a e4 26 af 0a e4 .L..@.@.....&...
> > 0020 c2 1e 11 44 88 4f b8 58 90 ec 8e 1b 32 ed 80 18 ...D.O.X....2...
> > 0030 01 01 fe d3 00 00 01 01 08 0a e6 ed ac be d6 a3 ................
> > 0040 5d 0c 05 00 18 00 18 00 00 00 02 00 04 00 00 00 ]...............
> > 0050 00 00 1b 00 00 00 1f 00 00 00 ..........
> >
> As I suspected, we did receive an invalid frame.
> Data digest would have saved us, but then it's not enabled.
>
> So we do need to check if the request is valid before processing it.

That's just addressing a symptom. You can't fully verify the request is
valid this way because the host could have started the same command ID
the very moment before the code checks it, incorrectly completing an
in-flight command and getting data corruption.

2021-02-26 16:45:15

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH] nvme-tcp: Check if request has started before processing it

On 2/26/21 5:13 PM, Keith Busch wrote:
> On Fri, Feb 26, 2021 at 01:54:00PM +0100, Hannes Reinecke wrote:
>> On 2/26/21 1:35 PM, Daniel Wagner wrote:
>>> On Mon, Feb 15, 2021 at 01:29:45PM -0800, Sagi Grimberg wrote:
>>>> Well, I think we should probably figure out why that is happening first.
>>>
>>> I got my hands on a tcpdump trace. I've trimmed it to this:
>>>
[ .. ]
>>> NVM Express Fabrics TCP
>>> Pdu Type: CapsuleResponse (5)
>>> Pdu Specific Flags: 0x00
>>> .... ...0 = PDU Header Digest: Not set
>>> .... ..0. = PDU Data Digest: Not set
>>> .... .0.. = PDU Data Last: Not set
>>> .... 0... = PDU Data Success: Not set
>>> Pdu Header Length: 24
>>> Pdu Data Offset: 0
>>> Packet Length: 24
>>> Unknown Data: 02000400000000001b0000001f000000
>>>
>>> 0000 00 00 0c 9f f5 a8 b4 96 91 41 16 c0 08 00 45 00 .........A....E.
>>> 0010 00 4c 00 00 40 00 40 06 00 00 0a e4 26 af 0a e4 .L..@.@.....&...
>>> 0020 c2 1e 11 44 88 4f b8 58 90 ec 8e 1b 32 ed 80 18 ...D.O.X....2...
>>> 0030 01 01 fe d3 00 00 01 01 08 0a e6 ed ac be d6 a3 ................
>>> 0040 5d 0c 05 00 18 00 18 00 00 00 02 00 04 00 00 00 ]...............
>>> 0050 00 00 1b 00 00 00 1f 00 00 00 ..........
>>>
>> As I suspected, we did receive an invalid frame.
>> Data digest would have saved us, but then it's not enabled.
>>
>> So we do need to check if the request is valid before processing it.
>
> That's just addressing a symptom. You can't fully verify the request is
> valid this way because the host could have started the same command ID
> the very moment before the code checks it, incorrectly completing an
> in-flight command and getting data corruption.
>
Oh, I am fully aware.

Bad frames are just that, bad frames.
We can only fully validate that when digests are enabled, but I gather
that controllers sending out bad frames wouldn't want to enable digests,
either. So relying on that is possibly not an option.

So really what I'm trying to avoid is the host crashing on a bad frame.
That kind of thing always resonates bad with customers.
And tripping over an uninitialized command is just too stupid IMO.

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

2021-02-26 17:21:07

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] nvme-tcp: Check if request has started before processing it

On Fri, Feb 26, 2021 at 05:42:46PM +0100, Hannes Reinecke wrote:
> On 2/26/21 5:13 PM, Keith Busch wrote:
> >
> > That's just addressing a symptom. You can't fully verify the request is
> > valid this way because the host could have started the same command ID
> > the very moment before the code checks it, incorrectly completing an
> > in-flight command and getting data corruption.
> >
> Oh, I am fully aware.
>
> Bad frames are just that, bad frames.
> We can only fully validate that when digests are enabled, but I gather that
> controllers sending out bad frames wouldn't want to enable digests, either.
> So relying on that is possibly not an option.
>
> So really what I'm trying to avoid is the host crashing on a bad frame.
> That kind of thing always resonates bad with customers.
> And tripping over an uninitialized command is just too stupid IMO.

Crashing is bad, silent data corruption is worse. Is there truly no
defense against that? If not, why should anyone rely on this?

2021-03-01 16:12:35

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] nvme-tcp: Check if request has started before processing it

On Mon, Mar 01, 2021 at 02:55:30PM +0100, Hannes Reinecke wrote:
> On 3/1/21 2:26 PM, Daniel Wagner wrote:
> > On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote:
> >> Crashing is bad, silent data corruption is worse. Is there truly no
> >> defense against that? If not, why should anyone rely on this?
> >
> > If we receive an response for which we don't have a started request, we
> > know that something is wrong. Couldn't we in just reset the connection
> > in this case? We don't have to pretend nothing has happened and
> > continuing normally. This would avoid a host crash and would not create
> > (more) data corruption. Or I am just too naive?
> >
> This is actually a sensible solution.
> Please send a patch for that.

Is a bad frame a problem that can be resolved with a reset?

Even if so, the reset doesn't indicate to the user if previous commands
completed with bad data, so it still seems unreliable.

2021-03-01 19:37:34

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH] nvme-tcp: Check if request has started before processing it

On 3/1/21 5:05 PM, Keith Busch wrote:
> On Mon, Mar 01, 2021 at 02:55:30PM +0100, Hannes Reinecke wrote:
>> On 3/1/21 2:26 PM, Daniel Wagner wrote:
>>> On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote:
>>>> Crashing is bad, silent data corruption is worse. Is there truly no
>>>> defense against that? If not, why should anyone rely on this?
>>>
>>> If we receive an response for which we don't have a started request, we
>>> know that something is wrong. Couldn't we in just reset the connection
>>> in this case? We don't have to pretend nothing has happened and
>>> continuing normally. This would avoid a host crash and would not create
>>> (more) data corruption. Or I am just too naive?
>>>
>> This is actually a sensible solution.
>> Please send a patch for that.
>
> Is a bad frame a problem that can be resolved with a reset?
>
> Even if so, the reset doesn't indicate to the user if previous commands
> completed with bad data, so it still seems unreliable.
>
We need to distinguish two cases here.
The one is use receiving a frame with an invalid tag, leading to a
crash. This can be easily resolved by issuing a reset, as clearly the
command was garbage and we need to invoke error handling (which is reset).

The other case is us receiving a frame with a _duplicate_ tag, ie a tag
which is _currently_ valid. This is a case which will fail _even now_,
as we have simply no way of detecting this.

So what again do we miss by fixing the first case?
Apart from a system which does _not_ crash?

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), Geschäftsführer: Felix Imendörffer

2021-03-02 20:39:46

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH] nvme-tcp: Check if request has started before processing it

On 3/1/21 9:59 PM, Keith Busch wrote:
> On Mon, Mar 01, 2021 at 05:53:25PM +0100, Hannes Reinecke wrote:
>> On 3/1/21 5:05 PM, Keith Busch wrote:
>>> On Mon, Mar 01, 2021 at 02:55:30PM +0100, Hannes Reinecke wrote:
>>>> On 3/1/21 2:26 PM, Daniel Wagner wrote:
>>>>> On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote:
>>>>>> Crashing is bad, silent data corruption is worse. Is there truly no
>>>>>> defense against that? If not, why should anyone rely on this?
>>>>>
>>>>> If we receive an response for which we don't have a started request, we
>>>>> know that something is wrong. Couldn't we in just reset the connection
>>>>> in this case? We don't have to pretend nothing has happened and
>>>>> continuing normally. This would avoid a host crash and would not create
>>>>> (more) data corruption. Or I am just too naive?
>>>>>
>>>> This is actually a sensible solution.
>>>> Please send a patch for that.
>>>
>>> Is a bad frame a problem that can be resolved with a reset?
>>>
>>> Even if so, the reset doesn't indicate to the user if previous commands
>>> completed with bad data, so it still seems unreliable.
>>>
>> We need to distinguish two cases here.
>> The one is use receiving a frame with an invalid tag, leading to a crash.
>> This can be easily resolved by issuing a reset, as clearly the command was
>> garbage and we need to invoke error handling (which is reset).
>>
>> The other case is us receiving a frame with a _duplicate_ tag, ie a tag
>> which is _currently_ valid. This is a case which will fail _even now_, as we
>> have simply no way of detecting this.
>>
>> So what again do we miss by fixing the first case?
>> Apart from a system which does _not_ crash?
>
> I'm just saying each case is a symptom of the same problem. The only
> difference from observing one vs the other is a race with the host's
> dispatch. And since you're proposing this patch, it sounds like this
> condition does happen on tcp compared to other transports where we don't
> observe it. I just thought the implication that data corruption happens
> is a alarming.
>
Oh yes, it is.
But sadly TCP inherently suffers from this, as literally anyone can
spoof frames on the network.
Other transports like RDMA or FC do not suffer to that extend as
spoofing frames there is far more elaborate, and not really possible
without dedicated hardware equipment.

That's why there is header and data digest; that will protect you
against accidental frame corruption (as this case clearly is; the
remainder of the frame is filled with zeroes).
It would not protect you against deliberate frame corruption; that's why
there is TPAR 8010 (TLS encryption for NVMe-TCP).

Be it as it may, none of these methods are in use here, and none of
these methods can be made mandatory. So we need to deal with the case at
hand.

And in my opinion crashing is the _worst_ options of all.
Tear the connection down, reset the thing, whatever.

But do not crash.
Customers tend to have a very dim view on crashing machines, and have a
very limited capacity for being susceptible to our reasoning in these cases.

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

2021-03-03 03:27:15

by Daniel Wagner

[permalink] [raw]
Subject: Re: [PATCH] nvme-tcp: Check if request has started before processing it

On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote:
> Crashing is bad, silent data corruption is worse. Is there truly no
> defense against that? If not, why should anyone rely on this?

If we receive an response for which we don't have a started request, we
know that something is wrong. Couldn't we in just reset the connection
in this case? We don't have to pretend nothing has happened and
continuing normally. This would avoid a host crash and would not create
(more) data corruption. Or I am just too naive?

2021-03-03 03:40:02

by Hannes Reinecke

[permalink] [raw]
Subject: Re: [PATCH] nvme-tcp: Check if request has started before processing it

On 3/1/21 2:26 PM, Daniel Wagner wrote:
> On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote:
>> Crashing is bad, silent data corruption is worse. Is there truly no
>> defense against that? If not, why should anyone rely on this?
>
> If we receive an response for which we don't have a started request, we
> know that something is wrong. Couldn't we in just reset the connection
> in this case? We don't have to pretend nothing has happened and
> continuing normally. This would avoid a host crash and would not create
> (more) data corruption. Or I am just too naive?
>
This is actually a sensible solution.
Please send a patch for that.

Cheers,

Hannes
--
Dr. Hannes Reinecke Kernel Storage Architect
[email protected] +49 911 74053 688
SUSE Software Solutions Germany GmbH, Maxfeldstr. 5, 90409 Nürnberg
HRB 36809 (AG Nürnberg), GF: Felix Imendörffer

2021-03-04 05:23:01

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] nvme-tcp: Check if request has started before processing it

On Mon, Mar 01, 2021 at 05:53:25PM +0100, Hannes Reinecke wrote:
> On 3/1/21 5:05 PM, Keith Busch wrote:
> > On Mon, Mar 01, 2021 at 02:55:30PM +0100, Hannes Reinecke wrote:
> > > On 3/1/21 2:26 PM, Daniel Wagner wrote:
> > > > On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote:
> > > > > Crashing is bad, silent data corruption is worse. Is there truly no
> > > > > defense against that? If not, why should anyone rely on this?
> > > >
> > > > If we receive an response for which we don't have a started request, we
> > > > know that something is wrong. Couldn't we in just reset the connection
> > > > in this case? We don't have to pretend nothing has happened and
> > > > continuing normally. This would avoid a host crash and would not create
> > > > (more) data corruption. Or I am just too naive?
> > > >
> > > This is actually a sensible solution.
> > > Please send a patch for that.
> >
> > Is a bad frame a problem that can be resolved with a reset?
> >
> > Even if so, the reset doesn't indicate to the user if previous commands
> > completed with bad data, so it still seems unreliable.
> >
> We need to distinguish two cases here.
> The one is use receiving a frame with an invalid tag, leading to a crash.
> This can be easily resolved by issuing a reset, as clearly the command was
> garbage and we need to invoke error handling (which is reset).
>
> The other case is us receiving a frame with a _duplicate_ tag, ie a tag
> which is _currently_ valid. This is a case which will fail _even now_, as we
> have simply no way of detecting this.
>
> So what again do we miss by fixing the first case?
> Apart from a system which does _not_ crash?

I'm just saying each case is a symptom of the same problem. The only
difference from observing one vs the other is a race with the host's
dispatch. And since you're proposing this patch, it sounds like this
condition does happen on tcp compared to other transports where we don't
observe it. I just thought the implication that data corruption happens
is a alarming.

2021-03-04 20:29:46

by Keith Busch

[permalink] [raw]
Subject: Re: [PATCH] nvme-tcp: Check if request has started before processing it

On Tue, Mar 02, 2021 at 08:18:40AM +0100, Hannes Reinecke wrote:
> On 3/1/21 9:59 PM, Keith Busch wrote:
> > On Mon, Mar 01, 2021 at 05:53:25PM +0100, Hannes Reinecke wrote:
> >> On 3/1/21 5:05 PM, Keith Busch wrote:
> >>> On Mon, Mar 01, 2021 at 02:55:30PM +0100, Hannes Reinecke wrote:
> >>>> On 3/1/21 2:26 PM, Daniel Wagner wrote:
> >>>>> On Sat, Feb 27, 2021 at 02:19:01AM +0900, Keith Busch wrote:
> >>>>>> Crashing is bad, silent data corruption is worse. Is there truly no
> >>>>>> defense against that? If not, why should anyone rely on this?
> >>>>>
> >>>>> If we receive an response for which we don't have a started request, we
> >>>>> know that something is wrong. Couldn't we in just reset the connection
> >>>>> in this case? We don't have to pretend nothing has happened and
> >>>>> continuing normally. This would avoid a host crash and would not create
> >>>>> (more) data corruption. Or I am just too naive?
> >>>>>
> >>>> This is actually a sensible solution.
> >>>> Please send a patch for that.
> >>>
> >>> Is a bad frame a problem that can be resolved with a reset?
> >>>
> >>> Even if so, the reset doesn't indicate to the user if previous commands
> >>> completed with bad data, so it still seems unreliable.
> >>>
> >> We need to distinguish two cases here.
> >> The one is use receiving a frame with an invalid tag, leading to a crash.
> >> This can be easily resolved by issuing a reset, as clearly the command was
> >> garbage and we need to invoke error handling (which is reset).
> >>
> >> The other case is us receiving a frame with a _duplicate_ tag, ie a tag
> >> which is _currently_ valid. This is a case which will fail _even now_, as we
> >> have simply no way of detecting this.
> >>
> >> So what again do we miss by fixing the first case?
> >> Apart from a system which does _not_ crash?
> >
> > I'm just saying each case is a symptom of the same problem. The only
> > difference from observing one vs the other is a race with the host's
> > dispatch. And since you're proposing this patch, it sounds like this
> > condition does happen on tcp compared to other transports where we don't
> > observe it. I just thought the implication that data corruption happens
> > is a alarming.
> >
> Oh yes, it is.
> But sadly TCP inherently suffers from this, as literally anyone can
> spoof frames on the network.
> Other transports like RDMA or FC do not suffer to that extend as
> spoofing frames there is far more elaborate, and not really possible
> without dedicated hardware equipment.
>
> That's why there is header and data digest; that will protect you
> against accidental frame corruption (as this case clearly is; the
> remainder of the frame is filled with zeroes).
> It would not protect you against deliberate frame corruption; that's why
> there is TPAR 8010 (TLS encryption for NVMe-TCP).
>
> Be it as it may, none of these methods are in use here, and none of
> these methods can be made mandatory. So we need to deal with the case at
> hand.
>
> And in my opinion crashing is the _worst_ options of all.
> Tear the connection down, reset the thing, whatever.
>
> But do not crash.
> Customers tend to have a very dim view on crashing machines, and have a
> very limited capacity for being susceptible to our reasoning in these cases.

I was pushing the data corruption angle because fixing that should
address both cases. If there's really nothing you can do about
corruption, your approach here makes sense, and I defer to Sagi and
Christoph for inclusion.

I still wouldn't trust my data to storage behaving this way, though. :)