Hi,
> If you know anything about iSCSI RFC draft and how storage truly works.
> Cisco gets it wrong, they do not believe in supporting the full RFC.
> So you get ERL=0, and now they turned of the "Header and Data Digests",
> this is equal to turning off the iCRC in ATA, or CRC in SCSI between the
> controller and the device. For those people who think removing the
> checksum test for the integrity of the data and command operations, you
> get what you deserve.
Ever heard of TCP checksums? Ever heard of ethernet checksums? Which
transport doesn't use checksums nowadays? The digest makes only sense if
you can generate it for free in hardware or for debugging, otherwise
it's only a waste of cpu time. This makes the complete ERL 1 irrelevant
for a software implementation. With block devices you can even get away
with just ERL 0 to implement transparent recovery.
bye, Roman
On Mon, 6 Jan 2003, Roman Zippel wrote:
> Hi,
>
> > If you know anything about iSCSI RFC draft and how storage truly works.
> > Cisco gets it wrong, they do not believe in supporting the full RFC.
> > So you get ERL=0, and now they turned of the "Header and Data Digests",
> > this is equal to turning off the iCRC in ATA, or CRC in SCSI between the
> > controller and the device. For those people who think removing the
> > checksum test for the integrity of the data and command operations, you
> > get what you deserve.
>
> Ever heard of TCP checksums? Ever heard of ethernet checksums? Which
> transport doesn't use checksums nowadays? The digest makes only sense if
> you can generate it for free in hardware or for debugging, otherwise
> it's only a waste of cpu time. This makes the complete ERL 1 irrelevant
> for a software implementation. With block devices you can even get away
> with just ERL 0 to implement transparent recovery.
Please continue to think of TCP checksums as valid for a data transport,
you data will be gone soon enough.
Initiator == Controller
Target == Disk
iSCSI == cable or ribbon
Please turn off the CRC on your disk drive and see if you still have data.
Cheers,
Andre Hedrick, CTO & Founder
iSCSI Software Solutions Provider
http://www.PyXTechnologies.com/
Hmm. The problem here is that there is a nontrivial probability that a
packet can pass both ethernet and TCP checksums and still not be right,
given the gigantic volumes of data that iSCSI is intended to be used with.
Back up a 100 terabyte array and it's more than 1%, back of the envelope.
Ethernet and TCP were both designed to be cheap to evaluate, not the
absolute last word in integrity. There is a move underway to provide an
optional stronger TCP digest for IPv6, and if used with that then there is
no need for the iSCSI digest. Otherwise, well, play dice with the data.
Loaded in your favour, but still dice.
Andrew
--On Monday, January 06, 2003 17:51:13 +0100 Roman Zippel
<[email protected]> wrote:
> Hi,
>
>> If you know anything about iSCSI RFC draft and how storage truly works.
>> Cisco gets it wrong, they do not believe in supporting the full RFC.
>> So you get ERL=0, and now they turned of the "Header and Data Digests",
>> this is equal to turning off the iCRC in ATA, or CRC in SCSI between the
>> controller and the device. For those people who think removing the
>> checksum test for the integrity of the data and command operations, you
>> get what you deserve.
>
> Ever heard of TCP checksums? Ever heard of ethernet checksums? Which
> transport doesn't use checksums nowadays? The digest makes only sense if
> you can generate it for free in hardware or for debugging, otherwise
> it's only a waste of cpu time. This makes the complete ERL 1 irrelevant
> for a software implementation. With block devices you can even get away
> with just ERL 0 to implement transparent recovery.
>
> bye, Roman
>
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>
On Tue, Jan 07, 2003 at 01:39:38PM +1300, Andrew McGregor wrote:
> Hmm. The problem here is that there is a nontrivial probability that a
> packet can pass both ethernet and TCP checksums and still not be right,
> given the gigantic volumes of data that iSCSI is intended to be used with.
> Back up a 100 terabyte array and it's more than 1%, back of the envelope.
What was the underlying error rate and distribution you assumed? I
figure if it were high enough to get to your 1%, you'd have such high
retry rates (and resulting throughput loss) that the operator would
notice his LAN was broken weeks before said transfer completed.
--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."
On Mon, 06 Jan 2003 22:20:46 CST, Oliver Xymoron said:
> What was the underlying error rate and distribution you assumed? I
> figure if it were high enough to get to your 1%, you'd have such high
> retry rates (and resulting throughput loss) that the operator would
> notice his LAN was broken weeks before said transfer completed.
The average ISP wouldn't notice things were broken unless enough magic
smoke escaped to cause a Halon dump.
Consider as evidence the following NANOG presentation:
http://www.nanog.org/mtg-0210/wessels.html
Some *98* percent of all queries at one of the root nameservers over a 24-hour
period were broken in some way. And there wasn't even a DDoS in progress
at the time...
Also, I think Andrew was computing the chances that *SOME* packet in the
100T would be mangled in an undetected fashion, so 99% of the time all 100T
would be OK, but 1% of the time there was some subtle block mangling some
dozens of terabytes into the transfer. Given that the TCP slow-start code
is currently busticated for gigabit and higher (it takes *hours* without a
packet drop to get the window open *all* the way - there's IETF drafts
in process about this), it's quite possible that you'd not notice packet
drops due to error among all the congestion drops kicking the window size
down.....
--
Valdis Kletnieks
Computer Systems Senior Engineer
Virginia Tech
[email protected] wrote:
> is currently busticated for gigabit and higher (it takes *hours* without a
> packet drop to get the window open *all* the way
Don't use 2.0.21 for gigabit traffic :-)
(2.0.21 and earlier initialized sstresh to zero, which would indeed
cause the behaviour you're describing.)
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/
On Tue, 07 Jan 2003 03:16:38 -0300, Werner Almesberger said:
> [email protected] wrote:
> > is currently busticated for gigabit and higher (it takes *hours* without a
> > packet drop to get the window open *all* the way
>
> Don't use 2.0.21 for gigabit traffic :-)
>
> (2.0.21 and earlier initialized sstresh to zero, which would indeed
> cause the behaviour you're describing.)
That's not the problem. The problem is that TCP slow-start itself (and some of
the related congestion control stuff) has some issues scaling to the very high
end. Floyd (of RFC3168 fame) has done some work in the area:
from http://www.ietf.org/internet-drafts/draft-floyd-tcp-highspeed-01.txt
Abstract
This document proposes HighSpeed TCP, a modification to TCP's
congestion control mechanism for use with TCP connections with large
congestion windows. The congestion control mechanisms of the current
Standard TCP constrains the congestion windows that can be achieved
by TCP in realistic environments. For example, for a Standard TCP
connection with 1500-byte packets and a 100 ms round-trip time,
achieving a steady-state throughput of 10 Gbps would require an
average congestion window of 83,333 segments, and a packet drop rate
of at most one congestion event every 5,000,000,000 packet (or
equivalently, at most one congestion event every 1 2/3 hours). This
is widely acknowledged as an unrealistic constraint. To address this
limitation of TCP, this document proposes HighSpeed TCP, and solicits
experimentation and feedback from the wider community.
Also http://www.ietf.org/internet-drafts/draft-floyd-tcp-slowstart-01.txt
Abstract
This note proposes a modification for TCP's slow-start for use with
TCP connections with large congestion windows. For TCP connections
that are able to use congestion windows of thousands (or tens of
thousands) of MSS-sized segments (for MSS the sender's MAXIMUM
SEGMENT SIZE), the current slow-start procedure can result in
increasing the congestion window by thousands of segments in a single
round-trip time. Such an increase can easily result in thousands of
packets being dropped in one round-trip time. This is often counter-
productive for the TCP flow itself, and is also hard on the rest of
the traffic sharing the congested link. This note proposes Limited
Slow-Start as an optional mechanism for limiting the number of
segments by which the congestion window is increased for one window
of data during slow-start, in order to improve performance for TCP
connections with large congestion windows.
At 12:38 AM 7/01/2003 -0500, [email protected] wrote:
> > What was the underlying error rate and distribution you assumed? I
> > figure if it were high enough to get to your 1%, you'd have such high
> > retry rates (and resulting throughput loss) that the operator would
> > notice his LAN was broken weeks before said transfer completed.
>
>The average ISP wouldn't notice things were broken unless enough magic
>smoke escaped to cause a Halon dump.
>
>Consider as evidence the following NANOG presentation:
>http://www.nanog.org/mtg-0210/wessels.html
>
>Some *98* percent of all queries at one of the root nameservers over a 24-hour
>period were broken in some way.
please don't confuse issues.
i think you just epitomized the quote: "there are lies, damn lies, and
statistics".
you're trying to say that because there is some broken/buggy nameserver
code out there, it means that the error-rate for TCP is correct?
cheers,
lincoln.
On Tue, 07 Jan 2003 17:45:03 +1100, Lincoln Dale said:
> At 12:38 AM 7/01/2003 -0500, [email protected] wrote:
> > > What was the underlying error rate and distribution you assumed? I
> > > figure if it were high enough to get to your 1%, you'd have such high
> > > retry rates (and resulting throughput loss) that the operator would
> > > notice his LAN was broken weeks before said transfer completed.
> >
> >The average ISP wouldn't notice things were broken unless enough magic
> >smoke escaped to cause a Halon dump.
> >
> >Consider as evidence the following NANOG presentation:
> >http://www.nanog.org/mtg-0210/wessels.html
> >
> >Some *98* percent of all queries at one of the root nameservers over a 24-ho
ur
> >period were broken in some way.
>
> please don't confuse issues.
> i think you just epitomized the quote: "there are lies, damn lies, and
> statistics".
>
> you're trying to say that because there is some broken/buggy nameserver
> code out there, it means that the error-rate for TCP is correct?
No, I'm saying the assertion that "the operator would notice his LAN was broken"
is incorrect.
[email protected] wrote:
> That's not the problem. The problem is that TCP slow-start itself (and some of
> the related congestion control stuff) has some issues scaling to the very high
> end.
I'm very well aware of that ;-) But what you wrote was:
> it takes *hours* without a
> packet drop to get the window open *all* the way
Or did you mean "after" instead of "without" ? Or maybe "into
equilibrium" instead of "the window open ..." ? (After all, the
window isn't only open, but it's been blown off its hinges.)
In any case, your statement accurately describes a somewhat
surprising quirk in Linux TCP performance as of only a bit more
than six years ago :)
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/
On Tue, 07 Jan 2003 04:08:29 -0300, Werner Almesberger said:
> [email protected] wrote:
> > it takes *hours* without a
> > packet drop to get the window open *all* the way
>
> Or did you mean "after" instead of "without" ? Or maybe "into
> equilibrium" instead of "the window open ..." ? (After all, the
> window isn't only open, but it's been blown off its hinges.)
"without". Let's say it takes 4 hours to recover from a drop, and
you have another one 3 hours into recovery - it will now take more than
one more hour to recover.
"into equilibrium fully open". It's easy enough to see it in equilibrium
(more or less) not fully open.. ;)
> In any case, your statement accurately describes a somewhat
> surprising quirk in Linux TCP performance as of only a bit more
> than six years ago :)
OK, I tuned in late - are you saying that the 6-year-old Linux quirk
happened to have the same symptoms as Floyd's current work, or that
the slow-start tweaks were designed in 6 years ago, or that a fix for
the quirk accidentally did the same thing as Floyd's stuff?
The whole slow-start/ack/retransmit has been chewed over so many times in the
last 20 years that it's hard to keep track of which vendors picked up which
tweaks when, and which vendors accidentally invented them again, and which
vendors invented the tweaks independently and didn't publicize them more....
/Valdis
[email protected] wrote:
> "without". Let's say it takes 4 hours to recover from a drop, and
> you have another one 3 hours into recovery - it will now take more than
> one more hour to recover.
Ah, now I understand what you meant. I read that as "even if there is
no drop at all, it can take hours".
> The whole slow-start/ack/retransmit has been chewed over so many times in the
> last 20 years that it's hard to keep track of which vendors picked up which
> tweaks when, and which vendors accidentally invented them again, and which
> vendors invented the tweaks independently and didn't publicize them more....
... or figure out which combination of RFCs, I-Ds, and ad hoc genius
makes up Linux TCP, yes ;-)
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/
On Tue, 07 Jan 2003 05:14:41 -0300, Werner Almesberger said:
> ... or figure out which combination of RFCs, I-Ds, and ad hoc genius
> makes up Linux TCP, yes ;-)
That's the easy part. :) You then get to take a cross-product of that and
its interactions with whatever combination of RFC/I-D/ad-crockery are in
the 4 different Solaris releases, the 2 or 3 AIX releases, the Tru64 releases,
the myriad different Microsoft-based servers - and that's just the 10K square
feet in our machine room. A lot of our gear keeps trying to talk to the
outside world, where ALL bets are off. ;)
Does anybody think it would be worthwhile to collate a document of which
RFCs/I-Ds are supported, and to what extent (the MUST/SHOULD/MAY stuff)?
Or is there one already and my 3:30AM search for same is missing it? ;)
/Valdis
On Tue, Jan 07, 2003 at 01:39:38PM +1300, Andrew McGregor wrote:
> Ethernet and TCP were both designed to be cheap to evaluate, not the
> absolute last word in integrity. There is a move underway to provide an
> optional stronger TCP digest for IPv6, and if used with that then there is
> no need for the iSCSI digest. Otherwise, well, play dice with the data.
> Loaded in your favour, but still dice.
Ethernet's checksum is a standard crc32, with all the usual good
properties and, at least on FE and lower, 1500bytes max of payload.
So it's quite reasonable. TCP's checksum, though, is crap.
I'm not entirely sure how crc32 would behave on jumbo frames.
OG.
No no, that's a 1% chance that one packet in the terabyte is broken.
But actually, it's not that hard to construct a peturbation to the packet
that will beat both the ethernet and TCP checksums (I gave an example that
beats TCP before). That kind of change is not likely for random bit
errors, but is quite likely to occur in just slightly marginal hardware.
Partial packet duplication or byte reordering on the highly ordered data
patterns you find in filesystem metadata could be really bad.
Like I say, debugging one crypto protocol I've seen this happen for real.
Twice in about 10000 packets, on an otherwise apparently perfectly fine
LAN. I suspect bad cabling, and changed it, but it's hard to tell that
anything has changed. That shows that my 1% is probably quite conservative
for that particular link.
Internet protocols (changing to IETF hat now) are supposed to work on the
global internet, and that means iSCSI has to be engineered to work on the
worst links imaginable, because sometime, somewhere, someone's data is
going to cross a really broken backup link that they have no way of knowing
has just come on. Possibly it's wireless, where packet corruption due to
undetected collisions happens quite frequently.
Andre routinely tests it with the IBM team in Israel, with his end in
California.
Andrew
--On Monday, January 06, 2003 22:20:46 -0600 Oliver Xymoron
<[email protected]> wrote:
> On Tue, Jan 07, 2003 at 01:39:38PM +1300, Andrew McGregor wrote:
>> Hmm. The problem here is that there is a nontrivial probability that a
>> packet can pass both ethernet and TCP checksums and still not be right,
>> given the gigantic volumes of data that iSCSI is intended to be used
>> with. Back up a 100 terabyte array and it's more than 1%, back of the
>> envelope.
>
> What was the underlying error rate and distribution you assumed? I
> figure if it were high enough to get to your 1%, you'd have such high
> retry rates (and resulting throughput loss) that the operator would
> notice his LAN was broken weeks before said transfer completed.
>
> --
> "Love the dolphins," she advised him. "Write by W.A.S.T.E.."
>
>
On Tue, 2003-01-07 at 00:39, Andrew McGregor wrote:
> optional stronger TCP digest for IPv6, and if used with that then there is
> no need for the iSCSI digest. Otherwise, well, play dice with the data.
> Loaded in your favour, but still dice.
There is no need for the iSCSI digest anyway. You can use IP-AH to achieve
precisely the same thing, and strong guarantees already in a standards
compliant manner.
Alan
No, he really meant without. I don't know if Valdis saw the presentation
that went with that draft, but I did (last IETF in Yokohama). The example
was a 10Gbps link with a 250ms RTT (plausibly, a trans-pacific cable
lambda). There are tens of thousands of 9k packets in the window (yep, 100
*megabytes* in flight in the cable!), and it does take several hours with
exactly zero drops to get the connection to fill the 10Gbps. After one
dropped packet, it can take an hour to get back to full utilisation. The
graphs are really startling to look at :-)
That quirk just meant the numbers were off by a few orders of magnitude.
If anyone wants to look at that further, I think there are URLs in the
draft. If not, I can dig them out of the proceedings.
Andrew
--On Tuesday, January 07, 2003 04:08:29 -0300 Werner Almesberger
<[email protected]> wrote:
> [email protected] wrote:
>> That's not the problem. The problem is that TCP slow-start itself (and
>> some of the related congestion control stuff) has some issues scaling to
>> the very high end.
>
> I'm very well aware of that ;-) But what you wrote was:
>
>> it takes *hours* without a
>> packet drop to get the window open *all* the way
>
> Or did you mean "after" instead of "without" ? Or maybe "into
> equilibrium" instead of "the window open ..." ? (After all, the
> window isn't only open, but it's been blown off its hinges.)
>
> In any case, your statement accurately describes a somewhat
> surprising quirk in Linux TCP performance as of only a bit more
> than six years ago :)
>
> - Werner
>
> --
>
> _________________________________________________________________________
> / Werner Almesberger, Buenos Aires, Argentina [email protected]
> /
> /_http://www.almesberger.net/____________________________________________/
> -
> To unsubscribe from this list: send the line "unsubscribe linux-kernel" in
> the body of a message to [email protected]
> More majordomo info at http://vger.kernel.org/majordomo-info.html
> Please read the FAQ at http://www.tux.org/lkml/
>
>
Or ESP, with or without encryption as well.
But that does not acheive quite the same thing, because the iSCSI digest is
another lightweight checksum, albeit stronger than most, and does not
provide authentication. So AH or ESP is stronger, but slower.
Maybe Cisco are assuming another layer deals with the errors. However, to
get an interoperable and efficient implementation requires the capability
to do whatever combination is required, along with sensible defaults.
Andrew
--On Tuesday, January 07, 2003 12:31:18 +0000 Alan Cox
<[email protected]> wrote:
> On Tue, 2003-01-07 at 00:39, Andrew McGregor wrote:
>> optional stronger TCP digest for IPv6, and if used with that then there
>> is no need for the iSCSI digest. Otherwise, well, play dice with the
>> data. Loaded in your favour, but still dice.
>
> There is no need for the iSCSI digest anyway. You can use IP-AH to
> achieve precisely the same thing, and strong guarantees already in a
> standards compliant manner.
>
> Alan
>
>
On lun, jan 06, 2003 at 05:51:13 +0100, Roman Zippel wrote:
> Hi,
>
> > If you know anything about iSCSI RFC draft and how storage truly works.
> > Cisco gets it wrong, they do not believe in supporting the full RFC.
> > So you get ERL=0, and now they turned of the "Header and Data Digests",
> > this is equal to turning off the iCRC in ATA, or CRC in SCSI between the
> > controller and the device. For those people who think removing the
> > checksum test for the integrity of the data and command operations, you
> > get what you deserve.
>
> Ever heard of TCP checksums? Ever heard of ethernet checksums? Which
> transport doesn't use checksums nowadays? The digest makes only sense if
> you can generate it for free in hardware or for debugging, otherwise
> it's only a waste of cpu time. This makes the complete ERL 1 irrelevant
> for a software implementation. With block devices you can even get away
> with just ERL 0 to implement transparent recovery.
>
Some others already stated that TCP checksums aren't reliable enough.
I'll add to these remarks the following :
2 years ago we installed a Linux based VPN for one of our customers. We got
anomaly reports where the symptoms were "SMB transfers died unexpectedly".
We suspected a bug in the Windows 2000 SMB client code that talked to the
Samba server accross the VPN link (very large files over slow and long path,
this was not exactly the usual environment for an SMB client code thought
for the LAN) until I tested scp transfers between the two routers.
Scheduling scp transfers/md5 checks during a whole day revealed that some
files were corrupted.
We guessed that one router along the path recomputed TCP checksums unconditionnaly
(even if the packets arrived corrupted) and this was confirmed by other
customers of the same ISP.
Next time we will use a more robust VPN than vtun :)
If you want to use iSCSI in an uncontrolled environment you can't trust the
data transport. This is sad, but many vendors violates RFCs on a regular
basis.
I've even learned recently that one vendor developped quite some time ago a
TCP stack that pre-ACKed packets in order to speed up transfers !
As you can guess they didn't have much success out of their labs...
LB
On Tue, 2003-01-07 at 12:31, Andrew McGregor wrote:
> Or ESP, with or without encryption as well.
>
> But that does not acheive quite the same thing, because the iSCSI digest is
> another lightweight checksum, albeit stronger than most, and does not
> provide authentication. So AH or ESP is stronger, but slower.
AH permits multiple digests, they also happen to correspond to the hardware
accelerated ones on things like the 3c990...
On Wed, Jan 08, 2003 at 01:31:36AM +1300, Andrew McGregor wrote:
> Or ESP, with or without encryption as well.
>
> But that does not acheive quite the same thing, because the iSCSI digest is
> another lightweight checksum, albeit stronger than most, and does not
> provide authentication. So AH or ESP is stronger, but slower.
>
> Maybe Cisco are assuming another layer deals with the errors. However, to
> get an interoperable and efficient implementation requires the capability
> to do whatever combination is required, along with sensible defaults.
Actually, as I pointed out before, the current Cisco iSCSI driver does
support CRC (32-bit), though it's presumably off by default.
--
"Love the dolphins," she advised him. "Write by W.A.S.T.E.."
[email protected] wrote:
> Does anybody think it would be worthwhile to collate a document of which
> RFCs/I-Ds are supported, and to what extent (the MUST/SHOULD/MAY stuff)?
> Or is there one already and my 3:30AM search for same is missing it? ;)
Well, every once in a while, that accumulated wisdom gets
summarized and written down in some RFC. Of course, it's been a
while since e.g. RFC2581 ...
What might be useful at some point, is to document Linux TCP, and
how it is linked to all the RFCs, I-Ds, research papers, etc.
- Werner
--
_________________________________________________________________________
/ Werner Almesberger, Buenos Aires, Argentina [email protected] /
/_http://www.almesberger.net/____________________________________________/
Hardware acceleration is the right way to do any of this, agreed :-)
--On Tuesday, January 07, 2003 13:58:50 +0000 Alan Cox
<[email protected]> wrote:
> On Tue, 2003-01-07 at 12:31, Andrew McGregor wrote:
>> Or ESP, with or without encryption as well.
>>
>> But that does not acheive quite the same thing, because the iSCSI digest
>> is another lightweight checksum, albeit stronger than most, and does
>> not provide authentication. So AH or ESP is stronger, but slower.
>
> AH permits multiple digests, they also happen to correspond to the
> hardware accelerated ones on things like the 3c990...
>
>