LinuxLists.cc - Re: [PATCH] ENBD for 2.5.64

2003-03-26 21:51:25

Subject: Re: [PATCH] ENBD for 2.5.64

"Lincoln Dale wrote:"
> >what multipathing and failover accomplish. iSCSI can be shoving bits
> >through multiple TCP connections, or fail over from one TCP connection to
> >another.
>
> while the iSCSI spec has the concept of a "network portal" that can have
> multiple TCP streams for i/o, in the real world, i'm yet to see anything
> actually use those multiple streams.

I'll content myself with mentioning that ENBD has /always/ throughout
its five years of life had automatic failover between channels. Mind
you, I don't think anybody makes use of the multichannel architecture in
practice for the purposes of redundancy (i.e. people using multiple
channels don't pass them through different interfaces or routes, which
is the idea!), they may do it for speed/bandwidth.

But then surely they might as well use channel bonding in the network layer?
I've never tried it, or possibly never figured out how ..

> the reason why goes back to how SCSI works. take a ethereal trace of iSCSI
> and you'll see the way that 2 round-trips are used before any typical i/o
> operation (read or write op) occurs.

Hmm.

I have some people telling me that I should pile up network packets
in order to avoid too many interrupts firing on Ge cards, and other
people telling me to send partial packets as soon as possible in order
to avoid buffer buildup. My head spins.

> multiple TCP streams for a given iSCSI session could potentially be used to
> achieve greater performance when the maximum-window-size of a single TCP
> stream is being hit.
> but its quite rare for this to happen.

My considered opinion is that there are way too many variables here for
anyone to make sense of them.

> in reality, if you had multiple TCP streams, its more likely you're doing
> it for high-availability reasons (i.e. multipathing).

Except that in real life most people don't know what they're doing and
they certainly don't know why they're doing it! In particular they
don't seem to get the idea that more redundancy is what they want.

I can almost see why.

But they can be persuaded to run multichannel by being promised more
speed.

> if you're multipathing, the chances are you want to multipath down two
> separate paths to two different iSCSI gateways. (assuming you're talking
> to traditional SAN storage and you're gatewaying into Fibre Channel).

Yes. This is all that really makes sense for redundancy. And make sure
the routing is distinct too.

Then you start having problems maintaining request order across
multiple paths. At least I do. But ENBD does it.

> determining the policy (read-preferred / write-preferred / round-robin /
> ratio-of-i/o / sync-preferred+async-fallback / ...) on how those paths are
> used is most definitely something that should NEVER be in the kernel.

ENBD doesn't have any problem - it uses all the channels, by demand.
Each userspace daemon runs a different channel and each daemon picks
up requests to treat as soon as it can, as soon as there are any. The
kernel does not dictate. It's async.

(iscsi stream over tcp)
> 5 minutes output rate 929091696 bits/sec, 116136462 bytes/sec,
> 80679 frames/sec
> ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^

Very impressive. I think the most that's been seen over ENBD is 60MB/s
sustained, across Ge.

> not bad for a single TCP stream and a software iSCSI stack. :-)
> (kernel is 2.4.20)

Ditto.

Peter

2003-03-26 23:41:48

by Lincoln Dale

[permalink] [raw]

Subject: Re: [PATCH] ENBD for 2.5.64

Hi Peter,

At 11:02 PM 26/03/2003 +0100, Peter T. Breuer wrote:
>I'll content myself with mentioning that ENBD has /always/ throughout
>its five years of life had automatic failover between channels. Mind
>you, I don't think anybody makes use of the multichannel architecture in
>practice for the purposes of redundancy (i.e. people using multiple
>channels don't pass them through different interfaces or routes, which
>is the idea!), they may do it for speed/bandwidth.
>
>But then surely they might as well use channel bonding in the network layer?
>I've never tried it, or possibly never figured out how ..

"channel bonding" can handle cases whereby you lose a single NIC or port --
but typically channeling means that you need multiple paths into a single
ethernet switch.
single ethernet switch = single point of failure.

hence, from a high-availability (HA) perspective, you're better off
connecting N NICs into N switches -- and then load-balance (multipath)
across those.

an interesting side-note is that channel-bonding doesn't necessarily mean
higher performance.
i haven't looked at linux's channel-bonding, but many NICs on higher-end
servers offer this as an option, and when enabled, you end up with multiple
NICs with the same MAC address. typically only one NIC is used for one
direction of traffic.

> > the reason why goes back to how SCSI works. take a ethereal trace of
> iSCSI
> > and you'll see the way that 2 round-trips are used before any typical i/o
> > operation (read or write op) occurs.
>
>Hmm.
>I have some people telling me that I should pile up network packets
>in order to avoid too many interrupts firing on Ge cards, and other
>people telling me to send partial packets as soon as possible in order
>to avoid buffer buildup. My head spins.

:-)
most "storage" people care more about latency than they do about raw
performance. coalescing packets = bad for latency.

i figure there has to be middle ground somewhere -- implement both and have
it as a config option.

decent GE cards will do coalescing themselves anyway.

cheers,

lincoln.

2003-03-26 23:57:17

by Peter T. Breuer

[permalink] [raw]

Subject: Re: [PATCH] ENBD for 2.5.64

"Lincoln Dale wrote:"
> Hi Peter,

Hi!

> decent GE cards will do coalescing themselves anyway.

From what I confusedly remember of my last interchange with someone
convinced that packet coalescing (or lack of it, I forget which)
was the root of all evil, it's "all because" there's some magic limit
of 8K interrupts per second somewhere, and at 1.5KB per packet, that
would be only 12MB/s. So Ge cards wait after each interrupt to see if
there's some more stuff coming, so that they can treat more than one
packet at a time.

Apparently that means that if you have a two-way interchange in
your protocol at low level, they wait at the end of each half of
the protocol, even though you can't proceed with the protocol
until they decide to stop listening and start working. And the
result is a severe slowdown.

In my naive opinion, hat should make ENBD's architecture (in which all
the channels going through the same NIC nevertheless work independently
and asynchronously) have an advantage, because pipelining effects
will fill up the slack time spaces in one channel's protocol with
activity from other channels.

But surely the number of channels required to fill up the waiting time
woulod be astronomical? Oh well.

Anyway, my head still spins.

The point is that none of this is as easy or straightforward as it
seems. I suspect that pure storage people like andre will make a real
mess of the networking considerations. It's just not easy.

Peter