2003-03-25 20:42:27

by Peter T. Breuer

[permalink] [raw]
Subject: Re: [PATCH] ENBD for 2.5.64

"Justin Cormack wrote:"
> On Tue, 2003-03-25 at 17:27, Peter T. Breuer wrote:
> > > ENBD is not a replacement for NBD - the two are alternatives, aimed
> > > at different niches. ENBD is a sort of heavyweight industrial NBD. It
> > > does many more things and has a different achitecture. Kernel NBD is
> > > like a stripped down version of ENBD. Both should be in the kernel.
>
> hmm, I would argue that nbd is ok, as it is a nice lightweight block
> device (though I have not been able to use it due to the fact that I can
> never find a userspace and kernel that work together), while ENBD should
> be replaced by iscsi, now that is a real ietf standard, and can burn CDs
> across the net and all that extra stuff.

It's not a bad idea. But ENBD in particular can use any transport,
precisely becuase its networking is done in userspace. One only has to
write a stream.c module for it that implements

read
write
open
close

(There are currently implementations for three transports, including
tcp of course).

> And I am intending to write an iscsi client sometime, but it got
> delayed. The server stuff is already available from 3com.

Possibly, but ENBD is designed to fail :-). And networks fail.
What will your iscsi implementation do when somebody resets the
router? All those issues are handled by ENBD. ENBD breaks off and
reconnects automatically. It reacts right to removable media.

I should also have said that ENBD has the following features (I said I
forgot some!)

9) it drops into a mode where it md5sums both ends and skips writes
of equal blocks, if that's faster. It moves in and out of this mode
automatically. This helps RAID resyncs (2* overspeed is common on
100BT nets, that is 20MB/s.).

10) integration with RAID - it advises raid correctly of its state
and does hot add and remove correctly (well, you need my patches to
raid, but there you are ...).

Of course, if somebody wants me to make enbd appear like a scis device
instead of a generic block device, I guess I could do that. Except,
that yecch, I have seen the scsi code, and I do not understand it.

Another good idea is to make the wire protocol iscsi compatible. I
have no objection to that.

Peter


2003-03-26 02:29:21

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] ENBD for 2.5.64

Peter T. Breuer wrote:
> "Justin Cormack wrote:"
>>And I am intending to write an iscsi client sometime, but it got
>>delayed. The server stuff is already available from 3com.
>
>
> Possibly, but ENBD is designed to fail :-). And networks fail.
> What will your iscsi implementation do when somebody resets the
> router? All those issues are handled by ENBD. ENBD breaks off and
> reconnects automatically. It reacts right to removable media.


Yeah, iSCSI handles all that and more. It's a behemoth of a
specification. (whether a particular implementation implements all that
stuff correctly is another matter...)

BTW, I'm a big enbd fan :) I like enbd for it's _simplicity_ compared
to iSCSI.

Jeff



2003-03-26 05:44:31

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH] ENBD for 2.5.64

On Tue, Mar 25, 2003 at 09:40:44PM -0500, Jeff Garzik wrote:
> Peter T. Breuer wrote:
> >"Justin Cormack wrote:"
> >>And I am intending to write an iscsi client sometime, but it got
> >>delayed. The server stuff is already available from 3com.
> >
> >
> >Possibly, but ENBD is designed to fail :-). And networks fail.
> >What will your iscsi implementation do when somebody resets the
> >router? All those issues are handled by ENBD. ENBD breaks off and
> >reconnects automatically. It reacts right to removable media.
>
> Yeah, iSCSI handles all that and more. It's a behemoth of a
> specification. (whether a particular implementation implements all that
> stuff correctly is another matter...)

Indeed, there are iSCSI implementations that do multipath and
failover.

Both iSCSI and ENBD currently have issues with pending writes during
network outages. The current I/O layer fails to report failed writes
to fsync and friends.

> BTW, I'm a big enbd fan :) I like enbd for it's _simplicity_ compared
> to iSCSI.

Definitely. The iSCSI protocol is more powerful but _much_ more
complex than ENBD. I've spent two years working on iSCSI but guess
which I use at home..

--
Matt Mackall : http://www.selenic.com : of or relating to the moon

2003-03-26 06:20:07

by Peter T. Breuer

[permalink] [raw]
Subject: Re: [PATCH] ENBD for 2.5.64

"A month of sundays ago Matt Mackall wrote:"
> On Tue, Mar 25, 2003 at 09:40:44PM -0500, Jeff Garzik wrote:
> > Peter T. Breuer wrote:
> > >"Justin Cormack wrote:"
> > >>And I am intending to write an iscsi client sometime, but it got
> > >>delayed. The server stuff is already available from 3com.
> > >
> > >Possibly, but ENBD is designed to fail :-). And networks fail.
> > >What will your iscsi implementation do when somebody resets the
> > >router? All those issues are handled by ENBD. ENBD breaks off and
> > >reconnects automatically. It reacts right to removable media.
> >
> > Yeah, iSCSI handles all that and more. It's a behemoth of a
> > specification. (whether a particular implementation implements all that
> > stuff correctly is another matter...)
>
> Indeed, there are iSCSI implementations that do multipath and
> failover.

Somebody really ought to explain it to me :-). I can't keep up with all
this!

> Both iSCSI and ENBD currently have issues with pending writes during
> network outages. The current I/O layer fails to report failed writes
> to fsync and friends.

ENBD has two (configurable) behaviors here. Perhaps it should have
more. By default it blocks pending reads and writes during times when
the connection is down. It can be configured to error them instead. The
erroring behavior is what you want when running under soft RAID, as
it's raid that should do the deciding about how to treat the requests
according to the overall state of the array, so it needs definite
yes/no info on each request, no "maybe".

Perhaps in a third mode requests should be blocked and time out after
about half an hour (or some number, in an infinite spectrum).

What I would like is some way of telling how backed up the VM is
against us. If the VM is full of dirty buffers aimed at us, then
I think we should consider erroring instead of blocking. The problem is
that at that point we're likely not getting any requests at all,
because the kernel long ago ran out of the 256 requests it has in
hand to send us.

There is indeed an information disconnect with VMS in those
circumstances that I've never known how to solve. FSs too are a
problem, because unless they are mounted sync they will happily permit
writes to a file on a fs on a blocked device even if that fills the
machine with buffers that can't go anywhere. Among other things that
will run tcp out of buffer space that is necessary in order to flush
those buffers even if the connection does come back. And even if the
mount is sync then some fs's (e.g. ext2) still allow infinitely much
writing to a blocked device under some circumstances (start two
processes writing to the same file .. the second will write to
buffers).


Peter

2003-03-26 06:37:29

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH] ENBD for 2.5.64

> > Both iSCSI and ENBD currently have issues with pending writes during
> > network outages. The current I/O layer fails to report failed writes
> > to fsync and friends.
>
> ENBD has two (configurable) behaviors here. Perhaps it should have
> more. By default it blocks pending reads and writes during times when
> the connection is down. It can be configured to error them instead.

And in this case, the upper layers will silently drop write errors on
current kernels.

Cisco's Linux iSCSI driver has a configurable timeout, defaulting to
'infinite', btw.

> What I would like is some way of telling how backed up the VM is
> against us. If the VM is full of dirty buffers aimed at us, then
> I think we should consider erroring instead of blocking. The problem is
> that at that point we're likely not getting any requests at all,
> because the kernel long ago ran out of the 256 requests it has in
> hand to send us.

Hrrmm. The potential to lose data by surprise here is not terribly
appealing. It might be better to add an accounting mechanism to say
"never go above x dirty pages against block device n" or something of
the sort but you can still get into trouble if you happen to have
hundreds of iSCSI devices each with their own request queue..

--
Matt Mackall : http://www.selenic.com : of or relating to the moon

2003-03-26 06:51:13

by Andre Hedrick

[permalink] [raw]
Subject: Re: [PATCH] ENBD for 2.5.64

On Tue, 25 Mar 2003, Matt Mackall wrote:

> > Yeah, iSCSI handles all that and more. It's a behemoth of a
> > specification. (whether a particular implementation implements all that
> > stuff correctly is another matter...)
>
> Indeed, there are iSCSI implementations that do multipath and
> failover.
>
> Both iSCSI and ENBD currently have issues with pending writes during
> network outages. The current I/O layer fails to report failed writes
> to fsync and friends.
>
> > BTW, I'm a big enbd fan :) I like enbd for it's _simplicity_ compared
> > to iSCSI.
>
> Definitely. The iSCSI protocol is more powerful but _much_ more
> complex than ENBD. I've spent two years working on iSCSI but guess
> which I use at home..

To be totally fair ENBD/NBD is not SAN nor will it ever become a qualified
SAN. I would like to see what happens to your data if you wrap your
ethernet around a ballast resistor or even run it near by the associated
light fixture, and blink the the power/lights. This is where goofy people
run cables in the drop ceilings.

We have almost finalized our initiator to be submitted under OSL/GPL.
This will be a full RFC ERL=2 w/ Sync-n-Steering.

I have seen to much GPL code stolen and could do nothing about it.

While the code is not in binary executable form it shall be under OSL
only. Only at compile and execution time will it become GPL compliant
period. This is designed to extend the copyright holders rights to force
anyone who uses the code and changes anything to return to open to the
copyright holder period. Additional language may be added to permit not
return code to exist under extremely heavy licensing fees to be used to
promote OSL projects and assist any GPL holder with litigation fees to
protect their rights. To many of us do not have the means to defend our
copyrights, this is one way I can see to provide a plausable solution to a
dirty problem. This may not be the best answer but it is doable. This
will apply some teeth into the license to stop this from happening again.

Like it or hate it, OSL/GPL looks to be the best match out there.

Regards,

Andre Hedrick, CTO & Founder
iSCSI Software Solutions Provider
http://www.PyXTechnologies.com/

2003-03-26 06:54:22

by Peter T. Breuer

[permalink] [raw]
Subject: Re: [PATCH] ENBD for 2.5.64

"A month of sundays ago Matt Mackall wrote:"
> > > Both iSCSI and ENBD currently have issues with pending writes during
> > > network outages. The current I/O layer fails to report failed writes
> > > to fsync and friends.
> >
> > ENBD has two (configurable) behaviors here. Perhaps it should have
> > more. By default it blocks pending reads and writes during times when
> > the connection is down. It can be configured to error them instead.
>
> And in this case, the upper layers will silently drop write errors on
> current kernels.
>
> Cisco's Linux iSCSI driver has a configurable timeout, defaulting to
> 'infinite', btw.

That corresponds to enbd's default behavior. Sigh. Guess I'll have to
make it 0-infty, instead of 0 or infty. It's easy enough - just need to
make it settable in proc (and think about which is the one line I need
to touch ...).

> Hrrmm. The potential to lose data by surprise here is not terribly
> appealing.

Making the driver "intelligent" is indeed bad news for the more
intelligent admin. I was thinking of making it default to 0 timeout if
it knows it's running under raid, but I have a natural antipathy to
such in-driver decisions. My conscience would be slightly less on
alert if the userspace daemon did the decision-making. I suppose it
could.

> It might be better to add an accounting mechanism to say
> "never go above x dirty pages against block device n" or something of
> the sort but you can still get into trouble if you happen to have
> hundreds of iSCSI devices each with their own request queue..

Well, you can get in trouble if you allow even a single dirty page to
be outstanding to something, and have thousands of those somethings.

That's not the normal situation, however, whereas it is normal to
have a single network device and to be writing pell-mell to it
oblivious to the state of the device itself.

Peter

2003-03-26 07:21:39

by Lincoln Dale

[permalink] [raw]
Subject: Re: [PATCH] ENBD for 2.5.64

At 11:55 PM 25/03/2003 -0600, Matt Mackall wrote:
> > Yeah, iSCSI handles all that and more. It's a behemoth of a
> > specification. (whether a particular implementation implements all that
> > stuff correctly is another matter...)
>
>Indeed, there are iSCSI implementations that do multipath and
>failover.

iSCSI is a transport.
logically, any "multipathing" and "failover" belongs in a layer above it --
typically as a block-layer function -- and not as a transport-layer function.

multipathing belongs elsewhere -- whether it be in MD, LVM, EVMS, DevMapper
-- or in a commercial implementation such as Veritas VxDMP, HDS HDLM, EMC
PowerPath, ...

>Both iSCSI and ENBD currently have issues with pending writes during
>network outages. The current I/O layer fails to report failed writes
>to fsync and friends.

these are not "iSCSI" or "ENBD" issues. these are issues with VFS.


cheers,

lincoln.

2003-03-26 09:49:09

by Lars Marowsky-Bree

[permalink] [raw]
Subject: Re: [PATCH] ENBD for 2.5.64

On 2003-03-26T18:31:31,
Lincoln Dale <[email protected]> said:

> >Indeed, there are iSCSI implementations that do multipath and
> >failover.
> iSCSI is a transport.
> logically, any "multipathing" and "failover" belongs in a layer above it --

"Multipathing" on iSCSI is actually a layer below - network resiliency is
handled by routing protocols, the switching fabric etc.

> >Both iSCSI and ENBD currently have issues with pending writes during
> >network outages. The current I/O layer fails to report failed writes
> >to fsync and friends.
> these are not "iSCSI" or "ENBD" issues. these are issues with VFS.

Yes, and it is a fairly annoying issue... In particular with ENBD, a partial
write could occur at the block device layer. Now try to report that upwards to
the write() call. Good luck.


Sincerely,
Lars Marowsky-Br?e <[email protected]>

--
SuSE Labs - Research & Development, SuSE Linux AG

"If anything can go wrong, it will." "Chance favors the prepared (mind)."
-- Capt. Edward A. Murphy -- Louis Pasteur

2003-03-26 10:07:00

by Andrew Morton

[permalink] [raw]
Subject: Re: [PATCH] ENBD for 2.5.64

Lars Marowsky-Bree <[email protected]> wrote:
>
> In particular with ENBD, a partial write could occur at the block device
> layer. Now try to report that upwards to the write() call. Good luck.

Well you can't, unless it is an O_SYNC or O_DIRECT write...

But yes, for a normal old write() followed by an fsync() the IO error can be
lost. We'll fix this for 2.6. I have oxymoron's patches lined up, but they
need a couple of quality hours' worth of thinking about yet.


2003-03-26 13:37:29

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] ENBD for 2.5.64

Lincoln Dale wrote:
> At 11:55 PM 25/03/2003 -0600, Matt Mackall wrote:
>
>> > Yeah, iSCSI handles all that and more. It's a behemoth of a
>> > specification. (whether a particular implementation implements all
>> that
>> > stuff correctly is another matter...)
>>
>> Indeed, there are iSCSI implementations that do multipath and
>> failover.
>
>
> iSCSI is a transport.
> logically, any "multipathing" and "failover" belongs in a layer above it
> -- typically as a block-layer function -- and not as a transport-layer
> function.
>
> multipathing belongs elsewhere -- whether it be in MD, LVM, EVMS,
> DevMapper -- or in a commercial implementation such as Veritas VxDMP,
> HDS HDLM, EMC PowerPath, ...

I think you will find that most Linux kernel developers agree w/ you :)

That said, iSCSI error recovery can be considered as supporting some of
what multipathing and failover accomplish. iSCSI can be shoving bits
through multiple TCP connections, or fail over from one TCP connection
to another.


>> Both iSCSI and ENBD currently have issues with pending writes during
>> network outages. The current I/O layer fails to report failed writes
>> to fsync and friends.

...not if your iSCSI implementation is up to spec. ;-)


> these are not "iSCSI" or "ENBD" issues. these are issues with VFS.

VFS+VM. But, agreed.

Jeff



2003-03-26 13:47:09

by Jeff Garzik

[permalink] [raw]
Subject: Re: [PATCH] ENBD for 2.5.64

Andre Hedrick wrote:
> We have almost finalized our initiator to be submitted under OSL/GPL.
> This will be a full RFC ERL=2 w/ Sync-n-Steering.


That's pretty good news.

Also, I tangent and mention that I have been won over WRT OSL: with its
more tight "lawyerspeak" and mutual patent defense clauses, I consider
OSL to be a "better GPL" license.

I would in fact love to see the Linux kernel relicensed under OSL. I
think that would close some "holes" that exist with the GPL, and give us
a better legal standing. But relicensing the kernel would be huge
political undertaking, and I sure as hell don't have the energy, even if
it possible. No idea if Linus, Alan, Andrew, or any of the other major
contributors would go for it, either.

Jeff, the radical


2003-03-26 15:58:25

by Matt Mackall

[permalink] [raw]
Subject: Re: [PATCH] ENBD for 2.5.64

On Wed, Mar 26, 2003 at 06:31:31PM +1100, Lincoln Dale wrote:
> At 11:55 PM 25/03/2003 -0600, Matt Mackall wrote:
> >> Yeah, iSCSI handles all that and more. It's a behemoth of a
> >> specification. (whether a particular implementation implements all that
> >> stuff correctly is another matter...)
> >
> >Indeed, there are iSCSI implementations that do multipath and
> >failover.
>
> iSCSI is a transport.
> logically, any "multipathing" and "failover" belongs in a layer above it --
> typically as a block-layer function -- and not as a transport-layer
> function.
>
> multipathing belongs elsewhere -- whether it be in MD, LVM, EVMS, DevMapper
> PowerPath, ...

Funny then that I should be talking about Cisco's driver. :P

iSCSI inherently has more interesting reconnect logic than other block
devices, so it's fairly trivial to throw in recognition of identical
devices discovered on two or more iSCSI targets..

> >Both iSCSI and ENBD currently have issues with pending writes during
> >network outages. The current I/O layer fails to report failed writes
> >to fsync and friends.
>
> these are not "iSCSI" or "ENBD" issues. these are issues with VFS.

Except that the issue simply doesn't show up for anyone else, which is
why it hasn't been fixed yet. Patches are in the works, but they need
more testing:

http://www.selenic.com/linux/write-error-propagation/

--
Matt Mackall : http://www.selenic.com : of or relating to the moon

2003-03-30 19:31:34

by Pavel Machek

[permalink] [raw]
Subject: Re: [PATCH] ENBD for 2.5.64

Hi!

> 9) it drops into a mode where it md5sums both ends and skips writes
> of equal blocks, if that's faster. It moves in and out of this mode
> automatically. This helps RAID resyncs (2* overspeed is common on
> 100BT nets, that is 20MB/s.).

Great way to find md5 collisions, I guess
:-).
Pavel
--
Pavel
Written on sharp zaurus, because my Velo1 broke. If you have Velo you don't need...

2003-03-30 20:37:08

by Peter T. Breuer

[permalink] [raw]
Subject: Re: [PATCH] ENBD for 2.5.64

"A month of sundays ago Pavel Machek wrote:"
> Hi!
>
> > 9) it drops into a mode where it md5sums both ends and skips writes
> > of equal blocks, if that's faster. It moves in and out of this mode
> > automatically. This helps RAID resyncs (2* overspeed is common on
> > 100BT nets, that is 20MB/s.).
>
> Great way to find md5 collisions, I guess
> :-).

Don't worry, I'm not planning on claiming the Turing medal! Or living for
the lifetime of the universe .. :-(.

Peter